Hacker News story: Fish Speech TTS: clone OpenAI TTS in 30 minutes

Fish Speech TTS: clone OpenAI TTS in 30 minutes
While we are still figuring out ways to improve the agent's emotional response to OpenAI GPT-4 level, we have already made significant progress in aligning OpenAI's TTS performance. To begin this experiment, we collected 10 hours of OpenAI TTS data to perform supervised fine-tuning (SFT) on both the LLM and VITS models, which took approximately 30 minutes. After that, we used 15 seconds of audio as a prompt during inference. Demos Available: https://ift.tt/a8cPkzN As you can see, the model's emotion, rhythm, accent, and timbre match the OpenAI speakers, though there is some degradation in audio quality, which we are working on. To avoid any legal issues, we are unable to release the fine-tuned model, but I believe everyone can tune Fish Speech to this level within hours and for around $20. Our experiment shows that with only 25 seconds of prompts (few-shot learning), without any fine-tuning, the model can mimic most behaviors except for how it reads numbers. To the best of our knowledge, you can clone how someone speaks in English, Chinese, and Japanese with 30 minutes of data using this framework. Repo: https://ift.tt/B30zDb6 0 comments on Hacker News.
While we are still figuring out ways to improve the agent's emotional response to OpenAI GPT-4 level, we have already made significant progress in aligning OpenAI's TTS performance. To begin this experiment, we collected 10 hours of OpenAI TTS data to perform supervised fine-tuning (SFT) on both the LLM and VITS models, which took approximately 30 minutes. After that, we used 15 seconds of audio as a prompt during inference. Demos Available: https://ift.tt/a8cPkzN As you can see, the model's emotion, rhythm, accent, and timbre match the OpenAI speakers, though there is some degradation in audio quality, which we are working on. To avoid any legal issues, we are unable to release the fine-tuned model, but I believe everyone can tune Fish Speech to this level within hours and for around $20. Our experiment shows that with only 25 seconds of prompts (few-shot learning), without any fine-tuning, the model can mimic most behaviors except for how it reads numbers. To the best of our knowledge, you can clone how someone speaks in English, Chinese, and Japanese with 30 minutes of data using this framework. Repo: https://ift.tt/B30zDb6

US Economy News

Hacker News story: Fish Speech TTS: clone OpenAI TTS in 30 minutes

No comments:

Follow Us

Recent Posts

Popular Posts

Search This Blog

Random Posts

Tags

Recent Posts