Spaces:
Running
on
Zero
Demo Vs Local Install
Is there any drastic difference in quality between local install and the provided demo? I have an entry level RTX2050 laptop GPU and I don't wanna stress it out. However, if there is a big difference in quality, it might be worth the compilation time and GPU stress as I'm planning to use this as a backup voiceover for my YouTube videos where quality audio must be prioritized. Should I install it locally? Is it worth the install headache?
Hi,
Thanks for your interest in the demo!
The audio quality on the online demo should be identical to a local installation.
However, the online demo is subject to the ZeroGPU rate limits. This means that if you do not have a Hugging Face Pro subscription, usage may be limited and you may need to wait for your quota to refill.
Installing it locally on your laptop should be quite straightforward and pretty fast, however running the model itself will be quite slow on your GPU. The online demo uses A100 GPUs, and a 2050 GPU is significantly slower. This means generation times will be increased.
Thanks!
Thank you for this transparent reply. And well, I just hit the limits right now. Does the limit renew in 24 hours? and how much daily quota we have for accessing F5 TTS? And one more doubt-- What's the difference between E2 and F5?
Hi,
I believe the limit for ZeroGPU on the free tier is 5 GPU minutes per day. So yes, it resets every day. Also, it may refill throughout the day, but I think they recently made some changes to that so I’m not sure if it still does. Note that 5 GPU minutes means 5 minutes of time spent generating audio, not 5 minutes of audio generated. Also, the limit is per account, so if you have multiple accounts, you get a separate quota per account.
The difference between E2 and F5 TTS:
E2 TTS was a paper originally published by Microsoft. They did not release code or models openly, but some unofficial implementations were made. F5 TTS is an improvement on the original E2 TTS. The authors of F5 TTS trained two models: an F5 TTS model with the improvements, and a E2 TTS model that aligns with Microsoft’s paper. Later, the authors released a new version of the F5 TTS (Base V1), with improved quality. While there’s no definitive “best model,” F5 TTS v1 usually works best in most circumstances. However, feel free to try both out and compare them.
Another note: if you find generation is slow on your 2050, you can try running the demo in Google Colab on the free plan. The free T4 will likely be faster than a 2050 GPU (although slower than this demo). Colab has a much more generous usage quota - you shouldn't run into any limits if you only run it occasionally