|
--- |
|
license: cc-by-nc-nd-4.0 |
|
--- |
|
|
|
|
|
# Spark TTS Vietnamese |
|
|
|
Spark-TTS is an advanced text-to-speech system that uses the power of large language models (LLM) for highly accurate and natural-sounding voice synthesis. It is designed to be efficient, flexible, and powerful for both research and production use. This model is trained from [viVoice](https://huggingface.co/datasets/thinhlpg/viVoice) vietnamese dataset |
|
|
|
# Usage |
|
|
|
First, install the required packages: |
|
|
|
``` |
|
pip install --upgrade transformers accelerate |
|
``` |
|
|
|
## Text-to-Speech |
|
We have customized the code so you can inference using the huggingface transformer library without installing anything else. |
|
|
|
```python |
|
from transformers import AutoProcessor, AutoModel, AutoTokenizer |
|
import soundfile as sf |
|
import torch |
|
import numpy as np |
|
|
|
device = "cuda" |
|
model_id = "DragonLineageAI/Vi-SparkTTS-0.5B" |
|
processor = AutoProcessor.from_pretrained(model_id, trust_remote_code=True) |
|
model = AutoModel.from_pretrained(model_id, trust_remote_code=True).eval() |
|
processor.model = model |
|
|
|
prompt_audio_path = "path_to_audio_path" # CHANGE TO YOUR ACTUAL PATH |
|
prompt_transcript = "text corresponding to prompt audio" # Optional |
|
text_input = "xin chào mọi người chúng tôi là Nguyễn Công Tú Anh và Chu Văn An đến từ dragonlineageai" |
|
|
|
inputs = processor( |
|
text=text_input.lower(), |
|
prompt_speech_path=prompt_audio_path, |
|
prompt_text=prompt_transcript, |
|
return_tensors="pt" |
|
).to(device) |
|
global_tokens_prompt = inputs.pop("global_token_ids_prompt", None) |
|
|
|
with torch.no_grad(): |
|
output_ids = model.generate( |
|
**inputs, |
|
max_new_tokens=3000, |
|
do_sample=True, |
|
temperature=0.8, |
|
top_k=50, |
|
top_p=0.95, |
|
eos_token_id=processor.tokenizer.eos_token_id, |
|
pad_token_id=processor.tokenizer.pad_token_id |
|
) |
|
|
|
output_clone = processor.decode( |
|
generated_ids=output_ids, |
|
global_token_ids_prompt=global_tokens_prompt, |
|
input_ids_len=inputs["input_ids"].shape[-1] |
|
) |
|
|
|
sf.write("output_cloned.wav", output_clone["audio"], output_clone["sampling_rate"]) |
|
``` |
|
## Fintune |
|
You can finetune this model with any dataset to improve quality or train on a new language. [training code](https://github.com/tuanh123789/Spark-TTS-finetune) |