DragonLineageAI
/

Vi-SparkTTS-0.5B

Model card Files Files and versions Community

Vi-SparkTTS-0.5B / README.md

ancv's picture

Update README.md

2721b51 verified about 2 months ago

|

2.27 kB

	---
	license: cc-by-nc-nd-4.0
	---


	# Spark TTS Vietnamese

	Spark-TTS is an advanced text-to-speech system that uses the power of large language models (LLM) for highly accurate and natural-sounding voice synthesis. It is designed to be efficient, flexible, and powerful for both research and production use. This model is trained from [viVoice](https://huggingface.co/datasets/thinhlpg/viVoice) vietnamese dataset

	# Usage

	First, install the required packages:

	```
	pip install --upgrade transformers accelerate
	```

	## Text-to-Speech
	We have customized the code so you can inference using the huggingface transformer library without installing anything else.

	```python
	from transformers import AutoProcessor, AutoModel, AutoTokenizer
	import soundfile as sf
	import torch
	import numpy as np

	device = "cuda"
	model_id = "DragonLineageAI/Vi-SparkTTS-0.5B"
	processor = AutoProcessor.from_pretrained(model_id, trust_remote_code=True)
	model = AutoModel.from_pretrained(model_id, trust_remote_code=True).eval()
	processor.model = model

	prompt_audio_path = "path_to_audio_path" # CHANGE TO YOUR ACTUAL PATH
	prompt_transcript = "text corresponding to prompt audio" # Optional
	text_input = "xin chào mọi người chúng tôi là Nguyễn Công Tú Anh và Chu Văn An đến từ dragonlineageai"

	inputs = processor(
	text=text_input.lower(),
	prompt_speech_path=prompt_audio_path,
	prompt_text=prompt_transcript,
	return_tensors="pt"
	).to(device)
	global_tokens_prompt = inputs.pop("global_token_ids_prompt", None)

	with torch.no_grad():
	output_ids = model.generate(
	**inputs,
	max_new_tokens=3000,
	do_sample=True,
	temperature=0.8,
	top_k=50,
	top_p=0.95,
	eos_token_id=processor.tokenizer.eos_token_id,
	pad_token_id=processor.tokenizer.pad_token_id
	)

	output_clone = processor.decode(
	generated_ids=output_ids,
	global_token_ids_prompt=global_tokens_prompt,
	input_ids_len=inputs["input_ids"].shape[-1]
	)

	sf.write("output_cloned.wav", output_clone["audio"], output_clone["sampling_rate"])
	```
	## Fintune
	You can finetune this model with any dataset to improve quality or train on a new language. [training code](https://github.com/tuanh123789/Spark-TTS-finetune)