csm-1b-tmp / README.md

Upload folder using huggingface_hub

b7aadca verified 22 days ago

12.1 kB

	---
	license: apache-2.0
	language:
	- en
	pipeline_tag: text-to-speech
	tags:
	- text-to-speech
	---

	## CSM 1B

	2025/05/20 - CSM is availabile natively in [Hugging Face Transformers](https://huggingface.co/docs/transformers/main/en/model_doc/csm) 🤗 as of version `4.52.1`

	2025/03/13 - We are releasing the 1B CSM variant. The checkpoint is [hosted on Hugging Face](https://huggingface.co/sesame/csm_1b).

	---

	CSM (Conversational Speech Model) is a speech generation model from [Sesame](sesame.com) that generates RVQ audio codes from text and audio inputs. The model architecture employs a [Llama](https://www.llama.com/) backbone and a smaller audio decoder that produces [Mimi](https://huggingface.co/kyutai/mimi) audio codes.

	A fine-tuned variant of CSM powers the [interactive voice demo](https://www.sesame.com/voicedemo) shown in our [blog post](https://www.sesame.com/research/crossing_the_uncanny_valley_of_voice).

	A hosted [HuggingFace space](https://huggingface.co/spaces/sesame/csm-1b) is also available for testing audio generation.

	## Usage

	### Generate a sentence

	```python
	import torch
	from transformers import CsmForConditionalGeneration, AutoProcessor

	model_id = "sesame/csm-1b"
	device = "cuda" if torch.cuda.is_available() else "cpu"

	# load the model and the processor
	processor = AutoProcessor.from_pretrained(model_id)
	model = CsmForConditionalGeneration.from_pretrained(model_id, device_map=device)

	# prepare the inputs
	text = "[0]Hello from Sesame." # `[0]` for speaker id 0
	inputs = processor(text, add_special_tokens=True).to(device)

	# another equivalent way to prepare the inputs
	conversation = [
	{"role": "0", "content": [{"type": "text", "text": "Hello from Sesame."}]},
	]
	inputs = processor.apply_chat_template(
	conversation,
	tokenize=True,
	return_dict=True,
	).to(device)

	# infer the model
	audio = model.generate(**inputs, output_audio=True)
	processor.save_audio(audio, "example_without_context.wav")
	```

	### CSM sounds best when provided with context

	```python
	import torch
	from transformers import CsmForConditionalGeneration, AutoProcessor
	from datasets import load_dataset, Audio

	model_id = "sesame/csm-1b"
	device = "cuda" if torch.cuda.is_available() else "cpu"

	# load the model and the processor
	processor = AutoProcessor.from_pretrained(model_id)
	model = CsmForConditionalGeneration.from_pretrained(model_id, device_map=device)

	# prepare the inputs
	ds = load_dataset("hf-internal-testing/dailytalk-dummy", split="train")
	# ensure the audio is 24kHz
	ds = ds.cast_column("audio", Audio(sampling_rate=24000))
	conversation = []

	# 1. context
	for text, audio, speaker_id in zip(ds[:4]["text"], ds[:4]["audio"], ds[:4]["speaker_id"]):
	conversation.append(
	{
	"role": f"{speaker_id}",
	"content": [{"type": "text", "text": text}, {"type": "audio", "path": audio["array"]}],
	}
	)

	# 2. text prompt
	conversation.append({"role": f"{ds[4]['speaker_id']}", "content": [{"type": "text", "text": ds[4]["text"]}]})

	inputs = processor.apply_chat_template(
	conversation,
	tokenize=True,
	return_dict=True,
	).to(device)

	# infer the model
	audio = model.generate(**inputs, output_audio=True)
	processor.save_audio(audio, "example_with_context.wav")
	```

	---

	### Batched Inference 📦

	CSM supports batched inference:

	<details>

	<summary> code snippet </summary>

	```python
	import torch
	from transformers import CsmForConditionalGeneration, AutoProcessor
	from datasets import load_dataset, Audio

	model_id = "sesame/csm-1b"
	device = "cuda" if torch.cuda.is_available() else "cpu"

	# load the model and the processor
	processor = AutoProcessor.from_pretrained(model_id)
	model = CsmForConditionalGeneration.from_pretrained(model_id, device_map=device)

	# prepare the inputs
	ds = load_dataset("hf-internal-testing/dailytalk-dummy", split="train")
	# ensure the audio is 24kHz
	ds = ds.cast_column("audio", Audio(sampling_rate=24000))
	# here a batch with two prompts
	conversation = [
	[
	{
	"role": f"{ds[0]['speaker_id']}",
	"content": [
	{"type": "text", "text": ds[0]["text"]},
	{"type": "audio", "path": ds[0]["audio"]["array"]},
	],
	},
	{
	"role": f"{ds[1]['speaker_id']}",
	"content": [
	{"type": "text", "text": ds[1]["text"]},
	],
	},
	],
	[
	{
	"role": f"{ds[0]['speaker_id']}",
	"content": [
	{"type": "text", "text": ds[0]["text"]},
	],
	}
	],
	]
	inputs = processor.apply_chat_template(
	conversation,
	tokenize=True,
	return_dict=True,
	).to(device)

	audio = model.generate(**inputs, output_audio=True)
	processor.save_audio(audio, [f"speech_batch_idx_{i}.wav" for i in range(len(audio))])
	```

	</details>


	### Making The Model Go Brrr 🏎️

	CSM supports full-graph compilation with CUDA graphs!

	<details>

	<summary> code snippet </summary>

	```python
	import torch
	import copy
	from transformers import CsmForConditionalGeneration, AutoProcessor
	from datasets import load_dataset

	model_id = "sesame/csm-1b"
	device = "cuda"

	# set logs to ensure no recompilation and graph breaks
	torch._logging.set_logs(graph_breaks=True, recompiles=True, cudagraphs=True)

	# load the model and the processor
	processor = AutoProcessor.from_pretrained(model_id)
	model = CsmForConditionalGeneration.from_pretrained(model_id, device_map=device)

	# use static cache, enabling automatically torch compile with fullgraph and reduce-overhead
	model.generation_config.max_length = 250 # big enough to avoid recompilation
	model.generation_config.max_new_tokens = None # would take precedence over max_length
	model.generation_config.cache_implementation = "static"
	model.depth_decoder.generation_config.cache_implementation = "static"

	# generation kwargs
	gen_kwargs = {
	"do_sample": False,
	"depth_decoder_do_sample": False,
	"temperature": 1.0,
	"depth_decoder_temperature": 1.0,
	}

	# Define a timing decorator
	class TimerContext:
	def __init__(self, name="Execution"):
	self.name = name
	self.start_event = None
	self.end_event = None

	def __enter__(self):
	# Use CUDA events for more accurate GPU timing
	self.start_event = torch.cuda.Event(enable_timing=True)
	self.end_event = torch.cuda.Event(enable_timing=True)
	self.start_event.record()
	return self

	def __exit__(self, *args):
	self.end_event.record()
	torch.cuda.synchronize()
	elapsed_time = self.start_event.elapsed_time(self.end_event) / 1000.0
	print(f"{self.name} time: {elapsed_time:.4f} seconds")

	# prepare the inputs
	ds = load_dataset("hf-internal-testing/dailytalk-dummy", split="train")

	conversation = [
	{
	"role": f"{ds[0]['speaker_id']}",
	"content": [
	{"type": "text", "text": ds[0]["text"]},
	{"type": "audio", "path": ds[0]["audio"]["array"]},
	],
	},
	{
	"role": f"{ds[1]['speaker_id']}",
	"content": [
	{"type": "text", "text": ds[1]["text"]},
	{"type": "audio", "path": ds[1]["audio"]["array"]},
	],
	},
	{
	"role": f"{ds[2]['speaker_id']}",
	"content": [
	{"type": "text", "text": ds[2]["text"]},
	],
	},
	]

	padded_inputs_1 = processor.apply_chat_template(
	conversation,
	tokenize=True,
	return_dict=True,
	).to(device)

	print("\n" + "="*50)
	print("First generation - compiling and recording CUDA graphs...")
	with TimerContext("First generation"):
	_ = model.generate(padded_inputs_1, gen_kwargs)
	print("="*50)

	print("\n" + "="*50)
	print("Second generation - fast !!!")
	with TimerContext("Second generation"):
	_ = model.generate(padded_inputs_1, gen_kwargs)
	print("="*50)

	# now with different inputs
	conversation = [
	{
	"role": f"{ds[0]['speaker_id']}",
	"content": [
	{"type": "text", "text": ds[2]["text"]},
	{"type": "audio", "path": ds[2]["audio"]["array"]},
	],
	},
	{
	"role": f"{ds[1]['speaker_id']}",
	"content": [
	{"type": "text", "text": ds[3]["text"]},
	{"type": "audio", "path": ds[3]["audio"]["array"]},
	],
	},
	{
	"role": f"{ds[2]['speaker_id']}",
	"content": [
	{"type": "text", "text": ds[4]["text"]},
	],
	},
	]
	padded_inputs_2 = processor.apply_chat_template(
	conversation,
	tokenize=True,
	return_dict=True,
	).to(device)

	print("\n" + "="*50)
	print("Generation with other inputs!")
	with TimerContext("Generation with different inputs"):
	_ = model.generate(padded_inputs_2, gen_kwargs)
	print("="*50)
	```

	</details>

	### Fine-tuning & training 📉

	CSM can be fine-tuned using [Transformers' Trainer](https://huggingface.co/docs/transformers/en/main_classes/trainer).

	<details>

	<summary> code snippet </summary>

	```python
	from datasets import load_dataset, Audio
	from transformers import (
	CsmForConditionalGeneration,
	TrainingArguments,
	CsmProcessor,
	Trainer
	)

	processor = CsmProcessor.from_pretrained("sesame/csm-1b")
	model = CsmForConditionalGeneration.from_pretrained("sesame/csm-1b")
	model.train()
	model.codec_model.eval()

	ds = load_dataset("eustlb/dailytalk-conversations-grouped", split="train")
	ds = ds.cast_column("audio", Audio(sampling_rate=processor.feature_extractor.sampling_rate))

	def data_collator(samples):
	conversations = []

	for sample in samples:
	concatenated_audio_array = sample["audio"]["array"]
	audio = [concatenated_audio_array[s: e] for s, e in sample["audio_cut_idxs"]]

	conversation = []
	for speaker_id, text, audio in zip(sample["speaker_ids"], sample["texts"], audio):
	conversation.append({
	"role": f"{speaker_id}",
	"content": [
	{"type": "text", "text": text},
	{"type": "audio", "audio": audio}
	]
	})

	conversations.append(conversation)

	inputs = processor.apply_chat_template(
	conversations,
	tokenize=True,
	return_dict=True,
	output_labels=True,
	)
	return inputs

	training_args = TrainingArguments(
	"test-trainer",
	remove_unused_columns=False,
	gradient_checkpointing=True,
	)

	trainer = Trainer(
	model,
	training_args,
	train_dataset=ds,
	data_collator=data_collator,
	)

	trainer.train()
	```

	</details>

	---

	## FAQ

	Does this model come with any voices?

	The model open sourced here is a base generation model. It is capable of producing a variety of voices, but it has not been fine-tuned on any specific voice.

	Can I converse with the model?

	CSM is trained to be an audio generation model and not a general purpose multimodal LLM. It cannot generate text. We suggest using a separate LLM for text generation.

	Does it support other languages?

	The model has some capacity for non-English languages due to data contamination in the training data, but it likely won't do well.

	## Misuse and abuse ⚠️

	This project provides a high-quality speech generation model for research and educational purposes. While we encourage responsible and ethical use, we explicitly prohibit the following:

	- Impersonation or Fraud: Do not use this model to generate speech that mimics real individuals without their explicit consent.
	- Misinformation or Deception: Do not use this model to create deceptive or misleading content, such as fake news or fraudulent calls.
	- Illegal or Harmful Activities: Do not use this model for any illegal, harmful, or malicious purposes.

	By using this model, you agree to comply with all applicable laws and ethical guidelines. We are not responsible for any misuse, and we strongly condemn unethical applications of this technology.

	Authors
	Johan Schalkwyk, Ankit Kumar, Dan Lyth, Sefik Emre Eskimez, Zack Hodari, Cinjon Resnick, Ramon Sanabria, Raven Jiang, and the Sesame team.