E2-F5-TTS

Sleeping

Sync from GitHub repo

1a19e0f verified about 2 months ago

708 Bytes

	## Backbones quick introduction


	### unett.py
	- flat unet transformer
	- structure same as in e2-tts & voicebox paper except using rotary pos emb
	- possible abs pos emb & convnextv2 blocks for embedded text before concat

	### dit.py
	- adaln-zero dit
	- embedded timestep as condition
	- concatted noised_input + masked_cond + embedded_text, linear proj in
	- possible abs pos emb & convnextv2 blocks for embedded text before concat
	- possible long skip connection (first layer to last layer)

	### mmdit.py
	- stable diffusion 3 block structure
	- timestep as condition
	- left stream: text embedded and applied a abs pos emb
	- right stream: masked_cond & noised_input concatted and with same conv pos emb as unett