Update README.md

ce80216 verified 15 days ago

3.7 kB

	---
	tags:
	- model_hub_mixin
	- pytorch_model_hub_mixin
	license: apache-2.0
	language:
	- en
	metrics:
	- accuracy
	base_model:
	- microsoft/wavlm-large
	pipeline_tag: audio-classification
	---
	# WavLM-Large for Speech Flow (Fluency) Classification

	# Model Description
	This model includes the implementation of speech fluency classification described in Vox-Profile: A Speech Foundation Model Benchmark for Characterizing Diverse Speaker and Speech Traits (https://arxiv.org/pdf/2505.14648)

	The model first predicts the speech with a 3-second window size and 1-second step size in
	```
	["fluent", "disfluent"]
	```
	If the disfluent speech is detected, we predict the disfluent types in:
	```
	[
	"Block",
	"Prolongation",
	"Sound Repetition",
	"Word Repetition",
	"Interjection"
	]
	```

	# How to use this model

	## Download repo
	```bash
	git clone [email protected]:tiantiaf0627/vox-profile-release.git
	```
	## Install the package
	```bash
	conda create -n vox_profile python=3.8
	cd vox-profile-release
	pip install -e .
	```

	## Load the model
	```python
	# Load libraries
	import torch
	import torch.nn.functional as F
	from src.model.fluency.wavlm_fluency import WavLMWrapper

	# Find device
	device = torch.device("cuda") if torch.cuda.is_available() else "cpu"

	# Load model from Huggingface
	model = WavLMWrapper.from_pretrained("tiantiaf/wavlm-large-speech-flow").to(device)
	model.eval()
	```

	## Load the model into 3s window chunks
	```python
	# The way we do inference for fluency is different as the training data is 3s, so we need to do some stacking
	audio_data = torch.zeros([1, 16000*10]).float().to(device)
	audio_segment = (audio_data.shape[1] - 3*16000) // 16000 + 1
	if audio_segment < 1: audio_segment = 1
	input_audio = list()
	for idx in range(audio_segment): input_audio.append(audio_data[0, 16000idx:16000idx+3*16000])
	input_audio = torch.stack(input_audio, dim=0)
	```
	## Prediction
	```python
	fluency_outputs, disfluency_type_outputs = model(input_audio)
	fluency_prob = F.softmax(fluency_outputs, dim=1).detach().cpu().numpy().astype(float).tolist()

	disfluency_type_prob = nn.Sigmoid()(disfluency_type_outputs)
	# we can set a higher threshold in practice
	disfluency_type_predictions = (disfluency_type_prob > 0.7).int().detach().cpu().numpy().tolist()
	disfluency_type_prob = disfluency_type_prob.cpu().numpy().astype(float).tolist()
	```

	## Now lets gather the predictions for the utterance
	```python
	utterance_fluency_list = list()
	utterance_disfluency_list = list()
	for audio_idx in range(audio_segment):
	disfluency_type = list()
	if fluency_prob[audio_idx][0] > 0.5:
	utterance_fluency_list.append("fluent")
	else:
	# If the prediction is disfluent, then which disfluency type
	utterance_fluency_list.append("disfluent")
	predictions = disfluency_type_predictions[audio_idx]
	for label_idx in range(len(predictions)):
	if predictions[label_idx] == 1:
	disfluency_type.append(disfluency_type_labels[label_idx])
	utterance_disfluency_list.append(disfluency_type)

	# Now print how fluent is the utterance
	print(utterance_fluency_list)
	print(utterance_disfluency_list)
	```

	## If you have any questions, please contact: Tiantian Feng ([email protected])

	## Kindly cite our paper if you are using our model or find it useful in your work
	```
	@article{feng2025vox,
	title={Vox-Profile: A Speech Foundation Model Benchmark for Characterizing Diverse Speaker and Speech Traits},
	author={Feng, Tiantian and Lee, Jihwan and Xu, Anfeng and Lee, Yoonjeong and Lertpetchpun, Thanathai and Shi, Xuan and Wang, Helin and Thebaud, Thomas and Moro-Velazquez, Laureano and Byrd, Dani and others},
	journal={arXiv preprint arXiv:2505.14648},
	year={2025}
	}
	```