AudSemThinker
Corresponding paper: https://arxiv.org/abs/2505.14142
Model Description
AudSemThinker
is a novel audio-language model that grounds its reasoning process in a structured framework of auditory semantics, inspired by human cognition. It processes audio by explicitly analyzing functional components such as sound-generating agents (who), physical sound sources (what), generation mechanisms (how), and contextual cues (when/where).
This model is built upon the Qwen2.5-Omni-7B
multimodal foundation model and is fine-tuned on the novel AudSem
dataset using Supervised Fine-Tuning (SFT). AudSemThinker
is designed to produce comprehensive responses in a three-phase structure: a detailed <thinking>
process, a listing of <semantic_elements>
, and a concise <answer>
.
How to Use
To use AudSemThinker
for audio understanding and captioning tasks, you can load it using the transformers
library. Ensure you have torch
, torchaudio
, and soundfile
installed.
import soundfile as sf
from transformers import Qwen2_5OmniForConditionalGeneration, Qwen2_5OmniProcessor
from qwen_omni_utils import process_mm_info
import torchaudio
# default: Load the model on the available device(s)
model = Qwen2_5OmniForConditionalGeneration.from_pretrained(
"gijs/audsemthinker",
torch_dtype="auto",
device_map="auto",
trust_remote_code=True
)
# We recommend enabling flash_attention_2 for better acceleration and memory saving.
# model = Qwen2_5OmniForConditionalGeneration.from_pretrained(
# "gijs/audsemthinker",
# torch_dtype="auto",
# device_map="auto",
# attn_implementation="flash_attention_2",
# trust_remote_code=True
# )
processor = Qwen2_5OmniProcessor.from_pretrained("gijs/audsemthinker", trust_remote_code=True)
# Load and preprocess audio
audio_file = "path/to/your/audio.wav"
audio_input, sampling_rate = torchaudio.load(audio_file)
if sampling_rate != processor.feature_extractor.sampling_rate:
audio_input = torchaudio.transforms.Resample(
orig_freq=sampling_rate,
new_freq=processor.feature_extractor.sampling_rate
)(audio_input)
audio_input = audio_input.squeeze().numpy()
# Conversation format
conversation = [
{
"role": "system",
"content": [
{"type": "text", "text": "You are Qwen, a virtual human developed by the Qwen Team, Alibaba Group, capable of perceiving auditory and visual inputs, as well as generating text and speech."}
],
},
{
"role": "user",
"content": [
{"type": "audio", "audio": audio_input},
{"type": "text", "text": "You are given an audio clip. Your task is to describe the audio in detail. First, think about the audio clip and put your thoughts in <think> and </think> tags. Then reason about the semantic elements involved in the audio clip and put your reasoning in <semantic_elements> and </semantic_elements> tags. Then describe the audio clip, put your answer in <answer> and </answer> tags."}
],
},
]
# Preparation for inference
text = processor.apply_chat_template(conversation, add_generation_prompt=True, tokenize=False)
audios, images, videos = process_mm_info(conversation)
inputs = processor(
text=text,
audio=audios,
images=images,
videos=videos,
return_tensors="pt",
padding=True
)
inputs = inputs.to(model.device).to(model.dtype)
# Inference: Generation of the output
output_ids = model.generate(**inputs, max_new_tokens=512)
response = processor.batch_decode(output_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)
print(response[0])
# Expected output format:
# <think>...detailed reasoning about the audio scene...</think>
# <semantic_elements>...list of identified semantic descriptors (e.g., Who, What, How, When, Where)...</semantic_elements>
# <answer>...concise audio caption...</answer>
Training Data
AudSemThinker
is fine-tuned on the full AudSem dataset, a novel, high-quality audio-language dataset comprising approximately 797k examples.
AudSem Dataset Characteristics:
- Source: Synthetically curated from YouTube closed captions, designed to minimize overlap with existing datasets like AudioSet and WavCaps.
- Generation Pipeline: Utilizes a robust multi-stage pipeline that integrates audio, video, and YouTube closed caption data. It employs an ensemble of 9 specialized AI models for comprehensive multimodal analysis (Qwen2Audio-7B, BEATs, AST, CoNeTTE, LP-MusicCaps, BLIP, CLIP, RT-DETR, Places365, LLaVA-Video-7B).
- Quality Control: Includes rigorous filtering steps, such as ensuring a cosine similarity score greater than 0.5 between generated audio captions and original YouTube closed captions, to ensure high quality and relevance.
- Diversity: Contains a diverse range of task types, including open-ended audio captioning, multiple-choice question answering, open-ended question answering, and creative writing based on audio.
Training Procedure
- Base Model: Qwen2.5-Omni-7B.
- Fine-tuning Paradigm: Supervised Fine-Tuning (SFT).
- Parameter-Efficient Fine-tuning: LoRA (Low-Rank Adaptation) applied to projection layers.
- Optimizer: AdamW.
- Learning Rate: 2e-04.
- Epochs: 1.
- Precision: bf16.
- Batch Size: 4.
- Hardware: Trained on a single H100 GPU.
- Training Time: Approximately 12 hours for the full dataset.
- Output Format: Trained to generate structured XML-like output with
<think>
,<semantic_elements>
, and<answer>
tags. The loss is computed only on the model completion part (assistant's response).
Evaluation Results
AudSemThinker
demonstrates state-of-the-art performance across multiple benchmarks, highlighting its strength in semantic audio reasoning. It shows particularly strong capabilities in music-related tasks.
Limitations and Bias
- Data Contamination: While
AudSem
is designed to minimize overlap with existing benchmarks, the underlyingQwen2.5-Omni
pretrained model might have encountered data present in test sets during its initial pretraining. - Generalization: While strong, supervised fine-tuning on
AudSem
for general tasks might not always outperform models specifically trained for niche benchmarks.
Ethical Considerations
- Data Sourcing: The
AudSem
dataset is primarily sourced from YouTube closed captions. While systematic checks for harmful content (e.g., child abuse, hate speech, sexual content, harassment) were performed and YouTube's community guidelines provide a safeguard, inherent biases or problematic content from the original video sources could potentially be present. - Societal Impact:
AudSemThinker
can contribute to positive societal impacts by enhancing audio-language understanding. Potential applications include improved audio transcription and captioning for individuals who are deaf or hard of hearing, sophisticated monitoring systems for environmental sounds (e.g., avian populations), and automated closed-caption generation for multimedia content.
Citation
@misc{wijngaard2025audsemthinkerenhancingaudiolanguagemodels,
title={AudSemThinker: Enhancing Audio-Language Models through Reasoning over Semantics of Sound},
author={Gijs Wijngaard and Elia Formisano and Michele Esposito and Michel Dumontier},
year={2025},
eprint={2505.14142},
archivePrefix={arXiv},
primaryClass={cs.SD},
url={https://arxiv.org/abs/2505.14142},
}
- Downloads last month
- 0