AudSemThinker

Corresponding paper: https://arxiv.org/abs/2505.14142

Model Description

AudSemThinker is a novel audio-language model that grounds its reasoning process in a structured framework of auditory semantics, inspired by human cognition. It processes audio by explicitly analyzing functional components such as sound-generating agents (who), physical sound sources (what), generation mechanisms (how), and contextual cues (when/where).

This model is built upon the Qwen2.5-Omni-7B multimodal foundation model and is fine-tuned on the novel AudSem dataset using Supervised Fine-Tuning (SFT). AudSemThinker is designed to produce comprehensive responses in a three-phase structure: a detailed <thinking> process, a listing of <semantic_elements>, and a concise <answer>.

How to Use

To use AudSemThinker for audio understanding and captioning tasks, you can load it using the transformers library. Ensure you have torch, torchaudio, and soundfile installed.

import soundfile as sf
from transformers import Qwen2_5OmniForConditionalGeneration, Qwen2_5OmniProcessor
from qwen_omni_utils import process_mm_info
import torchaudio

# default: Load the model on the available device(s)
model = Qwen2_5OmniForConditionalGeneration.from_pretrained(
    "gijs/audsemthinker", 
    torch_dtype="auto", 
    device_map="auto",
    trust_remote_code=True
)

# We recommend enabling flash_attention_2 for better acceleration and memory saving.
# model = Qwen2_5OmniForConditionalGeneration.from_pretrained(
#     "gijs/audsemthinker",
#     torch_dtype="auto",
#     device_map="auto",
#     attn_implementation="flash_attention_2",
#     trust_remote_code=True
# )

processor = Qwen2_5OmniProcessor.from_pretrained("gijs/audsemthinker", trust_remote_code=True)

# Load and preprocess audio
audio_file = "path/to/your/audio.wav"
audio_input, sampling_rate = torchaudio.load(audio_file)
if sampling_rate != processor.feature_extractor.sampling_rate:
    audio_input = torchaudio.transforms.Resample(
        orig_freq=sampling_rate, 
        new_freq=processor.feature_extractor.sampling_rate
    )(audio_input)
audio_input = audio_input.squeeze().numpy()

# Conversation format
conversation = [
    {
        "role": "system",
        "content": [
            {"type": "text", "text": "You are Qwen, a virtual human developed by the Qwen Team, Alibaba Group, capable of perceiving auditory and visual inputs, as well as generating text and speech."}
        ],
    },
    {
        "role": "user",
        "content": [
            {"type": "audio", "audio": audio_input},
            {"type": "text", "text": "You are given an audio clip. Your task is to describe the audio in detail. First, think about the audio clip and put your thoughts in <think> and </think> tags. Then reason about the semantic elements involved in the audio clip and put your reasoning in <semantic_elements> and </semantic_elements> tags. Then describe the audio clip, put your answer in <answer> and </answer> tags."}
        ],
    },
]

# Preparation for inference
text = processor.apply_chat_template(conversation, add_generation_prompt=True, tokenize=False)
audios, images, videos = process_mm_info(conversation)
inputs = processor(
    text=text, 
    audio=audios, 
    images=images, 
    videos=videos, 
    return_tensors="pt", 
    padding=True
)
inputs = inputs.to(model.device).to(model.dtype)

# Inference: Generation of the output
output_ids = model.generate(**inputs, max_new_tokens=512)
response = processor.batch_decode(output_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)
print(response[0])

# Expected output format:
# <think>...detailed reasoning about the audio scene...</think>
# <semantic_elements>...list of identified semantic descriptors (e.g., Who, What, How, When, Where)...</semantic_elements>
# <answer>...concise audio caption...</answer>

Training Data

AudSemThinker is fine-tuned on the full AudSem dataset, a novel, high-quality audio-language dataset comprising approximately 797k examples.

AudSem Dataset Characteristics:

Source: Synthetically curated from YouTube closed captions, designed to minimize overlap with existing datasets like AudioSet and WavCaps.
Generation Pipeline: Utilizes a robust multi-stage pipeline that integrates audio, video, and YouTube closed caption data. It employs an ensemble of 9 specialized AI models for comprehensive multimodal analysis (Qwen2Audio-7B, BEATs, AST, CoNeTTE, LP-MusicCaps, BLIP, CLIP, RT-DETR, Places365, LLaVA-Video-7B).
Quality Control: Includes rigorous filtering steps, such as ensuring a cosine similarity score greater than 0.5 between generated audio captions and original YouTube closed captions, to ensure high quality and relevance.
Diversity: Contains a diverse range of task types, including open-ended audio captioning, multiple-choice question answering, open-ended question answering, and creative writing based on audio.

Training Procedure

Base Model: Qwen2.5-Omni-7B.
Fine-tuning Paradigm: Supervised Fine-Tuning (SFT).
Parameter-Efficient Fine-tuning: LoRA (Low-Rank Adaptation) applied to projection layers.
Optimizer: AdamW.
Learning Rate: 2e-04.
Epochs: 1.
Precision: bf16.
Batch Size: 4.
Hardware: Trained on a single H100 GPU.
Training Time: Approximately 12 hours for the full dataset.
Output Format: Trained to generate structured XML-like output with <think>, <semantic_elements>, and <answer> tags. The loss is computed only on the model completion part (assistant's response).

Evaluation Results

AudSemThinker demonstrates state-of-the-art performance across multiple benchmarks, highlighting its strength in semantic audio reasoning. It shows particularly strong capabilities in music-related tasks.

Limitations and Bias

Data Contamination: While AudSem is designed to minimize overlap with existing benchmarks, the underlying Qwen2.5-Omni pretrained model might have encountered data present in test sets during its initial pretraining.
Generalization: While strong, supervised fine-tuning on AudSem for general tasks might not always outperform models specifically trained for niche benchmarks.

Ethical Considerations

Data Sourcing: The AudSem dataset is primarily sourced from YouTube closed captions. While systematic checks for harmful content (e.g., child abuse, hate speech, sexual content, harassment) were performed and YouTube's community guidelines provide a safeguard, inherent biases or problematic content from the original video sources could potentially be present.
Societal Impact: AudSemThinker can contribute to positive societal impacts by enhancing audio-language understanding. Potential applications include improved audio transcription and captioning for individuals who are deaf or hard of hearing, sophisticated monitoring systems for environmental sounds (e.g., avian populations), and automated closed-caption generation for multimedia content.

Citation

@misc{wijngaard2025audsemthinkerenhancingaudiolanguagemodels,
  title={AudSemThinker: Enhancing Audio-Language Models through Reasoning over Semantics of Sound}, 
  author={Gijs Wijngaard and Elia Formisano and Michele Esposito and Michel Dumontier},
  year={2025},
  eprint={2505.14142},
  archivePrefix={arXiv},
  primaryClass={cs.SD},
  url={https://arxiv.org/abs/2505.14142}, 
}