Video-Skill-CoT: Skill-based Chain-of-Thoughts for Domain-Adaptive Video Reasoning
Abstract
Video-SKoT framework improves domain-adaptive video reasoning by constructing skill-aware Chain-of-Thought supervisions and specialized expert modules.
Recent advances in Chain-of-Thought (CoT) reasoning have improved complex video understanding, but existing methods often struggle to adapt to domain-specific skills (e.g., event detection, spatial relation understanding, emotion understanding) over various video content. To address this, we propose Video-Skill-CoT (a.k.a. Video-SKoT), a framework that automatically constructs and leverages skill-aware CoT supervisions for domain-adaptive video reasoning. First, we construct skill-based CoT annotations: we extract domain-relevant reasoning skills from training questions, cluster them into a shared skill taxonomy, and create detailed multi-step CoT rationale tailored to each video-question pair for training. Second, we introduce a skill-specific expert learning framework. Each expert module specializes in a subset of reasoning skills and is trained with lightweight adapters using the collected CoT supervision. We demonstrate the effectiveness of the proposed approach on three video understanding benchmarks, where Video-SKoT consistently outperforms strong baselines. We also provide in-depth analyses on comparing different CoT annotation pipelines and learned skills over multiple video domains.
Community
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Chain-of-Frames: Advancing Video Understanding in Multimodal LLMs via Frame-Aware Reasoning (2025)
- SiLVR: A Simple Language-based Video Reasoning Framework (2025)
- VCR-Bench: A Comprehensive Evaluation Framework for Video Chain-of-Thought Reasoning (2025)
- ReFineVLA: Reasoning-Aware Teacher-Guided Transfer Fine-Tuning (2025)
- ViQAgent: Zero-Shot Video Question Answering via Agent with Open-Vocabulary Grounding Validation (2025)
- MINERVA: Evaluating Complex Video Reasoning (2025)
- ThinkVideo: High-Quality Reasoning Video Segmentation with Chain of Thoughts (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper