# SkyCaptioner-V1: A Structural Video Captioning Model
π Technical Report Β· π Playground Β· π¬ Discord Β· π€ Hugging Face Β· π€ ModelScope
--- Welcome to the SkyCaptioner-V1 repository! Here, you'll find the structural video captioning model weights and inference code for our video captioner that labels the video data efficiently and comprehensively. ## π₯π₯π₯ News!! * Apr 21, 2025: π We release the [vllm](https://github.com/vllm-project/vllm) batch inference code for SkyCaptioner-V1 Model and caption fusion inference code. * Apr 21, 2025: π We release the first shot-aware video captioning model [SkyCaptioner-V1 Model](https://huggingface.co/Skywork/SkyCaptioner-V1). For more details, please check our [paper](https://arxiv.org/pdf/2504.13074). ## π TODO List - SkyCaptioner-V1 - [x] Checkpoints - [x] Batch Inference Code - [x] Caption Fusion Method - [ ] Web Demo (Gradio) ## π Overview SkyCaptioner-V1 is a structural video captioning model designed to generate high-quality, structural descriptions for video data. It integrates specialized sub-expert models and multimodal large language models (MLLMs) with human annotations to address the limitations of general captioners in capturing professional film-related details. Key aspects include: 1. ββ**Structural Representation**β: Combines general video descriptions (from MLLMs) with sub-expert captioner (e.g., shot types,shot angles, shot positions, camera motions.) and human annotations. 2. ββ**Knowledge Distillation**β: Distills expertise from sub-expert captioners into a unified model. 3. ββ**Application Flexibility**β: Generates dense captions for text-to-video (T2V) and concise prompts for image-to-video (I2V) tasks. ## π Key Features ### Structural Captioning Framework Our Video Captioning model captures multi-dimensional details: * ββ**Subjects**β: Appearance, action, expression, position, and hierarchical categorization. * ββ**Shot Metadata**β: Shot type (e.g., close-up, long shot), shot angle, shot position, camera motion, environment, lighting, etc. ### Sub-Expert Integration * ββ**Shot Captioner**β: Classifies shot type, angle, and position with high precision. * ββ**Expression Captioner**β: Analyzes facial expressions, emotion intensity, and temporal dynamics. * ββ**Camera Motion Captioner**β: Tracks 6DoF camera movements and composite motion types, ### Training Pipeline * Trained on \~2M high-quality, concept-balanced videos curated from 10M raw samples. * Fine-tuned on Qwen2.5-VL-7B-Instruct with a global batch size of 512 across 32 A800 GPUs. * Optimized using AdamW (learning rate: 1e-5) for 2 epochs. ### Dynamic Caption Fusion: * Adapts output length based on application (T2V/I2V). * Employs LLM Model to fusion structural fields to get a natural and fluency caption for downstream tasks. ## π Benchmark Results SkyCaptioner-V1 demonstrates significant improvements over existing models in key film-specific captioning tasks, particularly in β**shot-language understanding** and ββ**domain-specific precision**β. The differences stem from its structural architecture and expert-guided training: 1. ββ**Superior Shot-Language Understanding**β: * βOur Captioner model outperforms Qwen2.5-VL-72B with +11.2% in shot type, +16.1% in shot angle, and +50.4% in shot position accuracy. Because SkyCaptioner-V1βs specialized shot classifiers outperform generalist MLLMs, which lack film-domain fine-tuning. * β+28.5% accuracy in camera motion vs. Tarsier2-recap-7B (88.8% vs. 41.5%): Its 6DoF motion analysis and active learning pipeline address ambiguities in composite motions (e.g., tracking + panning) that challenge generic captioners. 2. ββ**High domain-specific precision**β: * ββExpression accuracyβ: β68.8% vs. 54.3% (Tarsier2-recap-7B), leveraging temporal-aware S2D frameworks to capture dynamic facial changes.
Metric | Qwen2.5-VL-7B-Ins. | Qwen2.5-VL-72B-Ins. | Tarsier2-recap-7B | SkyCaptioner-V1 |
---|---|---|---|---|
Avg accuracy | 51.4% | 58.7% | 49.4% | 76.3% |
shot type | 76.8% | 82.5% | 60.2% | 93.7% |
shot angle | 60.0% | 73.7% | 52.4% | 89.8% |
shot position | 28.4% | 32.7% | 23.6% | 83.1% |
camera motion | 62.0% | 61.2% | 45.3% | 85.3% |
expression | 43.6% | 51.5% | 54.3% | 68.8% |
TYPES_type | 43.5% | 49.7% | 47.6% | 82.5% |
TYPES_sub_type | 38.9% | 44.9% | 45.9% | 75.4% |
appearance | 40.9% | 52.0% | 45.6% | 59.3% |
action | 32.4% | 52.0% | 69.8% | 68.8% |
position | 35.4% | 48.6% | 45.5% | 57.5% |
is_main_subject | 58.5% | 68.7% | 69.7% | 80.9% |
environment | 70.4% | 72.7% | 61.4% | 70.5% |
lighting | 77.1% | 80.0% | 21.2% | 76.5% |