Mavors: Multi-granularity Video Representation for Multimodal Large Language Model
Abstract
Long-context video understanding in multimodal large language models (MLLMs) faces a critical challenge: balancing computational efficiency with the retention of fine-grained spatio-temporal patterns. Existing approaches (e.g., sparse sampling, dense sampling with low resolution, and token compression) suffer from significant information loss in temporal dynamics, spatial details, or subtle interactions, particularly in videos with complex motion or varying resolutions. To address this, we propose Mavors, a novel framework that introduces Multi-granularity video representation for holistic long-video modeling. Specifically, Mavors directly encodes raw video content into latent representations through two core components: 1) an Intra-chunk Vision Encoder (IVE) that preserves high-resolution spatial features via 3D convolutions and Vision Transformers, and 2) an Inter-chunk Feature Aggregator (IFA) that establishes temporal coherence across chunks using transformer-based dependency modeling with chunk-level rotary position encodings. Moreover, the framework unifies image and video understanding by treating images as single-frame videos via sub-image decomposition. Experiments across diverse benchmarks demonstrate Mavors' superiority in maintaining both spatial fidelity and temporal continuity, significantly outperforming existing methods in tasks requiring fine-grained spatio-temporal reasoning.
Community
Mavors: Multi-granularity Video Representation for Multimodal Large Language Model
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Breaking the Encoder Barrier for Seamless Video-Language Understanding (2025)
- Token-Efficient Long Video Understanding for Multimodal LLMs (2025)
- Vamba: Understanding Hour-Long Videos with Hybrid Mamba-Transformers (2025)
- LVC: A Lightweight Compression Framework for Enhancing VLMs in Long Video Understanding (2025)
- Video-XL-Pro: Reconstructive Token Compression for Extremely Long Video Understanding (2025)
- Scaling Video-Language Models to 10K Frames via Hierarchical Differential Distillation (2025)
- DANTE-AD: Dual-Vision Attention Network for Long-Term Audio Description (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper