SkillFormer: Unified Multi-View Video Understanding for Proficiency Estimation
Abstract
SkillFormer, a parameter-efficient architecture, uses the TimeSformer backbone with a CrossViewFusion module to achieve state-of-the-art accuracy in multi-view skill assessment from egocentric and exocentric videos.
Assessing human skill levels in complex activities is a challenging problem with applications in sports, rehabilitation, and training. In this work, we present SkillFormer, a parameter-efficient architecture for unified multi-view proficiency estimation from egocentric and exocentric videos. Building on the TimeSformer backbone, SkillFormer introduces a CrossViewFusion module that fuses view-specific features using multi-head cross-attention, learnable gating, and adaptive self-calibration. We leverage Low-Rank Adaptation to fine-tune only a small subset of parameters, significantly reducing training costs. In fact, when evaluated on the EgoExo4D dataset, SkillFormer achieves state-of-the-art accuracy in multi-view settings while demonstrating remarkable computational efficiency, using 4.5x fewer parameters and requiring 3.75x fewer training epochs than prior baselines. It excels in multiple structured tasks, confirming the value of multi-view integration for fine-grained skill assessment.
Community
We present SkillFormer, a parameter-efficient architecture for unified multi-view proficiency estimation from egocentric and exocentric videos.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- PAVE: Patching and Adapting Video Large Language Models (2025)
- Body-Hand Modality Expertized Networks with Cross-attention for Fine-grained Skeleton Action Recognition (2025)
- Order Matters: On Parameter-Efficient Image-to-Video Probing for Recognizing Nearly Symmetric Actions (2025)
- CA2ST: Cross-Attention in Audio, Space, and Time for Holistic Video Recognition (2025)
- Knowledge Distillation for Multimodal Egocentric Action Recognition Robust to Missing Modalities (2025)
- Learning Streaming Video Representation via Multitask Training (2025)
- DiVE: Efficient Multi-View Driving Scenes Generation Based on Video Diffusion Transformer (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper