DINO-R1: Incentivizing Reasoning Capability in Vision Foundation Models
Abstract
DINO-R1 incorporates reinforcement learning to enhance visual in-context reasoning capabilities in vision foundation models, achieving better performance than supervised fine-tuning across various visual prompting scenarios.
The recent explosive interest in the reasoning capabilities of large language models, such as DeepSeek-R1, has demonstrated remarkable success through reinforcement learning-based fine-tuning frameworks, exemplified by methods like Group Relative Policy Optimization (GRPO). However, such reasoning abilities remain underexplored and notably absent in vision foundation models, including representation models like the DINO series. In this work, we propose DINO-R1, the first such attempt to incentivize visual in-context reasoning capabilities of vision foundation models using reinforcement learning. Specifically, DINO-R1 introduces Group Relative Query Optimization (GRQO), a novel reinforcement-style training strategy explicitly designed for query-based representation models, which computes query-level rewards based on group-normalized alignment quality. We also apply KL-regularization to stabilize the objectness distribution to reduce the training instability. This joint optimization enables dense and expressive supervision across queries while mitigating overfitting and distributional drift. Building upon Grounding-DINO, we train a series of DINO-R1 family models that integrate a visual prompt encoder and a visual-guided query selection mechanism. Extensive experiments on COCO, LVIS, and ODinW demonstrate that DINO-R1 significantly outperforms supervised fine-tuning baselines, achieving strong generalization in both open-vocabulary and closed-set visual prompting scenarios.
Community
DINO-R1: Incentivizing Reasoning Capability in Vision Foundation Models
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- VLM-R1: A Stable and Generalizable R1-style Large Vision-Language Model (2025)
- Compile Scene Graphs with Reinforcement Learning (2025)
- SATORI-R1: Incentivizing Multimodal Reasoning with Spatial Grounding and Verifiable Rewards (2025)
- Unveiling the Compositional Ability Gap in Vision-Language Reasoning Model (2025)
- Reinforced Reasoning for Embodied Planning (2025)
- DIP-R1: Deep Inspection and Perception with RL Looking Through and Understanding Complex Scenes (2025)
- R1-ShareVL: Incentivizing Reasoning Capability of Multimodal Large Language Models via Share-GRPO (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
It will be good if the authors can specify which "DINO" is it in the very beginning of the paper. Is it the feature extractor DINO? or the object detection DINO?
Anyway, thx for the great work!
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper