FlashVL-2B-Static-GRPO

[📜 FlashVL]

image/png

Introduction

We are excited to introduce FlashVL, a novel approach to optimizing Vision-Language Models (VLMs) for real-time applications, targeting ultra-low latency and high throughput without sacrificing accuracy. Leveraging advanced architectural enhancements and efficient computational strategies, Flash-VL 2B is designed to maximize throughput by reducing processing time while maintaining competitive performance across multiple vision-language benchmarks. Our approach includes tailored architectural choices, token compression mechanisms, data curation, training schemes, and a novel image processing technique called implicit semantic stitching that effectively balances computational load and model performance. Through extensive evaluations on 11 standard VLM benchmarks, we demonstrate that Flash-VL 2B achieves state-of-the-art results in both speed and accuracy, making it a promising solution for deployment in resource-constrained environments and large-scale real-time applications.

Environment Setup

pip install torch==2.1.2
pip install transformers==4.50.0.dev0

How to use it?

import torch
from PIL import Image
import requests
from io import BytesIO
from transformers import AutoModel, AutoTokenizer, SiglipProcessor

model_path = "Flash-VL/FlashVL-2B-Static-GRPO"
model = AutoModel.from_pretrained(model_path, torch_dtype=torch.bfloat16,trust_remote_code=True,device_map='cuda')
model.tokenizer = AutoTokenizer.from_pretrained(model_path,device_map='cuda')
model.im_trans = SiglipProcessor.from_pretrained(model_path).image_processor

image_url ="https://s3plus.meituan.net/automl-datasets/mlm/3FF4.png"
response = requests.get(image_url)
image_data = BytesIO(response.content)
pil_image = Image.open(image_data).convert('RGB')

messages = [{'role': 'user', 'content': "说说图中第一行第二列是什么蔬菜,买一斤多少钱"}] 
answer = model.chat(pil_image, messages, do_sample=False, max_new_tokens=256)
print(answer)
# 图片中第一行第二列的蔬菜是**荷兰豆**,买一斤的价格是**¥16.8**。

Evaluation

Method/model Average DynaMath MathVision MathVerse MMMU Pro WeMath
Flash-VL-2Bs 23.80 23.19 26.72 16.84 16.24 36.03
InternVL3-2B 27.03 32.55 26.49 17 22.56 36.55
+ SFT 26.08 (+2.28) 28.28 31.06 16.97 15.95 38.16
+ RL 27.23 (+3.43) 26.94 27.94 17.73 16.99 46.55
FlashVL-2B-Static-GRPO 29.05 (+5.25) 30.61 32.48 18.45 16.53 47.18

Note: FlashVL-2B-Static-GRPO applies both SFT and RL.

We use VLMEvalKit to evaluate FlashVL-2B-Static.

Citation

If you find this project useful in your research, please consider citing:

@misc{zhang2025flashvl2boptimizingvisionlanguage,
      title={Flash-VL 2B: Optimizing Vision-Language Model Performance for Ultra-Low Latency and High Throughput}, 
      author={Bo Zhang and Shuo Li and Runhe Tian and Yang Yang and Jixin Tang and Jinhao Zhou and Lin Ma},
      year={2025},
      eprint={2505.09498},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2505.09498}, 
}
Downloads last month
0
Safetensors
Model size
2.3B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for FlashVL/FlashVL-2B-Static-GRPO

Finetuned
(86)
this model

Datasets used to train FlashVL/FlashVL-2B-Static-GRPO