FlashVL-2B-Dynamic-ISS

Introduction

We are excited to introduce FlashVL, a novel approach to optimizing Vision-Language Models (VLMs) for real-time applications, targeting ultra-low latency and high throughput without sacrificing accuracy. Leveraging advanced architectural enhancements and efficient computational strategies, Flash-VL 2B is designed to maximize throughput by reducing processing time while maintaining competitive performance across multiple vision-language benchmarks. Our approach includes tailored architectural choices, token compression mechanisms, data curation, training schemes, and a novel image processing technique called implicit semantic stitching that effectively balances computational load and model performance. Through extensive evaluations on 11 standard VLM benchmarks, we demonstrate that Flash-VL 2B achieves state-of-the-art results in both speed and accuracy, making it a promising solution for deployment in resource-constrained environments and large-scale real-time applications.

Environment Setup

pip install torch==2.1.2
pip install transformers==4.50.0.dev0

How to use it?

import torch
from PIL import Image
import requests
from io import BytesIO
from transformers import AutoModel, AutoTokenizer, CLIPImageProcessor

model_path = "FlashVL/FlashVL-2B-Dynamic-ISS"
model = AutoModel.from_pretrained(model_path, torch_dtype=torch.bfloat16,trust_remote_code=True,device_map='cuda')
model.tokenizer = AutoTokenizer.from_pretrained(model_path,device_map='cuda')
model.im_trans = CLIPImageProcessor.from_pretrained(model_path)

# single-image single-round conversation (单图单轮对话)
image_url ="https://s3plus.meituan.net/automl-datasets/mlm/0516.png"
response = requests.get(image_url)
image_data = BytesIO(response.content)
pil_image = Image.open(image_data).convert('RGB')   
messages = [{'role': 'user', 'content': "生成图中菜品的菜谱"}] # answer: EXTRA
answer = model.chat(pil_image, messages, do_sample=False, max_new_tokens=256)
print(answer)

# single-image multi-round conversation (单图多轮对话)
messages = [
    {'role': 'user', 'content': '这是什么'},
    {"role": "assistant", "content": '这是一道看起来像是银耳莲子汤的甜品。\
     银耳是一种常见的食材，通常用于制作甜品和汤品，具有软糯的口感和清润的口感。莲 \
     子是莲子的干燥部分，常用于中医和食疗中，具有补脾止泻的功效。图片中还可以看到 \
     一些枸杞和核桃，枸杞富含维生素和抗氧化物质，核桃则提供丰富的蛋白质和健康脂肪。 \
     整体来看，这道甜品不仅美味，还具有一定的营养价值。'},
    {'role': 'user', 'content': '对图中菜品卡路里分析'}
    ]
answer = model.chat(pil_image, messages, do_sample=False, max_new_tokens=256)
print(answer)

# pure-text single-round conversation (纯文本对话）
messages = [{'role': 'user', 'content': "who are you"}]
answer = model.chat(None, messages, do_sample=False, max_new_tokens=256)
print(answer)

Evaluation

Benchmark	Qwen2-VL-2B	Aquila-VL-2B	InternVL2.5-2B	Flash-VL-2B_s	Flash-VL-2B_d	Flash-VL-2B_d-ISS
MMMU_val	41.9	44.4	41.8	43.6	42.9	42.9
MMBench^en	74.9	78.6	74.7	78.4	78.4	79.1
MMBench^cn	73.5	76.3	71.6	74.7	74.9	76.7
MMStar	48.0	54.9	54.1	53.8	54.4	54.1
MathVista_testmini	43.0	59.4	50.9	59.3	58.1	61.5
AI2D_test	74.1	75.0	75.1	74.2	74.1	74.4
MMVet	49.5	40.9	61.7	47.3	52.7	50.7
HallusionBench	39.2	38.5	42.7	43.5	45.5	49.0
OCRBench	794	773	800	764	831	843
MME	1872	1813	2091	1715	1866	1850
SEEDBench	71.5	78.9	73.2	73.6	73.6	74.5
Average	60.2	62.6	63.6	62.4	64.0	64.8

We use VLMEvalKit to evaluate FlashVL-2B-Static.

Citation

If you find this project useful in your research, please consider citing:

@misc{zhang2025flashvl2boptimizingvisionlanguage,
      title={Flash-VL 2B: Optimizing Vision-Language Model Performance for Ultra-Low Latency and High Throughput}, 
      author={Bo Zhang and Shuo Li and Runhe Tian and Yang Yang and Jixin Tang and Jinhao Zhou and Lin Ma},
      year={2025},
      eprint={2505.09498},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2505.09498}, 
}

FlashVL
/

FlashVL-2B-Dynamic-ISS