GUI-Actor-7B with Qwen2-VL-7B as backbone VLM

This model was introduced in the paper GUI-Actor: Coordinate-Free Visual Grounding for GUI Agents. It is developed based on Qwen2-VL-7B-Instruct , augmented by an attention-based action head and finetuned to perform GUI grounding using the dataset here.

For more details on model design and evaluation, please check: 🏠 Project Page | πŸ’» Github Repo | πŸ“‘ Paper.

Model Name Hugging Face Link
GUI-Actor-7B-Qwen2-VL πŸ€— Hugging Face
GUI-Actor-2B-Qwen2-VL πŸ€— Hugging Face
GUI-Actor-7B-Qwen2.5-VL πŸ€— Hugging Face
GUI-Actor-3B-Qwen2.5-VL πŸ€— Hugging Face
GUI-Actor-Verifier-2B πŸ€— Hugging Face

πŸ“Š Performance Comparison on GUI Grounding Benchmarks

Table 1. Main results on ScreenSpot-Pro, ScreenSpot, and ScreenSpot-v2 with Qwen2-VL as the backbone. † indicates scores obtained from our own evaluation of the official models on Huggingface.

Method Backbone VLM ScreenSpot-Pro ScreenSpot ScreenSpot-v2
72B models:
AGUVIS-72B Qwen2-VL - 89.2 -
UGround-V1-72B Qwen2-VL 34.5 89.4 -
UI-TARS-72B Qwen2-VL 38.1 88.4 90.3
7B models:
OS-Atlas-7B Qwen2-VL 18.9 82.5 84.1
AGUVIS-7B Qwen2-VL 22.9 84.4 86.0†
UGround-V1-7B Qwen2-VL 31.1 86.3 87.6†
UI-TARS-7B Qwen2-VL 35.7 89.5 91.6
GUI-Actor-7B Qwen2-VL 40.7 88.3 89.5
GUI-Actor-7B + Verifier Qwen2-VL 44.2 89.7 90.9
2B models:
UGround-V1-2B Qwen2-VL 26.6 77.1 -
UI-TARS-2B Qwen2-VL 27.7 82.3 84.7
GUI-Actor-2B Qwen2-VL 36.7 86.5 88.6
GUI-Actor-2B + Verifier Qwen2-VL 41.8 86.9 89.3

Table 2. Main results on the ScreenSpot-Pro and ScreenSpot-v2 with Qwen2.5-VL as the backbone.

Method Backbone VLM ScreenSpot-Pro ScreenSpot-v2
7B models:
Qwen2.5-VL-7B Qwen2.5-VL 27.6 88.8
Jedi-7B Qwen2.5-VL 39.5 91.7
GUI-Actor-7B Qwen2.5-VL 44.6 92.1
GUI-Actor-7B + Verifier Qwen2.5-VL 47.7 92.5
3B models:
Qwen2.5-VL-3B Qwen2.5-VL 25.9 80.9
Jedi-3B Qwen2.5-VL 36.1 88.6
GUI-Actor-3B Qwen2.5-VL 42.2 91.0
GUI-Actor-3B + Verifier Qwen2.5-VL 45.9 92.4

πŸš€ Usage

import torch

from qwen_vl_utils import process_vision_info
from datasets import load_dataset
from transformers import Qwen2VLProcessor
from gui_actor.constants import chat_template
from gui_actor.modeling import Qwen2VLForConditionalGenerationWithPointer
from gui_actor.inference import inference


# load model
model_name_or_path = "microsoft/GUI-Actor-7B-Qwen2-VL"
data_processor = Qwen2VLProcessor.from_pretrained(model_name_or_path)
tokenizer = data_processor.tokenizer
model = Qwen2VLForConditionalGenerationWithPointer.from_pretrained(
    model_name_or_path,
    torch_dtype=torch.bfloat16,
    device_map="cuda:0",
    attn_implementation="flash_attention_2"
).eval()

# prepare example
dataset = load_dataset("rootsautomation/ScreenSpot")["test"]
example = dataset[0]
print(f"Intruction: {example['instruction']}")
print(f"ground-truth action region (x1, y1, x2, y2): {[round(i, 2) for i in example['bbox']]}")

conversation = [
    {
        "role": "system",
        "content": [
            {
                "type": "text",
                "text": "You are a GUI agent. You are given a task and a screenshot of the screen. You need to perform a series of pyautogui actions to complete the task.",
            }
        ]
    },
    {
        "role": "user",
        "content": [
            {
                "type": "image",
                "image": example["image"], # PIL.Image.Image or str to path
                # "image_url": "https://xxxxx.png" or "https://xxxxx.jpg" or "file://xxxxx.png" or "data:image/png;base64,xxxxxxxx", will be split by "base64,"
            },
            {
                "type": "text",
                "text": example["instruction"]
            },
        ],
    },
]

# inference
pred = inference(conversation, model, tokenizer, data_processor, use_placeholder=True, topk=3)
px, py = pred["topk_points"][0]
print(f"Predicted click point: [{round(px, 4)}, {round(py, 4)}]")

# >> Model Response
# Intruction: close this window
# ground-truth action region (x1, y1, x2, y2): [0.9479, 0.1444, 0.9938, 0.2074]
# Predicted click point: [0.9709, 0.1548]

πŸ“ Citation

@article{wu2025gui,
  title={GUI-Actor: Coordinate-Free Visual Grounding for GUI Agents},
  author={Wu, Qianhui and Cheng, Kanzhi and Yang, Rui and Zhang, Chaoyun and Yang, Jianwei and Jiang, Huiqiang and Mu, Jian and Peng, Baolin and Qiao, Bo and Tan, Reuben and others},
  journal={arXiv preprint arXiv:2506.03143},
  year={2025}
}
Downloads last month
257
Safetensors
Model size
8.39B params
Tensor type
BF16
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for microsoft/GUI-Actor-7B-Qwen2-VL

Base model

Qwen/Qwen2-VL-7B
Finetuned
(343)
this model