File size: 6,859 Bytes
d33174a
76e32dd
 
457793d
 
 
 
 
 
 
976c513
 
 
 
d33174a
 
457793d
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
---
base_model:
- Qwen/Qwen2.5-VL-3B-Instruct
datasets:
- zackriya/diagramJSON
library_name: peft
tags:
- diagram
- structured-data
- image-processing
- knowledge-graph
- json
license: apache-2.0
pipeline_tag: visual-document-retrieval
---

# πŸ–ΌοΈπŸ”— Diagram-to-Graph Model

<div align="center">
  <img src="https://github.com/Zackriya-Solutions/diagram2graph/blob/main/docs/diagram2graph_cmpr.png?raw=true" width="800" style="border-radius:10px;" alt="Diagram to Graph Header"/>
</div>

This model is a research-driven project built during an internship at [Zackariya Solution](https://www.zackriya.com/). It specializes in extracting **structured data(JSON)** from images, particularly **nodes, edges, and their sub-attributes** to represent visual information as knowledge graphs.

> πŸš€ **Note:** This model is intended for **learning purposes** only and not for production applications. The extracted structured data may vary based on project needs.

## πŸ“ Model Details

- **Developed by:** Zackariya Solution Internship Team(Mohammed Safvan)
- **Fine Tuned from:** `Qwen/Qwen2.5-VL-3B-Instruct`
- **License:**  Apache 2.0
- **Language(s):** Multilingual (focus on structured extraction)
- **Model type:** Vision-Language Transformer (PEFT fine-tuned)

## 🎯 Use Cases

### βœ… Direct Use
- Experimenting with **diagram-to-graph conversion** πŸ“Š
- Understanding **AI-driven structured extraction** from images

### πŸš€ Downstream Use (Potential)
- Enhancing **BPMN/Flowchart** analysis πŸ—οΈ
- Supporting **automated document processing** πŸ“„

### ❌ Out-of-Scope Use
- Not designed for **real-world production** deployment ⚠️
- May not generalize well across **all diagram types**

## πŸ“Š How to Use
```python
%pip install -q "transformers>=4.49.0" accelerate datasets "qwen-vl-utils[decord]==0.0.8"

import os
import PIL
import torch
from qwen_vl_utils import process_vision_info
from transformers import Qwen2_5_VLForConditionalGeneration, Qwen2_5_VLProcessor


MODEL_ID="zackriya/diagram2graph"
MAX_PIXELS = 1280 * 28 * 28
MIN_PIXELS = 256 * 28 * 28


model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
	MODEL_ID,
	device_map="auto",
	torch_dtype=torch.bfloat16
)

processor = Qwen2_5_VLProcessor.from_pretrained(
	MODEL_ID,
	min_pixels=MIN_PIXELS,
	max_pixels=MAX_PIXELS
)


SYSTEM_MESSAGE = """You are a Vision Language Model specialized in extracting structured data from visual representations of process and flow diagrams.
Your task is to analyze the provided image of a diagram and extract the relevant information into a well-structured JSON format.
The diagram includes details such as nodes and edges. each of them have their own attributes.
Focus on identifying key data fields and ensuring the output adheres to the requested JSON structure.
Provide only the JSON output based on the extracted information. Avoid additional explanations or comments."""

def run_inference(image):
	messages= [
    	{
        	"role": "system",
        	"content": [{"type": "text", "text": SYSTEM_MESSAGE}],
    	},
    	{
        	"role": "user",
        	"content": [
            	{
                	"type": "image",
                	# this image is handled by qwen_vl_utils's process_visio_Info so no need to worry about pil image or path
                	"image": image,
            	},
            	{
                	"type": "text",
                	"text": "Extract data in JSON format, Only give the JSON",
            	},
        	],
    	},
	]

	text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
	image_inputs, _ = process_vision_info(messages)

	inputs = processor(
    	text=[text],
    	images=image_inputs,
    	return_tensors="pt",
	)
	inputs = inputs.to('cuda')

	generated_ids = model.generate(**inputs, max_new_tokens=512)
	generated_ids_trimmed = [
    	out_ids[len(in_ids):]
    	for in_ids, out_ids
    	in zip(inputs.input_ids, generated_ids)
	]

	output_text = processor.batch_decode(
    	generated_ids_trimmed,
    	skip_special_tokens=True,
    	clean_up_tokenization_spaces=False
	)
	return output_text
image = eval_dataset[9]['image'] # PIL image
# `image` could be URL or relative path to the image
output = run_inference(image)

# JSON loading
import json
json.loads(output[0])
```


## πŸ—οΈ Training Details
- **Dataset:** Internally curated diagram dataset πŸ–ΌοΈ
- **Fine-tuning:** LoRA-based optimization ⚑
- **Precision:** bf16 mixed-precision training 🎯

## πŸ“ˆ Evaluation

- **Metrics:** F1-score πŸ†
- **Limitations:** May struggle with **complex, dense diagrams** ⚠️
## Results

- **+14% improvement in node detection**
- **+23% improvement in edge detection**

| Samples     	| (Base)Node F1 | (Fine)Node F1 | (Base)Edge F1 | (Fine)Edge F1 |
| --------------- | ------------- | ------------- | ------------- | ------------- |
| image_sample_1  | 0.46      	| 1.0       	| 0.59      	| 0.71      	|
| image_sample_2  | 0.67      	| 0.57      	| 0.25      	| 0.25      	|
| image_sample_3  | 1.0       	| 1.0       	| 0.25      	| 0.75      	|
| image_sample_4  | 0.5       	| 0.83      	| 0.15      	| 0.62      	|
| image_sample_5  | 0.72      	| 0.78      	| 0.0       	| 0.48      	|
| image_sample_6  | 0.75      	| 0.75      	| 0.29      	| 0.67      	|
| image_sample_7  | 0.6       	| 1.0       	| 1.0       	| 1.0       	|
| image_sample_8  | 0.6       	| 1.0       	| 1.0       	| 1.0       	|
| image_sample_9  | 1.0       	| 1.0       	| 0.55      	| 0.77      	|
| image_sample_10 | 0.67      	| 0.8       	| 0.0       	| 1.0       	|
| image_sample_11 | 0.8       	| 0.8       	| 0.5       	| 1.0       	|
| image_sample_12 | 0.67      	| 1.0       	| 0.62      	| 0.75      	|
| image_sample_13 | 1.0       	| 1.0       	| 0.73      	| 0.67      	|
| image_sample_14 | 0.74      	| 0.95      	| 0.56      	| 0.67      	|
| image_sample_15 | 0.86      	| 0.71      	| 0.67      	| 0.67      	|
| image_sample_16 | 0.75      	| 1.0       	| 0.8       	| 0.75      	|
| image_sample_17 | 0.8       	| 1.0       	| 0.63      	| 0.73      	|
| image_sample_18 | 0.83      	| 0.83      	| 0.33      	| 0.43      	|
| image_sample_19 | 0.75      	| 0.8       	| 0.06      	| 0.22      	|
| image_sample_20 | 0.81      	| 1.0       	| 0.23      	| 0.75      	|
| **Mean**    	| 0.749     	| **0.891** 	| 0.4605    	| **0.6945**	|


## 🀝 Collaboration
Are you interested in fine tuning your own model for your use case or want to explore how we can help you? Let's collaborate.

[Zackriya Solutions](https://www.zackriya.com/collaboration-form)

## πŸ”— References
 - [Roboflow](https://github.com/roboflow/notebooks/blob/main/notebooks/how-to-finetune-qwen2-5-vl-for-json-data-extraction.ipynb)
 - [Qwen](https://huggingface.co/Qwen/Qwen2.5-VL-3B-Instruct)

<h3 align='center'>
πŸš€Stay Curious & Keep Exploring!πŸš€
</h3>