|
--- |
|
base_model: |
|
- Qwen/Qwen2.5-VL-3B-Instruct |
|
datasets: |
|
- zackriya/diagramJSON |
|
library_name: peft |
|
tags: |
|
- diagram |
|
- structured-data |
|
- image-processing |
|
- knowledge-graph |
|
- json |
|
license: apache-2.0 |
|
pipeline_tag: visual-document-retrieval |
|
--- |
|
|
|
# πΌοΈπ Diagram-to-Graph Model |
|
|
|
<div align="center"> |
|
<img src="https://github.com/Zackriya-Solutions/diagram2graph/blob/main/docs/diagram2graph_cmpr.png?raw=true" width="800" style="border-radius:10px;" alt="Diagram to Graph Header"/> |
|
</div> |
|
|
|
This model is a research-driven project built during an internship at [Zackariya Solution](https://www.zackriya.com/). It specializes in extracting **structured data(JSON)** from images, particularly **nodes, edges, and their sub-attributes** to represent visual information as knowledge graphs. |
|
|
|
> π **Note:** This model is intended for **learning purposes** only and not for production applications. The extracted structured data may vary based on project needs. |
|
|
|
## π Model Details |
|
|
|
- **Developed by:** Zackariya Solution Internship Team(Mohammed Safvan) |
|
- **Fine Tuned from:** `Qwen/Qwen2.5-VL-3B-Instruct` |
|
- **License:** Apache 2.0 |
|
- **Language(s):** Multilingual (focus on structured extraction) |
|
- **Model type:** Vision-Language Transformer (PEFT fine-tuned) |
|
|
|
## π― Use Cases |
|
|
|
### β
Direct Use |
|
- Experimenting with **diagram-to-graph conversion** π |
|
- Understanding **AI-driven structured extraction** from images |
|
|
|
### π Downstream Use (Potential) |
|
- Enhancing **BPMN/Flowchart** analysis ποΈ |
|
- Supporting **automated document processing** π |
|
|
|
### β Out-of-Scope Use |
|
- Not designed for **real-world production** deployment β οΈ |
|
- May not generalize well across **all diagram types** |
|
|
|
## π How to Use |
|
```python |
|
%pip install -q "transformers>=4.49.0" accelerate datasets "qwen-vl-utils[decord]==0.0.8" |
|
|
|
import os |
|
import PIL |
|
import torch |
|
from qwen_vl_utils import process_vision_info |
|
from transformers import Qwen2_5_VLForConditionalGeneration, Qwen2_5_VLProcessor |
|
|
|
|
|
MODEL_ID="zackriya/diagram2graph-adapters" |
|
MAX_PIXELS = 1280 * 28 * 28 |
|
MIN_PIXELS = 256 * 28 * 28 |
|
|
|
|
|
model = Qwen2_5_VLForConditionalGeneration.from_pretrained( |
|
MODEL_ID, |
|
device_map="auto", |
|
torch_dtype=torch.bfloat16 |
|
) |
|
|
|
processor = Qwen2_5_VLProcessor.from_pretrained( |
|
MODEL_ID, |
|
min_pixels=MIN_PIXELS, |
|
max_pixels=MAX_PIXELS |
|
) |
|
|
|
|
|
SYSTEM_MESSAGE = """You are a Vision Language Model specialized in extracting structured data from visual representations of process and flow diagrams. |
|
Your task is to analyze the provided image of a diagram and extract the relevant information into a well-structured JSON format. |
|
The diagram includes details such as nodes and edges. each of them have their own attributes. |
|
Focus on identifying key data fields and ensuring the output adheres to the requested JSON structure. |
|
Provide only the JSON output based on the extracted information. Avoid additional explanations or comments.""" |
|
|
|
def run_inference(image): |
|
messages= [ |
|
{ |
|
"role": "system", |
|
"content": [{"type": "text", "text": SYSTEM_MESSAGE}], |
|
}, |
|
{ |
|
"role": "user", |
|
"content": [ |
|
{ |
|
"type": "image", |
|
# this image is handled by qwen_vl_utils's process_visio_Info so no need to worry about pil image or path |
|
"image": image, |
|
}, |
|
{ |
|
"type": "text", |
|
"text": "Extract data in JSON format, Only give the JSON", |
|
}, |
|
], |
|
}, |
|
] |
|
|
|
text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True) |
|
image_inputs, _ = process_vision_info(messages) |
|
|
|
inputs = processor( |
|
text=[text], |
|
images=image_inputs, |
|
return_tensors="pt", |
|
) |
|
inputs = inputs.to('cuda') |
|
|
|
generated_ids = model.generate(**inputs, max_new_tokens=512) |
|
generated_ids_trimmed = [ |
|
out_ids[len(in_ids):] |
|
for in_ids, out_ids |
|
in zip(inputs.input_ids, generated_ids) |
|
] |
|
|
|
output_text = processor.batch_decode( |
|
generated_ids_trimmed, |
|
skip_special_tokens=True, |
|
clean_up_tokenization_spaces=False |
|
) |
|
return output_text |
|
image = eval_dataset[9]['image'] # PIL image |
|
# `image` could be URL or relative path to the image |
|
output = run_inference(image) |
|
|
|
# JSON loading |
|
import json |
|
json.loads(output[0]) |
|
``` |
|
|
|
|
|
## ποΈ Training Details |
|
- **Dataset:** Internally curated diagram dataset πΌοΈ |
|
- **Fine-tuning:** LoRA-based optimization β‘ |
|
- **Precision:** bf16 mixed-precision training π― |
|
|
|
## π Evaluation |
|
|
|
- **Metrics:** F1-score π |
|
- **Limitations:** May struggle with **complex, dense diagrams** β οΈ |
|
## Results |
|
|
|
- **+14% improvement in node detection** |
|
- **+23% improvement in edge detection** |
|
|
|
| Samples | (Base)Node F1 | (Fine)Node F1 | (Base)Edge F1 | (Fine)Edge F1 | |
|
| --------------- | ------------- | ------------- | ------------- | ------------- | |
|
| image_sample_1 | 0.46 | 1.0 | 0.59 | 0.71 | |
|
| image_sample_2 | 0.67 | 0.57 | 0.25 | 0.25 | |
|
| image_sample_3 | 1.0 | 1.0 | 0.25 | 0.75 | |
|
| image_sample_4 | 0.5 | 0.83 | 0.15 | 0.62 | |
|
| image_sample_5 | 0.72 | 0.78 | 0.0 | 0.48 | |
|
| image_sample_6 | 0.75 | 0.75 | 0.29 | 0.67 | |
|
| image_sample_7 | 0.6 | 1.0 | 1.0 | 1.0 | |
|
| image_sample_8 | 0.6 | 1.0 | 1.0 | 1.0 | |
|
| image_sample_9 | 1.0 | 1.0 | 0.55 | 0.77 | |
|
| image_sample_10 | 0.67 | 0.8 | 0.0 | 1.0 | |
|
| image_sample_11 | 0.8 | 0.8 | 0.5 | 1.0 | |
|
| image_sample_12 | 0.67 | 1.0 | 0.62 | 0.75 | |
|
| image_sample_13 | 1.0 | 1.0 | 0.73 | 0.67 | |
|
| image_sample_14 | 0.74 | 0.95 | 0.56 | 0.67 | |
|
| image_sample_15 | 0.86 | 0.71 | 0.67 | 0.67 | |
|
| image_sample_16 | 0.75 | 1.0 | 0.8 | 0.75 | |
|
| image_sample_17 | 0.8 | 1.0 | 0.63 | 0.73 | |
|
| image_sample_18 | 0.83 | 0.83 | 0.33 | 0.43 | |
|
| image_sample_19 | 0.75 | 0.8 | 0.06 | 0.22 | |
|
| image_sample_20 | 0.81 | 1.0 | 0.23 | 0.75 | |
|
| **Mean** | 0.749 | **0.891** | 0.4605 | **0.6945** | |
|
|
|
|
|
## π€ Collaboration |
|
Are you interested in fine tuning your own model for your use case or want to explore how we can help you? Let's collaborate. |
|
|
|
[Zackriya Solutions](https://www.zackriya.com/collaboration-form) |
|
|
|
## π References |
|
- [Roboflow](https://github.com/roboflow/notebooks/blob/main/notebooks/how-to-finetune-qwen2-5-vl-for-json-data-extraction.ipynb) |
|
- [Qwen](https://huggingface.co/Qwen/Qwen2.5-VL-3B-Instruct) |
|
|
|
<h3 align='center'> |
|
πStay Curious & Keep Exploring!π |
|
</h3> |