Update README.md

9e7d376 verified 26 days ago

6.87 kB

	---
	base_model:
	- Qwen/Qwen2.5-VL-3B-Instruct
	datasets:
	- zackriya/diagramJSON
	library_name: peft
	tags:
	- diagram
	- structured-data
	- image-processing
	- knowledge-graph
	- json
	license: apache-2.0
	pipeline_tag: visual-document-retrieval
	---

	# 🖼️🔗 Diagram-to-Graph Model

	<div align="center">
	<img src="https://github.com/Zackriya-Solutions/diagram2graph/blob/main/docs/diagram2graph_cmpr.png?raw=true" width="800" style="border-radius:10px;" alt="Diagram to Graph Header"/>
	</div>

	This model is a research-driven project built during an internship at [Zackariya Solution](https://www.zackriya.com/). It specializes in extracting structured data(JSON) from images, particularly nodes, edges, and their sub-attributes to represent visual information as knowledge graphs.

	> 🚀 Note: This model is intended for learning purposes only and not for production applications. The extracted structured data may vary based on project needs.

	## 📝 Model Details

	- Developed by: Zackariya Solution Internship Team(Mohammed Safvan)
	- Fine Tuned from: `Qwen/Qwen2.5-VL-3B-Instruct`
	- License: Apache 2.0
	- Language(s): Multilingual (focus on structured extraction)
	- Model type: Vision-Language Transformer (PEFT fine-tuned)

	## 🎯 Use Cases

	### ✅ Direct Use
	- Experimenting with diagram-to-graph conversion 📊
	- Understanding AI-driven structured extraction from images

	### 🚀 Downstream Use (Potential)
	- Enhancing BPMN/Flowchart analysis 🏗️
	- Supporting automated document processing 📄

	### ❌ Out-of-Scope Use
	- Not designed for real-world production deployment ⚠️
	- May not generalize well across all diagram types

	## 📊 How to Use
	```python
	%pip install -q "transformers>=4.49.0" accelerate datasets "qwen-vl-utils[decord]==0.0.8"

	import os
	import PIL
	import torch
	from qwen_vl_utils import process_vision_info
	from transformers import Qwen2_5_VLForConditionalGeneration, Qwen2_5_VLProcessor


	MODEL_ID="zackriya/diagram2graph-adapters"
	MAX_PIXELS = 1280 * 28 * 28
	MIN_PIXELS = 256 * 28 * 28


	model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
	MODEL_ID,
	device_map="auto",
	torch_dtype=torch.bfloat16
	)

	processor = Qwen2_5_VLProcessor.from_pretrained(
	MODEL_ID,
	min_pixels=MIN_PIXELS,
	max_pixels=MAX_PIXELS
	)


	SYSTEM_MESSAGE = """You are a Vision Language Model specialized in extracting structured data from visual representations of process and flow diagrams.
	Your task is to analyze the provided image of a diagram and extract the relevant information into a well-structured JSON format.
	The diagram includes details such as nodes and edges. each of them have their own attributes.
	Focus on identifying key data fields and ensuring the output adheres to the requested JSON structure.
	Provide only the JSON output based on the extracted information. Avoid additional explanations or comments."""

	def run_inference(image):
	messages= [
	{
	"role": "system",
	"content": [{"type": "text", "text": SYSTEM_MESSAGE}],
	},
	{
	"role": "user",
	"content": [
	{
	"type": "image",
	# this image is handled by qwen_vl_utils's process_visio_Info so no need to worry about pil image or path
	"image": image,
	},
	{
	"type": "text",
	"text": "Extract data in JSON format, Only give the JSON",
	},
	],
	},
	]

	text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
	image_inputs, _ = process_vision_info(messages)

	inputs = processor(
	text=[text],
	images=image_inputs,
	return_tensors="pt",
	)
	inputs = inputs.to('cuda')

	generated_ids = model.generate(**inputs, max_new_tokens=512)
	generated_ids_trimmed = [
	out_ids[len(in_ids):]
	for in_ids, out_ids
	in zip(inputs.input_ids, generated_ids)
	]

	output_text = processor.batch_decode(
	generated_ids_trimmed,
	skip_special_tokens=True,
	clean_up_tokenization_spaces=False
	)
	return output_text
	image = eval_dataset[9]['image'] # PIL image
	# `image` could be URL or relative path to the image
	output = run_inference(image)

	# JSON loading
	import json
	json.loads(output[0])
	```


	## 🏗️ Training Details
	- Dataset: Internally curated diagram dataset 🖼️
	- Fine-tuning: LoRA-based optimization ⚡
	- Precision: bf16 mixed-precision training 🎯

	## 📈 Evaluation

	- Metrics: F1-score 🏆
	- Limitations: May struggle with complex, dense diagrams ⚠️
	## Results

	- +14% improvement in node detection
	- +23% improvement in edge detection

	\| Samples \| (Base)Node F1 \| (Fine)Node F1 \| (Base)Edge F1 \| (Fine)Edge F1 \|
	\| --------------- \| ------------- \| ------------- \| ------------- \| ------------- \|
	\| image_sample_1 \| 0.46 \| 1.0 \| 0.59 \| 0.71 \|
	\| image_sample_2 \| 0.67 \| 0.57 \| 0.25 \| 0.25 \|
	\| image_sample_3 \| 1.0 \| 1.0 \| 0.25 \| 0.75 \|
	\| image_sample_4 \| 0.5 \| 0.83 \| 0.15 \| 0.62 \|
	\| image_sample_5 \| 0.72 \| 0.78 \| 0.0 \| 0.48 \|
	\| image_sample_6 \| 0.75 \| 0.75 \| 0.29 \| 0.67 \|
	\| image_sample_7 \| 0.6 \| 1.0 \| 1.0 \| 1.0 \|
	\| image_sample_8 \| 0.6 \| 1.0 \| 1.0 \| 1.0 \|
	\| image_sample_9 \| 1.0 \| 1.0 \| 0.55 \| 0.77 \|
	\| image_sample_10 \| 0.67 \| 0.8 \| 0.0 \| 1.0 \|
	\| image_sample_11 \| 0.8 \| 0.8 \| 0.5 \| 1.0 \|
	\| image_sample_12 \| 0.67 \| 1.0 \| 0.62 \| 0.75 \|
	\| image_sample_13 \| 1.0 \| 1.0 \| 0.73 \| 0.67 \|
	\| image_sample_14 \| 0.74 \| 0.95 \| 0.56 \| 0.67 \|
	\| image_sample_15 \| 0.86 \| 0.71 \| 0.67 \| 0.67 \|
	\| image_sample_16 \| 0.75 \| 1.0 \| 0.8 \| 0.75 \|
	\| image_sample_17 \| 0.8 \| 1.0 \| 0.63 \| 0.73 \|
	\| image_sample_18 \| 0.83 \| 0.83 \| 0.33 \| 0.43 \|
	\| image_sample_19 \| 0.75 \| 0.8 \| 0.06 \| 0.22 \|
	\| image_sample_20 \| 0.81 \| 1.0 \| 0.23 \| 0.75 \|
	\| Mean \| 0.749 \| 0.891 \| 0.4605 \| 0.6945 \|


	## 🤝 Collaboration
	Are you interested in fine tuning your own model for your use case or want to explore how we can help you? Let's collaborate.

	[Zackriya Solutions](https://www.zackriya.com/collaboration-form)

	## 🔗 References
	- [Roboflow](https://github.com/roboflow/notebooks/blob/main/notebooks/how-to-finetune-qwen2-5-vl-for-json-data-extraction.ipynb)
	- [Qwen](https://huggingface.co/Qwen/Qwen2.5-VL-3B-Instruct)

	<h3 align='center'>
	🚀Stay Curious & Keep Exploring!🚀
	</h3>