Update README.md

2fa451b verified 2 months ago

4.27 kB

	---
	datasets:
	- NTT-hil-insight/opendocvqa-corpus
	- NTT-hil-insight/opendocvqa
	language:
	- en
	base_model:
	- microsoft/Phi-3-vision-128k-instruct
	library_name: peft
	---
	# VDocRAG: Retrieval-Augmented Generation over Visually-Rich Documents
	VDocRAG is a new RAG framework that can directly understand diverse real-world documents purely from visual features. It was introduced in the paper [VDocRAG: Retrieval-Augmented Generation over Visually-Rich Documents](http://arxiv.org/abs/2504.09795) by Tanaka et al. and first released in [this repository](https://github.com/nttmdlab-nlp/VDocRAG).

	Key Enhancements of VDocRAG:
	- New Pretraining Tasks: We propose novel self-supervised pre-training tasks (RCR and RCG) that adapt large vision-language models for retrieval by compressing visual information into dense token representations while aligning them with textual content in documents.
	- New Dataset: We introduce OpenDocVQA, the first unified collection of open-domain document visual question answering datasets, encompassing diverse document types and formats.

	## Model Description
	The model, `NTT-hil-insight/VDocGenerator-Phi3-vision`, was trained on QA pairs `NTT-hil-insight/OpenDocVQA` with the corpus `NTT-hil-insight/OpenDocVQA-Corpus`, for training VDocRAG with Vision Language Models (`microsoft/Phi-3-vision-128k-instruct`) in open-domain question answering scenarios.

	`NTT-hil-insight/VDocGenerator-Phi3-vision` is an autoregressive model designed to generate answers based on retrieved document images.

	## Usage
	Here, we show an easy way to download our model from HuggingFace Hub and run it quickly. Make sure to install the dependencies and packages as following at [VDocRAG/README.md](https://github.com/nttmdlab-nlp/VDocRAG/README.md). To run our full inference pipeline with a retrieval system, please use [our code](https://github.com/nttmdlab-nlp/VDocRAG).

	```py
	from PIL import Image
	import requests
	from io import BytesIO
	from torch.nn.functional import cosine_similarity
	import torch
	from transformers import AutoProcessor
	from vdocrag.vdocgenerator.modeling import VDocGenerator

	model = VDocGenerator.load('microsoft/Phi-3-vision-128k-instruct',
	lora_name_or_path='NTT-hil-insight/VDocGenerator-Phi3-vision',
	trust_remote_code=True,
	attn_implementation="flash_attention_2",
	torch_dtype=torch.bfloat16,
	use_cache=False).to('cuda:0')

	# Process images with the prompt
	query = "How many international visitors came to Japan in 2017? \n Answer briefly."
	image_tokens = "\n".join([f"<\|image_{i+1}\|>" for i in range(len(doc_images))])
	messages = [{"role": "user", "content": f"{image_tokens}\n{query}"}]
	prompt = processor.tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
	processed = processor(prompt, images=doc_images, return_tensors="pt").to('cuda:0')

	# Generate the answer
	generate_ids = model.generate(processed,
	generation_args={
	"max_new_tokens": 64,
	"temperature": 0.0,
	"do_sample": False,
	"eos_token_id": processor.tokenizer.eos_token_id
	})
	generate_ids = generate_ids[:, processed['input_ids'].shape[1]:]
	response = processor.batch_decode(generate_ids,
	skip_special_tokens=True,
	clean_up_tokenization_spaces=False)[0].strip()

	print("Model prediction: {0}".format(response))

	# >> Model prediction: 28.69m
	```

	## License
	The models and weights of VDocRAG in this repo are released under the [NTT License](https://huggingface.co/NTT-hil-insight/VDocGenerator-Phi3-vision/edit/main/LICENSE).

	## Citation
	```bibtex
	@inproceedings{tanaka2025vdocrag,
	author = {Ryota Tanaka and
	Taichi Iki and
	Taku Hasegawa and
	Kyosuke Nishida and
	Kuniko Saito and
	Jun Suzuki},
	title = {VDocRAG: Retrieval-Augmented Generation over Visually-Rich Documents},
	booktitle = {CVPR},
	year = {2025}
	}
	```