--- datasets: - NTT-hil-insight/opendocvqa-corpus - NTT-hil-insight/opendocvqa language: - en base_model: - microsoft/Phi-3-vision-128k-instruct library_name: peft --- # VDocRAG: Retrieval-Augmented Generation over Visually-Rich Documents **VDocRAG** is a new RAG framework that can directly understand diverse real-world documents purely from visual features. It was introduced in the paper [VDocRAG: Retrieval-Augmented Generation over Visually-Rich Documents](http://arxiv.org/abs/2504.09795) by Tanaka et al. and first released in [this repository](https://github.com/nttmdlab-nlp/VDocRAG). **Key Enhancements of VDocRAG:** - **New Pretraining Tasks:** We propose novel self-supervised pre-training tasks (**RCR** and **RCG**) that adapt large vision-language models for retrieval by compressing visual information into dense token representations while aligning them with textual content in documents. - **New Dataset:** We introduce **OpenDocVQA**, the first unified collection of open-domain document visual question answering datasets, encompassing diverse document types and formats. ## Model Description The model, `NTT-hil-insight/VDocGenerator-Phi3-vision`, was trained on QA pairs `NTT-hil-insight/OpenDocVQA` with the corpus `NTT-hil-insight/OpenDocVQA-Corpus`, for training VDocRAG with Vision Language Models (`microsoft/Phi-3-vision-128k-instruct`) in open-domain question answering scenarios. `NTT-hil-insight/VDocGenerator-Phi3-vision` is an autoregressive model designed to generate answers based on retrieved document images. ## Usage Here, we show an easy way to download our model from HuggingFace Hub and run it quickly. Make sure to install the dependencies and packages as following at [VDocRAG/README.md](https://github.com/nttmdlab-nlp/VDocRAG/README.md). To run our full inference pipeline with a retrieval system, please use [our code](https://github.com/nttmdlab-nlp/VDocRAG). ```py from PIL import Image import requests from io import BytesIO from torch.nn.functional import cosine_similarity import torch from transformers import AutoProcessor from vdocrag.vdocgenerator.modeling import VDocGenerator model = VDocGenerator.load('microsoft/Phi-3-vision-128k-instruct', lora_name_or_path='NTT-hil-insight/VDocGenerator-Phi3-vision', trust_remote_code=True, attn_implementation="flash_attention_2", torch_dtype=torch.bfloat16, use_cache=False).to('cuda:0') # Process images with the prompt query = "How many international visitors came to Japan in 2017? \n Answer briefly." image_tokens = "\n".join([f"<|image_{i+1}|>" for i in range(len(doc_images))]) messages = [{"role": "user", "content": f"{image_tokens}\n{query}"}] prompt = processor.tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True) processed = processor(prompt, images=doc_images, return_tensors="pt").to('cuda:0') # Generate the answer generate_ids = model.generate(processed, generation_args={ "max_new_tokens": 64, "temperature": 0.0, "do_sample": False, "eos_token_id": processor.tokenizer.eos_token_id }) generate_ids = generate_ids[:, processed['input_ids'].shape[1]:] response = processor.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0].strip() print("Model prediction: {0}".format(response)) # >> Model prediction: 28.69m ``` ## License The models and weights of VDocRAG in this repo are released under the [NTT License](https://huggingface.co/NTT-hil-insight/VDocGenerator-Phi3-vision/edit/main/LICENSE). ## Citation ```bibtex @inproceedings{tanaka2025vdocrag, author = {Ryota Tanaka and Taichi Iki and Taku Hasegawa and Kyosuke Nishida and Kuniko Saito and Jun Suzuki}, title = {VDocRAG: Retrieval-Augmented Generation over Visually-Rich Documents}, booktitle = {CVPR}, year = {2025} } ```