---
datasets:
- NTT-hil-insight/opendocvqa-corpus
- NTT-hil-insight/opendocvqa
language:
- en
base_model:
- microsoft/Phi-3-vision-128k-instruct
library_name: peft
---
# VDocRAG: Retrieval-Augmented Generation over Visually-Rich Documents
**VDocRAG** is a new RAG framework that can directly understand diverse real-world documents purely from visual features. It was introduced in the paper [VDocRAG: Retrieval-Augmented Generation over Visually-Rich Documents](http://arxiv.org/abs/2504.09795) by Tanaka et al. and first released in [this repository](https://github.com/nttmdlab-nlp/VDocRAG).

**Key Enhancements of VDocRAG:**
- **New Pretraining Tasks:** We propose novel self-supervised pre-training tasks (**RCR** and **RCG**) that adapt large vision-language models for retrieval by compressing visual information into dense token representations while aligning them with textual content in documents. 
- **New Dataset:** We introduce **OpenDocVQA**, the first unified collection of open-domain document visual question answering datasets, encompassing diverse document types and formats.

## Model Description
The model, `NTT-hil-insight/VDocGenerator-Phi3-vision`, was trained on QA pairs `NTT-hil-insight/OpenDocVQA` with the corpus `NTT-hil-insight/OpenDocVQA-Corpus`, for training VDocRAG with Vision Language Models (`microsoft/Phi-3-vision-128k-instruct`) in open-domain question answering scenarios. 

`NTT-hil-insight/VDocGenerator-Phi3-vision` is an autoregressive model designed to generate answers based on retrieved document images. 

## Usage
Here, we show an easy way to download our model from HuggingFace Hub and run it quickly. Make sure to install the dependencies and packages as following at [VDocRAG/README.md](https://github.com/nttmdlab-nlp/VDocRAG/README.md). To run our full inference pipeline with a retrieval system, please use [our code](https://github.com/nttmdlab-nlp/VDocRAG).

```py
from PIL import Image
import requests
from io import BytesIO
from torch.nn.functional import cosine_similarity
import torch
from transformers import AutoProcessor
from vdocrag.vdocgenerator.modeling import VDocGenerator

model = VDocGenerator.load('microsoft/Phi-3-vision-128k-instruct', 
                          lora_name_or_path='NTT-hil-insight/VDocGenerator-Phi3-vision', 
                          trust_remote_code=True,
                          attn_implementation="flash_attention_2", 
                          torch_dtype=torch.bfloat16, 
                          use_cache=False).to('cuda:0')

# Process images with the prompt
query =  "How many international visitors came to Japan in 2017? \n Answer briefly."
image_tokens = "\n".join([f"<|image_{i+1}|>" for i in range(len(doc_images))])
messages = [{"role": "user", "content": f"{image_tokens}\n{query}"}]
prompt = processor.tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True) 
processed = processor(prompt, images=doc_images, return_tensors="pt").to('cuda:0')

# Generate the answer
generate_ids = model.generate(processed, 
                              generation_args={
                                "max_new_tokens": 64, 
                                "temperature": 0.0, 
                                "do_sample": False, 
                                "eos_token_id": processor.tokenizer.eos_token_id
                              })
generate_ids = generate_ids[:, processed['input_ids'].shape[1]:]
response = processor.batch_decode(generate_ids, 
                                  skip_special_tokens=True, 
                                  clean_up_tokenization_spaces=False)[0].strip()

print("Model prediction: {0}".format(response))

# >> Model prediction: 28.69m
```

## License
The models and weights of VDocRAG in this repo are released under the [NTT License](https://huggingface.co/NTT-hil-insight/VDocGenerator-Phi3-vision/edit/main/LICENSE).

## Citation
```bibtex
@inproceedings{tanaka2025vdocrag,
  author    = {Ryota Tanaka and
               Taichi Iki and
               Taku Hasegawa and
               Kyosuke Nishida and
               Kuniko Saito and
               Jun Suzuki},
  title     = {VDocRAG: Retrieval-Augmented Generation over Visually-Rich Documents},
  booktitle = {CVPR},
  year      = {2025}
}
```