PEFT
Safetensors
English
RyotaTanaka's picture
Update README.md
2fa451b verified
---
datasets:
- NTT-hil-insight/opendocvqa-corpus
- NTT-hil-insight/opendocvqa
language:
- en
base_model:
- microsoft/Phi-3-vision-128k-instruct
library_name: peft
---
# VDocRAG: Retrieval-Augmented Generation over Visually-Rich Documents
**VDocRAG** is a new RAG framework that can directly understand diverse real-world documents purely from visual features. It was introduced in the paper [VDocRAG: Retrieval-Augmented Generation over Visually-Rich Documents](http://arxiv.org/abs/2504.09795) by Tanaka et al. and first released in [this repository](https://github.com/nttmdlab-nlp/VDocRAG).
**Key Enhancements of VDocRAG:**
- **New Pretraining Tasks:** We propose novel self-supervised pre-training tasks (**RCR** and **RCG**) that adapt large vision-language models for retrieval by compressing visual information into dense token representations while aligning them with textual content in documents.
- **New Dataset:** We introduce **OpenDocVQA**, the first unified collection of open-domain document visual question answering datasets, encompassing diverse document types and formats.
## Model Description
The model, `NTT-hil-insight/VDocGenerator-Phi3-vision`, was trained on QA pairs `NTT-hil-insight/OpenDocVQA` with the corpus `NTT-hil-insight/OpenDocVQA-Corpus`, for training VDocRAG with Vision Language Models (`microsoft/Phi-3-vision-128k-instruct`) in open-domain question answering scenarios.
`NTT-hil-insight/VDocGenerator-Phi3-vision` is an autoregressive model designed to generate answers based on retrieved document images.
## Usage
Here, we show an easy way to download our model from HuggingFace Hub and run it quickly. Make sure to install the dependencies and packages as following at [VDocRAG/README.md](https://github.com/nttmdlab-nlp/VDocRAG/README.md). To run our full inference pipeline with a retrieval system, please use [our code](https://github.com/nttmdlab-nlp/VDocRAG).
```py
from PIL import Image
import requests
from io import BytesIO
from torch.nn.functional import cosine_similarity
import torch
from transformers import AutoProcessor
from vdocrag.vdocgenerator.modeling import VDocGenerator
model = VDocGenerator.load('microsoft/Phi-3-vision-128k-instruct',
lora_name_or_path='NTT-hil-insight/VDocGenerator-Phi3-vision',
trust_remote_code=True,
attn_implementation="flash_attention_2",
torch_dtype=torch.bfloat16,
use_cache=False).to('cuda:0')
# Process images with the prompt
query = "How many international visitors came to Japan in 2017? \n Answer briefly."
image_tokens = "\n".join([f"<|image_{i+1}|>" for i in range(len(doc_images))])
messages = [{"role": "user", "content": f"{image_tokens}\n{query}"}]
prompt = processor.tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
processed = processor(prompt, images=doc_images, return_tensors="pt").to('cuda:0')
# Generate the answer
generate_ids = model.generate(processed,
generation_args={
"max_new_tokens": 64,
"temperature": 0.0,
"do_sample": False,
"eos_token_id": processor.tokenizer.eos_token_id
})
generate_ids = generate_ids[:, processed['input_ids'].shape[1]:]
response = processor.batch_decode(generate_ids,
skip_special_tokens=True,
clean_up_tokenization_spaces=False)[0].strip()
print("Model prediction: {0}".format(response))
# >> Model prediction: 28.69m
```
## License
The models and weights of VDocRAG in this repo are released under the [NTT License](https://huggingface.co/NTT-hil-insight/VDocGenerator-Phi3-vision/edit/main/LICENSE).
## Citation
```bibtex
@inproceedings{tanaka2025vdocrag,
author = {Ryota Tanaka and
Taichi Iki and
Taku Hasegawa and
Kyosuke Nishida and
Kuniko Saito and
Jun Suzuki},
title = {VDocRAG: Retrieval-Augmented Generation over Visually-Rich Documents},
booktitle = {CVPR},
year = {2025}
}
```