|
--- |
|
datasets: |
|
- NTT-hil-insight/opendocvqa-corpus |
|
- NTT-hil-insight/opendocvqa |
|
language: |
|
- en |
|
base_model: |
|
- microsoft/Phi-3-vision-128k-instruct |
|
library_name: peft |
|
--- |
|
# VDocRAG: Retrieval-Augmented Generation over Visually-Rich Documents |
|
**VDocRAG** is a new RAG framework that can directly understand diverse real-world documents purely from visual features. It was introduced in the paper [VDocRAG: Retrieval-Augmented Generation over Visually-Rich Documents](http://arxiv.org/abs/2504.09795) by Tanaka et al. and first released in [this repository](https://github.com/nttmdlab-nlp/VDocRAG). |
|
|
|
**Key Enhancements of VDocRAG:** |
|
- **New Pretraining Tasks:** We propose novel self-supervised pre-training tasks (**RCR** and **RCG**) that adapt large vision-language models for retrieval by compressing visual information into dense token representations while aligning them with textual content in documents. |
|
- **New Dataset:** We introduce **OpenDocVQA**, the first unified collection of open-domain document visual question answering datasets, encompassing diverse document types and formats. |
|
|
|
## Model Description |
|
The model, `NTT-hil-insight/VDocGenerator-Phi3-vision`, was trained on QA pairs `NTT-hil-insight/OpenDocVQA` with the corpus `NTT-hil-insight/OpenDocVQA-Corpus`, for training VDocRAG with Vision Language Models (`microsoft/Phi-3-vision-128k-instruct`) in open-domain question answering scenarios. |
|
|
|
`NTT-hil-insight/VDocGenerator-Phi3-vision` is an autoregressive model designed to generate answers based on retrieved document images. |
|
|
|
## Usage |
|
Here, we show an easy way to download our model from HuggingFace Hub and run it quickly. Make sure to install the dependencies and packages as following at [VDocRAG/README.md](https://github.com/nttmdlab-nlp/VDocRAG/README.md). To run our full inference pipeline with a retrieval system, please use [our code](https://github.com/nttmdlab-nlp/VDocRAG). |
|
|
|
```py |
|
from PIL import Image |
|
import requests |
|
from io import BytesIO |
|
from torch.nn.functional import cosine_similarity |
|
import torch |
|
from transformers import AutoProcessor |
|
from vdocrag.vdocgenerator.modeling import VDocGenerator |
|
|
|
model = VDocGenerator.load('microsoft/Phi-3-vision-128k-instruct', |
|
lora_name_or_path='NTT-hil-insight/VDocGenerator-Phi3-vision', |
|
trust_remote_code=True, |
|
attn_implementation="flash_attention_2", |
|
torch_dtype=torch.bfloat16, |
|
use_cache=False).to('cuda:0') |
|
|
|
# Process images with the prompt |
|
query = "How many international visitors came to Japan in 2017? \n Answer briefly." |
|
image_tokens = "\n".join([f"<|image_{i+1}|>" for i in range(len(doc_images))]) |
|
messages = [{"role": "user", "content": f"{image_tokens}\n{query}"}] |
|
prompt = processor.tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True) |
|
processed = processor(prompt, images=doc_images, return_tensors="pt").to('cuda:0') |
|
|
|
# Generate the answer |
|
generate_ids = model.generate(processed, |
|
generation_args={ |
|
"max_new_tokens": 64, |
|
"temperature": 0.0, |
|
"do_sample": False, |
|
"eos_token_id": processor.tokenizer.eos_token_id |
|
}) |
|
generate_ids = generate_ids[:, processed['input_ids'].shape[1]:] |
|
response = processor.batch_decode(generate_ids, |
|
skip_special_tokens=True, |
|
clean_up_tokenization_spaces=False)[0].strip() |
|
|
|
print("Model prediction: {0}".format(response)) |
|
|
|
# >> Model prediction: 28.69m |
|
``` |
|
|
|
## License |
|
The models and weights of VDocRAG in this repo are released under the [NTT License](https://huggingface.co/NTT-hil-insight/VDocGenerator-Phi3-vision/edit/main/LICENSE). |
|
|
|
## Citation |
|
```bibtex |
|
@inproceedings{tanaka2025vdocrag, |
|
author = {Ryota Tanaka and |
|
Taichi Iki and |
|
Taku Hasegawa and |
|
Kyosuke Nishida and |
|
Kuniko Saito and |
|
Jun Suzuki}, |
|
title = {VDocRAG: Retrieval-Augmented Generation over Visually-Rich Documents}, |
|
booktitle = {CVPR}, |
|
year = {2025} |
|
} |
|
``` |