arxiv:2209.04372

Pre-training image-language transformers for open-vocabulary tasks

Published on Sep 9, 2022

Authors:

Abstract

Pre-training vision and language transformer models using a mixture of diverse tasks, including image-text captioning and object-aware strategies, improves performance on text-generative vision+language tasks.

AI-generated summary

We present a pre-training approach for vision and language transformer models, which is based on a mixture of diverse tasks. We explore both the use of image-text captioning data in pre-training, which does not need additional supervision, as well as object-aware strategies to pre-train the model. We evaluate the method on a number of textgenerative vision+language tasks, such as Visual Question Answering, visual entailment and captioning, and demonstrate large gains over standard pre-training methods.

View arXiv page View PDF Add to collection

Community

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Models citing this paper 165

Browse 165 models citing this paper

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2209.04372 in a dataset README.md to link it from this page.

Spaces citing this paper 73

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.