File size: 4,223 Bytes
9509736 c724cdd 9509736 e15e82f 9509736 9d734eb 9509736 9d734eb |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 |
---
license: apache-2.0
---
<h1 align="left"> [WIP] GAEA: A Geolocation Aware Conversational Model</h1>
<h3 align="left"> Summary</h3>
<p align="justify"> Image geolocalization, in which, traditionally, an AI model predicts the precise GPS coordinates of an image is a challenging task with many downstream applications. However, the user cannot utilize the model to further their knowledge other than the GPS coordinate; the model lacks an understanding of the location and the conversational ability to communicate with the user. In recent days, with tremendous progress of large multimodal models (LMMs)—proprietary and open-source—researchers attempted to geolocalize images via LMMs. However, the issues remain unaddressed; beyond general tasks, for more specialized downstream tasks, one of which is geolocalization, LMMs struggle. In this work, we propose to solve this problem by introducing a conversational model `GAEA` that can provide information regarding the location of an image, as required by a user. No large-scale dataset enabling the training of such a model exists. Thus we propose a comprehensive dataset `GAEA-Train` with 800K images and around 1.6M question-answer pairs constructed by leveraging OpenStreetMap (OSM) attributes and geographical context clues. For quantitative evaluation, we propose a diverse benchmark, `GAEA-Bench` comprising 4K image-text pairs to evaluate conversational capabilities equipped with diverse question types. We consider 11 state-of-the-art open-source and proprietary LMMs and demonstrate that `GAEA` significantly outperforms the best open-source model, LLaVA-OneVision by 25.69% and best proprietary model, GPT-4o by 8.28%. We will publicly release our dataset and codes. </p>
## `GAEA` is the first open-source conversational model for conversational capabilities equipped with global-scale geolocalization.
[](https://arxiv.org/abs/2503.16423)
[](https://huggingface.co/collections/ucf-crcv/gaea-67d514a61d48eb1708b13a08)
[](https://ucf-crcv.github.io/GAEA/)
**Main contributions:**
1) **`GAEA-Train: A Diverse Training Dataset:`** We propose GAEA-Train, a new dataset designed for training conversational image geolocalization models, incorporating diverse visual and contextual data.
2) **`GAEA-Bench: Evaluating Conversational Geolocalization:`** To assess conversational capabilities in geolocalization, we introduce GAEA-Bench, a benchmark featuring various question-answer formats.
3) **`GAEA: An Interactive Geolocalization Chatbot:`** We present GAEA, a conversational chatbot that extends beyond geolocalization to provide rich contextual insights about locations from images.
4) **`Benchmarking Against State-of-the-Art LMMs:`** We quantitatively compare our model’s performance against 8 open-source and 3 proprietary LMMs, including GPT-4o and Gemini-2.0-Flash.
<b> This page is dedicated to the GAEA model </b>
<p align="center">
<img src="Assets/teaser.jpg" alt="teaser"></a>
</p>
<h2 align="left"> Model Description</h2>
<h3 align="left">Architecture</h3>
<p align="center">
<img src="Assets/arch_iccv.jpg" alt="arch-iccv"></a>
</p>
<h2 align="left"> How To Use</h2>
<h2 align="left">Evaluation Results</h2>
<h3 align="left">Comparison with SoTA LMMs on GAEA-Bench (Conversational) </h3>
<p align="center">
<img src="Assets/GAEA-Benc-Eval.png" alt="GAEA-Benc-Eval"></a>
</p>
<p align="center">
<img src="Assets/question_types_stats.jpg" alt="question-types-stats"></a>
</p>
<h3 align="left">Qualitative Results (Conversational) </h3>
<p align="center">
<img src="Assets/queston_types_qual.jpg" alt="queston-types-qual"></a>
</p>
<h3 align="left">Comparison with Specialized Models on Standard Geolocalization Datasets</h3>
<p align="center">
<img src="Assets/Geolocalization_results.png" alt="Geolocalization_results"></a>
</p>
<h3 align="left">Comparison with best SoTA LMMs on City/Country Prediction </h3>
<p align="center">
<img src="Assets/City_Country_results.jpg" alt="City-Country-results"></a>
</p> |