Trillion-7B-preview
Collection
5 items
•
Updated
•
4
We introduce Trillion-LLaVA-7B, a Vision Language Model (VLM) capable of understanding images.
To better observe the transfer of multilinguality in vision tasks under controlled conditions, we adopted the same dataset, two-stage training strategy, and model architecture as LLaVA. While Trillion-7B-preview-vision was trained exclusively on English vision-language instruction pairs, the model is able to demonstrate strong performance on Korean visual reasoning tasks. The results indicate that our model’s robust multilingual foundation enables the effective transfer of visual reasoning capabilities across languages without requiring language-specific visual training data.
Model | MMBENCH En | MMBENCH Ko | SEED-I En | SEED-I Ko | MMStar En | MMStar Ko | K-DTCB |
---|---|---|---|---|---|---|---|
Llava-1.5-7b | 0.64 | 0.43 | 0.66 | 0.52 | 0.34 | 0.33 | 0.30 |
Llava-1.6-mistral-7b | 0.68 | 0.49 | 0.72 | 0.61 | 0.36 | 0.33 | 0.30 |
Trillion-LLaVA-7B | 0.66 | 0.61 | 0.68 | 0.66 | 0.37 | 0.37 | 0.33 |
This model repository is licensed under the Apache-2.0 License.
Base model
trillionlabs/Trillion-7B-preview