Perception Encoder: The best visual embeddings are not at the output of the network Paper • 2504.13181 • Published 11 days ago • 31
Summarization of Multimodal Presentations with Vision-Language Models: Study of the Effect of Modalities and Structure Paper • 2504.10049 • Published 14 days ago • 3
view article Article Welcome Gemma 3: Google's all new multimodal, multilingual, long context open LLM Mar 12 • 403
EuroBERT: Scaling Multilingual Encoders for European Languages Paper • 2503.05500 • Published Mar 7 • 78
view article Article A Deepdive into Aya Vision: Advancing the Frontier of Multilingual Multimodality Mar 4 • 74
Ignore the KL Penalty! Boosting Exploration on Critical Tokens to Enhance RL Fine-Tuning Paper • 2502.06533 • Published Feb 10 • 18
SmolLM2: When Smol Goes Big -- Data-Centric Training of a Small Language Model Paper • 2502.02737 • Published Feb 4 • 229
Contextual Position Encoding: Learning to Count What's Important Paper • 2405.18719 • Published May 29, 2024 • 5
Building and better understanding vision-language models: insights and future directions Paper • 2408.12637 • Published Aug 22, 2024 • 131
Harvesting Textual and Structured Data from the HAL Publication Repository Paper • 2407.20595 • Published Jul 30, 2024 • 22
MINT-1T: Scaling Open-Source Multimodal Data by 10x: A Multimodal Dataset with One Trillion Tokens Paper • 2406.11271 • Published Jun 17, 2024 • 21
view article Article Docmatix - a huge dataset for Document Visual Question Answering Jul 18, 2024 • 72