Time Blindness: Why Video-Language Models Can't See What Humans Can? Paper • 2505.24867 • Published 7 days ago • 72
ProRL: Prolonged Reinforcement Learning Expands Reasoning Boundaries in Large Language Models Paper • 2505.24864 • Published 7 days ago • 112
To Trust Or Not To Trust Your Vision-Language Model's Prediction Paper • 2505.23745 • Published 8 days ago • 5
Diffusion Classifiers Understand Compositionality, but Conditions Apply Paper • 2505.17955 • Published 14 days ago • 19
Alchemist: Turning Public Text-to-Image Data into Generative Gold Paper • 2505.19297 • Published 12 days ago • 73
Exploring the Deep Fusion of Large Language Models and Diffusion Transformers for Text-to-Image Synthesis Paper • 2505.10046 • Published 23 days ago • 9
Flow-GRPO: Training Flow Matching Models via Online RL Paper • 2505.05470 • Published 29 days ago • 78
You Do Not Fully Utilize Transformer's Representation Capacity Paper • 2502.09245 • Published Feb 13 • 38
SpargeAttn: Accurate Sparse Attention Accelerating Any Model Inference Paper • 2502.18137 • Published Feb 25 • 57
REPA-E: Unlocking VAE for End-to-End Tuning with Latent Diffusion Transformers Paper • 2504.10483 • Published Apr 14 • 21
InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models Paper • 2504.10479 • Published Apr 14 • 268
view article Article You could have designed state of the art positional encoding By FL33TW00D-HF • Nov 25, 2024 • 287