Low-Rank Adapters Meet Neural Architecture Search for LLM Compression Paper • 2501.16372 • Published Jan 23 • 9
TAID: Temporally Adaptive Interpolated Distillation for Efficient Knowledge Transfer in Language Models Paper • 2501.16937 • Published Jan 28 • 6
Identifying Sensitive Weights via Post-quantization Integral Paper • 2503.01901 • Published Feb 28 • 7
EAGLE-3: Scaling up Inference Acceleration of Large Language Models via Training-Time Test Paper • 2503.01840 • Published Mar 3 • 5
SEAP: Training-free Sparse Expert Activation Pruning Unlock the Brainpower of Large Language Models Paper • 2503.07605 • Published Mar 10 • 66
DistiLLM-2: A Contrastive Approach Boosts the Distillation of LLMs Paper • 2503.07067 • Published Mar 10 • 31
Efficient Distillation of Classifier-Free Guidance using Adapters Paper • 2503.07274 • Published Mar 10 • 4
RayFlow: Instance-Aware Diffusion Acceleration via Adaptive Flow Trajectories Paper • 2503.07699 • Published Mar 10 • 5
QuartDepth: Post-Training Quantization for Real-Time Depth Estimation on the Edge Paper • 2503.16709 • Published Mar 20
LogQuant: Log-Distributed 2-Bit Quantization of KV Cache with Superior Accuracy Preservation Paper • 2503.19950 • Published 27 days ago • 11
AdaptiVocab: Enhancing LLM Efficiency in Focused Domains through Lightweight Vocabulary Adaptation Paper • 2503.19693 • Published 27 days ago • 75
Harnessing the Reasoning Economy: A Survey of Efficient Reasoning for Large Language Models Paper • 2503.24377 • Published 21 days ago • 17
ShortV: Efficient Multimodal Large Language Models by Freezing Visual Tokens in Ineffective Layers Paper • 2504.00502 • Published 21 days ago • 21
Distillation and Refinement of Reasoning in Small Language Models for Document Re-ranking Paper • 2504.03947 • Published 17 days ago • 4
HybriMoE: Hybrid CPU-GPU Scheduling and Cache Management for Efficient MoE Inference Paper • 2504.05897 • Published 13 days ago • 12
Accelerate Parallelizable Reasoning via Parallel Decoding within One Sequence Paper • 2503.20533 • Published 26 days ago • 11
SpecReason: Fast and Accurate Inference-Time Compute via Speculative Reasoning Paper • 2504.07891 • Published 11 days ago • 5
PRIMA.CPP: Speeding Up 70B-Scale LLM Inference on Low-Resource Everyday Home Clusters Paper • 2504.08791 • Published 14 days ago • 117
Adaptive Computation Pruning for the Forgetting Transformer Paper • 2504.06949 • Published 12 days ago • 3
70% Size, 100% Accuracy: Lossless LLM Compression for Efficient GPU Inference via Dynamic-Length Float Paper • 2504.11651 • Published 6 days ago • 14