KV CACHE Compression - a giulio98 Collection

giulio98 's Collections

KV CACHE Compression

Functional Diffusion Processes

KV CACHE Compression

updated Mar 17

SnapKV: LLM Knows What You are Looking for Before Generation

Paper • 2404.14469 • Published Apr 22, 2024 • 27
Finch: Prompt-guided Key-Value Cache Compression

Paper • 2408.00167 • Published Jul 31, 2024 • 18
Beyond RAG: Task-Aware KV Cache Compression for Comprehensive Knowledge Reasoning

Paper • 2503.04973 • Published Mar 6 • 24
A Simple and Effective L_2 Norm-Based Strategy for KV Cache Compression

Paper • 2406.11430 • Published Jun 17, 2024 • 24
FastKV: KV Cache Compression for Fast Long-Context Processing with Token-Selective Propagation

Paper • 2502.01068 • Published Feb 3 • 16
ChunkKV: Semantic-Preserving KV Cache Compression for Efficient Long-Context LLM Inference

Paper • 2502.00299 • Published Feb 1 • 2
Efficient Streaming Language Models with Attention Sinks

Paper • 2309.17453 • Published Sep 29, 2023 • 13
Transformers are Multi-State RNNs

Paper • 2401.06104 • Published Jan 11, 2024 • 38
H_2O: Heavy-Hitter Oracle for Efficient Generative Inference of Large Language Models

Paper • 2306.14048 • Published Jun 24, 2023 • 12
Q-Filters: Leveraging QK Geometry for Efficient KV Cache Compression

Paper • 2503.02812 • Published Mar 4 • 10
ThinK: Thinner Key Cache by Query-Driven Pruning

Paper • 2407.21018 • Published Jul 30, 2024 • 33
LightTransfer: Your Long-Context LLM is Secretly a Hybrid Model with Effortless Adaptation

Paper • 2410.13846 • Published Oct 17, 2024 • 1
DuoAttention: Efficient Long-Context LLM Inference with Retrieval and Streaming Heads

Paper • 2410.10819 • Published Oct 14, 2024 • 7
Scissorhands: Exploiting the Persistence of Importance Hypothesis for LLM KV Cache Compression at Test Time

Paper • 2305.17118 • Published May 26, 2023 • 1
PyramidKV: Dynamic KV Cache Compression based on Pyramidal Information Funneling

Paper • 2406.02069 • Published Jun 4, 2024 • 1