Abstract
Recent advances in large language models have demonstrated the effectiveness of length scaling during post-training, yet its potential in pre-training remains underexplored. We present the Parallel Hidden Decoding Transformer (PHD-Transformer), a novel framework that enables efficient length scaling during pre-training while maintaining inference efficiency. PHD-Transformer achieves this through an innovative KV cache management strategy that distinguishes between original tokens and hidden decoding tokens. By retaining only the KV cache of original tokens for long-range dependencies while immediately discarding hidden decoding tokens after use, our approach maintains the same KV cache size as the vanilla transformer while enabling effective length scaling. To further enhance performance, we introduce two optimized variants: PHD-SWA employs sliding window attention to preserve local dependencies, while PHD-CSWA implements chunk-wise sliding window attention to eliminate linear growth in pre-filling time. Extensive experiments demonstrate consistent improvements across multiple benchmarks.
Community
This paper studies efficient pretraining length scaling, which firstly presents that the performance of pretraining models can be scaled robustly by scaling the pretrainig sequence length via repeating training tokens. This paper also uses multiple sparse strategies to maintain the size of the kv cache while scaling length. In conclusion, it observes scalable benefits when the latency loss of prefilling and decoding is acceptable.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- SWAN-GPT: An Efficient and Scalable Approach for Long-Context Language Modeling (2025)
- Sliding Window Attention Training for Efficient Large Language Models (2025)
- LLMs Know What to Drop: Self-Attention Guided KV Cache Eviction for Efficient Long-Context Inference (2025)
- Skip-Vision: Efficient and Scalable Acceleration of Vision-Language Models via Adaptive Token Skipping (2025)
- MOM: Memory-Efficient Offloaded Mini-Sequence Inference for Long Context Language Models (2025)
- Adaptive Layer-skipping in Pre-trained LLMs (2025)
- PromptDistill: Query-based Selective Token Retention in Intermediate Layers for Efficient Large Language Model Inference (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper