Model is loaded with use_cache=False by default
#4
by
shawnghu
- opened
This behavior is different from the behavior of the base Qwen2.5-32B-Instruct model, which was misleading.
This carries over to inference with vLLM, and can be a simple source of inference slowdown by a factor of 3x or more (3x in my case, i suppose if KV computations dominate more compared to attention then it would be worse).
For anyone who reads this, it's straightforward to simply set use_cache to true for the model (with vLLM use LLM/server param hf_overrides).