Model is loaded with use_cache=False by default

#4
by shawnghu - opened

This behavior is different from the behavior of the base Qwen2.5-32B-Instruct model, which was misleading.
This carries over to inference with vLLM, and can be a simple source of inference slowdown by a factor of 3x or more (3x in my case, i suppose if KV computations dominate more compared to attention then it would be worse).

For anyone who reads this, it's straightforward to simply set use_cache to true for the model (with vLLM use LLM/server param hf_overrides).

Your need to confirm your account before you can post a new comment.

Sign up or log in to comment