Model is loaded with use_cache=False by default

by shawnghu - opened about 21 hours ago

about 21 hours ago

This behavior is different from the behavior of the base Qwen2.5-32B-Instruct model, which was misleading.
This carries over to inference with vLLM, and can be a simple source of inference slowdown by a factor of 3x or more (3x in my case, i suppose if KV computations dominate more compared to attention then it would be worse).

For anyone who reads this, it's straightforward to simply set use_cache to true for the model (with vLLM use LLM/server param hf_overrides).

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

Your need to confirm your account before you can post a new comment.

· Sign up or log in to comment