The inference speed of the AWQ model is slower than that of the original model

by wakakakakawa - opened 29 days ago

29 days ago

When I used 4 A100 (40G) graphics cards and VLLM to deploy versions 32 b and 32 b AWQ, I found that the inference time of the AWQ version on the same image was significantly longer than the initial version, with an average of 13.7S for the initial version and 25S for the AWQ version. This is why

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

Your need to confirm your account before you can post a new comment.

· Sign up or log in to comment