The inference speed of the AWQ model is slower than that of the original model

#4
by wakakakakawa - opened

When I used 4 A100 (40G) graphics cards and VLLM to deploy versions 32 b and 32 b AWQ, I found that the inference time of the AWQ version on the same image was significantly longer than the initial version, with an average of 13.7S for the initial version and 25S for the AWQ version. This is why

Your need to confirm your account before you can post a new comment.

Sign up or log in to comment