The inference speed of the AWQ model is slower than that of the original model
#4
by
wakakakakawa
- opened
When I used 4 A100 (40G) graphics cards and VLLM to deploy versions 32 b and 32 b AWQ, I found that the inference time of the AWQ version on the same image was significantly longer than the initial version, with an average of 13.7S for the initial version and 25S for the AWQ version. This is why