13 B and34 B Pleeease!!! Most people cannot even run this.

#28
by UniversalLove333 - opened

This was disappointing. 😪

@UniversalLove333 Vague comments like 'This was disappointing' aren't very constructive without specifics, especially when discussing these complex MoEs. Often, the issue boils down to misunderstandings about running them. To clarify that part: There is NO SUCH THING as being not able to 'run' a model (you can run ANY model considering you have enough swap size outside of dram and vram) so I assume you mean 'the speed it runs at isn't usable'. Which if this 17Bx128E model runs at a slow speed, then you can run Llama 4 Scout 17Bx16E which is smaller than this and should give 17B LIKE SPEED and still be JUST AS FAST even when some layers offloaded to swap (considering you'd make the amount of memory needed argument but don't forget that this IS a MoE. Clearly you have never ran a MoE before, because anyone who has knows how fast they actually perform relative to their total size. I can't understand how people think these models are outright 402B or 109B when they are A MoE). From my own performance metrics, I saw the model people say can't run even with better hardware than me (You list NVIDIA RTX 4090 w/24GB vram and 12th gen i9 with 32GB dram in your hardware list, crazy how you think this is disappointing when you have more memory than me) and when I actually run these LLaMA 4 models, It is faster than dense models that are more than 2 times smaller even while being offloaded to swap/disk when those smaller models that aren't llama 4 that are more than two times smaller don't even get to swap and still manage to run slower, how? BECAUSE LLAMA 4 IS A MoE, that 109B IS NOT DENSE. I have personally compared Gemma 3 27B Q4 QaT (16GB) to LLama 4 scout (after seeing that I can't run maverick, I haven't gone to hugging face to open a discussion to complain about not being able to run this large model and contribute to the un-reasoned hate. Instead I have thought of attempting to run the smaller, Scout model they released.) at Q2_K_XL (42.6GB). And my total system memory was 40GB (8gb vram + 32gb dram) So the LLaMA 4 scout model WAS going to swap (aka disk) which should have made it even slower for me but still any model is runnable and the only thing that matters is the speed so I looked at the speed and I saw performance between 2.7 tokens per second to 3.4 tokens per second compared to gemma 3 27B q4 QaT's 2.4 tokens per second. I have not cherry picked Gemma 3 27B Q4 QaT, other dense models still run slower.

Your need to confirm your account before you can post a new comment.

Sign up or log in to comment