How come this pruned model has 162 layers

by ymcki - opened 20 days ago

Discussion

ymcki

20 days ago

when the source model only has 126 layers?

ymcki

20 days ago

Also, how come the full layers has 16 kv_heads when the original has only 8?

Despite only 64 layers are full, f16 KV cache size at 128k context is 64GB vs 63GB for the original. No KV cache saving at all compare to the 49B and 51B models.

itlevy

NVIDIA org 19 days ago

It has 81 "real" layers and 81 no-op layers

izikg

NVIDIA org 19 days ago

when the source model only has 126 layers?

There are only 81 layers in the model that perform actual computations—the rest are no_op (identity) layers added to maintain balance across pipeline stages. These no_op layers serve as padding to ensure that each stage in the pipeline has approximately the same number of parameters, which helps optimize pipeline parallelism.

itlevy

NVIDIA org 19 days ago

The config specifies the number of key_value_groups per layer
num_key_value_heads = num_heads // num_key_value_groups
In layers with 16 num_key_value_groups, there are 8 num_key_value_heads, just like the original 405B model.
We get significant KV cache savings from skipped attention subblocks

itlevy changed discussion status to closed 19 days ago

ymcki

19 days ago

The config specifies the number of key_value_groups per layer
num_key_value_heads = num_heads // num_key_value_groups
In layers with 16 num_key_value_groups, there are 8 num_key_value_heads, just like the original 405B model.
We get significant KV cache savings from skipped attention subblocks

I think you are right. Sorry about that. So the fp16 KV cache use should only be 32GB, That's a significant saving here.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

Your need to confirm your account before you can post a new comment.

· Sign up or log in to comment