How come this pruned model has 162 layers
when the source model only has 126 layers?
Also, how come the full layers has 16 kv_heads when the original has only 8?
Despite only 64 layers are full, f16 KV cache size at 128k context is 64GB vs 63GB for the original. No KV cache saving at all compare to the 49B and 51B models.
It has 81 "real" layers and 81 no-op layers
when the source model only has 126 layers?
There are only 81 layers in the model that perform actual computations—the rest are no_op
(identity) layers added to maintain balance across pipeline stages. These no_op
layers serve as padding to ensure that each stage in the pipeline has approximately the same number of parameters, which helps optimize pipeline parallelism.
The config specifies the number of key_value_groups per layer
num_key_value_heads = num_heads // num_key_value_groups
In layers with 16 num_key_value_groups, there are 8 num_key_value_heads, just like the original 405B model.
We get significant KV cache savings from skipped attention subblocks
The config specifies the number of key_value_groups per layer
num_key_value_heads = num_heads // num_key_value_groups
In layers with 16 num_key_value_groups, there are 8 num_key_value_heads, just like the original 405B model.
We get significant KV cache savings from skipped attention subblocks
I think you are right. Sorry about that. So the fp16 KV cache use should only be 32GB, That's a significant saving here.