Llama-3.1-8b-Instruct-Residual

Full-rank instruction residual for Llama-3.1-8B

This repository provides the full-rank instruction residual (Δθ = θ_{instruct} - θ_{base}) between the instruction-tuned Llama-3.1-8B-Instruct model and its corresponding base Llama-3.1-8B model. By adding this residual to a fresh base checkpoint, you can restore instruction-following capabilities without running a full fine-tuning cycle.

How it was created

We follow the instruction residual approach introduced by Jindal et al. (2024):

“In this section, we describe the instruction residual approach to simply regain the instruction following capabilities. We compute the instruction residual between an instruction following LLM (θ_{i,d_1,v_1}) and its corresponding base model (θ_{b,d_1}) in the parametric space as [ Θ_{r,v_1} = θ_{i,d_1,v_1} - θ_{b,d_1}. ] This tensor subtraction extracts the instruction-specific information, which can then be added to any base model.”

The full paper is available at: https://arxiv.org/abs/2410.10739

Files

pytorch_model.safetensors — full-rank FP16 residual weights (~16 GB).
config.json — configuration matching the Llama-3.1-8B architecture.
README.md — this model card.

Usage

Below is a minimal example showing how to apply the residual to a base model:

from transformers import AutoModelForCausalLM
from safetensors.torch import load_file
import torch

# 1) Load base
model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-3.1-8B",
    torch_dtype=torch.float16,
    device_map="auto",
)

# 2) Load residual
residual_sd = load_file("pytorch_model.safetensors", device="cpu")

# 3) Apply residual
for name, delta in residual_sd.items():
    param = dict(model.named_parameters())[name]
    param.data += delta.to(param.device).to(param.dtype)

# 4) Save or push
model.save_pretrained("llama-3.1-8b-base-plus-instruct")

For full scripts, see the examples/ folder.

Intended Use & Limitations

Intended Use: Add instruction-following capabilities to Llama-3.1-8B base models.
Limitations:
- Residual must match the exact base checkpoint.
- Stored in FP16 (~16 GB); dequantization needed if working in 4-bit.
- Applying to mismatched architectures will produce invalid weights.

License

This residual is released under the Apache License 2.0. See the LICENSE file for details.

References

As mentioned before this method was introduced by Jindal et al., 2024, arXiv:2410.10739.:

@misc{jindal2024balancingcontinuouspretraininginstruction,
  title={Balancing Continuous Pre-Training and Instruction Fine-Tuning: Optimizing Instruction-Following in LLMs},
  author={Ishan Jindal and Chandana Badrinath and Pranjal Bharti and Lakkidi Vinay and Sachin Dev Sharma},
  year={2024},
  eprint={2410.10739},
  archivePrefix={arXiv},
  primaryClass={cs.CL},
  url={https://arxiv.org/abs/2410.10739},
}

tachiwin
/

llama-3.1-8b-instruct-residual