Llama-3.1-8b-Instruct-Residual
Full-rank instruction residual for Llama-3.1-8B
This repository provides the full-rank instruction residual (Δθ = θ_{instruct} - θ_{base}) between the instruction-tuned Llama-3.1-8B-Instruct model and its corresponding base Llama-3.1-8B model. By adding this residual to a fresh base checkpoint, you can restore instruction-following capabilities without running a full fine-tuning cycle.
How it was created
We follow the instruction residual approach introduced by Jindal et al. (2024):
“In this section, we describe the instruction residual approach to simply regain the instruction following capabilities. We compute the instruction residual between an instruction following LLM (θ_{i,d_1,v_1}) and its corresponding base model (θ_{b,d_1}) in the parametric space as [ Θ_{r,v_1} = θ_{i,d_1,v_1} - θ_{b,d_1}. ] This tensor subtraction extracts the instruction-specific information, which can then be added to any base model.”
The full paper is available at: https://arxiv.org/abs/2410.10739
Files
pytorch_model.safetensors
— full-rank FP16 residual weights (~16 GB).config.json
— configuration matching the Llama-3.1-8B architecture.README.md
— this model card.
Usage
Below is a minimal example showing how to apply the residual to a base model:
from transformers import AutoModelForCausalLM
from safetensors.torch import load_file
import torch
# 1) Load base
model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Llama-3.1-8B",
torch_dtype=torch.float16,
device_map="auto",
)
# 2) Load residual
residual_sd = load_file("pytorch_model.safetensors", device="cpu")
# 3) Apply residual
for name, delta in residual_sd.items():
param = dict(model.named_parameters())[name]
param.data += delta.to(param.device).to(param.dtype)
# 4) Save or push
model.save_pretrained("llama-3.1-8b-base-plus-instruct")
For full scripts, see the examples/
folder.
Intended Use & Limitations
- Intended Use: Add instruction-following capabilities to Llama-3.1-8B base models.
- Limitations:
- Residual must match the exact base checkpoint.
- Stored in FP16 (~16 GB); dequantization needed if working in 4-bit.
- Applying to mismatched architectures will produce invalid weights.
License
This residual is released under the Apache License 2.0. See the LICENSE
file for details.
References
As mentioned before this method was introduced by Jindal et al., 2024, arXiv:2410.10739.:
@misc{jindal2024balancingcontinuouspretraininginstruction,
title={Balancing Continuous Pre-Training and Instruction Fine-Tuning: Optimizing Instruction-Following in LLMs},
author={Ishan Jindal and Chandana Badrinath and Pranjal Bharti and Lakkidi Vinay and Sachin Dev Sharma},
year={2024},
eprint={2410.10739},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2410.10739},
}
Model tree for tachiwin/llama-3.1-8b-instruct-residual
Base model
meta-llama/Llama-3.1-8B