LLaMA 4 Fine-Tuning with Mental Health Counseling Data
Building a Mental Health Chatbot by Fine-Tuning Llama 4
Python libraries
import os
import torch
import pandas as pd
from datasets import Dataset
from trl import SFTTrainer
from huggingface_hub import login
from transformers import (
AutoTokenizer,
Llama4ForConditionalGeneration,
BitsAndBytesConfig,
TrainingArguments,
DataCollatorForLanguageModeling,
)
from peft import LoraConfig, get_peft_model
We are importing all the necessary libraries for loading the model, tokenizer, dataset, fine-tuning configurations and training utilities.
Hugging Face Login
To gain access to LLaMA 4, we need to use Hugging Face token and request access to the model. Please fill out the request form at the following link: https://huggingface.co/meta-llama/Llama-4-Scout-17B-16E-Instruct
hf_token = os.environ.get("HF_TOKEN")
login(hf_token)
This logs into Hugging Face using your token (make sure it's stored in your environment as HF_TOKEN).
GPU Check
!nvidia-smi
Helpful to verify GPU memory and model usage.
Loading LLaMA 4 Model with 4-bit Quantization
Here is a code to help get started with loading the model efficiently using 4-bit quantization (for reduced memory usage)
model_id = "meta-llama/Llama-4-Scout-17B-16E-Instruct"
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_use_double_quant=False,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.bfloat16,
)
model = Llama4ForConditionalGeneration.from_pretrained(
model_id,
device_map="auto",
torch_dtype=torch.bfloat16,
quantization_config=bnb_config,
trust_remote_code=True,
)
model.config.use_cache = False
model.config.pretraining_tp = 1
# Load tokenizer
#Tokenizer is essential for converting text into tokens that the model understands.
tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
Loading the model using 4-bit quantization to save GPU memory and improve speed.
Loading and processing the dataset
df = pd.read_json("hf://datasets/Amod/mental_health_counseling_conversations/combined_dataset.json", lines=True)
dataset = Dataset.from_pandas(df)
Loading mental health JSON dataset and converting it into a Hugging Face-compatible format.
Prompt Template
train_prompt_style = """Below is an instruction that describes a task, paired with an input that provides further context.
Write a response that appropriately completes the request.
Before answering, think carefully about the question and create a step-by-step chain of thoughts to ensure a logical and accurate response.
### Instruction:
You are a counseling assistant trained to provide empathetic and helpful responses to users' mental health concerns.
### Context:
{}
### Response:
<think>
{}
</think>
{}"""
This template structures how model will learn to respond adding reasoning () before the final answer.
Format Dataset for Training
EOS_TOKEN = tokenizer.eos_token
def formatting_prompts_func(examples):
inputs = examples["Context"]
complex_cots = examples.get("thoughts", [""] * len(inputs))
outputs = examples["response"]
texts = []
for prompt, cot, response in zip(inputs, complex_cots, outputs):
if not response.endswith(EOS_TOKEN):
response += EOS_TOKEN
text = train_prompt_style.format(prompt, cot, response)
texts.append(text)
return {"text": texts}
dataset = dataset.map(formatting_prompts_func, batched=True)
dataset
Formatting raw data into prompt response pairs, ready for model input.
Data Collator
data_collator = DataCollatorForLanguageModeling(
tokenizer=tokenizer,
mlm=False
)
Preparing batches of data for training. Since we are doing causal language modeling, we turn off MLM (Masked LM).
Testing Pre-Fine-Tuned Model Output
prompt_style = """Below is an instruction that describes a task, paired with an input that provides further context.
Write a response that appropriately completes the request.
Before answering, think carefully about the question and create a step-by-step chain of thoughts to ensure a logical and accurate response.
### Instruction:
You are a counseling assistant trained to provide empathetic and helpful responses to users' mental health concerns.
### Context:
{}
### Response:
<think>{}"""
example = dataset[0]["Context"]
inputs = tokenizer(
[prompt_style.format(example, "") + tokenizer.eos_token],
return_tensors="pt"
).to("cuda")
outputs = model.generate(
input_ids=inputs.input_ids,
attention_mask=inputs.attention_mask,
max_new_tokens=1000,
eos_token_id=tokenizer.eos_token_id,
use_cache=True,
)
response = tokenizer.batch_decode(outputs, skip_special_tokens=True)
print(response[0].split("### Response:")[1])
LoRA for Parameter-Efficient Fine-Tuning
peft_config = LoraConfig(
lora_alpha=16,
lora_dropout=0.05,
r=64,
bias="none",
task_type="CAUSAL_LM",
target_modules=[
"q_proj", "k_proj", "v_proj", "o_proj",
"gate_proj", "up_proj", "down_proj",
],
)
model = get_peft_model(model, peft_config)
LoRA makes training more efficient by only updating a small number of model weights.
Training Arguments
training_arguments = TrainingArguments(
output_dir="output",
per_device_train_batch_size=1,
per_device_eval_batch_size=1,
gradient_accumulation_steps=2,
optim="paged_adamw_32bit",
num_train_epochs=1,
logging_steps=0.2,
warmup_steps=10,
logging_strategy="steps",
learning_rate=2e-4,
fp16=False,
bf16=False,
group_by_length=True,
report_to="none"
)
Initialize the Trainer
trainer = SFTTrainer(
model=model,
args=training_arguments,
train_dataset=dataset,
peft_config=peft_config,
data_collator=data_collator,
)
Start Fine-Tuning
trainer.train()
This is where LLaMA 4 model gets fine-tuned on the counseling dataset.
Model inference after fine-tuning
example = dataset[0]["Context"]
inputs = tokenizer(
[prompt_style.format(example, "") + tokenizer.eos_token],
return_tensors="pt"
).to("cuda")
outputs = model.generate(
input_ids=inputs.input_ids,
attention_mask=inputs.attention_mask,
max_new_tokens=1000,
eos_token_id=tokenizer.eos_token_id,
use_cache=True,
)
response = tokenizer.batch_decode(outputs, skip_special_tokens=True)
print(response[0].split("### Response:")[1])
Push Model to Hugging Face Hub
Saving the model
model.push_to_hub("Name-the-finetuned-model")
tokenizer.push_to_hub("Name-the-finetuned-model")
Save and share fine-tuned model publicly or privately on the Hub.