LLaMA 4 Fine-Tuning with Mental Health Counseling Data

Community Article Published April 14, 2025

Loading and processing the dataset
Prompt Template

Format Dataset for Training

Data Collator

Testing Pre-Fine-Tuned Model Output
LoRA for Parameter-Efficient Fine-Tuning

Training Arguments

Initialize the Trainer

Start Fine-Tuning

Model inference after fine-tuning
Push Model to Hugging Face Hub

Saving the model

Building a Mental Health Chatbot by Fine-Tuning Llama 4

Python libraries

import os
import torch
import pandas as pd
from datasets import Dataset
from trl import SFTTrainer
from huggingface_hub import login
from transformers import (
    AutoTokenizer,
    Llama4ForConditionalGeneration,
    BitsAndBytesConfig,
    TrainingArguments,
    DataCollatorForLanguageModeling,
)
from peft import LoraConfig, get_peft_model

We are importing all the necessary libraries for loading the model, tokenizer, dataset, fine-tuning configurations and training utilities.

Hugging Face Login

To gain access to LLaMA 4, we need to use Hugging Face token and request access to the model. Please fill out the request form at the following link: https://huggingface.co/meta-llama/Llama-4-Scout-17B-16E-Instruct

hf_token = os.environ.get("HF_TOKEN")
login(hf_token)

This logs into Hugging Face using your token (make sure it's stored in your environment as HF_TOKEN).

GPU Check

!nvidia-smi

Helpful to verify GPU memory and model usage.

Loading LLaMA 4 Model with 4-bit Quantization

Here is a code to help get started with loading the model efficiently using 4-bit quantization (for reduced memory usage)

model_id = "meta-llama/Llama-4-Scout-17B-16E-Instruct"

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=False,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16,
)

model = Llama4ForConditionalGeneration.from_pretrained(
    model_id,
    device_map="auto",
    torch_dtype=torch.bfloat16,
    quantization_config=bnb_config,
    trust_remote_code=True,
)

model.config.use_cache = False
model.config.pretraining_tp = 1

# Load tokenizer
#Tokenizer is essential for converting text into tokens that the model understands.
tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)

Loading the model using 4-bit quantization to save GPU memory and improve speed.

Loading and processing the dataset

df = pd.read_json("hf://datasets/Amod/mental_health_counseling_conversations/combined_dataset.json", lines=True)

dataset = Dataset.from_pandas(df)

Loading mental health JSON dataset and converting it into a Hugging Face-compatible format.

Prompt Template

train_prompt_style = """Below is an instruction that describes a task, paired with an input that provides further context.
Write a response that appropriately completes the request.
Before answering, think carefully about the question and create a step-by-step chain of thoughts to ensure a logical and accurate response.

### Instruction:
You are a counseling assistant trained to provide empathetic and helpful responses to users' mental health concerns.

### Context:
{}

### Response:
<think>
{}
</think>
{}"""

This template structures how model will learn to respond adding reasoning () before the final answer.

Format Dataset for Training

EOS_TOKEN = tokenizer.eos_token  

def formatting_prompts_func(examples):
    inputs = examples["Context"]
    complex_cots = examples.get("thoughts", [""] * len(inputs)) 
    outputs = examples["response"]
    texts = []
    for prompt, cot, response in zip(inputs, complex_cots, outputs):
        if not response.endswith(EOS_TOKEN):
            response += EOS_TOKEN
        text = train_prompt_style.format(prompt, cot, response)
        texts.append(text)
    return {"text": texts}

dataset = dataset.map(formatting_prompts_func, batched=True)
dataset

Formatting raw data into prompt response pairs, ready for model input.

Data Collator

data_collator = DataCollatorForLanguageModeling(
    tokenizer=tokenizer,
    mlm=False
)

Preparing batches of data for training. Since we are doing causal language modeling, we turn off MLM (Masked LM).

Testing Pre-Fine-Tuned Model Output

prompt_style = """Below is an instruction that describes a task, paired with an input that provides further context.
Write a response that appropriately completes the request.
Before answering, think carefully about the question and create a step-by-step chain of thoughts to ensure a logical and accurate response.

### Instruction:
You are a counseling assistant trained to provide empathetic and helpful responses to users' mental health concerns.

### Context:
{}

### Response:
<think>{}"""

example = dataset[0]["Context"]
inputs = tokenizer(
    [prompt_style.format(example, "") + tokenizer.eos_token],
    return_tensors="pt"
).to("cuda")

outputs = model.generate(
    input_ids=inputs.input_ids,
    attention_mask=inputs.attention_mask,
    max_new_tokens=1000,
    eos_token_id=tokenizer.eos_token_id,
    use_cache=True,
)

response = tokenizer.batch_decode(outputs, skip_special_tokens=True)
print(response[0].split("### Response:")[1])

LoRA for Parameter-Efficient Fine-Tuning

peft_config = LoraConfig(
    lora_alpha=16,
    lora_dropout=0.05,
    r=64,
    bias="none",
    task_type="CAUSAL_LM",
    target_modules=[
        "q_proj", "k_proj", "v_proj", "o_proj",
        "gate_proj", "up_proj", "down_proj",
    ],
)

model = get_peft_model(model, peft_config)

LoRA makes training more efficient by only updating a small number of model weights.

Training Arguments

training_arguments = TrainingArguments(
    output_dir="output",
    per_device_train_batch_size=1,
    per_device_eval_batch_size=1,
    gradient_accumulation_steps=2,
    optim="paged_adamw_32bit",
    num_train_epochs=1,
    logging_steps=0.2,
    warmup_steps=10,
    logging_strategy="steps",
    learning_rate=2e-4,
    fp16=False,
    bf16=False,
    group_by_length=True,
    report_to="none"
)

Initialize the Trainer

trainer = SFTTrainer(
    model=model,
    args=training_arguments,
    train_dataset=dataset,
    peft_config=peft_config,
    data_collator=data_collator,
)

Start Fine-Tuning

trainer.train()

This is where LLaMA 4 model gets fine-tuned on the counseling dataset.

Model inference after fine-tuning

example = dataset[0]["Context"]
inputs = tokenizer(
    [prompt_style.format(example, "") + tokenizer.eos_token],
    return_tensors="pt"
).to("cuda")

outputs = model.generate(
    input_ids=inputs.input_ids,
    attention_mask=inputs.attention_mask,
    max_new_tokens=1000,
    eos_token_id=tokenizer.eos_token_id,
    use_cache=True,
)

response = tokenizer.batch_decode(outputs, skip_special_tokens=True)
print(response[0].split("### Response:")[1])

Push Model to Hugging Face Hub

Saving the model

model.push_to_hub("Name-the-finetuned-model")
tokenizer.push_to_hub("Name-the-finetuned-model")

Save and share fine-tuned model publicly or privately on the Hub.

Happy to Connect 😊

Muhammad Imran Zaman

Community

JLouisBiz

8 days ago

Based on which data? Psychiatrists'?

Is this bot to drive people to buy psychiatric mental damaging drugs?

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

Your need to confirm your account before you can post a new comment.

· Sign up or log in to comment

Upvote