The Only Guide You Need to Fine-Tune Llama 3 or Any Other Open Source Model

August 1, 2024

1 7 minutes read

Refining large language models (LLMs) like Llama 3 involves adapting a pre-trained model to specific tasks using a domain-specific dataset. This process leverages the model’s pre-existing knowledge, making it efficient and cost-effective compared to training from scratch. In this guide, we’ll walk through the steps to fine-tune Llama 3 using QLoRA (Quantized LoRA), a parameter-efficient method that minimizes memory usage and computational costs.

Overview of fine tuning

Fine-tuning involves several important steps:

Select a pre-trained model: Choose a basic model that matches your desired architecture.
Collect a relevant data set: Collect and prepare a data set specific to your task.
Focus: Adjust the model using the dataset to improve performance on specific tasks.
Evaluation: Assess the refined model using both qualitative and quantitative metrics.

Concepts and techniques

Refinement of large language models

Complete fine tuning

Full tuning updates all parameters of the model, making it specific to the new task. This method requires significant computational resources and is often impractical for very large models.

Parameter-efficient fine-tuning (PEFT)

PEFT updates only a subset of the model’s parameters, reducing memory requirements and computational costs. This technique prevents catastrophic forgetting and maintains common knowledge of the model.

Low-Rank Adaptation (LoRA) and Quantized LoRA (QLoRA)

LoRA refines only a few low-rank matrices, while QLoRA quantizes these matrices to further reduce the memory footprint.

Refinement methods

Complete fine tuning: This involves training all parameters of the model on the task-specific dataset. While this method can be very effective, it is also computationally expensive and requires a significant amount of memory.
Parameter-efficient fine-tuning (PEFT): PEFT updates only a subset of the model’s parameters, making memory more efficient. Techniques such as Low-Rank Adaptation (LoRA) and Quantized LoRA (QLoRA) fall into this category.

What is LoRA?

Comparison of fine-tuning methods: QLORA improves LoRA with 4-bit precision quantization and paged optimizers for memory peak management

LoRA is an improved refinement method where, instead of refining all the weights of the pre-trained model, two smaller matrices that approximate the larger matrix are refined. These matrices form the LoRA adapter. This fine-tuned adapter is then loaded into the pre-trained model and used for inference.

Main benefits of LoRA:

Memory efficiency: LoRA reduces the memory footprint by refining only small matrices instead of the entire model.
Reusability: The original model remains unchanged and multiple LoRA adapters can be used with it, facilitating multi-task performance with lower memory requirements.

What is Quantized LoRA (QLoRA)?

QLoRA goes one step further by quantizing the weights of the LoRA adapters to a lower accuracy (e.g. 4-bit instead of 8-bit). This further reduces memory usage and storage requirements while maintaining a comparable level of effectiveness.

Key benefits of QLoRA:

Even greater memory efficiency: By quantizing the weights, QLoRA significantly reduces the memory and storage requirements of the model.
Maintains performance: Despite the reduced precision, QLoRA maintains performance levels close to those of full-precision models.

Task-specific adaptation

During refinement, the model’s parameters are adjusted based on the new dataset, allowing it to better understand and generate content relevant to the specific task. This process preserves the general language knowledge acquired during prior training while tailoring the model to the nuances of the target domain.

Refinement in practice

Full fine tuning versus PEFT

Complete fine tuning: involves training the entire model, which can be computationally expensive and requires a lot of memory.
PEFT (LoRA and QLoRA): Refines only a subset of parameters, reduces memory requirements and prevents catastrophic forgetting, making it a more efficient alternative.

Implementation steps

Set environment: Install the necessary libraries and set up the computing environment.
Load and preprocess dataset: Load the dataset and preprocess it into a format suitable for the model.
Load pre-trained model: Load the base model with quantization configurations if you are using QLoRA.
Tokenization: Tokenize the dataset to prepare it for training.
Course: Refine the model using the prepared dataset.
Evaluation: Evaluate model performance on specific tasks using qualitative and quantitative metrics.

Steo-by-step guide to refining LLM

Set up the environment

For this tutorial, we’ll be using a Jupyter notebook. Platforms like Kaggle, which offer free GPU usage, or Google Colab are ideal for running these experiments.

1. Install the required libraries

First make sure you have the necessary libraries installed:

!pip install -qqq -U bitsandbytes transformers peft accelerate datasets scipy einops evaluate trl rouge_score</div>

2. Import libraries and set up environment

import os
import torch
from datasets import load_dataset
from transformers import (
    AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig, TrainingArguments, 
    pipeline, HfArgumentParser
)
from trl import ORPOConfig, ORPOTrainer, setup_chat_format, SFTTrainer
from tqdm import tqdm
import gc
import pandas as pd
import numpy as np
from huggingface_hub import interpreter_login
# Disable Weights and Biases logging
os.environ['WANDB_DISABLED'] = "true"
interpreter_login()

3. Load the dataset

We’ll use the DialogSum dataset for this tutorial:

Preprocess the dataset according to the model’s requirements, including applying appropriate templates and ensuring the data format is suitable for refinement (Hugging face) (DataCamp).

dataset_name = "neil-code/dialogsum-test"
dataset = load_dataset(dataset_name)

Inspect the dataset structure:

print(dataset['test'][0])

4. Create a BitsAndBytes configuration

To load the model in 4-bit format:

compute_dtype = getattr(torch, "float16")
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type='nf4',
    bnb_4bit_compute_dtype=compute_dtype,
    bnb_4bit_use_double_quant=False,
)

5. Load the pre-trained model

Use Microsoft’s Phi-2 model for this tutorial:

model_name = 'microsoft/phi-2'
device_map = {"": 0}
original_model = AutoModelForCausalLM.from_pretrained(
    model_name, 
    device_map=device_map,
    quantization_config=bnb_config,
    trust_remote_code=True,
    use_auth_token=True
)

6. Tokenization

Configure the tokenizer:

tokenizer = AutoTokenizer.from_pretrained(
    model_name, 
    trust_remote_code=True, 
    padding_side="left", 
    add_eos_token=True, 
    add_bos_token=True, 
    use_fast=False
)
tokenizer.pad_token = tokenizer.eos_token

Fine tuning of Llama 3 or other models

When refining models like Llama 3 or other state-of-the-art open-source LLMs, specific considerations and adjustments are needed to ensure optimal performance. Here are the detailed steps and insights on how to go about this for different models including Llama 3, GPT-3 and Mistral.

5.1 Using Llama 3

Model selection:

Make sure you have the correct Hugging Face model hub model ID. For example, the Llama 3 model can be identified as meta-llama/Meta-Llama-3-8B on hugging face.
Be sure to request access and log into your Hugging Face account if necessary for models like Llama 3 (Hugging face)

Tokenization:

Use the right tokenizer for Llama 3, make sure it is compatible with the model and supports the required features such as padding and special tokens.

Memory and calculation:

Refining large models like Llama 3 requires significant computing power. Make sure your environment, such as a high-performance GPU setup, can handle the memory and processing requirements. Ensure the environment can handle the memory requirements, which can be mitigated by using techniques such as QLoRA to reduce the memory footprint (Hug Face Forums)

Example:

model_name = 'meta-llama/Meta-Llama-3-8B'
device_map = {"": 0}
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.float16,
    bnb_4bit_use_double_quant=True,
)
original_model = AutoModelForCausalLM.from_pretrained(
    model_name, 
    device_map=device_map,
    quantization_config=bnb_config,
    trust_remote_code=True,
    use_auth_token=True
)

Tokenization:

Depending on the specific use case and model requirements, ensure proper tokenizer configuration without redundant settings. For example, use_fast=True is recommended for better performance (Hugging face) (Weights and prejudices).

tokenizer = AutoTokenizer.from_pretrained(
    model_name, 
    trust_remote_code=True, 
    padding_side="left", 
    add_eos_token=True, 
    add_bos_token=True, 
    use_fast=False
)
tokenizer.pad_token = tokenizer.eos_token

5.2 Using other popular models (e.g. GPT-3, Mistral)

Model selection:

For models such as GPT-3 and Mistral, make sure you use the correct model name and identification from the Hugging Face model hub or other sources.

Tokenization:

As with Llama 3, make sure the tokenizer is set up correctly and is compatible with the model.

Memory and calculation:

Each model may have different memory requirements. Adjust your environmental settings accordingly.

Example for GPT-3:

model_name = 'openai/gpt-3'
device_map = {"": 0}
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.float16,
    bnb_4bit_use_double_quant=True,
)
original_model = AutoModelForCausalLM.from_pretrained(
    model_name, 
    device_map=device_map,
    quantization_config=bnb_config,
    trust_remote_code=True,
    use_auth_token=True
)

Example for Mistral:

model_name = 'mistral-7B'
device_map = {"": 0}
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.float16,
    bnb_4bit_use_double_quant=True,
)
original_model = AutoModelForCausalLM.from_pretrained(
    model_name, 
    device_map=device_map,
    quantization_config=bnb_config,
    trust_remote_code=True,
    use_auth_token=True
)

Tokenization Considerations: Each model may have unique tokenization requirements. Make sure the tokenizer matches the model and is configured correctly.

Llama 3 Tokenizer example:

tokenizer = AutoTokenizer.from_pretrained(
    model_name, 
    trust_remote_code=True, 
    padding_side="left", 
    add_eos_token=True, 
    add_bos_token=True, 
    use_fast=False
)
tokenizer.pad_token = tokenizer.eos_token

GPT-3 and Mistral Tokenizer example:

tokenizer = AutoTokenizer.from_pretrained(
    model_name, 
    use_fast=True
)

7. Test the model with Zero-Shot Inferencing

Evaluate the basic model with a sample input:

from transformers import set_seed
set_seed(42)
index = 10
prompt = dataset['test'][index]['dialogue']
formatted_prompt = f"Instruct: Summarize the following conversation.\n{prompt}\nOutput:\n"
# Generate output
def gen(model, prompt, max_length):
    inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
    outputs = model.generate(**inputs, max_length=max_length)
    return tokenizer.batch_decode(outputs, skip_special_tokens=True)
res = gen(original_model, formatted_prompt, 100)
output = res[0].split('Output:\n')[1]
print(f'INPUT PROMPT:\n{formatted_prompt}')
print(f'MODEL GENERATION - ZERO SHOT:\n{output}')

8. Pre-process the data set

Convert dialog-summary pairs to prompts:

def create_prompt_formats(sample):
    blurb = "Below is an instruction that describes a task. Write a response that appropriately completes the request."
    instruction = "### Instruct: Summarize the below conversation."
    input_context = sample['dialogue']
    response = f"### Output:\n{sample['summary']}"
    end = "### End"
    
    parts = [blurb, instruction, input_context, response, end]
    formatted_prompt = "\n\n".join(parts)
    sample["text"] = formatted_prompt
    return sample
dataset = dataset.map(create_prompt_formats)

Tokenize the formatted data set:

def preprocess_batch(batch, tokenizer, max_length):
    return tokenizer(batch["text"], max_length=max_length, truncation=True)
max_length = 1024
train_dataset = dataset["train"].map(lambda batch: preprocess_batch(batch, tokenizer, max_length), batched=True)
eval_dataset = dataset["validation"].map(lambda batch: preprocess_batch(batch, tokenizer, max_length), batched=True)

9. Prepare the model for QLoRA

Prepare the model for parameter efficient tuning:

original_model = prepare_model_for_kbit_training(original_model)

Hyper parameters and their impact

Hyperparameters play a crucial role in optimizing the performance of your model. Here are some important hyper parameters to consider:

Learning speed: Controls the rate at which the model updates its parameters. A high learning rate can lead to faster convergence, but may overshoot the optimal solution. A slow learning rate ensures steady convergence, but may require more epochs.
Batch size: The number of samples processed before the model updates the parameters. Larger batch sizes can improve stability, but require more memory. Smaller batch sizes can lead to more noise in the training process.
Gradient accumulation steps: This parameter helps simulate larger batch sizes by collecting gradients over multiple steps before performing a parameter update.
Number of eras: The number of times the entire dataset is passed through the model. More epochs can improve performance, but can lead to overfitting if not managed properly.
Weight loss: Regularization technique to prevent overfitting by penalizing large weights.
Learning rate planner: Adjusts the learning rate during training to improve performance and convergence.

Customize the training configuration by adjusting hyper-parameters such as learning rate, batch size, and gradient accumulation steps based on the specific model and task requirements. For example, Llama 3 models may require different learning rates than smaller models (Weights and prejudices) (GitHub)

Example training configuration

orpo_args = ORPOConfig(
learning_rate=8e-6,
lr_scheduler_type="linear",max_length=1024,max_prompt_length=512,
beta=0.1,per_device_train_batch_size=2,per_device_eval_batch_size=2,
gradient_accumulation_steps=4,optim="paged_adamw_8bit",num_train_epochs=1,
evaluation_strategy="steps",eval_steps=0.2,logging_steps=1,warmup_steps=10,
report_to="wandb",output_dir="./results/",)

10. Train the model

Set up the trainer and start training:

trainer = ORPOTrainer(
model=original_model,
args=orpo_args,
train_dataset=train_dataset,
eval_dataset=eval_dataset,
tokenizer=tokenizer,)
trainer.train()
trainer.save_model("fine-tuned-llama-3")

Evaluation of the refined model

After training, evaluate model performance using both qualitative and quantitative methods.

1. Human evaluation

Compare the generated summaries with human-written summaries to assess quality.

2. Quantitative evaluation

Use metrics like ROUGE to assess performance:

from rouge_score import rouge_scorer
scorer = rouge_scorer.RougeScorer(['rouge1', 'rouge2', 'rougeL'], use_stemmer=True)
scores = scorer.score(reference_summary, generated_summary)
print(scores)

Common challenges and solutions

1. Memory limitations

Using QLoRA helps reduce memory issues by quantifying model weights down to 4-bit. Make sure you have enough GPU memory for your batch size and model size.

2. Overfitting

Monitor validation metrics to prevent overfitting. Use techniques such as early quitting and weight loss.

3. Exercise slowly

Optimize training speed by adjusting batch size and learning rate and using gradient accumulation.

4. Data quality

Make sure your dataset is clean and well pre-processed. Poor data quality can significantly affect model performance.

Conclusion

Refining LLMs using QLoRA is an efficient way to adapt large, pre-trained models to specific tasks with lower computational costs. By following this guide, you can fine-tune PHI, Llama 3, or any other open source model to achieve high performance for your specific tasks.