13. Fine-tuning Llama3#

You can use your local GPU assuming it has 24GB of RAM, or for about 40 cents on vast.ai, if you choose this option I recommend picking an RTX 4090 since it’s fast and relatively cheap. Also choose at least 32GB of storage, I’m not entirely sure why but the model needs a lot more than 8GB of disk space.

We will use QLoRA, a fine-tuning method that combines quantization and LoRA.

Since it’s a large model will load it in 4bit using bitsandbytes and use QLoRA to train using the PEFT library from Hugging Face.

13.1. 1. Preparing data#

First you need your data, mine is just a giant file called shakespeare.txt (can you guess what it contains?). Modifying the steps below to work for any other type of data (json, csv) is very simple, I recommend checking the HuggingFace tutorials (or just asking ChatGPT).

13.1.1. Load Dataset#

Notice the above comment about GPU requirements! It’ll suck to redo the process if you are on the wrong machine - make sure you have 24GB of RAM.

# Set up
%pip install -q -U bitsandbytes
%pip install -q -U huggingface_hub
%pip install -q -U git+https://github.com/huggingface/transformers.git
%pip install -q -U git+https://github.com/huggingface/peft.git
%pip install -q -U git+https://github.com/huggingface/accelerate.git
%pip install -q -U datasets scipy ipywidgets matplotlib

from datasets import load_dataset, Features, Value

# I'm using a giant txt file, but we want each example to fit in the context length! Luckily, sample_by="paragraph" exists.
features = Features({ "text": Value("string") })
dataset = load_dataset("text", data_files={"train": "shakespeare.txt"}, sample_by="paragraph")['train']

13.2. 2. Load Base Model#

Let’s now load the model (remember you need to request permission with your account, even though it’s immediate!). We will be using 4bit quantization since it barely lowers performance and saves us a lot of memory.

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig
from huggingface_hub import login

base_model_id = "meta-llama/Meta-Llama-3-8B"
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16
)

login() # for llama 3 you need permission (it's given automatically when you request, but you need to login)

model = AutoModelForCausalLM.from_pretrained(base_model_id, quantization_config=bnb_config, device_map="auto")

13.3. 3. Tokenization#

Set up the tokenizer. Add padding on the left as it makes training use less memory.

For model_max_length, it’s helpful to get a distribution of your data lengths. Let’s first tokenize without the truncation/padding, so we can get a length distribution.

tokenizer = AutoTokenizer.from_pretrained(
    base_model_id,
    padding_side="left",
    add_eos_token=True,
    add_bos_token=True,
)
tokenizer.pad_token = tokenizer.eos_token

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.

From here, you can choose where you’d like to set the max_length to be. You can truncate and pad training examples to fit them to your chosen size. I’m going to pick 256 tokens (it’s a nice number). Each token is about a word.

Now let’s tokenize again with padding and truncation, and set up the tokenize function to make labels and input_ids the same. This is basically what self-supervised fine-tuning is.

max_length = 256

def generate_and_tokenize_prompt(prompt):
    result = tokenizer(
        prompt['text'],
        truncation=True,
        max_length=max_length,
        padding="max_length",
    )
    result["labels"] = result["input_ids"].copy()
    return result

tokenized_dataset = dataset.map(generate_and_tokenize_prompt)

Observe how the model does out of the box.

13.4. 4. Set Up LoRA#

Now, to start our fine-tuning, we have to apply some preprocessing to the model to prepare it for training. For that use the prepare_model_for_kbit_training method from PEFT.

from peft import prepare_model_for_kbit_training

model.gradient_checkpointing_enable()
model = prepare_model_for_kbit_training(model)

def print_trainable_parameters(model):
    """
    Prints the number of trainable parameters in the model.
    """
    trainable_params = 0
    all_param = 0
    for _, param in model.named_parameters():
        all_param += param.numel()
        if param.requires_grad:
            trainable_params += param.numel()
    print(
        f"trainable params: {trainable_params} || all params: {all_param} || trainable%: {100 * trainable_params / all_param}"
    )

Let’s print the model to examine its layers, as we will apply QLoRA to all the linear layers of the model. Those layers are q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj, and lm_head.

Here we define the LoRA config.

r is the rank of the low-rank matrix used in the adapters, which thus controls the number of parameters trained. A higher rank will allow for more expressivity, but there is a compute tradeoff.

alpha is the scaling factor for the learned weights. The weight matrix is scaled by alpha/r, and thus a higher value for alpha assigns more weight to the LoRA activations.

from peft import LoraConfig, get_peft_model

config = LoraConfig(
    r=32,
    lora_alpha=64,
    target_modules=[
        "q_proj",
        "k_proj",
        "v_proj",
        "o_proj",
        "gate_proj",
        "up_proj",
        "down_proj",
        "lm_head",
    ],
    bias="none",
    task_type="CAUSAL_LM",
)

model = get_peft_model(model, config)

13.5. 5. Run Training!#

I didn’t have a lot of training samples: only about 200 total train/validation. I used 500 training steps, and I was fine with overfitting in this case. I found that the end product worked well. It took about 20 minutes on the 1x A10G 24GB.

Overfitting is when the validation loss goes up (bad) while the training loss goes down significantly, meaning the model is learning the training set really well, but is unable to generalize to new datapoints. In most cases, this is not desired, but since I am just playing around with a model to generate outputs like my shakespeare entries, I was fine with a moderate amount of overfitting.

With that said, a note on training: you can set the max_steps to be high initially, and examine at what step your model’s performance starts to degrade. There is where you’ll find a sweet spot for how many steps to perform. For example, say you start with 1000 steps, and find that at around 500 steps the model starts overfitting, as described above. Therefore, 500 steps would be your sweet spot, so you would use the checkpoint-500 model repo in your output dir (mistral-shakespeare-finetune) as your final model in step 6 below.

If you’re just doing something for fun like I did and are OK with overfitting, you can try different checkpoint versions with different degrees of overfitting.

You can interrupt the process via Kernel -> Interrupt Kernel in the top nav bar once you realize you didn’t need to train anymore.

import transformers
from datetime import datetime

project = "shakespeare-finetune"
base_model_name = "mistral"
run_name = base_model_name + "-" + project
output_dir = "./" + run_name

trainer = transformers.Trainer(
    model=model,
    train_dataset=tokenized_dataset,
    args=transformers.TrainingArguments(
        output_dir=output_dir,
        warmup_steps=1,
        per_device_train_batch_size=2,
        gradient_accumulation_steps=1,
        gradient_checkpointing=True,
        max_steps=500,
        learning_rate=2.5e-5, # we want a small lr for finetuning
        bf16=True,
        optim="paged_adamw_8bit",
        save_strategy="steps",       # Save the model checkpoint every logging step
    ),
    data_collator=transformers.DataCollatorForLanguageModeling(tokenizer, mlm=False),
)

model.config.use_cache = False  # silence the warnings. Please re-enable for inference!
trainer.train()

Detected kernel version 5.4.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.

max_steps is given, it will override any value given in num_train_epochs
/opt/conda/lib/python3.10/site-packages/peft/utils/save_and_load.py:180: UserWarning: Setting `save_embedding_layers` to `True` as embedding layers found in `target_modules`.
  warnings.warn("Setting `save_embedding_layers` to `True` as embedding layers found in `target_modules`.")
/opt/conda/lib/python3.10/site-packages/torch/utils/checkpoint.py:460: UserWarning: torch.utils.checkpoint: please pass in use_reentrant=True or use_reentrant=False explicitly. The default value of use_reentrant will be updated to be False in the future. To maintain current behavior, pass use_reentrant=True. It is recommended that you use use_reentrant=False. Refer to docs for more details on the differences between the two variants.
  warnings.warn(

[500/500 05:01, Epoch 1/2]

Step	Training Loss
500	0.193500

TrainOutput(global_step=500, training_loss=0.19348802185058595, metrics={'train_runtime': 312.0695, 'train_samples_per_second': 3.204, 'train_steps_per_second': 1.602, 'total_flos': 1.1662918680576e+16, 'train_loss': 0.19348802185058595, 'epoch': 1.8181818181818183})

I cleared the output of the cell above because I stopped the training early, and it produced a long, ugly error message.

13.6. 6. Try the trained model (!!!)#

By default, the PEFT library will only save the QLoRA adapters, so we need to first load the base model from the Huggingface Hub:

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16
)

login()

base_model = AutoModelForCausalLM.from_pretrained(
    base_model_id, 
    quantization_config=bnb_config,
    device_map="auto",
    trust_remote_code=True,
)

eval_tokenizer = AutoTokenizer.from_pretrained(base_model_id, add_bos_token=True, trust_remote_code=True)

Now load the QLoRA adapter from the appropriate checkpoint directory, i.e. the best performing model checkpoint:

from peft import PeftModel

ft_model = PeftModel.from_pretrained(base_model, "mistral-shakespeare-finetune/checkpoint-500")

and run your inference!

Let’s try the same eval_prompt and thus model_input as above, and see if the new finetuned model performs better.

eval_prompt = """ 
THOMAS. What calamity befalls these sprites so fair?
GERALD. List close; the GPUs, once abundant and cheap"""
model_input = eval_tokenizer(eval_prompt, return_tensors="pt").to("cuda")

ft_model.eval()
with torch.no_grad():
    print(eval_tokenizer.decode(ft_model.generate(**model_input, max_new_tokens=100, repetition_penalty=1.15)[0], skip_special_tokens=True))

Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.

 
THOMAS. What calamity befalls these sprites so fair?
GERALD. List close; the GPUs, once abundant and cheap,
    Are now grown dearer than they were in time.
    The King hath laid an embargo on our goods;
    And we are not at liberty to buy
    So much as will maintain us. 'Tis most strange!
    I never knew it thus before.
ROSALINE. O, what a scene of peace this might afford!  
    If all the dues and customs of the King
    Were paid to him in simple money-coin,
    His coffers would th

Large Language Models

Fine-tuning Llama3

Contents