10. Ranked Predictions#

This tutorial walks through a basic example of using BERT to rank the answers to each question. We’ll finetune BERT on the 200 public questions, then use the AutoModelForMultipleChoice class to generate probabilities that each option correctly answers the prompt, and finally we’ll turn those predictions into a MAP@3-formatted prediction like A B C.

# Let's import the public training set and take a look
import pandas as pd

train_df = pd.read_csv('/kaggle/input/kaggle-llm-science-exam/train.csv')
train_df.head()
id prompt A B C D E answer
0 0 Which of the following statements accurately d... MOND is a theory that reduces the observed mis... MOND is a theory that increases the discrepanc... MOND is a theory that explains the missing bar... MOND is a theory that reduces the discrepancy ... MOND is a theory that eliminates the observed ... D
1 1 Which of the following is an accurate definiti... Dynamic scaling refers to the evolution of sel... Dynamic scaling refers to the non-evolution of... Dynamic scaling refers to the evolution of sel... Dynamic scaling refers to the non-evolution of... Dynamic scaling refers to the evolution of sel... A
2 2 Which of the following statements accurately d... The triskeles symbol was reconstructed as a fe... The triskeles symbol is a representation of th... The triskeles symbol is a representation of a ... The triskeles symbol represents three interloc... The triskeles symbol is a representation of th... A
3 3 What is the significance of regularization in ... Regularizing the mass-energy of an electron wi... Regularizing the mass-energy of an electron wi... Regularizing the mass-energy of an electron wi... Regularizing the mass-energy of an electron wi... Regularizing the mass-energy of an electron wi... C
4 4 Which of the following statements accurately d... The angular spacing of features in the diffrac... The angular spacing of features in the diffrac... The angular spacing of features in the diffrac... The angular spacing of features in the diffrac... The angular spacing of features in the diffrac... D
# For convenience we'll turn our pandas Dataframe into a Dataset
from datasets import Dataset
train_ds = Dataset.from_pandas(train_df)
from transformers import AutoTokenizer

# The path of the model checkpoint we want to use
model_dir = '/kaggle/input/huggingface-bert/bert-base-cased'
tokenizer = AutoTokenizer.from_pretrained(model_dir)
# We'll create a dictionary to convert option names (A, B, C, D, E) into indices and back again
options = 'ABCDE'
indices = list(range(5))

option_to_index = {option: index for option, index in zip(options, indices)}
index_to_option = {index: option for option, index in zip(options, indices)}

def preprocess(example):
    # The AutoModelForMultipleChoice class expects a set of question/answer pairs
    # so we'll copy our question 5 times before tokenizing
    first_sentence = [example['prompt']] * 5
    second_sentence = []
    for option in options:
        second_sentence.append(example[option])
    # Our tokenizer will turn our text into token IDs BERT can understand
    tokenized_example = tokenizer(first_sentence, second_sentence, truncation=True)
    tokenized_example['label'] = option_to_index[example['answer']]
    return tokenized_example

tokenized_train_ds = train_ds.map(preprocess, batched=False, remove_columns=['prompt', 'A', 'B', 'C', 'D', 'E', 'answer'])
Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.
# Following datacollator (adapted from https://huggingface.co/docs/transformers/tasks/multiple_choice)
# will dynamically pad our questions at batch-time so we don't have to make every question the length
# of our longest question.
from dataclasses import dataclass
from transformers.tokenization_utils_base import PreTrainedTokenizerBase, PaddingStrategy
from typing import Optional, Union
import torch

@dataclass
class DataCollatorForMultipleChoice:
    tokenizer: PreTrainedTokenizerBase
    padding: Union[bool, str, PaddingStrategy] = True
    max_length: Optional[int] = None
    pad_to_multiple_of: Optional[int] = None
    
    def __call__(self, features):
        label_name = "label" if 'label' in features[0].keys() else 'labels'
        labels = [feature.pop(label_name) for feature in features]
        batch_size = len(features)
        num_choices = len(features[0]['input_ids'])
        flattened_features = [
            [{k: v[i] for k, v in feature.items()} for i in range(num_choices)] for feature in features
        ]
        flattened_features = sum(flattened_features, [])
        
        batch = self.tokenizer.pad(
            flattened_features,
            padding=self.padding,
            max_length=self.max_length,
            pad_to_multiple_of=self.pad_to_multiple_of,
            return_tensors='pt',
        )
        batch = {k: v.view(batch_size, num_choices, -1) for k, v in batch.items()}
        batch['labels'] = torch.tensor(labels, dtype=torch.int64)
        return batch
# Now we'll instatiate the model that we'll finetune on our public dataset, then use to
# make prediction on the private dataset.
from transformers import AutoModelForMultipleChoice, TrainingArguments, Trainer
model = AutoModelForMultipleChoice.from_pretrained(model_dir)
# The arguments here are selected to run quickly; feel free to play with them.
model_dir = 'finetuned_bert'
training_args = TrainingArguments(
    output_dir=model_dir,
    evaluation_strategy="epoch",
    save_strategy="epoch",
    load_best_model_at_end=True,
    learning_rate=5e-5,
    per_device_train_batch_size=4,
    per_device_eval_batch_size=4,
    num_train_epochs=3,
    weight_decay=0.01,
    report_to='none'
)
# Generally it's a bad idea to validate on your training set, but because our training set
# for this problem is so small we're going to train on all our data.
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_train_ds,
    eval_dataset=tokenized_train_ds,
    tokenizer=tokenizer,
    data_collator=DataCollatorForMultipleChoice(tokenizer=tokenizer),
)
# Training should take about a minute
trainer.train()
/opt/conda/lib/python3.10/site-packages/transformers/optimization.py:411: FutureWarning: This implementation of AdamW is deprecated and will be removed in a future version. Use the PyTorch implementation torch.optim.AdamW instead, or set `no_deprecation_warning=True` to disable this warning
  warnings.warn(
You're using a BertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.
/opt/conda/lib/python3.10/site-packages/torch/nn/parallel/_functions.py:68: UserWarning: Was asked to gather along dimension 0, but all input tensors were scalars; will instead unsqueeze and return a vector.
  warnings.warn('Was asked to gather along dimension 0, but all '
[75/75 00:47, Epoch 3/3]
Epoch Training Loss Validation Loss
1 No log 1.564447
2 No log 1.527968
3 No log 1.417341

/opt/conda/lib/python3.10/site-packages/torch/nn/parallel/_functions.py:68: UserWarning: Was asked to gather along dimension 0, but all input tensors were scalars; will instead unsqueeze and return a vector.
  warnings.warn('Was asked to gather along dimension 0, but all '
/opt/conda/lib/python3.10/site-packages/torch/nn/parallel/_functions.py:68: UserWarning: Was asked to gather along dimension 0, but all input tensors were scalars; will instead unsqueeze and return a vector.
  warnings.warn('Was asked to gather along dimension 0, but all '
TrainOutput(global_step=75, training_loss=1.5644291178385417, metrics={'train_runtime': 54.1225, 'train_samples_per_second': 11.086, 'train_steps_per_second': 1.386, 'total_flos': 156631893796800.0, 'train_loss': 1.5644291178385417, 'epoch': 3.0})
# Now we can actually make predictions on our questions
predictions = trainer.predict(tokenized_train_ds)
# The following function gets the indices of the highest scoring answers for each row
# and converts them back to our answer format (A, B, C, D, E)
import numpy as np
def predictions_to_map_output(predictions):
    sorted_answer_indices = np.argsort(-predictions)
    top_answer_indices = sorted_answer_indices[:,:3] # Get the first three answers in each row
    top_answers = np.vectorize(index_to_option.get)(top_answer_indices)
    return np.apply_along_axis(lambda row: ' '.join(row), 1, top_answers)
    
# Let's double check our output looks correct:
predictions_to_map_output(predictions.predictions)
array(['D B E', 'B A D', 'A C D', 'C D A', 'E D C', 'A B E', 'D C B',
       'D B E', 'C E D', 'A E C', 'E B D', 'D A B', 'E C A', 'E A D',
       'B A E', 'B E D', 'E B A', 'E B A', 'A D B', 'D A E', 'B D C',
       'E D B', 'C A B', 'C A B', 'E A D', 'E D A', 'A D E', 'D C B',
       'E B C', 'C B E', 'B E D', 'E C A', 'E D B', 'E D B', 'E B C',
       'B D A', 'A D E', 'A D C', 'E A D', 'E A C', 'A B C', 'B A C',
       'B C A', 'D C E', 'D C E', 'A B C', 'B C A', 'C E B', 'E B D',
       'B A C', 'B D A', 'E D B', 'A C D', 'A C D', 'B A C', 'B D E',
       'C B D', 'C B A', 'A B D', 'B C A', 'B E D', 'B E C', 'A C D',
       'C B A', 'A D C', 'E A D', 'C D A', 'E A B', 'E C D', 'D C E',
       'B C D', 'A C E', 'D E A', 'B D C', 'D C A', 'E D C', 'B A C',
       'B C E', 'C A D', 'A D B', 'D C E', 'A D E', 'C E B', 'A C B',
       'C B D', 'E A C', 'C B D', 'B C D', 'E A C', 'D A B', 'D B A',
       'B C D', 'D B C', 'E D B', 'E C A', 'C B D', 'C D A', 'D B E',
       'C D E', 'E C D', 'D E A', 'B C A', 'C B D', 'E B D', 'D E A',
       'D B C', 'A B C', 'D E B', 'E C A', 'D A C', 'E C A', 'D B C',
       'A B D', 'B D E', 'D C B', 'E B C', 'E B C', 'C E D', 'D A B',
       'A B C', 'D A C', 'C B A', 'B D E', 'C B A', 'C B A', 'E D C',
       'D B C', 'C D E', 'E B D', 'D E B', 'B E D', 'C D A', 'E D B',
       'B C E', 'C A E', 'E D B', 'E B D', 'E A D', 'A D B', 'A D C',
       'C A E', 'E D C', 'E C B', 'A D B', 'C E B', 'D C E', 'A D C',
       'A C D', 'B A D', 'B D A', 'B E D', 'C E B', 'A C D', 'E C A',
       'B E D', 'B D C', 'D B E', 'A C B', 'D A E', 'E A B', 'A B D',
       'C E D', 'A D C', 'B A D', 'C B D', 'D C A', 'C D A', 'D A E',
       'E A C', 'A C D', 'B C E', 'B E A', 'E C B', 'C B E', 'C E A',
       'E C A', 'E A B', 'C D B', 'D C A', 'C D E', 'C B E', 'C E B',
       'A E C', 'C B D', 'A C D', 'A E D', 'D C E', 'A D B', 'D C B',
       'C B E', 'D E A', 'C D E', 'E A B', 'C B D', 'D E B', 'E D A',
       'B E C', 'D E A', 'D B C', 'A E B'], dtype='<U5')
# Now we can load up our test set to use our model on!
# The public test.csv isn't the real dataset (it's actually just a copy of train.csv without the answer column)
# but it has the same format as the real test set, so using it is a good way to ensure our code will work when we submit.
test_df = pd.read_csv('/kaggle/input/kaggle-llm-science-exam/test.csv')
test_df.head()
id prompt A B C D E
0 0 Which of the following statements accurately d... MOND is a theory that reduces the observed mis... MOND is a theory that increases the discrepanc... MOND is a theory that explains the missing bar... MOND is a theory that reduces the discrepancy ... MOND is a theory that eliminates the observed ...
1 1 Which of the following is an accurate definiti... Dynamic scaling refers to the evolution of sel... Dynamic scaling refers to the non-evolution of... Dynamic scaling refers to the evolution of sel... Dynamic scaling refers to the non-evolution of... Dynamic scaling refers to the evolution of sel...
2 2 Which of the following statements accurately d... The triskeles symbol was reconstructed as a fe... The triskeles symbol is a representation of th... The triskeles symbol is a representation of a ... The triskeles symbol represents three interloc... The triskeles symbol is a representation of th...
3 3 What is the significance of regularization in ... Regularizing the mass-energy of an electron wi... Regularizing the mass-energy of an electron wi... Regularizing the mass-energy of an electron wi... Regularizing the mass-energy of an electron wi... Regularizing the mass-energy of an electron wi...
4 4 Which of the following statements accurately d... The angular spacing of features in the diffrac... The angular spacing of features in the diffrac... The angular spacing of features in the diffrac... The angular spacing of features in the diffrac... The angular spacing of features in the diffrac...
# There are more verbose/elegant ways of doing this, but if we give our test set a random `answer` column
# we can make predictions directly with our trainer.
test_df['answer'] = 'A'

# Other than that we'll preprocess it in the same way we preprocessed test.csv
test_ds = Dataset.from_pandas(test_df)
tokenized_test_ds = test_ds.map(preprocess, batched=False, remove_columns=['prompt', 'A', 'B', 'C', 'D', 'E', 'answer'])
# Here we'll generate our "real" predictions on the test set
test_predictions = trainer.predict(tokenized_test_ds)
# Now we can create our submission using the id column from test.csv
submission_df = test_df[['id']]
submission_df['prediction'] = predictions_to_map_output(test_predictions.predictions)

submission_df.head()
/tmp/ipykernel_23/1317637749.py:3: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  submission_df['prediction'] = predictions_to_map_output(test_predictions.predictions)
id prediction
0 0 D B E
1 1 B A D
2 2 A C D
3 3 C D A
4 4 E D C