Transformers

Welcome to the fascinating world of transformers—no, not the giant robots, but the powerful language models that have revolutionized natural language processing (NLP) and artificial intelligence (AI) as we know it! In this chapter, we’ll learn about the inner workings of these state-of-the-art models and see firsthand how they’ve transformed (pun intended) the field of NLP.

Transformers have become a backbone for modern AI applications, powering everything from chatbots and smart assistants to machine translation and content summarization. They’ve even branched out beyond text, making waves in fields like computer vision, music generation, and protein folding. So, let’s get started and see what all the fuss is about!

Language Models: A conceptual overview

Before we delve into the nitty-gritty of transformers, let’s take a step back and understand what language modeling is. At its core, a language model is a probabilistic model that learns to predict the next word (or token) in a sequence based on the preceding words. By doing so, it captures the underlying structure and patterns in the language, allowing it to generate realistic and coherent text.

So, how do transformers fit into this picture? Unlike traditional language models that use fixed-sized sliding windows or recurrent neural networks (RNNs), transformers are designed to handle long-range dependencies and complex relationships between words more efficiently and expressively. At the heart of this innovation lies the self-attention mechanism, which allows the model to weigh the importance of each word in the context of the entire sequence.

Here’s a high-level overview of how transformers work:

Tokenization: The input text is broken down into individual tokens (words or subwords), which are then mapped to numerical values called embeddings.
Positional Encoding: To retain information about the original order of the tokens, a positional encoding is incorporated into the embeddings.
Self-Attention: The model calculates a weighted sum of the embeddings, emphasizing the most relevant tokens in the context of the entire sequence.
Feed-Forward Neural Network: The self-attention output is passed through a feed-forward neural network, which further refines the representation.
Prediction: An additional layer or a separate decoder processes the final representation into a task-dependant final output.

[diagram of a transformer block with self-attention, positional encoding, and feed-forward neural network blocks]

The true power of transformers comes from stacking multiple layers (or blocks) of these components, allowing the model to learn increasingly complex and abstract relationships between the input tokens. As we’ll see in the following sections, this architecture has enabled transformers to achieve unprecedented performance in a wide variety of tasks and domains. In the next section, we’ll explore how transformers work in action, including tokenization, predicting probabilities, and generating text. Throughout the rest of this chapter and indeed the rest of this book, you’ll see them cropping up again and again

A Language Model in Action

In this section, we’re going to load and interact with a pre-trained transformer model to get a feel for how they work. We’ll be using the GPT-2 model, which made headlines in 2019 for its impressive text-generation capabilities. Small and almost quaint by today’s standards, GPT-2 is nevertheless a good illustration of how these language models work and the same principles apply to the larger and more powerful models that have since been released.

Tokenizing Text

To feed text into a model, we first need to find a way to turn strings into numbers. This process is called tokenization, and it’s a crucial step in any NLP pipeline. There are many possible strategies. An easy option would be to split the text into individual characters and assign each a unique numerical ID. However, this approach requires many tokens to represent a given string (a downside for performance) and erases some of the structure and meaning of the text (a downside for accuracy). Another approach could be to split the text into individual words - while this lets us capture more meaning per token, it has the downside that we need to deal with unknown words (e.g. typos, slang, etc.) and that we need to deal with different forms of the same word (e.g. “run”, “runs”, “running”, etc.). Modern tokenization strategies try to strike a balance between these two extremes, splitting the text into subwords that capture both the structure and meaning of the text while still being able to handle unknown words and different forms of the same word. To see this in action, let’s look at how the GPT-2 tokenizer handles a sentence:

from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("gpt2")

input_ids = tokenizer("It was a dark and stormy", return_tensors="pt").input_ids
input_ids

tensor([[1026,  373,  257, 3223,  290, 6388,   88]])

for t in input_ids[0]:
  print(t, ':', tokenizer.decode(t))

tensor(1026) : It
tensor(373) :  was
tensor(257) :  a
tensor(3223) :  dark
tensor(290) :  and
tensor(6388) :  storm
tensor(88) : y

Most words are represented by a single token, but stormy is represented by two tokens: one for storm (including the space before the word) and one for the suffix y. This allows the model to learn that stormy is related to storm and that the suffix y is often used to turn nouns into adjectives. With a ‘vocabulary’ of around 50,000 tokens, the GPT-2 tokenizer can efficiently represent almost any input text, and averages about 1.3 tokens per word.

Predicting Probabilities

GPT-2 was trained as an auto-regressive language model, which means that it was trained to predict the next word in a sequence given the preceding words. The transformers library has pipelines that make it easy to use such a model to generate text or perform other tasks, but before we get to that it is useful to understand how the model makes its predictions by directly inspecting its predictions on this base language-modelling task. We begin by loading the model.

TODO note about AutoModels in HF?

from transformers import AutoModelForCausalLM
gpt2 = AutoModelForCausalLM.from_pretrained("gpt2", return_dict_in_generate=True)

If we feed the tokenized sentence from the previous section through the model, we see that we get a result back with 50,257 values for each token in the input string:

gpt2(input_ids).logits.shape # An output for each input token

torch.Size([1, 7, 50257])

These are the raw model outputs (logits) with each corresponding to a possible token. Higher values mean that the model considers the corresponding token a more likely continuation of the sequence. Let’s focus on the final set of outputs, which are the model’s predictions for the what token should follow the input sequence we fed in:

final_logits = gpt2(input_ids).logits[0, -1] # The last set of logits
final_logits.shape # One for each possible token

torch.Size([50257])

We can find the index of the token with the highest value using the argmax method:

final_logits.argmax() # The position of the maximum

tensor(1755)

This is the token the model considers most likely to follow the input string “It was a dark and stormy”. Decoding this token, we can see that this model knows a few story tropes:

tokenizer.decode(final_logits.argmax()) # Decoding this, and it looks like the model has read lots of short stories :)

' night'

To turn the logits into probability values, we use the softmax operation to scale them such that their values sum to 1. The following code does this and prints out the top 10 most likely tokens and their associated probabilities according to the model:

# Looking at the probabilities for the top 10 tokens:
import torch
top10 = torch.topk(final_logits.softmax(dim=0)  , 10)
for value, index in zip(top10.values, top10.indices):
    print(f'{tokenizer.decode(index):<10} {value.item():.2%}')

 night     46.18%
 day       23.46%
 evening   5.87%
 morning   4.42%
 afternoon 4.11%
 summer    1.34%
 time      1.33%
 winter    1.22%
 weekend   0.39%
,          0.38%

Try this with different input texts - do you tend to agree with the model’s predictions? What happens if you feed in a longer or shorter input string? What happens if you feed in a string that is not a grammatically correct sentence?

Generating Text

Once we know how to get the model’s predictions for the next token in a sequence, it is easy to generate text by repeatedly feeding the model’s predictions back into itself. There are many possible strategies for doing this, but the simplest is to repeatedly sample the next token from the model’s predictions and append it to the input sequence - so-called “greedy decoding”. Unfortunately, greedy decoding often leads to repetitive and nonsensical text. A better strategy is to sample the next token from the model’s predictions using a probability distribution that is “softened” by a temperature parameter, where temperature 0 is equivalent to greedy decoding and higher temperatures lead to more random sampling. Compare the effect of this temperature parameter on the generated text in the following example:

from transformers import pipeline, set_seed
generator = pipeline('text-generation', model='gpt2')
set_seed(42) # For reproducibility

generator("The humble tomato", max_length=30, num_return_sequences=3, temperature=0.7)

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.

[{'generated_text': "The humble tomato is a simple, healthy tomato that's great for everyone! The flavor is great, it's a great combination of sweet and sour,"},
 {'generated_text': 'The humble tomato sauce is just what you need for a decent lunch and dinner with your friends. You can add some tomato paste to help you get your'},
 {'generated_text': "The humble tomato was used for its flavor, it's important for its texture and its color.\n\nThe tomato is used in a wide variety of"}]

generator("The humble tomato", max_length=30, num_return_sequences=3, temperature=0.01)

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.

[{'generated_text': "The humble tomato is a staple in many Italian restaurants, and it's also a staple in many Italian restaurants.\n\nThe tomato is a staple in"},
 {'generated_text': "The humble tomato is a staple in many Italian restaurants, and it's a staple in many Italian dishes. It's a great way to get a little"},
 {'generated_text': "The humble tomato is a staple in many Italian restaurants, and it's a staple in many Italian dishes. It's a great way to get a little"}]

Rather than one token at a time, there are techniques such as beam search that ‘explore’ multiple possible continuations of the sequence and return the most likely one.

Additional techniques for controlling the generation - top_k - how many tokens to consider (avoids choosing extremely unlikely tokens) - repetition_penalty - how much to penalize tokens that have already been generated (avoids repetition) - bad_words_ids - list of tokens that should not be generated (avoids generating offensive words)

TODO expand a little.

Zero-Shot Classification

Generating language is a fun and interesting application of transformers, but writing fake articles about unicorns is not the use-case that has made them so popular. You see, to predict the next token well, these models must ‘learn’ a fair amount about the world. We can take advantage of this to perform various downstream tasks. For example, instead of training a model dedicated to translation, we can simply prompt a sufficiently powerful language model with an input like:

Translate the following sentence from English to French: 
Input: The cat sat on the mat.
Translation:

I typed this example with GitHub copilot active, and it helpfully suggested “Le chat était assis sur le tapis.” as a continuation of the above prompt - a perfect illustration of how a language model can perform tasks that are not explicitly trained for. The more powerful the model, the more tasks it can perform without any additional training. This is the true power of transformers, and it’s what has made them so popular in recent years.

To see this in action for ourselves, let’s put GPT-2 to use as a classification model. Specifically, we’ll try to classify movie reviews as either positive or negative - a classic benchmark task in the NLP field. To make things interesting, we’ll use a zero-shot approach, which means we won’t train the model on any labeled data. Instead, we’ll simply prompt the model with the text of a review and ask it to predict the sentiment. Let’s see how it does!

To do this, we’ll insert the review into a prompt template that hopefully provides some context for the model and helps it understand what we’re asking it to do. After feeding the prompt through the model, we’ll look at it’s prediction for the next token and see which possible answer is assigned a higher probability:

# Given a review, predict whether it is positive or negative using a bit of clever prompting
def score(review):
  prompt = f"Question: Is the following review positive or negative about the movie? Review: {review} Answer:"
  input_ids = tokenizer(prompt, return_tensors="pt").input_ids
  final_logits = gpt2(input_ids).logits[0, -1]
  if final_logits[3967]>final_logits[4633]: print('Positive')
  else: print('Negative')

For reference, here’s how to find the token ID for a possible answer:

# Check the token IDs for the words 'positive' and 'negative' (note the space before the words)
tokenizer.encode(' positive'), tokenizer.encode(' negative')

([3967], [4633])

We can try out this zero-shot classifier on a few fake reviews to see how it does:

score("This movie was terrible!")

Negative

score('That was a delight to watch, 10/10 would recommend :)')

Positive

score("A complex yet wonderful film about the depravity of man") # A mistake

Negative

In the supplementary material, you’ll find a dataset of labeled reviews and code to assess the accuracy of this zero-shot approach. Can you tweak the prompt template to improve the model’s performance? Can you think of other tasks that could be performed using a similar approach?

The Power of Pre-Training

Traditional approaches to a task such as movie review classification (from the previous section) were limited by the availability of labeled data. A model would be trained from scratch on a large corpus of labeled examples attempting to directly predict the label from the input text. This approach is often referred to as supervised learning. However, it has a major drawback: it requires a large amount of labeled data to train effectively. This is a problem because labeled data is expensive to obtain and time-consuming to label. In many cases, it’s simply not available at all. To address this, researchers began to look for a way to train models on existing data that could then be fine-tuned for a specific task. This approach is known as transfer learning, and it’s the foundation of modern AI.

Transfer learning had already been successfully applied to computer vision, where large models train to classify millions of images could be used to initialize a smaller model for a specific task. Language modeling turned out to be the perfect task for pre-training, as it does not require human-annotated data and can be performed on a wide variety of text. Initial works focused on finding domain-specific corpora for this language model pre-training phase, but papers such as ULMFiT (reference) showed that even pre-training on generic text such as Wikipedia could yield impressive results when the models were then fine-tuned on downstream tasks. This set the stage for the rise of transformers, which turned out to be extremely well-suited for the task of learning rich representations of language.

NB: I like the GPT paper names:

Improving Language Understanding by Generative Pre-Training The original GPT paper was titled Improving Language Understanding by Generative Pre-Training and was published in 2018 by Alec Radford, Karthik Narasimhan, Tim Salimans and Ilya Sutskever. It introduced the idea of using a Transformer-based model pre-trained on a large corpus of text to learn general language representations, and then fine-tuning it on specific downstream tasks. The paper also showed that the GPT model achieved state-of-the-art results on several natural language understanding benchmarks at the time.
Language Models are Unsupervised Multitask Learners and was published in 2019 by Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei and Ilya Sutskever1. It presented GPT-2, a Transformer-based model with 1.5 billion parameters that was pre-trained on a large corpus of web text called WebText1. The paper also demonstrated that GPT-2 could perform well on various natural language tasks without any fine-tuning, such as text generation, summarization, translation, reading comprehension and commonsense reasoning1. The paper also discussed the potential ethical and social implications of large-scale language models.
Language Models are Few-Shot Learners, which was published in 2020 by Tom B. Brown and others23. This paper shows that scaling up language models greatly improves their ability to perform new language tasks from only a few examples or from simple instructions, without any fine-tuning or gradient updates23. The paper also presents GPT-3, an autoregressive language model with 175 billion parameters, which achieves strong performance on many NLP datasets and tasks.

Tells the story well.

Sequence-To-Sequence Tasks

The architecture shown at the start of this chapter has a single stack of transformer blocks that process an input sequence. This is a popular approach today, but the original transformer paper (TODO cite) used a more complicated architecture called the Encoder-Decoder architecture which is still in common use today:

[Diagram]

In these models, one stack of transformer blocks processes an input sequence into a set of rich representations, which are then fed into another stack of transformer blocks that decodes them into an output sequence. This architecture is used for a variety of tasks, including summarization, translation, and question-answering. For example, you might feed an English sentence through the encoder of a translation model, and then generate the corresponding French sentence using the decoder. The generation happens one token at a time just like we saw when generating sequences earlier in the chapter, but this time generation starts with a special token and the predictions for each successive token are informed not just by the current sequence being generated, but also by the output from the encoder.

The mechanism by which the embeddings from the encoder side are incorporated into the decoder stack is something called ‘cross attention’. It resembles self-attention, except that each token in the input (the sequence being processed by the decoder) ‘attends’ to the context from the encoder rather than to other tokens in its own sequence. These cross-attention layers are interleaved with self-attention, allowing the decoder to use both context within its own sequence and the information from the encoder.

Beyond Text

Proteins, music, images, etc.

Project Time: Putting LMs to Work

Practice your promptcraft: can you get a language model to work better at zero-shot classification by tweaking the prompt? Can you turn it into a chatbot?

Conclusions

LLMs, APIs and the future of NLP