LLMs: What's a large language model?

A newer technology, large language models (LLMs) predict a token or sequence of tokens, sometimes many paragraphs worth of predicted tokens. Remember that a token can be a word, a subword (a subset of a word), or even a single character. LLMs make much better predictions than N-gram language models or recurrent neural networks because:

LLMs contain far more parameters than recurrent models.
LLMs gather far more context.

This section introduces the most successful and widely used architecture for building LLMs: the Transformer.

What's a Transformer?

Transformers are the state-of-the-art architecture for a wide variety of language model applications, such as translation:

Figure 1. The input is: I am a good dog. A Transformer-based
translator transforms that input into the output: Je suis un bon
chien, which is the same sentence translated into French. — **Figure 1.** A Transformer-based application that translates from English to French.

Full transformers consist of an encoder and a decoder:

An encoder converts input text into an intermediate representation. An encoder is an enormous neural net.
A decoder converts that intermediate representation into useful text. A decoder is also an enormous neural net.

For example, in a translator:

The encoder processes the input text (for example, an English sentence) into some intermediate representation.
The decoder converts that intermediate representation into output text (for example, the equivalent French sentence).

Figure 2. The Transformer-based translator starts with an encoder,
which generates an intermediate representation of an English
sentence. A decoder converts that intermediate representation into
a French output sentence. — **Figure 2.** A full Transformer contains both an encoder and a decoder.

Click the icon to learn more about partial Transformers.

This module focuses on full Transformers, which contain both an encoder and a decoder; however, encoder-only and decoder-only architectures also exist:

Encoder-only architectures map input text into an intermediate representation (often, an embedding layer). Use cases for encoder-only architectures include:
- Predicting any token in the input sequence (which is the conventional role of language models).
- Creating a sophisticated embedding, which could serve as the input for another system, such as a classifier.
Decoder-only architectures generate new tokens from the text already generated. Decoder-only models typically excel at generating sequences; modern decoder-only models can use their generation power to create continuations of dialog histories and other prompts.

What is self-attention?

To enhance context, Transformers rely heavily on a concept called self-attention. Effectively, on behalf of each token of input, self-attention asks the following question:

"How much does each other token of input affect the interpretation of this token?"

The "self" in "self-attention" refers to the input sequence. Some attention mechanisms weight relations of input tokens to tokens in an output sequence like a translation or to tokens in some other sequence. But self-attention only weights the importance of relations between tokens in the input sequence.

To simplify matters, assume that each token is a word and the complete context is only a single sentence. Consider the following sentence:

The animal didn't cross the street because it was too tired.

The preceding sentence contains eleven words. Each of the eleven words is paying attention to the other ten, wondering how much each of those ten words matters to itself. For example, notice that the sentence contains the pronoun it. Pronouns are often ambiguous. The pronoun it typically refers to a recent noun or noun phrase, but in the example sentence, which recent noun does it refer to—the animal or the street?

The self-attention mechanism determines the relevance of each nearby word to the pronoun it. Figure 3 shows the results—the bluer the line, the more important that word is to the pronoun it. That is, animal is more important than street to the pronoun it.

Figure 3. The relevance of each of the eleven words in the sentence:
'The animal didn't cross the street because it was too tired'
to the pronoun 'it'. The word 'animal' is the most relevant to
the pronoun 'it'. — **Figure 3.** Self-attention for the pronoun it. From Transformer: A Novel Neural Network Architecture for Language Understanding.

Conversely, suppose the final word in the sentence changes as follows:

The animal didn't cross the street because it was too wide.

In this revised sentence, self-attention would hopefully rate street as more relevant than animal to the pronoun it.

Some self-attention mechanisms are bidirectional, meaning that they calculate relevance scores for tokens preceding and following the word being attended to. For example, in Figure 3, notice that words on both sides of it are examined. So, a bidirectional self-attention mechanism can gather context from words on either side of the word being attended to. By contrast, a unidirectional self-attention mechanism can only gather context from words on one side of the word being attended to. Bidirectional self-attention is especially useful for generating representations of whole sequences, while applications that generate sequences token-by-token require unidirectional self-attention. For this reason, encoders use bidirectional self-attention, while decoders use unidirectional.

What is multi-head self-attention?

Each self-attention layer is typically comprised of multiple self-attention heads. The output of a layer is a mathematical operation (for example, weighted average or dot product) of the output of the different heads.

Since each self-attention layer is initialized to random values, different heads can learn different relationships between each word being attended to and the nearby words. For example, the self-attention layer described in the previous section focused on determining which noun the pronoun it referred to. However, other self-attention layers might learn the grammatical relevance of each word to every other word, or learn other interactions.

Click the icon to learn about Big O for LLMs.

Self-attention forces every word in the context to learn the relevance of all the other words in the context. So, it is tempting to proclaim this an O(N²) problem, where:

N is the number of tokens in the context.

As if the preceding Big O weren't disturbing enough, Transformers contain multiple self-attention layers and multiple self-attention heads per self-attention layer, so Big O is actually:

O(N² · S · D)

where:

S is the number of self-attention layers.
D is the number of heads per layer.

Click the icon to learn more about how LLMs are trained.

You probably will never train an LLM from scratch. Training an industrial-strength LLM requires enormous amounts of ML expertise, computational resources, and time. Regardless, you clicked the icon to learn more, so we owe you an explanation.

The primary ingredient in building a LLM is a phenomenal amount of training data (text), typically somewhat filtered. The first phase of training is usually some form of unsupervised learning on that training data. Specifically, the model trains on masked predictions, meaning that certain tokens in the training data are intentionally hidden. The model trains by trying to predict those missing tokens. For example, assume the following sentence is part of the training data:

The residents of the sleepy town weren't prepared for what came next.

Random tokens are removed, for example:

The ___ of the sleepy town weren't prepared for ___ came next.

An LLM is just a neural net, so loss (the number of masked tokens the model correctly considered) guides the degree to which backpropagation updates parameter values.

A Transformer-based model trained to predict missing data gradually learns to detect patterns and higher-order structures in the data to get clues about the missing token. Consider the following example masked instance:

Oranges are traditionally ___ by hand. Once clipped from a tree, __ don't ripen.

Extensive training on enormous numbers of masked examples enable an LLM to learn that "harvested" or "picked" are high probability matches for the first token and "oranges" or "they" are good choices for the second token.

An optional further training step called instruction tuning can improve an LLM's ability to follow instructions.

Why are Transformers so large?

Transformers contain hundreds of billion or even trillions of parameters. This course has generally recommended building models with smaller number of parameters over those with a larger number of parameters. After all, a model with a smaller number of parameters uses fewer resources to make predictions than a model with a larger number of parameters. However, research shows that Transformers with more parameters consistently outperform Transformers with fewer parameters.

But how does an LLM generate text?

You've seen how researchers train LLMs to predict a missing word or two, and you might be unimpressed. After all, predicting a word or two is essentially the autocomplete feature built into various text, email, and authoring software. You might be wondering how LLMs can generate sentences or paragraphs or haikus about arbitrage.

In fact, LLMs are essentially autocomplete mechanisms that can automatically predict (complete) thousands of tokens. For example, consider a sentence followed by a masked sentence:

My dog, Max, knows how to perform many traditional dog tricks.
___ (masked sentence)

An LLM can generate probabilities for the masked sentence, including:

Probability	Word(s)
3.1%	For example, he can sit, stay, and roll over.
2.9%	For example, he knows how to sit, stay, and roll over.

A sufficiently large LLM can generate probabilities for paragraphs and entire essays. You can think of a user's questions to an LLM as the "given" sentence followed by an imaginary mask. For example:

User's question: What is the easiest trick to teach a dog?
LLM's response:  ___

The LLM generates probabilities for various possible responses.

As another example, an LLM trained on a massive number of mathematical "word problems" can give the appearance of doing sophisticated mathematical reasoning. However, those LLMs are basically just autocompleting a word problem prompt.

Benefits of LLMs

LLMs can generate clear, easy-to-understand text for a wide variety of target audiences. LLMs can make predictions on tasks they are explicitly trained on. Some researchers claim that LLMs can also make predictions for input they were not explicitly trained on, but other researchers have refuted this claim.

Problems with LLMs

Training an LLM entails many problems, including:

Gathering an enormous training set.
Consuming multiple months and enormous computational resources and electricity.
Solving parallelism challenges.

Using LLMs to infer predictions causes the following problems:

LLMs hallucinate, meaning their predictions often contain mistakes.
LLMs consume enormous amounts of computational resources and electricity. Training LLMs on larger datasets typically reduces the amount of resources required for inference, though the larger training sets incur more training resources.
Like all ML models, LLMs can exhibit all sorts of bias.

Exercise: Check your understanding

Suppose a Transformer is trained on a billion documents, including thousands of documents containing at least one instance of the word elephant. Which of the following statements are probably true?

Acacia trees, an important part of an elephant's diet, will gradually gain a high self-attention score with the word elephant.

Yes and this will enable the Transformer to answer questions about an elephant's diet.

The Transformer will associate the word elephant with various idioms that contain the word elephant.

Yes, the system will begin to attach high self-attention scores between the word elephant and other words in elephant idioms.

The Transformer will gradually learn to ignore any sarcastic or ironic uses of the word elephant in training data.

Sufficiently large Transformers trained on a sufficiently broad training set become quite adept at recognizing sarcasm, humor, and irony. So, rather than ignoring sarcasm and irony, the Transformer learns from it.

Introduction: What is a language model? (10 min)

Fine-tuning, distillation, and prompt engineering (10 min)