LLMs: What's a large language model?

A newer technology, large language models (LLMs) predict a token or sequence of tokens, sometimes many paragraphs worth of predicted tokens. Remember that a token can be a word, a subword (a subset of a word), or even a single character. LLMs make much better predictions than N-gram language models or recurrent neural networks because:

  • LLMs contain far more parameters than recurrent models.
  • LLMs gather far more context.

This section introduces the most successful and widely used architecture for building LLMs: the Transformer.

What's a Transformer?

Transformers are the state-of-the-art architecture for a wide variety of language model applications, such as translation:

Figure 1. The input is: I am a good dog. A Transformer-based
            translator transforms that input into the output: Je suis un bon
            chien, which is the same sentence translated into French.
Figure 1. A Transformer-based application that translates from English to French.

 

Full transformers consist of an encoder and a decoder:

  • An encoder converts input text into an intermediate representation. An encoder is an enormous neural net.
  • A decoder converts that intermediate representation into useful text. A decoder is also an enormous neural net.

For example, in a translator:

  • The encoder processes the input text (for example, an English sentence) into some intermediate representation.
  • The decoder converts that intermediate representation into output text (for example, the equivalent French sentence).
Figure 2. The Transformer-based translator starts with an encoder,
            which generates an intermediate representation of an English
            sentence. A decoder converts that intermediate representation into
            a French output sentence.
Figure 2. A full Transformer contains both an encoder and a decoder.

 

What is self-attention?

To enhance context, Transformers rely heavily on a concept called self-attention. Effectively, on behalf of each token of input, self-attention asks the following question:

"How much does each other token of input affect the interpretation of this token?"

The "self" in "self-attention" refers to the input sequence. Some attention mechanisms weight relations of input tokens to tokens in an output sequence like a translation or to tokens in some other sequence. But self-attention only weights the importance of relations between tokens in the input sequence.

To simplify matters, assume that each token is a word and the complete context is only a single sentence. Consider the following sentence:

The animal didn't cross the street because it was too tired.

The preceding sentence contains eleven words. Each of the eleven words is paying attention to the other ten, wondering how much each of those ten words matters to itself. For example, notice that the sentence contains the pronoun it. Pronouns are often ambiguous. The pronoun it typically refers to a recent noun or noun phrase, but in the example sentence, which recent noun does it refer to—the animal or the street?

The self-attention mechanism determines the relevance of each nearby word to the pronoun it. Figure 3 shows the results—the bluer the line, the more important that word is to the pronoun it. That is, animal is more important than street to the pronoun it.

Figure 3. The relevance of each of the eleven words in the sentence:
            'The animal didn't cross the street because it was too tired'
            to the pronoun 'it'. The word 'animal' is the most relevant to
            the pronoun 'it'.
Figure 3. Self-attention for the pronoun it. From Transformer: A Novel Neural Network Architecture for Language Understanding.

 

Conversely, suppose the final word in the sentence changes as follows:

The animal didn't cross the street because it was too wide.

In this revised sentence, self-attention would hopefully rate street as more relevant than animal to the pronoun it.

Some self-attention mechanisms are bidirectional, meaning that they calculate relevance scores for tokens preceding and following the word being attended to. For example, in Figure 3, notice that words on both sides of it are examined. So, a bidirectional self-attention mechanism can gather context from words on either side of the word being attended to. By contrast, a unidirectional self-attention mechanism can only gather context from words on one side of the word being attended to. Bidirectional self-attention is especially useful for generating representations of whole sequences, while applications that generate sequences token-by-token require unidirectional self-attention. For this reason, encoders use bidirectional self-attention, while decoders use unidirectional.

What is multi-head self-attention?

Each self-attention layer is typically comprised of multiple self-attention heads. The output of a layer is a mathematical operation (for example, weighted average or dot product) of the output of the different heads.

Since each self-attention layer is initialized to random values, different heads can learn different relationships between each word being attended to and the nearby words. For example, the self-attention layer described in the previous section focused on determining which noun the pronoun it referred to. However, other self-attention layers might learn the grammatical relevance of each word to every other word, or learn other interactions.

Why are Transformers so large?

Transformers contain hundreds of billion or even trillions of parameters. This course has generally recommended building models with smaller number of parameters over those with a larger number of parameters. After all, a model with a smaller number of parameters uses fewer resources to make predictions than a model with a larger number of parameters. However, research shows that Transformers with more parameters consistently outperform Transformers with fewer parameters.

But how does an LLM generate text?

You've seen how researchers train LLMs to predict a missing word or two, and you might be unimpressed. After all, predicting a word or two is essentially the autocomplete feature built into various text, email, and authoring software. You might be wondering how LLMs can generate sentences or paragraphs or haikus about arbitrage.

In fact, LLMs are essentially autocomplete mechanisms that can automatically predict (complete) thousands of tokens. For example, consider a sentence followed by a masked sentence:

My dog, Max, knows how to perform many traditional dog tricks.
___ (masked sentence)

An LLM can generate probabilities for the masked sentence, including:

Probability Word(s)
3.1% For example, he can sit, stay, and roll over.
2.9% For example, he knows how to sit, stay, and roll over.

A sufficiently large LLM can generate probabilities for paragraphs and entire essays. You can think of a user's questions to an LLM as the "given" sentence followed by an imaginary mask. For example:

User's question: What is the easiest trick to teach a dog?
LLM's response:  ___

The LLM generates probabilities for various possible responses.

As another example, an LLM trained on a massive number of mathematical "word problems" can give the appearance of doing sophisticated mathematical reasoning. However, those LLMs are basically just autocompleting a word problem prompt.

Benefits of LLMs

LLMs can generate clear, easy-to-understand text for a wide variety of target audiences. LLMs can make predictions on tasks they are explicitly trained on. Some researchers claim that LLMs can also make predictions for input they were not explicitly trained on, but other researchers have refuted this claim.

Problems with LLMs

Training an LLM entails many problems, including:

  • Gathering an enormous training set.
  • Consuming multiple months and enormous computational resources and electricity.
  • Solving parallelism challenges.

Using LLMs to infer predictions causes the following problems:

  • LLMs hallucinate, meaning their predictions often contain mistakes.
  • LLMs consume enormous amounts of computational resources and electricity. Training LLMs on larger datasets typically reduces the amount of resources required for inference, though the larger training sets incur more training resources.
  • Like all ML models, LLMs can exhibit all sorts of bias.

Exercise: Check your understanding

Suppose a Transformer is trained on a billion documents, including thousands of documents containing at least one instance of the word elephant. Which of the following statements are probably true?
Acacia trees, an important part of an elephant's diet, will gradually gain a high self-attention score with the word elephant.
Yes and this will enable the Transformer to answer questions about an elephant's diet.
The Transformer will associate the word elephant with various idioms that contain the word elephant.
Yes, the system will begin to attach high self-attention scores between the word elephant and other words in elephant idioms.
The Transformer will gradually learn to ignore any sarcastic or ironic uses of the word elephant in training data.
Sufficiently large Transformers trained on a sufficiently broad training set become quite adept at recognizing sarcasm, humor, and irony. So, rather than ignoring sarcasm and irony, the Transformer learns from it.