Large language models

What is a language model?

A language model estimates the probability of a token or sequence of tokens occurring within a longer sequence of tokens. A token could be a word, a subword (a subset of a word), or even a single character.

Consider the following sentence and the token(s) that might complete it:

When I hear rain on my roof, I _______ in my kitchen.


A language model determines the probabilities of different tokens or sequences of tokens to complete that blank. For example, the following probability table identifies some possible tokens and their probabilities:

Probability Token(s)
9.4% cook soup
5.2% warm up a kettle
3.6% cower
2.5% nap
2.2% relax

In some situations, the sequence of tokens could be an entire sentence, paragraph, or even an entire essay.

An application can use the probability table to make predictions. The prediction might be the highest probability (for example, "cook soup") or a random selection from tokens having a probability greater than a certain threshold.

Estimating the probability of what fills in the blank in a text sequence can be extended to more complex tasks, including:

• Generating text.
• Translating text from one language to another.
• Summarizing documents.

By modeling the statistical patterns of tokens, modern language models develop extremely powerful internal representations of language and can generate plausible language.

N-gram language models

N-grams are ordered sequences of words used to build language models, where N is the number of words in the sequence. For example, when N is 2, the N-gram is called a 2-gram (or a bigram); when N is 5, the N-gram is called a 5-gram. Given the following phrase in a training document:

you are very nice


The resulting 2-grams are as follows:

• you are
• are very
• very nice

When N is 3, the N-gram is called a 3-gram (or a trigram). Given that same phrase, the resulting 3-grams are:

• you are very
• are very nice

Given two words as input, a language model based on 3-grams can predict the likelihood of the third word. For example, given the following two words:

orange is


A language model examines all the different 3-grams derived from its training corpus that start with orange is to determine the most likely third word. Hundreds of 3-grams could start with the two words orange is, but you can focus solely on the following two possibilities:

orange is ripe
orange is cheerful


The first possibility (orange is ripe) is about orange the fruit, while the second possibility (orange is cheerful) is about the color orange.

Context

Humans can retain relatively long contexts. While watching Act 3 of a play, you retain knowledge of characters introduced in Act 1. Similarly, the punchline of a long joke makes you laugh because you can remember the context from the joke's setup.

In language models, context is helpful information before or after the target token. Context can help a language model determine whether "orange" refers to a citrus fruit or a color.

Context can help a language model make better predictions, but does a 3-gram provide sufficient context? Unfortunately, the only context a 3-gram provides is the first two words. For example, the two words orange is doesn't provide enough context for the language model to predict the third word. Due to lack of context, language models based on 3-grams make a lot of mistakes.

Longer N-grams would certainly provide more context than shorter N-grams. However, as N grows, the relative occurrence of each instance decreases. When N becomes very large, the language model typically has only a single instance of each occurrence of N tokens, which isn't very helpful in predicting the target token.

Recurrent neural networks

Recurrent neural networks provide more context than N-grams. A recurrent neural network is a type of neural network that trains on a sequence of tokens. For example, a recurrent neural network can gradually learn (and learn to ignore) selected context from each word in a sentence, kind of like you would when listening to someone speak. A large recurrent neural network can gain context from a passage of several sentences.

Although recurrent neural networks learn more context than N-grams, the amount of useful context recurrent neural networks can intuit is still relatively limited. Recurrent neural networks evaluate information "token by token." In contrast, large language models—the topic of the next section—can evaluate the whole context at once.

Note that training recurrent neural networks for long contexts is constrained by the vanishing gradient problem.