LLMs: Fine-tuning, distillation, and prompt engineering

The previous unit described general-purpose LLMs, variously known as:

foundation LLMs
base LLMs
pre-trained LLMs

A foundation LLM is trained on enough natural language to "know" a remarkable amount about grammar, words, and idioms. A foundation language model can generate helpful sentences about topics it is trained on. Furthermore, a foundation LLM can perform certain tasks traditionally called "creative," like writing poetry. However, a foundation LLM's generative text output isn't a solution for other kinds of common ML problems, such as regression or classification. For these use cases, a foundation LLM can serve as a platform rather than a solution.

Transforming a foundation LLM into a solution that meets an application's needs requires a process called fine-tuning. A secondary process called distillation generates a smaller (fewer parameters) version of the fine-tuned model.

Fine-tuning

Research shows that the pattern-recognition abilities of foundation language models are so powerful that they sometimes require relatively little additional training to learn specific tasks. That additional training helps the model make better predictions on a specific task. This additional training, called fine-tuning, unlocks an LLM's practical side.

Fine-tuning trains on examples specific to the task your application will perform. Engineers can sometimes fine-tune a foundation LLM on just a few hundred or a few thousand training examples.

Despite the relatively tiny number of training examples, standard fine-tuning is often computationally expensive. That's because standard fine-tuning involves updating the weight and bias of every parameter on each backpropagation iteration. Fortunately, a smarter process called parameter-efficient tuning can fine-tune an LLM by adjusting only a subset of parameters on each backpropagation iteration.

A fine-tuned model's predictions are usually better than the foundation LLM's predictions. However, a fine-tuned model contains the same number of parameters as the foundation LLM. So, if a foundation LLM contains ten billion parameters, then the fine-tuned version will also contain ten billion parameters.

Distillation

Most fine-tuned LLMs contain enormous numbers of parameters. Consequently, foundation LLMs require enormous computational and environmental resources to generate predictions. Note that large swaths of those parameters are typically irrelevant for a specific application.

Distillation creates a smaller version of an LLM. The distilled LLM generates predictions much faster and requires fewer computational and environmental resources than the full LLM. However, the distilled model's predictions are generally not quite as good as the original LLM's predictions. Recall that LLMs with more parameters almost always generate better predictions than LLMs with fewer parameters.

Click the icon to learn how distillation works.

The most common form of distillation uses bulk inference to label data. This labeled data is then used to train a new, smaller model (known as the student model) that can be more affordably served. The labeled data serves as a channel by which the larger model (known as the teacher model) funnels its knowledge to the smaller model.

For example, suppose you need an online toxicity scorer for automatic moderation of comments. In this case, you can use a large offline toxicity scorer to label training data. Then, you can use that training data to distill a toxicity scorer model small enough to be served and handle live traffic.

A teacher model can sometimes provide more labeled data than it was trained on. Alternatively, a teacher model can funnel a numerical score instead of a binary label to the student model. A numerical score provides a richer training signal than a binary label, enabling the student model to predict not only positive and negative classes but also borderline classes.

Prompt engineering

Prompt engineering enables an LLM's end users to customize the model's output. That is, end users clarify how the LLM should respond to their prompt.

Humans learn well from examples. So do LLMs. Showing one example to an LLM is called one-shot prompting. For example, suppose you want a model to use the following format to output a fruit's family:

User inputs the name of a fruit: LLM outputs that fruit's class.

A one-shot prompt shows the LLM a single example of the preceding format and then asks the LLM to complete a query based on that example. For instance:

peach: drupe
apple: ______

A single example is sometimes sufficient. If it is, the LLM outputs a useful prediction. For instance:

apple: pome

In other situations, a single example is insufficient. That is, the user must show the LLM multiple examples. For instance, the following prompt contains two examples:

plum: drupe
pear: pome
lemon: ____

Providing multiple examples is called few-shot prompting. You can think of the first two lines of the preceding prompt as training examples.

Can an LLM provide useful predictions with no examples (zero-shot prompting)? Sometimes, but LLMs like context. Without context, the following zero-shot prompt might return information about the technology company rather than the fruit:

apple: _______

Offline inference

The number of parameters in an LLM is sometimes so large that online inference is too slow to be practical for real-world tasks like regression or classification. Consequently, many engineering teams rely on offline inference (also known as bulk inference or static inference) instead. In other words, rather than responding to queries at serving time, the trained model makes predictions in advance and then caches those predictions.

It doesn't matter if it takes a long time for an LLM to complete its task if the LLM only has to perform the task once a week or once a month.

For example, Google Search used an LLM to perform offline inference in order to cache a list of over 800 synonyms for Covid vaccines in more than 50 languages. Google Search then used the cached list to identify queries about vaccines in live traffic.

Use LLMs responsibly

Like any form of machine learning, LLMs generally share the biases of:

The data they were trained on.
The data they were distilled on.

Use LLMs fairly and responsibly, following the guidelines presented in the data modules and the Fairness module.

Exercise: Check your understanding

Which of the following statements is true about LLMs?

A distilled LLM contains fewer parameters than the foundation language model it sprung from.

Yes, distillation reduces the number of parameters.

A fine-tuned LLM contains fewer parameters than the foundation language model it was trained on.

A fine-tuned model contains the same number of parameters as the original foundation language model.

As users perform more prompt engineering, the number of parameters in an LLM grows.

Prompt engineering doesn't add (or remove or alter) LLM parameters.

What's a Large Language Model? (15 min)

Test your knowledge (10 min)