This page contains Generative AI glossary terms. For all glossary terms, click here.
A
automatic evaluation
Using software to judge the quality of a model's output.
When model output is relatively straightforward, a script or program can compare the model's output to a golden response. This type of automatic evaluation is sometimes called programmatic evaluation. Metrics such as ROUGE or BLEU are often useful for programmatic evaluation.
When model output is complex or has no one right answer, a separate ML program called an autorater sometimes performs the automatic evaluation.
Contrast with human evaluation.
autorater evaluation
A hybrid mechanism for judging the quality of a generative AI model's output that combines human evaluation with automatic evaluation. An autorater is an ML model trained on data created by human evaluation. Ideally, an autorater learns to mimic a human evaluator.Prebuilt autoraters are available, but the best autoraters are fine-tuned specifically to the task you are evaluating.
auto-regressive model
A model that infers a prediction based on its own previous predictions. For example, auto-regressive language models predict the next token based on the previously predicted tokens. All Transformer-based large language models are auto-regressive.
In contrast, GAN-based image models are usually not auto-regressive since they generate an image in a single forward-pass and not iteratively in steps. However, certain image generation models are auto-regressive because they generate an image in steps.
C
chain-of-thought prompting
A prompt engineering technique that encourages a large language model (LLM) to explain its reasoning, step by step. For example, consider the following prompt, paying particular attention to the second sentence:
How many g forces would a driver experience in a car that goes from 0 to 60 miles per hour in 7 seconds? In the answer, show all relevant calculations.
The LLM's response would likely:
- Show a sequence of physics formulas, plugging in the values 0, 60, and 7 in appropriate places.
- Explain why it chose those formulas and what the various variables mean.
Chain-of-thought prompting forces the LLM to perform all the calculations, which might lead to a more correct answer. In addition, chain-of-thought prompting enables the user to examine the LLM's steps to determine whether or not the answer makes sense.
chat
The contents of a back-and-forth dialogue with an ML system, typically a large language model. The previous interaction in a chat (what you typed and how the large language model responded) becomes the context for subsequent parts of the chat.
A chatbot is an application of a large language model.
contextualized language embedding
An embedding that comes close to "understanding" words and phrases in ways that native human speakers can. Contextualized language embeddings can understand complex syntax, semantics, and context.
For example, consider embeddings of the English word cow. Older embeddings such as word2vec can represent English words such that the distance in the embedding space from cow to bull is similar to the distance from ewe (female sheep) to ram (male sheep) or from female to male. Contextualized language embeddings can go a step further by recognizing that English speakers sometimes casually use the word cow to mean either cow or bull.
context window
The number of tokens a model can process in a given prompt. The larger the context window, the more information the model can use to provide coherent and consistent responses to the prompt.
D
direct prompting
Synonym for zero-shot prompting.
distillation
The process of reducing the size of one model (known as the teacher) into a smaller model (known as the student) that emulates the original model's predictions as faithfully as possible. Distillation is useful because the smaller model has two key benefits over the larger model (the teacher):
- Faster inference time
- Reduced memory and energy usage
However, the student's predictions are typically not as good as the teacher's predictions.
Distillation trains the student model to minimize a loss function based on the difference between the outputs of the predictions of the student and teacher models.
Compare and contrast distillation with the following terms:
See LLMs: Fine-tuning, distillation, and prompt engineering in Machine Learning Crash Course for more information.
E
evals
Primarily used as an abbreviation for LLM evaluations. More broadly, evals is an abbreviation for any form of evaluation.
evaluation
The process of measuring a model's quality or comparing different models against each other.
To evaluate a supervised machine learning model, you typically judge it against a validation set and a test set. Evaluating a LLM typically involves broader quality and safety assessments.
F
factuality
Within the ML world, a property describing a model whose output is based on reality. Factuality is a concept rather than a metric. For example, suppose you send the following prompt to a large language model:
What is the chemical formula for table salt?
A model optimizing factuality would respond:
NaCl
It is tempting to assume that all models should be based on factuality. However, some prompts, such as the following, should cause a generative AI model to optimize creativity rather than factuality.
Tell me a limerick about an astronaut and a caterpillar.
It is unlikely that the resulting limerick would be based on reality.
Contrast with groundedness.
few-shot prompting
A prompt that contains more than one (a "few") example demonstrating how the large language model should respond. For example, the following lengthy prompt contains two examples showing a large language model how to answer a query.
Parts of one prompt | Notes |
---|---|
What is the official currency of the specified country? | The question you want the LLM to answer. |
France: EUR | One example. |
United Kingdom: GBP | Another example. |
India: | The actual query. |
Few-shot prompting generally produces more desirable results than zero-shot prompting and one-shot prompting. However, few-shot prompting requires a lengthier prompt.
Few-shot prompting is a form of few-shot learning applied to prompt-based learning.
See Prompt engineering in Machine Learning Crash Course for more information.
fine-tuning
A second, task-specific training pass performed on a pre-trained model to refine its parameters for a specific use case. For example, the full training sequence for some large language models is as follows:
- Pre-training: Train a large language model on a vast general dataset, such as all the English language Wikipedia pages.
- Fine-tuning: Train the pre-trained model to perform a specific task, such as responding to medical queries. Fine-tuning typically involves hundreds or thousands of examples focused on the specific task.
As another example, the full training sequence for a large image model is as follows:
- Pre-training: Train a large image model on a vast general image dataset, such as all the images in Wikimedia commons.
- Fine-tuning: Train the pre-trained model to perform a specific task, such as generating images of orcas.
Fine-tuning can entail any combination of the following strategies:
- Modifying all of the pre-trained model's existing parameters. This is sometimes called full fine-tuning.
- Modifying only some of the pre-trained model's existing parameters (typically, the layers closest to the output layer), while keeping other existing parameters unchanged (typically, the layers closest to the input layer). See parameter-efficient tuning.
- Adding more layers, typically on top of the existing layers closest to the output layer.
Fine-tuning is a form of transfer learning. As such, fine-tuning might use a different loss function or a different model type than those used to train the pre-trained model. For example, you could fine-tune a pre-trained large image model to produce a regression model that returns the number of birds in an input image.
Compare and contrast fine-tuning with the following terms:
See Fine-tuning in Machine Learning Crash Course for more information.
fraction of successes
A metric for evaluating an ML model's generated text. The fraction of successes is the number of "successful" generated text outputs divided by the total number of generated text outputs. For example, if a large language model generated 10 blocks of code, five of which were successful, then the fraction of successes would be 50%.
Although fraction of successes is broadly useful throughout statistics, within ML, this metric is primarily useful for measuring verifiable tasks like code generation or math problems.
G
Gemini
The ecosystem comprising Google's most advanced AI. Elements of this ecosystem include:
- Various Gemini models.
- The interactive conversational interface to a Gemini model. Users type prompts and Gemini responds to those prompts.
- Various Gemini APIs.
- Various business products based on Gemini models; for example, Gemini for Google Cloud.
Gemini models
Google's state-of-the-art Transformer-based multimodal models. Gemini models are specifically designed to integrate with agents.
Users can interact with Gemini models in a variety of ways, including through an interactive dialog interface and through SDKs.
generated text
In general, the text that an ML model outputs. When evaluating large language models, some metrics compare generated text against reference text. For example, suppose you are trying to determine how effectively an ML model translates from French to Dutch. In this case:
- The generated text is the Dutch translation that the ML model outputs.
- The reference text is the Dutch translation that a human translator (or software) creates.
Note that some evaluation strategies don't involve reference text.
generative AI
An emerging transformative field with no formal definition. That said, most experts agree that generative AI models can create ("generate") content that is all of the following:
- complex
- coherent
- original
For example, a generative AI model can create sophisticated essays or images.
Some earlier technologies, including LSTMs and RNNs, can also generate original and coherent content. Some experts view these earlier technologies as generative AI, while others feel that true generative AI requires more complex output than those earlier technologies can produce.
Contrast with predictive ML.
golden response
An answer known to be good. For example, given the following prompt:
2 + 2
The golden response is hopefully:
4
H
human evaluation
A process in which people judge the quality of an ML model's output; for example, having bilingual people judge the quality of an ML translation model. Human evaluation is particularly useful for judging models that have no one right answer.
Contrast with automatic evaluation and autorater evaluation.
human in the loop (HITL)
A loosely-defined idiom that could mean either of the following:
- A policy of viewing Generative AI output critically or skeptically. For example, the humans who write this ML Glossary are amazed by what large language models can do but are mindful of the mistakes large language models make.
- A strategy or system for ensuring that people help shape, evaluate, and refine a model's behavior. Keeping a human in the loop enables an AI to benefit from both machine intelligence and human intelligence. For example, a system in which an AI generates code which software engineers then review is a human-in-the-loop system.
I
in-context learning
Synonym for few-shot prompting.
instruction tuning
A form of fine-tuning that improves a generative AI model's ability to follow instructions. Instruction tuning involves training a model on a series of instruction prompts, typically covering a wide variety of tasks. The resulting instruction-tuned model then tends to generate useful responses to zero-shot prompts across a variety of tasks.
Compare and contrast with:
L
LLM
Abbreviation for large language model.
LLM evaluations (evals)
A set of metrics and benchmarks for assessing the performance of large language models (LLMs). At a high level, LLM evaluations:
- Help researchers identify areas where LLMs need improvement.
- Are useful in comparing different LLMs and identifying the best LLM for a particular task.
- Help ensure that LLMs are safe and ethical to use.
LoRA
Abbreviation for Low-Rank Adaptability.
Low-Rank Adaptability (LoRA)
A parameter-efficient technique for fine tuning that "freezes" the model's pre-trained weights (such that they can no longer be modified) and then inserts a small set of trainable weights into the model. This set of trainable weights (also known as "update matrices") is considerably smaller than the base model and is therefore much faster to train.
LoRA provides the following benefits:
- Improves the quality of a model's predictions for the domain where the fine tuning is applied.
- Fine-tunes faster than techniques that require fine-tuning all of a model's parameters.
- Reduces the computational cost of inference by enabling concurrent serving of multiple specialized models sharing the same base model.
M
machine translation
Using software (typically, a machine learning model) to convert text from one human language to another human language, for example, from English to Japanese.
mean average precision at k (mAP@k)
The statistical mean of all average precision at k scores across a validation dataset. One use of mean average precision at k is to judge the quality of recommendations generated by a recommendation system.
Although the phrase "mean average" sounds redundant, the name of the metric is appropriate. After all, this metric finds the mean of multiple average precision at k values.
mixture of experts
A scheme to increase neural network efficiency by using only a subset of its parameters (known as an expert) to process a given input token or example. A gating network routes each input token or example to the proper expert(s).
For details, see either of the following papers:
- Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer
- Mixture-of-Experts with Expert Choice Routing
MMIT
Abbreviation for multimodal instruction-tuned.
model cascading
A system that picks the ideal model for a specific inference query.
Imagine a group of models, ranging from very large (lots of parameters) to much smaller (far fewer parameters). Very large models consume more computational resources at inference time than smaller models. However, very large models can typically infer more complex requests than smaller models. Model cascading determines the complexity of the inference query and then picks the appropriate model to perform the inference. The main motivation for model cascading is to reduce inference costs by generally selecting smaller models, and only selecting a larger model for more complex queries.
Imagine that a small model runs on a phone and a larger version of that model runs on a remote server. Good model cascading reduces cost and latency by enabling the smaller model to handle simple requests and only calling the remote model to handle complex requests.
See also model router.
model router
The algorithm that determines the ideal model for inference in model cascading. A model router is itself typically a machine learning model that gradually learns how to pick the best model for a given input. However, a model router could sometimes be a simpler, non-machine learning algorithm.
MOE
Abbreviation for mixture of experts.
MT
Abbreviation for machine translation.
N
no one right answer (NORA)
A prompt having multiple appropriate responses. For example, the following prompt has no one right answer:
Tell me a joke about elephants.
Evaluating no-one-right-answer prompts can be challenging.
NORA
Abbreviation for no one right answer.
O
one-shot prompting
A prompt that contains one example demonstrating how the large language model should respond. For example, the following prompt contains one example showing a large language model how it should answer a query.
Parts of one prompt | Notes |
---|---|
What is the official currency of the specified country? | The question you want the LLM to answer. |
France: EUR | One example. |
India: | The actual query. |
Compare and contrast one-shot prompting with the following terms:
P
parameter-efficient tuning
A set of techniques to fine-tune a large pre-trained language model (PLM) more efficiently than full fine-tuning. Parameter-efficient tuning typically fine-tunes far fewer parameters than full fine-tuning, yet generally produces a large language model that performs as well (or almost as well) as a large language model built from full fine-tuning.
Compare and contrast parameter-efficient tuning with:
Parameter-efficient tuning is also known as parameter-efficient fine-tuning.
PLM
Abbreviation for pre-trained language model.
post-trained model
Loosely-defined term that typically refers to a pre-trained model that has gone through some post-processing, such as one or more of the following:
pre-trained model
Typically, a model that has already been trained. The term could also mean a previously trained embedding vector.
The term pre-trained language model usually refers to an already trained large language model.
pre-training
The initial training of a model on a large dataset. Some pre-trained models are clumsy giants and must typically be refined through additional training. For example, ML experts might pre-train a large language model on a vast text dataset, such as all the English pages in Wikipedia. Following pre-training, the resulting model might be further refined through any of the following techniques:
prompt
Any text entered as input to a large language model to condition the model to behave in a certain way. Prompts can be as short as a phrase or arbitrarily long (for example, the entire text of a novel). Prompts fall into multiple categories, including those shown in the following table:
Prompt category | Example | Notes |
---|---|---|
Question | How fast can a pigeon fly? | |
Instruction | Write a funny poem about arbitrage. | A prompt that asks the large language model to do something. |
Example | Translate Markdown code to HTML. For example:
Markdown: * list item HTML: <ul> <li>list item</li> </ul> |
The first sentence in this example prompt is an instruction. The remainder of the prompt is the example. |
Role | Explain why gradient descent is used in machine learning training to a PhD in Physics. | The first part of the sentence is an instruction; the phrase "to a PhD in Physics" is the role portion. |
Partial input for the model to complete | The Prime Minister of the United Kingdom lives at | A partial input prompt can either end abruptly (as this example does) or end with an underscore. |
A generative AI model can respond to a prompt with text, code, images, embeddings, videos…almost anything.
prompt-based learning
A capability of certain models that enables them to adapt their behavior in response to arbitrary text input (prompts). In a typical prompt-based learning paradigm, a large language model responds to a prompt by generating text. For example, suppose a user enters the following prompt:
Summarize Newton's Third Law of Motion.
A model capable of prompt-based learning isn't specifically trained to answer the previous prompt. Rather, the model "knows" a lot of facts about physics, a lot about general language rules, and a lot about what constitutes generally useful answers. That knowledge is sufficient to provide a (hopefully) useful answer. Additional human feedback ("That answer was too complicated." or "What's a reaction?") enables some prompt-based learning systems to gradually improve the usefulness of their answers.
prompt design
Synonym for prompt engineering.
prompt engineering
The art of creating prompts that elicit the desired responses from a large language model. Humans perform prompt engineering. Writing well-structured prompts is an essential part of ensuring useful responses from a large language model. Prompt engineering depends on many factors, including:
- The dataset used to pre-train and possibly fine-tune the large language model.
- The temperature and other decoding parameters that the model uses to generate responses.
See Introduction to prompt design for more details on writing helpful prompts.
Prompt design is a synonym for prompt engineering.
prompt tuning
A parameter efficient tuning mechanism that learns a "prefix" that the system prepends to the actual prompt.
One variation of prompt tuning—sometimes called prefix tuning—is to prepend the prefix at every layer. In contrast, most prompt tuning only adds a prefix to the input layer.
R
reference text
An expert's response to a prompt. For example, given the following prompt:
Translate the question "What is your name?" from English to French.
An expert's response might be:
Comment vous appelez-vous?
Various metrics (such as ROUGE) measure the degree to which the reference text matches an ML model's generated text.
Reinforcement Learning from Human Feedback (RLHF)
Using feedback from human raters to improve the quality of a model's responses. For example, an RLHF mechanism can ask users to rate the quality of a model's response with a 👍 or 👎 emoji. The system can then adjust its future responses based on that feedback.
role prompting
An optional part of a prompt that identifies a target audience for a generative AI model's response. Without a role prompt, a large language model provides an answer that may or may not be useful for the person asking the questions. With a role prompt, a large language model can answer in a way that's more appropriate and more helpful for a specific target audience. For example, the role prompt portion of the following prompts are in boldface:
- Summarize this article for a PhD in economics.
- Describe how tides work for a ten-year old.
- Explain the 2008 financial crisis. Speak as you might to a young child, or a golden retriever.
S
soft prompt tuning
A technique for tuning a large language model for a particular task, without resource intensive fine-tuning. Instead of retraining all the weights in the model, soft prompt tuning automatically adjusts a prompt to achieve the same goal.
Given a textual prompt, soft prompt tuning typically appends additional token embeddings to the prompt and uses backpropagation to optimize the input.
A "hard" prompt contains actual tokens instead of token embeddings.
T
temperature
A hyperparameter that controls the degree of randomness of a model's output. Higher temperatures result in more random output, while lower temperatures result in less random output.
Choosing the best temperature depends on the specific application and the preferred properties of the model's output. For example, you would probably raise the temperature when creating an application that generates creative output. Conversely, you would probably lower the temperature when building a model that classifies images or text in order to improve the model's accuracy and consistency.
Temperature is often used with softmax.
Z
zero-shot prompting
A prompt that does not provide an example of how you want the large language model to respond. For example:
Parts of one prompt | Notes |
---|---|
What is the official currency of the specified country? | The question you want the LLM to answer. |
India: | The actual query. |
The large language model might respond with any of the following:
- Rupee
- INR
- ₹
- Indian rupee
- The rupee
- The Indian rupee
All of the answers are correct, though you might prefer a particular format.
Compare and contrast zero-shot prompting with the following terms: