Machine Learning Glossary: Metrics

This page contains Metrics glossary terms. For all glossary terms, click here.

A

accuracy

#fundamentals
#Metric

The number of correct classification predictions divided by the total number of predictions. That is:

Accuracy=correct predictionscorrect predictions + incorrect predictions 

For example, a model that made 40 correct predictions and 10 incorrect predictions would have an accuracy of:

Accuracy=4040 + 10=80%

Binary classification provides specific names for the different categories of correct predictions and incorrect predictions. So, the accuracy formula for binary classification is as follows:

Accuracy=TP+TNTP+TN+FP+FN

where:

Compare and contrast accuracy with precision and recall.

Although a valuable metric for some situations, accuracy is highly misleading for others. Notably, accuracy is usually a poor metric for evaluating classification models that process class-imbalanced datasets.

For example, suppose snow falls only 25 days per century in a certain subtropical city. Since days without snow (the negative class) vastly outnumber days with snow (the positive class), the snow dataset for this city is class-imbalanced. Imagine a binary classification model that is supposed to predict either snow or no snow each day but simply predicts "no snow" every day. This model is highly accurate but has no predictive power. The following table summarizes the results for a century of predictions:

Category Number
TP 0
TN 36499
FP 0
FN 25

The accuracy of this model is therefore:

accuracy = (TP + TN) / (TP + TN + FP + FN)
accuracy = (0 + 36499) / (0 + 36499 + 0 + 25) = 0.9993 = 99.93%

Although 99.93% accuracy seems like very a impressive percentage, the model actually has no predictive power.

Precision and recall are usually more useful metrics than accuracy for evaluating models trained on class-imbalanced datasets.


See Classification: Accuracy, recall, precision and related metrics in Machine Learning Crash Course for more information.

area under the PR curve

#Metric

See PR AUC (Area under the PR Curve).

area under the ROC curve

#Metric

See AUC (Area under the ROC curve).

AUC (Area under the ROC curve)

#fundamentals
#Metric

A number between 0.0 and 1.0 representing a binary classification model's ability to separate positive classes from negative classes. The closer the AUC is to 1.0, the better the model's ability to separate classes from each other.

For example, the following illustration shows a classifier model that separates positive classes (green ovals) from negative classes (purple rectangles) perfectly. This unrealistically perfect model has an AUC of 1.0:

A number line with 8 positive examples on one side and
          9 negative examples on the other side.

Conversely, the following illustration shows the results for a classifier model that generated random results. This model has an AUC of 0.5:

A number line with 6 positive examples and 6 negative examples.
          The sequence of examples is positive, negative,
          positive, negative, positive, negative, positive, negative, positive
          negative, positive, negative.

Yes, the preceding model has an AUC of 0.5, not 0.0.

Most models are somewhere between the two extremes. For instance, the following model separates positives from negatives somewhat, and therefore has an AUC somewhere between 0.5 and 1.0:

A number line with 6 positive examples and 6 negative examples.
          The sequence of examples is negative, negative, negative, negative,
          positive, negative, positive, positive, negative, positive, positive,
          positive.

AUC ignores any value you set for classification threshold. Instead, AUC considers all possible classification thresholds.

AUC represents the area under an ROC curve. For example, the ROC curve for a model that perfectly separates positives from negatives looks as follows:

Cartesian plot. x-axis is false positive rate; y-axis
          is true positive rate. Graph starts at 0,0 and goes straight up
          to 0,1 and then straight to the right ending at 1,1.

AUC is the area of the gray region in the preceding illustration. In this unusual case, the area is simply the length of the gray region (1.0) multiplied by the width of the gray region (1.0). So, the product of 1.0 and 1.0 yields an AUC of exactly 1.0, which is the highest possible AUC score.

Conversely, the ROC curve for a classifier that can't separate classes at all is as follows. The area of this gray region is 0.5.

Cartesian plot. x-axis is false positive rate; y-axis is true
          positive rate. Graph starts at 0,0 and goes diagonally to 1,1.

A more typical ROC curve looks approximately like the following:

Cartesian plot. x-axis is false positive rate; y-axis is true
          positive rate. Graph starts at 0,0 and takes an irregular arc
          to 1,0.

It would be painstaking to calculate the area under this curve manually, which is why a program typically calculates most AUC values.


AUC is the probability that a classifier will be more confident that a randomly chosen positive example is actually positive than that a randomly chosen negative example is positive.


See Classification: ROC and AUC in Machine Learning Crash Course for more information.

average precision at k

#language
#Metric

A metric for summarizing a model's performance on a single prompt that generates ranked results, such as a numbered list of book recommendations. Average precision at k is, well, the average of the precision at k values for each relevant result. The formula for average precision at k is therefore:

average precision at k=1ni=1nprecision at k for each relevant item

where:

  • n is the number of relevant items in the list.

Contrast with recall at k.

Suppose a large language model is given the following query:

List the 6 funniest movies of all time in order.

And the large language model returns the following list:

  1. The General
  2. Mean Girls
  3. Platoon
  4. Bridesmaids
  5. Citizen Kane
  6. This is Spinal Tap
Four of the movies in the returned list are very funny (that is, they are relevant) but two movies are dramas (not relevant). The following table details the results:
Position Movie Relevant? Precision at k
1 The General Yes 1.0
2 Mean Girls Yes 1.0
3 Platoon No not relevant
4 Bridesmaids Yes 0.75
5 Citizen Kane No not relevant
6 This is Spinal Tap Yes 0.67

The number of relevant results is 4. Therefore, you can calculate the average precision at 6 as follows:

average precision at 6=14(1.0 + 1.0 + 0.75 + 0.67)
average precision at 6=~0.85

B

baseline

#Metric

A model used as a reference point for comparing how well another model (typically, a more complex one) is performing. For example, a logistic regression model might serve as a good baseline for a deep model.

For a particular problem, the baseline helps model developers quantify the minimal expected performance that a new model must achieve for the new model to be useful.

C

cost

#Metric

Synonym for loss.

counterfactual fairness

#fairness
#Metric

A fairness metric that checks whether a classifier produces the same result for one individual as it does for another individual who is identical to the first, except with respect to one or more sensitive attributes. Evaluating a classifier for counterfactual fairness is one method for surfacing potential sources of bias in a model.

See either of the following for more information:

cross-entropy

#Metric

A generalization of Log Loss to multi-class classification problems. Cross-entropy quantifies the difference between two probability distributions. See also perplexity.

cumulative distribution function (CDF)

#Metric

A function that defines the frequency of samples less than or equal to a target value. For example, consider a normal distribution of continuous values. A CDF tells you that approximately 50% of samples should be less than or equal to the mean and that approximately 84% of samples should be less than or equal to one standard deviation above the mean.

D

demographic parity

#fairness
#Metric

A fairness metric that is satisfied if the results of a model's classification are not dependent on a given sensitive attribute.

For example, if both Lilliputians and Brobdingnagians apply to Glubbdubdrib University, demographic parity is achieved if the percentage of Lilliputians admitted is the same as the percentage of Brobdingnagians admitted, irrespective of whether one group is on average more qualified than the other.

Contrast with equalized odds and equality of opportunity, which permit classification results in aggregate to depend on sensitive attributes, but don't permit classification results for certain specified ground truth labels to depend on sensitive attributes. See "Attacking discrimination with smarter machine learning" for a visualization exploring the tradeoffs when optimizing for demographic parity.

See Fairness: demographic parity in Machine Learning Crash Course for more information.

E

earth mover's distance (EMD)

#Metric

A measure of the relative similarity of two distributions. The lower the earth mover's distance, the more similar the distributions.

edit distance

#language
#Metric

A measurement of how similar two text strings are to each other. In machine learning, edit distance is useful for the following reasons:

  • Edit distance is easy to compute.
  • Edit distance can compare two strings known to be similar to each other.
  • Edit distance can determine the degree to which different strings are similar to a given string.

There are several definitions of edit distance, each using different string operations. See Levenshtein distance for an example.

empirical cumulative distribution function (eCDF or EDF)

#Metric

A cumulative distribution function based on empirical measurements from a real dataset. The value of the function at any point along the x-axis is the fraction of observations in the dataset that are less than or equal to the specified value.

entropy

#df
#Metric

In information theory, a description of how unpredictable a probability distribution is. Alternatively, entropy is also defined as how much information each example contains. A distribution has the highest possible entropy when all values of a random variable are equally likely.

The entropy of a set with two possible values "0" and "1" (for example, the labels in a binary classification problem) has the following formula:

  H = -p log p - q log q = -p log p - (1-p) * log (1-p)

where:

  • H is the entropy.
  • p is the fraction of "1" examples.
  • q is the fraction of "0" examples. Note that q = (1 - p)
  • log is generally log2. In this case, the entropy unit is a bit.

For example, suppose the following:

  • 100 examples contain the value "1"
  • 300 examples contain the value "0"

Therefore, the entropy value is:

  • p = 0.25
  • q = 0.75
  • H = (-0.25)log2(0.25) - (0.75)log2(0.75) = 0.81 bits per example

A set that is perfectly balanced (for example, 200 "0"s and 200 "1"s) would have an entropy of 1.0 bit per example. As a set becomes more imbalanced, its entropy moves towards 0.0.

In decision trees, entropy helps formulate information gain to help the splitter select the conditions during the growth of a classification decision tree.

Compare entropy with:

Entropy is often called Shannon's entropy.

See Exact splitter for binary classification with numerical features in the Decision Forests course for more information.

equality of opportunity

#fairness
#Metric

A fairness metric to assess whether a model is predicting the desirable outcome equally well for all values of a sensitive attribute. In other words, if the desirable outcome for a model is the positive class, the goal would be to have the true positive rate be the same for all groups.

Equality of opportunity is related to equalized odds, which requires that both the true positive rates and false positive rates are the same for all groups.

Suppose Glubbdubdrib University admits both Lilliputians and Brobdingnagians to a rigorous mathematics program. Lilliputians' secondary schools offer a robust curriculum of math classes, and the vast majority of students are qualified for the university program. Brobdingnagians' secondary schools don't offer math classes at all, and as a result, far fewer of their students are qualified. Equality of opportunity is satisfied for the preferred label of "admitted" with respect to nationality (Lilliputian or Brobdingnagian) if qualified students are equally likely to be admitted irrespective of whether they're a Lilliputian or a Brobdingnagian.

For example, suppose 100 Lilliputians and 100 Brobdingnagians apply to Glubbdubdrib University, and admissions decisions are made as follows:

Table 1. Lilliputian applicants (90% are qualified)

  Qualified Unqualified
Admitted 45 3
Rejected 45 7
Total 90 10
Percentage of qualified students admitted: 45/90 = 50%
Percentage of unqualified students rejected: 7/10 = 70%
Total percentage of Lilliputian students admitted: (45+3)/100 = 48%

 

Table 2. Brobdingnagian applicants (10% are qualified):

  Qualified Unqualified
Admitted 5 9
Rejected 5 81
Total 10 90
Percentage of qualified students admitted: 5/10 = 50%
Percentage of unqualified students rejected: 81/90 = 90%
Total percentage of Brobdingnagian students admitted: (5+9)/100 = 14%

The preceding examples satisfy equality of opportunity for acceptance of qualified students because qualified Lilliputians and Brobdingnagians both have a 50% chance of being admitted.

While equality of opportunity is satisfied, the following two fairness metrics are not satisfied:

  • demographic parity: Lilliputians and Brobdingnagians are admitted to the university at different rates; 48% of Lilliputians students are admitted, but only 14% of Brobdingnagian students are admitted.
  • equalized odds: While qualified Lilliputian and Brobdingnagian students both have the same chance of being admitted, the additional constraint that unqualified Lilliputians and Brobdingnagians both have the same chance of being rejected is not satisfied. Unqualified Lilliputians have a 70% rejection rate, whereas unqualified Brobdingnagians have a 90% rejection rate.

See Fairness: Equality of opportunity in Machine Learning Crash Course for more information.

equalized odds

#fairness
#Metric

A fairness metric to assess whether a model is predicting outcomes equally well for all values of a sensitive attribute with respect to both the positive class and negative class—not just one class or the other exclusively. In other words, both the true positive rate and false negative rate should be the same for all groups.

Equalized odds is related to equality of opportunity, which only focuses on error rates for a single class (positive or negative).

For example, suppose Glubbdubdrib University admits both Lilliputians and Brobdingnagians to a rigorous mathematics program. Lilliputians' secondary schools offer a robust curriculum of math classes, and the vast majority of students are qualified for the university program. Brobdingnagians' secondary schools don't offer math classes at all, and as a result, far fewer of their students are qualified. Equalized odds is satisfied provided that no matter whether an applicant is a Lilliputian or a Brobdingnagian, if they are qualified, they are equally as likely to get admitted to the program, and if they are not qualified, they are equally as likely to get rejected.

Suppose 100 Lilliputians and 100 Brobdingnagians apply to Glubbdubdrib University, and admissions decisions are made as follows:

Table 3. Lilliputian applicants (90% are qualified)

  Qualified Unqualified
Admitted 45 2
Rejected 45 8
Total 90 10
Percentage of qualified students admitted: 45/90 = 50%
Percentage of unqualified students rejected: 8/10 = 80%
Total percentage of Lilliputian students admitted: (45+2)/100 = 47%

 

Table 4. Brobdingnagian applicants (10% are qualified):

  Qualified Unqualified
Admitted 5 18
Rejected 5 72
Total 10 90
Percentage of qualified students admitted: 5/10 = 50%
Percentage of unqualified students rejected: 72/90 = 80%
Total percentage of Brobdingnagian students admitted: (5+18)/100 = 23%

Equalized odds is satisfied because qualified Lilliputian and Brobdingnagian students both have a 50% chance of being admitted, and unqualified Lilliputian and Brobdingnagian have an 80% chance of being rejected.

Equalized odds is formally defined in "Equality of Opportunity in Supervised Learning" as follows: "predictor Ŷ satisfies equalized odds with respect to protected attribute A and outcome Y if Ŷ and A are independent, conditional on Y."

evals

#language
#generativeAI
#Metric

Primarily used as an abbreviation for LLM evaluations. More broadly, evals is an abbreviation for any form of evaluation.

evaluation

#language
#generativeAI
#Metric

The process of measuring a model's quality or comparing different models against each other.

To evaluate a supervised machine learning model, you typically judge it against a validation set and a test set. Evaluating a LLM typically involves broader quality and safety assessments.

F

F1

#Metric

A "roll-up" binary classification metric that relies on both precision and recall. Here is the formula:

F1=2 * precision * recallprecision + recall

Suppose precision and recall have the following values:

  • precision = 0.6
  • recall = 0.4

You calculate F1 as follows:

F1=2 * 0.6 * 0.40.6 + 0.4=0.48

When precision and recall are fairly similar (as in the preceding example), F1 is close to their mean. When precision and recall differ significantly, F1 is closer to the lower value. For example:

  • precision = 0.9
  • recall = 0.1
F1=2 * 0.9 * 0.10.9 + 0.1=0.18

fairness metric

#fairness
#Metric

A mathematical definition of "fairness" that is measurable. Some commonly used fairness metrics include:

Many fairness metrics are mutually exclusive; see incompatibility of fairness metrics.

false negative (FN)

#fundamentals
#Metric

An example in which the model mistakenly predicts the negative class. For example, the model predicts that a particular email message is not spam (the negative class), but that email message actually is spam.

false negative rate

#Metric

The proportion of actual positive examples for which the model mistakenly predicted the negative class. The following formula calculates the false negative rate:

false negative rate=false negativesfalse negatives+true positives

See Thresholds and the confusion matrix in Machine Learning Crash Course for more information.

false positive (FP)

#fundamentals
#Metric

An example in which the model mistakenly predicts the positive class. For example, the model predicts that a particular email message is spam (the positive class), but that email message is actually not spam.

See Thresholds and the confusion matrix in Machine Learning Crash Course for more information.

false positive rate (FPR)

#fundamentals
#Metric

The proportion of actual negative examples for which the model mistakenly predicted the positive class. The following formula calculates the false positive rate:

false positive rate=false positivesfalse positives+true negatives

The false positive rate is the x-axis in an ROC curve.

See Classification: ROC and AUC in Machine Learning Crash Course for more information.

feature importances

#df
#Metric

Synonym for variable importances.

fraction of successes

#generativeAI
#Metric

A metric for evaluating an ML model's generated text. The fraction of successes is the number of "successful" generated text outputs divided by the total number of generated text outputs. For example, if a large language model generated 10 blocks of code, five of which were successful, then the fraction of successes would be 50%.

Although fraction of successes is broadly useful throughout statistics, within ML, this metric is primarily useful for measuring verifiable tasks like code generation or math problems.

G

gini impurity

#df
#Metric

A metric similar to entropy. Splitters use values derived from either gini impurity or entropy to compose conditions for classification decision trees. Information gain is derived from entropy. There is no universally accepted equivalent term for the metric derived from gini impurity; however, this unnamed metric is just as important as information gain.

Gini impurity is also called gini index, or simply gini.

Gini impurity is the probability of misclassifying a new piece of data taken from the same distribution. The gini impurity of a set with two possible values "0" and "1" (for example, the labels in a binary classification problem) is calculated from the following formula:

   I = 1 - (p2 + q2) = 1 - (p2 + (1-p)2)

where:

  • I is the gini impurity.
  • p is the fraction of "1" examples.
  • q is the fraction of "0" examples. Note that q = 1-p

For example, consider the following dataset:

  • 100 labels (0.25 of the dataset) contain the value "1"
  • 300 labels (0.75 of the dataset) contain the value "0"

Therefore, the gini impurity is:

  • p = 0.25
  • q = 0.75
  • I = 1 - (0.252 + 0.752) = 0.375

Consequently, a random label from the same dataset would have a 37.5% chance of being misclassified, and a 62.5% chance of being properly classified.

A perfectly balanced label (for example, 200 "0"s and 200 "1"s) would have a gini impurity of 0.5. A highly imbalanced label would have a gini impurity close to 0.0.


H

hinge loss

#Metric

A family of loss functions for classification designed to find the decision boundary as distant as possible from each training example, thus maximizing the margin between examples and the boundary. KSVMs use hinge loss (or a related function, such as squared hinge loss). For binary classification, the hinge loss function is defined as follows:

loss=max(0,1(yy))

where y is the true label, either -1 or +1, and y' is the raw output of the classifier model:

y=b+w1x1+w2x2+wnxn

Consequently, a plot of hinge loss versus (y * y') looks as follows:

A Cartesian plot consisting of two joined line segments. The first
          line segment starts at (-3, 4) and ends at (1, 0). The second line
          segment begins at (1, 0) and continues indefinitely with a slope
          of 0.

I

incompatibility of fairness metrics

#fairness
#Metric

The idea that some notions of fairness are mutually incompatible and cannot be satisfied simultaneously. As a result, there is no single universal metric for quantifying fairness that can be applied to all ML problems.

While this may seem discouraging, incompatibility of fairness metrics doesn't imply that fairness efforts are fruitless. Instead, it suggests that fairness must be defined contextually for a given ML problem, with the goal of preventing harms specific to its use cases.

See "On the (im)possibility of fairness" for a more detailed discussion of the incompatibility of fairness metrics.

individual fairness

#fairness
#Metric

A fairness metric that checks whether similar individuals are classified similarly. For example, Brobdingnagian Academy might want to satisfy individual fairness by ensuring that two students with identical grades and standardized test scores are equally likely to gain admission.

Note that individual fairness relies entirely on how you define "similarity" (in this case, grades and test scores), and you can run the risk of introducing new fairness problems if your similarity metric misses important information (such as the rigor of a student's curriculum).

See "Fairness Through Awareness" for a more detailed discussion of individual fairness.

information gain

#df
#Metric

In decision forests, the difference between a node's entropy and the weighted (by number of examples) sum of the entropy of its children nodes. A node's entropy is the entropy of the examples in that node.

For example, consider the following entropy values:

  • entropy of parent node = 0.6
  • entropy of one child node with 16 relevant examples = 0.2
  • entropy of another child node with 24 relevant examples = 0.1

So 40% of the examples are in one child node and 60% are in the other child node. Therefore:

  • weighted entropy sum of child nodes = (0.4 * 0.2) + (0.6 * 0.1) = 0.14

So, the information gain is:

  • information gain = entropy of parent node - weighted entropy sum of child nodes
  • information gain = 0.6 - 0.14 = 0.46

Most splitters seek to create conditions that maximize information gain.

inter-rater agreement

#Metric

A measurement of how often human raters agree when doing a task. If raters disagree, the task instructions may need to be improved. Also sometimes called inter-annotator agreement or inter-rater reliability. See also Cohen's kappa, which is one of the most popular inter-rater agreement measurements.

See Categorical data: Common issues in Machine Learning Crash Course for more information.

L

L1 loss

#fundamentals
#Metric

A loss function that calculates the absolute value of the difference between actual label values and the values that a model predicts. For example, here's the calculation of L1 loss for a batch of five examples:

Actual value of example Model's predicted value Absolute value of delta
7 6 1
5 4 1
8 11 3
4 6 2
9 8 1
  8 = L1 loss

L1 loss is less sensitive to outliers than L2 loss.

The Mean Absolute Error is the average L1 loss per example.

L1loss=i=0n|yiy^i|

where:
  • n is the number of examples.
  • y is the actual value of the label.
  • y^ is the value that the model predicts for y.

See Linear regression: Loss in Machine Learning Crash Course for more information.

L2 loss

#fundamentals
#Metric

A loss function that calculates the square of the difference between actual label values and the values that a model predicts. For example, here's the calculation of L2 loss for a batch of five examples:

Actual value of example Model's predicted value Square of delta
7 6 1
5 4 1
8 11 9
4 6 4
9 8 1
  16 = L2 loss

Due to squaring, L2 loss amplifies the influence of outliers. That is, L2 loss reacts more strongly to bad predictions than L1 loss. For example, the L1 loss for the preceding batch would be 8 rather than 16. Notice that a single outlier accounts for 9 of the 16.

Regression models typically use L2 loss as the loss function.

The Mean Squared Error is the average L2 loss per example. Squared loss is another name for L2 loss.

L2loss=i=0n(yiy^i)2

where:
  • n is the number of examples.
  • y is the actual value of the label.
  • y^ is the value that the model predicts for y.

See Logistic regression: Loss and regularization in Machine Learning Crash Course for more information.

LLM evaluations (evals)

#language
#generativeAI
#Metric

A set of metrics and benchmarks for assessing the performance of large language models (LLMs). At a high level, LLM evaluations:

  • Help researchers identify areas where LLMs need improvement.
  • Are useful in comparing different LLMs and identifying the best LLM for a particular task.
  • Help ensure that LLMs are safe and ethical to use.

See Large language models (LLMs) in Machine Learning Crash Course for more information.

loss

#fundamentals
#Metric

During the training of a supervised model, a measure of how far a model's prediction is from its label.

A loss function calculates the loss.

See Linear regression: Loss in Machine Learning Crash Course for more information.

loss function

#fundamentals
#Metric

During training or testing, a mathematical function that calculates the loss on a batch of examples. A loss function returns a lower loss for models that makes good predictions than for models that make bad predictions.

The goal of training is typically to minimize the loss that a loss function returns.

Many different kinds of loss functions exist. Pick the appropriate loss function for the kind of model you are building. For example:

M

Mean Absolute Error (MAE)

#Metric

The average loss per example when L1 loss is used. Calculate Mean Absolute Error as follows:

  1. Calculate the L1 loss for a batch.
  2. Divide the L1 loss by the number of examples in the batch.

Mean Absolute Error=1ni=0n|yiy^i|

where:

  • n is the number of examples.
  • y is the actual value of the label.
  • y^ is the value that the model predicts for y.

For example, consider the calculation of L1 loss on the following batch of five examples:

Actual value of example Model's predicted value Loss (difference between actual and predicted)
7 6 1
5 4 1
8 11 3
4 6 2
9 8 1
  8 = L1 loss

So, L1 loss is 8 and the number of examples is 5. Therefore, the Mean Absolute Error is:

Mean Absolute Error = L1 loss / Number of Examples
Mean Absolute Error = 8/5 = 1.6

Contrast Mean Absolute Error with Mean Squared Error and Root Mean Squared Error.

mean average precision at k (mAP@k)

#language
#generativeAI
#Metric

The statistical mean of all average precision at k scores across a validation dataset. One use of mean average precision at k is to judge the quality of recommendations generated by a recommendation system.

Although the phrase "mean average" sounds redundant, the name of the metric is appropriate. After all, this metric finds the mean of multiple average precision at k values.

Suppose you build a recommendation system that generates a personalized list of recommended novels for each user. Based on feedback from selected users, you calculate the following five average precision at k scores (one score per user):

  • 0.73
  • 0.77
  • 0.67
  • 0.82
  • 0.76

The mean Average Precision at K is therefore:

mean =0.73 + 0.77 + 0.67 + 0.82 + 0.765=0.75

Mean Squared Error (MSE)

#Metric

The average loss per example when L2 loss is used. Calculate Mean Squared Error as follows:

  1. Calculate the L2 loss for a batch.
  2. Divide the L2 loss by the number of examples in the batch.
Mean Squared Error=1ni=0n(yiy^i)2
where:
  • n is the number of examples.
  • y is the actual value of the label.
  • y^ is the model's prediction for y.

For example, consider the loss on the following batch of five examples:

Actual value Model's prediction Loss Squared loss
7 6 1 1
5 4 1 1
8 11 3 9
4 6 2 4
9 8 1 1
16 = L2 loss

Therefore, the Mean Squared Error is:

Mean Squared Error = L2 loss / Number of Examples
Mean Squared Error = 16/5 = 3.2

Mean Squared Error is a popular training optimizer, particularly for linear regression.

Contrast Mean Squared Error with Mean Absolute Error and Root Mean Squared Error.

TensorFlow Playground uses Mean Squared Error to calculate loss values.

Outliers strongly influence Mean Squared Error. For example, a loss of 1 is a squared loss of 1, but a loss of 3 is a squared loss of 9. In the preceding table, the example with a loss of 3 accounts for ~56% of the Mean Squared Error, while each of the examples with a loss of 1 accounts for only 6% of the Mean Squared Error.

Outliers don't influence Mean Absolute Error as strongly as Mean Squared Error. For example, a loss of 3 accounts for only ~38% of the Mean Absolute Error.

Clipping is one way to prevent extreme outliers from damaging your model's predictive ability.


metric

#TensorFlow
#Metric

A statistic that you care about.

An objective is a metric that a machine learning system tries to optimize.

Metrics API (tf.metrics)

#Metric

A TensorFlow API for evaluating models. For example, tf.metrics.accuracy determines how often a model's predictions match labels.

minimax loss

#Metric

A loss function for generative adversarial networks, based on the cross-entropy between the distribution of generated data and real data.

Minimax loss is used in the first paper to describe generative adversarial networks.

See Loss Functions in the Generative Adversarial Networks course for more information.

model capacity

#Metric

The complexity of problems that a model can learn. The more complex the problems that a model can learn, the higher the model's capacity. A model's capacity typically increases with the number of model parameters. For a formal definition of classifier capacity, see VC dimension.

N

negative class

#fundamentals
#Metric

In binary classification, one class is termed positive and the other is termed negative. The positive class is the thing or event that the model is testing for and the negative class is the other possibility. For example:

  • The negative class in a medical test might be "not tumor."
  • The negative class in an email classifier might be "not spam."

Contrast with positive class.

O

objective

#Metric

A metric that your algorithm is trying to optimize.

objective function

#Metric

The mathematical formula or metric that a model aims to optimize. For example, the objective function for linear regression is usually Mean Squared Loss. Therefore, when training a linear regression model, training aims to minimize Mean Squared Loss.

In some cases, the goal is to maximize the objective function. For example, if the objective function is accuracy, the goal is to maximize accuracy.

See also loss.

P

pass at k (pass@k)

#Metric

A metric to determine the quality of code (for example, Python) that a large language model generates. More specifically, pass at k tells you the likelihood that at least one generated block of code out of k generated blocks of code will pass all of its unit tests.

Large language models often struggle to generate good code for complex programming problems. Software engineers adapt to this problem by prompting the large language model to generate multiple (k) solutions for the same problem. Then, software engineers test each of the solutions against unit tests. The calculation of pass at k depends on the outcome of the unit tests:

  • If one or more of those solutions pass the unit test, then the LLM Passes that code generation challenge.
  • If none of the solutions pass the unit test, then the LLM Fails that code generation challenge.

The formula for pass at k is as follows:

pass at k=total number of passestotal number of challenges

In general, higher values of k produce higher pass at k scores; however, higher values of k require more large language model and unit testing resources.

Suppose a software engineer asks a large language model to generate k=10 solutions for n=50 challenging coding problems. Here are the results:

  • 30 Passes
  • 20 Fails

The pass at 10 score is therefore:

pass at 10=3050=0.6

performance

#Metric

Overloaded term with the following meanings:

  • The standard meaning within software engineering. Namely: How fast (or efficiently) does this piece of software run?
  • The meaning within machine learning. Here, performance answers the following question: How correct is this model? That is, how good are the model's predictions?

permutation variable importances

#df
#Metric

A type of variable importance that evaluates the increase in the prediction error of a model after permuting the feature's values. Permutation variable importance is a model-independent metric.

perplexity

#Metric

One measure of how well a model is accomplishing its task. For example, suppose your task is to read the first few letters of a word a user is typing on a phone keyboard, and to offer a list of possible completion words. Perplexity, P, for this task is approximately the number of guesses you need to offer in order for your list to contain the actual word the user is trying to type.

Perplexity is related to cross-entropy as follows:

P=2cross entropy

positive class

#fundamentals
#Metric

The class you are testing for.

For example, the positive class in a cancer model might be "tumor." The positive class in an email classifier might be "spam."

Contrast with negative class.

The term positive class can be confusing because the "positive" outcome of many tests is often an undesirable result. For example, the positive class in many medical tests corresponds to tumors or diseases. In general, you want a doctor to tell you, "Congratulations! Your test results were negative." Regardless, the positive class is the event that the test is seeking to find.

Admittedly, you're simultaneously testing for both the positive and negative classes.


PR AUC (area under the PR curve)

#Metric

Area under the interpolated precision-recall curve, obtained by plotting (recall, precision) points for different values of the classification threshold.

precision

#Metric

A metric for classification models that answers the following question:

When the model predicted the positive class, what percentage of the predictions were correct?

Here is the formula:

Precision=true positivestrue positives+false positives

where:

  • true positive means the model correctly predicted the positive class.
  • false positive means the model mistakenly predicted the positive class.

For example, suppose a model made 200 positive predictions. Of these 200 positive predictions:

  • 150 were true positives.
  • 50 were false positives.

In this case:

Precision=150150+50=0.75

Contrast with accuracy and recall.

See Classification: Accuracy, recall, precision and related metrics in Machine Learning Crash Course for more information.

precision at k (precision@k)

#language
#Metric

A metric for evaluating a ranked (ordered) list of items. Precision at k identifies the fraction of the first k items in that list that are "relevant." That is:

precision at k=relevant items in first k items of the listk

The value of k must be less than or equal to the length of the returned list. Note that the length of the returned list is not part of the calculation.

Relevance is often subjective; even expert human evaluators often disagree on which items are relevant.

Compare with:

Suppose a large language model is given the following query:

List the 6 funniest movies of all time in order.

And the large language model returns the list shown in the first two columns of the following table:

Position Movie Relevant?
1 The General Yes
2 Mean Girls Yes
3 Platoon No
4 Bridesmaids Yes
5 Citizen Kane No
6 This is Spinal Tap Yes

Two of the first three movies are relevant, so precision at 3 is:

precision at 3=23=0.67

Four of the first five movies are very funny, so precision at 5 is:

precision at 5=45=0.8

precision-recall curve

#Metric

A curve of precision versus recall at different classification thresholds.

prediction bias

#Metric

A value indicating how far apart the average of predictions is from the average of labels in the dataset.

Not to be confused with the bias term in machine learning models or with bias in ethics and fairness.

predictive parity

#fairness
#Metric

A fairness metric that checks whether, for a given classifier, the precision rates are equivalent for subgroups under consideration.

For example, a model that predicts college acceptance would satisfy predictive parity for nationality if its precision rate is the same for Lilliputians and Brobdingnagians.

Predictive parity is sometime also called predictive rate parity.

See "Fairness Definitions Explained" (section 3.2.1) for a more detailed discussion of predictive parity.

predictive rate parity

#fairness
#Metric

Another name for predictive parity.

probability density function

#Metric

A function that identifies the frequency of data samples having exactly a particular value. When a dataset's values are continuous floating-point numbers, exact matches rarely occur. However, integrating a probability density function from value x to value y yields the expected frequency of data samples between x and y.

For example, consider a normal distribution having a mean of 200 and a standard deviation of 30. To determine the expected frequency of data samples falling within the range 211.4 to 218.7, you can integrate the probability density function for a normal distribution from 211.4 to 218.7.

R

recall

#Metric

A metric for classification models that answers the following question:

When ground truth was the positive class, what percentage of predictions did the model correctly identify as the positive class?

Here is the formula:

Recall=true positivestrue positives+false negatives

where:

  • true positive means the model correctly predicted the positive class.
  • false negative means that the model mistakenly predicted the negative class.

For instance, suppose your model made 200 predictions on examples for which ground truth was the positive class. Of these 200 predictions:

  • 180 were true positives.
  • 20 were false negatives.

In this case:

Recall=180180+20=0.9

Recall is particularly useful for determining the predictive power of classification models in which the positive class is rare. For example, consider a class-imbalanced dataset in which the positive class for a certain disease occurs in only 10 patients out of a million. Suppose your model makes five million predictions that yield the following outcomes:

  • 30 True Positives
  • 20 False Negatives
  • 4,999,000 True Negatives
  • 950 False Positives

The recall of this model is therefore:

recall = TP / (TP + FN)
recall = 30 / (30 + 20) = 0.6 = 60%
By contrast, the accuracy of this model is:
accuracy = (TP + TN) / (TP + TN + FP + FN)
accuracy = (30 + 4,999,000) / (30 + 4,999,000 + 950 + 20) = 99.98%

That high value of accuracy looks impressive but is essentially meaningless. Recall is a much more useful metric for class-imbalanced datasets than accuracy.


See Classification: Accuracy, recall, precision and related metrics for more information.

recall at k (recall@k)

#language
#Metric

A metric for evaluating systems that output a ranked (ordered) list of items. Recall at k identifies the fraction of relevant items in the first k items in that list out of the total number of relevant items returned.

recall at k=relevant items in first k items of the listtotal number of relevant items in the list

Contrast with precision at k.

Suppose a large language model is given the following query:

List the 10 funniest movies of all time in order.

And the large language model returns the list shown in the first two columns:

Position Movie Relevant?
1 The General Yes
2 Mean Girls Yes
3 Platoon No
4 Bridesmaids Yes
5 This is Spinal Tap Yes
6 Airplane! Yes
7 Groundhog Day Yes
8 Monty Python and the Holy GrailYes
9 Oppenheimer No
10 Clueless Yes

Eight of the movies in the preceding list are very funny, so they are "relevant items in the list." Therefore, 8 will be the denominator in all the calculations of recall at k. What about the numerator? Well, 3 of the first 4 items are relevant, so recall at 4 is:

recall at 4=38=0.375

7 of the first 8 movies are very funny, so recall at 8 is:

recall at 8=78=0.875

ROC (receiver operating characteristic) Curve

#fundamentals
#Metric

A graph of true positive rate versus false positive rate for different classification thresholds in binary classification.

The shape of an ROC curve suggests a binary classification model's ability to separate positive classes from negative classes. Suppose, for example, that a binary classification model perfectly separates all the negative classes from all the positive classes:

A number line with 8 positive examples on the right side and
          7 negative examples on the left.

The ROC curve for the preceding model looks as follows:

An ROC curve. The x-axis is False Positive Rate and the y-axis
          is True Positive Rate. The curve has an inverted L shape. The curve
          starts at (0.0,0.0) and goes straight up to (0.0,1.0). Then the curve
          goes from (0.0,1.0) to (1.0,1.0).

In contrast, the following illustration graphs the raw logistic regression values for a terrible model that can't separate negative classes from positive classes at all:

A number line with positive examples and negative classes
          completely intermixed.

The ROC curve for this model looks as follows:

An ROC curve, which is actually a straight line from (0.0,0.0)
          to (1.0,1.0).

Meanwhile, back in the real world, most binary classification models separate positive and negative classes to some degree, but usually not perfectly. So, a typical ROC curve falls somewhere between the two extremes:

An ROC curve. The x-axis is False Positive Rate and the y-axis
          is True Positive Rate. The ROC curve approximates a shaky arc
          traversing the compass points from West to North.

The point on an ROC curve closest to (0.0,1.0) theoretically identifies the ideal classification threshold. However, several other real-world issues influence the selection of the ideal classification threshold. For example, perhaps false negatives cause far more pain than false positives.

A numerical metric called AUC summarizes the ROC curve into a single floating-point value.

Root Mean Squared Error (RMSE)

#fundamentals
#Metric

The square root of the Mean Squared Error.

ROUGE (Recall-Oriented Understudy for Gisting Evaluation)

#language
#Metric

A family of metrics that evaluate automatic summarization and machine translation models. ROUGE metrics determine the degree to which a reference text overlaps an ML model's generated text. Each member of the ROUGE family measures overlap in a different way. Higher ROUGE scores indicate more similarity between the reference text and generated text than lower ROUGE scores.

Each ROUGE family member typically generates the following metrics:

  • Precision
  • Recall
  • F1

For details and examples, see:

ROUGE-L

#language
#Metric

A member of the ROUGE family focused on the length of the longest common subsequence in the reference text and generated text. The following formulas calculate recall and precision for ROUGE-L:

ROUGE-L recall=longest common sequencenumber of words in the reference text
ROUGE-L precision=longest common sequencenumber of words in the generated text

You can then use F1 to roll up ROUGE-L recall and ROUGE-L precision into a single metric:

ROUGE-L F1=2ROUGE-L recallROUGE-L precisionROUGE-L recall+ROUGE-L precision
Consider the following reference text and generated text.
Category Who produced? Text
Reference text Human translator I want to understand a wide variety of things.
Generated text ML model I want to learn plenty of things.
Therefore:
  • The longest common subsequence is 5 (I want to of things)
  • The number of words in the reference text is 9.
  • The number of words in the generated text is 7.
Consequently:
ROUGE-L recall=59=0.56
ROUGE-L precision=57=0.71
ROUGE-L F1=20.560.710.56+0.71=0.63

ROUGE-L ignores any newlines in the reference text and generated text, so the longest common subsequence could cross multiple sentences. When the reference text and generated text involve multiple sentences, a variation of ROUGE-L called ROUGE-Lsum is generally a better metric. ROUGE-Lsum determines the longest common subsequence for each sentence in a passage and then calculates the mean of those longest common subsequences.

Consider the following reference text and generated text.
Category Who produced? Text
Reference text Human translator The surface of Mars is dry. Nearly all the water is deep underground.
Generated text ML model Mars has a dry surface. However, the vast majority of water is underground.
Therefore:
First sentence Second sentence
Longest common sequence2 (Mars dry) 3 (water is underground)
Sentence length of reference text 6 7
Sentence length of generated text 5 8
Consequently:
recall of first sentence=26=0.33
recall of second sentence=37=0.43
ROUGE-Lsum recall=0.33+0.432=0.38
precision of first sentence=25=0.4
precision of second sentence=38=0.38
ROUGE-Lsum precision=0.4+0.382=0.39
ROUGE-Lsum F1=20.380.390.38+0.39=0.38

ROUGE-N

#language
#Metric

A set of metrics within the ROUGE family that compares the shared N-grams of a certain size in the reference text and generated text. For example:

  • ROUGE-1 measures the number of shared tokens in the reference text and generated text.
  • ROUGE-2 measures the number of shared bigrams (2-grams) in the reference text and generated text.
  • ROUGE-3 measures the number of shared trigrams (3-grams) in the reference text and generated text.

You can use the following formulas to calculate ROUGE-N recall and ROUGE-N precision for any member of the ROUGE-N family:

ROUGE-N recall=number of matching N-gramsnumber of N-grams in the reference text
ROUGE-N precision=number of matching N-gramsnumber of N-grams in the generated text

You can then use F1 to roll up ROUGE-N recall and ROUGE-N precision into a single metric:

ROUGE-N F1=2ROUGE-N recallROUGE-N precisionROUGE-N recall+ROUGE-N precision
Suppose you decide to use ROUGE-2 to measure the effectiveness of an ML model's translation compared to a human translator's.
Category Who produced? Text Bigrams
Reference text Human translator I want to understand a wide variety of things. I want, want to, to understand, understand a, a wide, wide variety, variety of, of things
Generated text ML model I want to learn plenty of things. I want, want to, to learn, learn plenty, plenty of, of things
Therefore:
  • The number of matching 2-grams is 3 (I want, want to, and of things).
  • The number of 2-grams in the reference text is 8.
  • The number of 2-grams in the generated text is 6.
Consequently:
ROUGE-2 recall=38=0.375
ROUGE-2 precision=36=0.5
ROUGE-2 F1=20.3750.50.375+0.5=0.43

ROUGE-S

#language
#Metric

A forgiving form of ROUGE-N that enables skip-gram matching. That is, ROUGE-N only counts N-grams that match exactly, but ROUGE-S also counts N-grams separated by one or more words. For example, consider the following:

When calculating ROUGE-N, the 2-gram, White clouds doesn't match White billowing clouds. However, when calculating ROUGE-S, White clouds does match White billowing clouds.

R-squared

#Metric

A regression metric indicating how much variation in a label is due to an individual feature or to a feature set. R-squared is a value between 0 and 1, which you can interpret as follows:

  • An R-squared of 0 means that none of a label's variation is due to the feature set.
  • An R-squared of 1 means that all of a label's variation is due to the feature set.
  • An R-squared between 0 and 1 indicates the extent to which the label's variation can be predicted from a particular feature or the feature set. For example, an R-squared of 0.10 means that 10 percent of the variance in the label is due to the feature set, an R-squared of 0.20 means that 20 percent is due to the feature set, and so on.

R-squared is the square of the Pearson correlation coefficient between the values that a model predicted and ground truth.

S

scoring

#recsystems
#Metric

The part of a recommendation system that provides a value or ranking for each item produced by the candidate generation phase.

similarity measure

#clustering
#Metric

In clustering algorithms, the metric used to determine how alike (how similar) any two examples are.

sparsity

#Metric

The number of elements set to zero (or null) in a vector or matrix divided by the total number of entries in that vector or matrix. For example, consider a 100-element matrix in which 98 cells contain zero. The calculation of sparsity is as follows:

sparsity=98100=0.98

Feature sparsity refers to the sparsity of a feature vector; model sparsity refers to the sparsity of the model weights.

squared hinge loss

#Metric

The square of the hinge loss. Squared hinge loss penalizes outliers more harshly than regular hinge loss.

squared loss

#fundamentals
#Metric

Synonym for L2 loss.

T

test loss

#fundamentals
#Metric

A metric representing a model's loss against the test set. When building a model, you typically try to minimize test loss. That's because a low test loss is a stronger quality signal than a low training loss or low validation loss.

A large gap between test loss and training loss or validation loss sometimes suggests that you need to increase the regularization rate.

top-k accuracy

#language
#Metric

The percentage of times that a "target label" appears within the first k positions of generated lists. The lists could be personalized recommendations or a list of items ordered by softmax.

Top-k accuracy is also known as accuracy at k.

Consider a machine learning system that uses softmax to identify tree probabilities based on a picture of tree leaves. The following table shows output lists generated from five input tree pictures. Each row contains a target label and the five most likely trees. For example, when the target label was maple, the machine learning model identified elm as the most likely tree, oak as the second most likely tree, and so on.

Target label 1 2 3 4 5
maple elm oak maple beech poplar
dogwood oak dogwood poplar hickory maple
oak oak basswood locust alder linden
linden maple paw-paw oak basswood poplar
oak locust linden oak maple paw-paw

The target label appears in the first position only once, so the top-1 accuracy is:

top-1 accuracy=15=0.2

The target label appears in one of the top three positions four times, so the top-3 accuracy is:

top-1 accuracy=45=0.8

toxicity

#language
#Metric

The degree to which content is abusive, threatening, or offensive. Many machine learning models can identify and measure toxicity. Most of these models identify toxicity along multiple parameters, such as the level of abusive language and the level of threatening language.

training loss

#fundamentals
#Metric

A metric representing a model's loss during a particular training iteration. For example, suppose the loss function is Mean Squared Error. Perhaps the training loss (the Mean Squared Error) for the 10th iteration is 2.2, and the training loss for the 100th iteration is 1.9.

A loss curve plots training loss versus the number of iterations. A loss curve provides the following hints about training:

  • A downward slope implies that the model is improving.
  • An upward slope implies that the model is getting worse.
  • A flat slope implies that the model has reached convergence.

For example, the following somewhat idealized loss curve shows:

  • A steep downward slope during the initial iterations, which implies rapid model improvement.
  • A gradually flattening (but still downward) slope until close to the end of training, which implies continued model improvement at a somewhat slower pace then during the initial iterations.
  • A flat slope towards the end of training, which suggests convergence.

The plot of training loss versus iterations. This loss curve starts
     with a steep downward slope. The slope gradually flattens until the
     slope becomes zero.

Although training loss is important, see also generalization.

true negative (TN)

#fundamentals
#Metric

An example in which the model correctly predicts the negative class. For example, the model infers that a particular email message is not spam, and that email message really is not spam.

true positive (TP)

#fundamentals
#Metric

An example in which the model correctly predicts the positive class. For example, the model infers that a particular email message is spam, and that email message really is spam.

true positive rate (TPR)

#fundamentals
#Metric

Synonym for recall. That is:

true positive rate=true positivestrue positives+false negatives

True positive rate is the y-axis in an ROC curve.

V

validation loss

#fundamentals
#Metric

A metric representing a model's loss on the validation set during a particular iteration of training.

See also generalization curve.

variable importances

#df
#Metric

A set of scores that indicates the relative importance of each feature to the model.

For example, consider a decision tree that estimates house prices. Suppose this decision tree uses three features: size, age, and style. If a set of variable importances for the three features are calculated to be {size=5.8, age=2.5, style=4.7}, then size is more important to the decision tree than age or style.

Different variable importance metrics exist, which can inform ML experts about different aspects of models.

W

Wasserstein loss

#Metric

One of the loss functions commonly used in generative adversarial networks, based on the earth mover's distance between the distribution of generated data and real data.