Глоссарий машинного обучения: основы машинного обучения

На этой странице содержится глоссарий терминов по основам машинного обучения. Чтобы просмотреть все термины глоссария, нажмите здесь .

А

точность

#основы
#Метрика

Количество правильных прогнозов классификации, разделенное на общее количество прогнозов. То есть:

$$\text{Accuracy} = \frac{\text{correct predictions}} {\text{correct predictions + incorrect predictions }}$$

Например, модель, которая сделала 40 правильных прогнозов и 10 неправильных прогнозов, будет иметь точность:

$$\text{Accuracy} = \frac{\text{40}} {\text{40 + 10}} = \text{80%}$$

Бинарная классификация дает конкретные названия различным категориям правильных и неправильных прогнозов . Итак, формула точности бинарной классификации выглядит следующим образом:

$$\text{Accuracy} = \frac{\text{TP} + \text{TN}} {\text{TP} + \text{TN} + \text{FP} + \text{FN}}$$

где:

Сравните и сопоставьте точность с точностью и отзывом .

Дополнительную информацию см. в разделе «Классификация: точность, полнота, прецизионность и связанные с ними показатели» в ускоренном курсе машинного обучения.

функция активации

#основы

Функция, которая позволяет нейронным сетям изучать нелинейные (сложные) связи между объектами и меткой.

Популярные функции активации включают в себя:

Графики функций активации никогда не представляют собой одиночные прямые линии. Например, график функции активации ReLU состоит из двух прямых:

Декартов график из двух линий. В первой строке есть константа           значение y, равное 0, вдоль оси X от -бесконечности,0 до 0,-0.           Вторая строка начинается с 0,0. Эта линия имеет наклон +1, поэтому           он работает от 0,0 до +бесконечности,+бесконечности.

График сигмовидной функции активации выглядит следующим образом:

Двумерный изогнутый график со значениями x, охватывающими область.           от -бесконечности до +положительного, а значения y охватывают диапазон от почти 0 до           почти 1. Когда x равен 0, y равен 0,5. Наклон кривой всегда           положительный, с наибольшим наклоном 0,0,5 и постепенно уменьшающимся           наклоны по мере увеличения абсолютного значения x.

Дополнительную информацию см. в разделе «Нейронные сети: функции активации в ускоренном курсе машинного обучения».

искусственный интеллект

#основы

Нечеловеческая программа или модель , способная решать сложные задачи. Например, программа или модель, которая переводит текст, или программа или модель, которая идентифицирует заболевания по радиологическим изображениям, обладают искусственным интеллектом.

Формально машинное обучение — это подобласть искусственного интеллекта. Однако в последние годы некоторые организации начали использовать термины «искусственный интеллект» и «машинное обучение» как синонимы.

AUC (Площадь под кривой ROC)

#основы
#Метрика

Число от 0,0 до 1,0, обозначающее способность модели бинарной классификации отделять положительные классы от отрицательных классов . Чем ближе AUC к 1,0, тем лучше способность модели отделять классы друг от друга.

Например, на следующем рисунке показана модель классификации , которая идеально отделяет положительные классы (зеленые овалы) от отрицательных классов (фиолетовые прямоугольники). Эта нереально идеальная модель имеет AUC 1,0:

Числовая линия с 8 положительными примерами на одной стороне и           9 негативных примеров с другой стороны.

И наоборот, на следующем рисунке показаны результаты для модели классификации , которая генерировала случайные результаты. Эта модель имеет AUC 0,5:

Числовая линия с 6 положительными примерами и 6 отрицательными примерами.           Последовательность примеров положительная, отрицательная,           положительный, отрицательный, положительный, отрицательный, положительный, отрицательный, положительный           отрицательный, положительный, отрицательный.

Да, предыдущая модель имеет AUC 0,5, а не 0,0.

Большинство моделей находятся где-то между двумя крайностями. Например, следующая модель несколько отделяет положительные значения от отрицательных и поэтому имеет AUC где-то между 0,5 и 1,0:

Числовая линия с 6 положительными примерами и 6 отрицательными примерами.           Последовательность примеров отрицательная, отрицательная, отрицательная, отрицательная,           положительный, отрицательный, положительный, положительный, отрицательный, положительный, положительный,           позитивный.

AUC игнорирует любые значения, установленные вами для порога классификации . Вместо этого AUC учитывает все возможные пороги классификации.

Дополнительную информацию см. в разделе «Классификация: ROC и AUC в ускоренном курсе машинного обучения».

Б

обратное распространение ошибки

#основы

Алгоритм, реализующий градиентный спуск в нейронных сетях .

Обучение нейронной сети включает в себя множество итераций следующего двухпроходного цикла:

  1. Во время прямого прохода система обрабатывает пакет примеров для получения прогнозов. Система сравнивает каждый прогноз с каждым значением метки . Разница между прогнозом и значением метки — это потеря для этого примера. Система суммирует потери для всех примеров, чтобы вычислить общие потери для текущей партии.
  2. Во время обратного прохода (обратного распространения ошибки) система уменьшает потери, корректируя веса всех нейронов во всех скрытых слоях .

Нейронные сети часто содержат множество нейронов во многих скрытых слоях. Каждый из этих нейронов по-разному вносит свой вклад в общую потерю. Обратное распространение ошибки определяет, следует ли увеличивать или уменьшать веса, применяемые к конкретным нейронам.

Скорость обучения — это множитель, который контролирует степень увеличения или уменьшения каждого веса при каждом обратном проходе. Большая скорость обучения будет увеличивать или уменьшать каждый вес больше, чем низкая скорость обучения.

С точки зрения исчисления, обратное распространение ошибки реализует правило цепочки . из исчисления. То есть обратное распространение ошибки вычисляет частную производную ошибки по каждому параметру.

Несколько лет назад специалистам по машинному обучению приходилось писать код для реализации обратного распространения ошибки. Современные API машинного обучения, такие как Keras, теперь реализуют обратное распространение ошибки. Уф!

Дополнительную информацию см. в разделе «Нейронные сети в ускоренном курсе машинного обучения».

партия

#основы

Набор примеров, используемых в одной обучающей итерации . Размер партии определяет количество примеров в партии.

См. «Эпоха» для объяснения того, как партия связана с эпохой.

Дополнительную информацию см. в разделе «Линейная регрессия: гиперпараметры в ускоренном курсе машинного обучения».

размер партии

#основы

Количество примеров в пакете . Например, если размер пакета равен 100, модель обрабатывает 100 примеров за итерацию .

Ниже приведены популярные стратегии размера партии:

  • Стохастический градиентный спуск (SGD) , в котором размер пакета равен 1.
  • Полный пакет, в котором размер пакета — это количество примеров во всем обучающем наборе . Например, если обучающий набор содержит миллион примеров, то размер пакета будет составлять миллион примеров. Полная партия обычно является неэффективной стратегией.
  • мини-пакет , размер которого обычно составляет от 10 до 1000. Мини-пакет обычно является наиболее эффективной стратегией.

Для получения дополнительной информации см. следующее:

предвзятость (этика/справедливость)

#ответственный
#основы

1. Стереотипы, предрассудки или фаворитизм в отношении одних вещей, людей или групп по сравнению с другими. Эти предубеждения могут повлиять на сбор и интерпретацию данных, дизайн системы и то, как пользователи взаимодействуют с системой. К формам этого типа предвзятости относятся:

2. Систематическая ошибка, вызванная процедурой выборки или отчетности. К формам этого типа предвзятости относятся:

Не путать с термином «предвзятость» в моделях машинного обучения или «предвзятость прогнозирования» .

Дополнительную информацию см. в разделе «Справедливость: типы предвзятости» в ускоренном курсе машинного обучения.

предвзятость (математика) или термин предвзятости

#основы

Перехват или смещение от начала координат. Смещение — это параметр в моделях машинного обучения, который обозначается одним из следующих символов:

  • б
  • ш 0

Например, смещение — это буква b в следующей формуле:

$$y' = b + w_1x_1 + w_2x_2 + … w_nx_n$$

В простой двумерной линии смещение означает просто «пересечение оси Y». Например, смещение линии на следующем рисунке равно 2.

График линии с наклоном 0,5 и смещением (пересечение оси Y) 2.

Смещение существует, потому что не все модели начинаются с начала координат (0,0). Например, предположим, что вход в парк развлечений стоит 2 евро и дополнительно 0,5 евро за каждый час пребывания клиента. Следовательно, модель, отображающая общую стоимость, имеет смещение 2, поскольку минимальная стоимость составляет 2 евро.

Предвзятость не следует путать с предвзятостью в вопросах этики и справедливости или предвзятостью прогнозирования .

Дополнительную информацию см. в разделе «Линейная регрессия в ускоренном курсе машинного обучения».

бинарная классификация

#основы

Тип задачи классификации , которая прогнозирует один из двух взаимоисключающих классов:

Например, каждая из следующих двух моделей машинного обучения выполняет двоичную классификацию:

  • Модель, определяющая, являются ли сообщения электронной почты спамом (положительный класс) или нет (негативный класс).
  • Модель, которая оценивает медицинские симптомы, чтобы определить, есть ли у человека определенное заболевание (положительный класс) или нет этого заболевания (негативный класс).

Сравните с многоклассовой классификацией .

См. также логистическую регрессию и порог классификации .

Дополнительную информацию см. в разделе «Классификация в ускоренном курсе машинного обучения».

группирование

#основы

Преобразование одного объекта в несколько двоичных объектов, называемых сегментами или контейнерами , обычно на основе диапазона значений. Вырезанный объект обычно является непрерывным объектом .

Например, вместо того, чтобы представлять температуру как один непрерывный признак с плавающей запятой, вы можете разбить диапазоны температур на отдельные сегменты, например:

  • <= 10 градусов по Цельсию будет «холодным» ведром.
  • 11–24 градуса по Цельсию будет «умеренным» ведром.
  • >= 25 градусов по Цельсию будет «теплым» ведром.

Модель будет обрабатывать каждое значение в одном и том же сегменте одинаково. Например, значения 13 и 22 относятся к сегменту умеренного климата, поэтому модель обрабатывает эти два значения одинаково.

Дополнительные сведения см. в разделе «Численные данные: группирование в ускоренном курсе машинного обучения».

С

категориальные данные

#основы

Функции, имеющие определенный набор возможных значений. Например, рассмотрим категориальную функцию под названием traffic-light-state , которая может иметь только одно из следующих трех возможных значений:

  • red
  • yellow
  • green

Представляя traffic-light-state как категориальную характеристику, модель может изучить различное влияние red , green и yellow на поведение водителя.

Категориальные признаки иногда называют дискретными признаками .

Сравните с числовыми данными .

Дополнительную информацию см. в разделе Работа с категориальными данными в ускоренном курсе машинного обучения.

сорт

#основы

Категория, к которой может принадлежать метка . Например:

Модель классификации предсказывает класс. Напротив, регрессионная модель предсказывает число, а не класс.

Дополнительную информацию см. в разделе «Классификация в ускоренном курсе машинного обучения».

модель классификации

#основы

Модель , предсказание которой является классом . Например, ниже приведены все модели классификации:

  • Модель, которая предсказывает язык входного предложения (французский? испанский? итальянский?).
  • Модель, предсказывающая породы деревьев (клен? дуб? баобаб?).
  • Модель, которая прогнозирует положительный или отрицательный класс конкретного заболевания.

Напротив, регрессионные модели предсказывают числа, а не классы.

Два распространенных типа классификационных моделей:

порог классификации

#основы

В двоичной классификации - число от 0 до 1, которое преобразует необработанные выходные данные модели логистической регрессии в прогноз либо положительного , либо отрицательного класса . Обратите внимание, что порог классификации — это значение, которое выбирает человек, а не значение, выбранное при обучении модели.

Модель логистической регрессии выводит необработанное значение от 0 до 1. Затем:

  • Если это необработанное значение превышает порог классификации, то прогнозируется положительный класс.
  • Если это необработанное значение меньше порога классификации, то прогнозируется отрицательный класс.

Например, предположим, что порог классификации равен 0,8. Если исходное значение равно 0,9, модель прогнозирует положительный класс. Если исходное значение равно 0,7, то модель прогнозирует отрицательный класс.

Выбор порога классификации сильно влияет на количество ложноположительных и ложноотрицательных результатов .

Дополнительные сведения см. в разделе «Пороговые значения и матрица путаницы» в ускоренном курсе машинного обучения.

классификатор

#основы

Случайный термин для модели классификации .

несбалансированный по классам набор данных

#основы

Набор данных для задачи классификации, в которой общее количество меток каждого класса значительно различается. Например, рассмотрим набор данных двоичной классификации, две метки которого разделены следующим образом:

  • 1 000 000 негативных ярлыков
  • 10 положительных ярлыков

Соотношение отрицательных и положительных меток составляет 100 000 к 1, поэтому это набор данных с несбалансированным классом.

Напротив, следующий набор данных не является несбалансированным по классам, поскольку соотношение отрицательных меток к положительным меткам относительно близко к 1:

  • 517 отрицательных ярлыков
  • 483 положительных метки

Многоклассовые наборы данных также могут быть несбалансированными по классам. Например, следующий набор данных многоклассовой классификации также несбалансирован по классам, поскольку одна метка содержит гораздо больше примеров, чем две другие:

  • 1 000 000 этикеток класса «зеленый»
  • 200 этикеток класса «фиолетовый».
  • 350 этикеток класса «оранжевый».

См. также энтропию , класс большинства и класс меньшинства .

вырезка

#основы

Техника обработки выбросов путем выполнения одного или обоих следующих действий:

  • Уменьшение значений функций , превышающих максимальный порог, до этого максимального порога.
  • Увеличение значений функций, которые меньше минимального порога, до этого минимального порога.

Например, предположим, что <0,5% значений определенного признака выходят за пределы диапазона 40–60. В этом случае вы можете сделать следующее:

  • Обрежьте все значения выше 60 (максимальный порог), чтобы они составляли ровно 60.
  • Обрежьте все значения ниже 40 (минимальный порог), чтобы они составляли ровно 40.

Выбросы могут повредить модели, иногда вызывая переполнение весов во время обучения. Некоторые выбросы также могут существенно испортить такие показатели, как точность . Обрезка — распространенный метод ограничения ущерба.

Отсечение градиента приводит к тому, что значения градиента находятся в пределах заданного диапазона во время обучения.

Дополнительную информацию см. в разделе «Численные данные: нормализация в ускоренном курсе машинного обучения».

матрица путаницы

#основы

Таблица NxN, в которой суммируется количество правильных и неправильных прогнозов, сделанных моделью классификации . Например, рассмотрим следующую матрицу путаницы для модели двоичной классификации :

Опухоль (прогнозируемая) Неопухолевый (прогнозируемый)
Опухоль (основная правда) 18 (ТП) 1 (ФН)
Не опухоль (основная правда) 6 (ФП) 452 (Теннесси)

Предыдущая матрица путаницы показывает следующее:

  • Из 19 прогнозов, в которых основной истиной была опухоль, модель правильно классифицировала 18 и неправильно классифицировала 1.
  • Из 458 прогнозов, в которых основной истиной было отсутствие опухоли, модель правильно классифицировала 452 и неправильно классифицировала 6.

Матрица путаницы для задачи классификации нескольких классов может помочь вам выявить закономерности ошибок. Например, рассмотрим следующую матрицу путаницы для трехклассовой многоклассовой модели классификации, которая классифицирует три разных типа радужной оболочки (Вирджиника, Версиколор и Сетоза). Когда основной истиной была Вирджиния, матрица путаницы показывает, что модель с гораздо большей вероятностью ошибочно предсказывала Версиколор, чем Сетозу:

Сетоза (прогноз) Разноцветный (предсказано) Вирджиния (прогнозируется)
Сетоза (основная правда) 88 12 0
Версиколор (основная правда) 6 141 7
Вирджиния (основная правда) 2 27 109

Еще один пример: матрица путаницы может показать, что модель, обученная распознавать рукописные цифры, имеет тенденцию ошибочно предсказывать 9 вместо 4 или ошибочно предсказывать 1 вместо 7.

Матрицы ошибок содержат достаточно информации для расчета различных показателей производительности, включая точность и полноту .

непрерывный объект

#основы

Функция с плавающей запятой с бесконечным диапазоном возможных значений, таких как температура или вес.

Контраст с дискретной функцией .

конвергенция

#основы

Состояние, при котором значения потерь изменяются очень незначительно или вообще не меняются на каждой итерации . Например, следующая кривая потерь предполагает сходимость примерно через 700 итераций:

Картезианский сюжет. Ось X — потери. Ось Y — количество тренировок           итерации. Потери очень велики в течение первых нескольких итераций, но           резко падает. Примерно после 100 итераций потери все еще           нисходящее, но гораздо более постепенное. Примерно после 700 итераций           потери остаются неизменными.

Модель сходится , когда дополнительное обучение не улучшает модель.

При глубоком обучении значения потерь иногда остаются постоянными или почти постоянными в течение многих итераций, прежде чем, наконец, упасть. В течение длительного периода постоянных значений потерь у вас может временно возникнуть ложное ощущение конвергенции.

См. также раннюю остановку .

Дополнительные сведения см. в разделе Кривые сходимости и потерь модели в ускоренном курсе машинного обучения.

Д

DataFrame

#основы

Популярный тип данных pandas для представления наборов данных в памяти.

DataFrame аналогичен таблице или электронной таблице. Каждый столбец DataFrame имеет имя (заголовок), а каждая строка идентифицируется уникальным номером.

Каждый столбец в DataFrame структурирован как двумерный массив, за исключением того, что каждому столбцу можно назначить свой собственный тип данных.

См. также официальную справочную страницу pandas.DataFrame .

набор данных или набор данных

#основы

Коллекция необработанных данных, обычно (но не исключительно) организованная в одном из следующих форматов:

  • электронная таблица
  • файл в формате CSV (значения, разделенные запятыми)

глубокая модель

#основы

Нейронная сеть , содержащая более одного скрытого слоя .

Глубокую модель еще называют глубокой нейронной сетью .

Контраст с широкой моделью .

плотная особенность

#основы

Функция , в которой большинство или все значения не равны нулю, обычно это тензор значений с плавающей запятой. Например, следующий 10-элементный тензор является плотным, поскольку 9 его значений не равны нулю:

8 3 7 5 2 4 0 4 9 6

Контраст с редкими функциями .

глубина

#основы

Сумма следующего в нейронной сети :

Например, нейронная сеть с пятью скрытыми слоями и одним выходным слоем имеет глубину 6.

Обратите внимание, что входной слой не влияет на глубину.

дискретная функция

#основы

Объект с конечным набором возможных значений. Например, признак, значения которого могут быть только «животное» , «растение» или «минерал», является дискретным (или категориальным) признаком.

Контраст с непрерывной функцией .

динамичный

#основы

Что-то, что делается часто или постоянно. Термины динамический и онлайн являются синонимами в машинном обучении. Ниже приведены распространенные варианты использования динамического и онлайн-обучения в машинном обучении:

  • Динамическая модель (или онлайн-модель ) — это модель, которая часто или непрерывно переобучается.
  • Динамическое обучение (или онлайн-обучение ) — это процесс частого или непрерывного обучения.
  • Динамический вывод (или онлайн-вывод ) — это процесс генерации прогнозов по требованию.

динамическая модель

#основы

Модель , которая часто (возможно, даже постоянно) переобучается. Динамическая модель — это «обучающийся на протяжении всей жизни», который постоянно адаптируется к меняющимся данным. Динамическая модель также известна как онлайн-модель .

Контраст со статической моделью .

Э

ранняя остановка

#основы

Метод регуляризации , который предполагает прекращение обучения до того, как перестанут уменьшаться потери при обучении. При ранней остановке вы намеренно прекращаете обучение модели, когда потери в наборе проверочных данных начинают увеличиваться; то есть, когда производительность обобщения ухудшается.

слой внедрения

#язык
#основы

Специальный скрытый слой , который обучается на многомерном категориальном признаке для постепенного изучения вектора внедрения более низкого измерения. Слой внедрения позволяет нейронной сети обучаться гораздо эффективнее, чем обучение только на многомерном категориальном признаке.

Например, на Земле в настоящее время произрастает около 73 000 видов деревьев. Предположим, что виды деревьев являются признаком вашей модели, поэтому входной слой вашей модели включает в себя вектор длиной 73 000 элементов. Например, возможно, baobab можно было бы представить примерно так:

Массив из 73 000 элементов. Первые 6232 элемента содержат значение      0. Следующий элемент содержит значение 1. Последние 66 767 элементов содержат значение      значение ноль.

Массив из 73 000 элементов очень длинный. Если вы не добавите в модель слой внедрения, обучение займет очень много времени из-за умножения 72 999 нулей. Возможно, вы выберете слой внедрения, состоящий из 12 измерений. Следовательно, слой внедрения постепенно изучает новый вектор внедрения для каждой породы деревьев.

В определенных ситуациях хеширование является разумной альтернативой слою внедрения.

Дополнительную информацию см. в разделе « Внедрения в ускоренный курс машинного обучения».

эпоха

#основы

Полный проход обучения по всему обучающему набору , при котором каждый пример обрабатывается один раз.

Эпоха представляет собой N / итераций обучения размера пакета , где N — общее количество примеров.

Например, предположим следующее:

  • Набор данных состоит из 1000 примеров.
  • Размер партии — 50 экземпляров.

Следовательно, для одной эпохи требуется 20 итераций:

1 epoch = (N/batch size) = (1,000 / 50) = 20 iterations

Дополнительную информацию см. в разделе «Линейная регрессия: гиперпараметры в ускоренном курсе машинного обучения».

пример

#основы

Значения одной строки объектов и, возможно, метки . Примеры контролируемого обучения делятся на две общие категории:

  • Помеченный пример состоит из одного или нескольких объектов и метки. Маркированные примеры используются во время обучения.
  • Немаркированный пример состоит из одного или нескольких объектов, но без метки. Во время вывода используются немаркированные примеры.

Например, предположим, что вы обучаете модель для определения влияния погодных условий на результаты тестов учащихся. Вот три помеченных примера:

Функции Этикетка
Температура Влажность Давление Оценка теста
15 47 998 Хороший
19 34 1020 Отличный
18 92 1012 Бедный

Вот три немаркированных примера:

Температура Влажность Давление
12 62 1014
21 47 1017
19 41 1021

Строка набора данных обычно является необработанным источником примера. То есть пример обычно состоит из подмножества столбцов набора данных. Кроме того, объекты в примере также могут включать в себя синтетические объекты , такие как перекрестные объекты .

Дополнительную информацию см. в разделе «Обучение с учителем» в курсе «Введение в машинное обучение».

Ф

ложноотрицательный (ЛН)

#основы
#Метрика

Пример, в котором модель ошибочно предсказывает отрицательный класс . Например, модель предсказывает, что конкретное сообщение электронной почты не является спамом (негативный класс), но на самом деле это сообщение электронной почты является спамом .

ложноположительный результат (FP)

#основы
#Метрика

Пример, в котором модель ошибочно предсказывает положительный класс . Например, модель предсказывает, что конкретное сообщение электронной почты является спамом (положительный класс), но на самом деле это сообщение электронной почты не является спамом .

Дополнительные сведения см. в разделе «Пороговые значения и матрица путаницы» в ускоренном курсе машинного обучения.

уровень ложноположительных результатов (FPR)

#основы
#Метрика

Доля реальных отрицательных примеров, для которых модель ошибочно предсказала положительный класс. Следующая формула рассчитывает уровень ложноположительных результатов:

$$\text{false positive rate} = \frac{\text{false positives}}{\text{false positives} + \text{true negatives}}$$

Частота ложноположительных результатов — это ось X на кривой ROC .

Дополнительную информацию см. в разделе «Классификация: ROC и AUC в ускоренном курсе машинного обучения».

особенность

#основы

Входная переменная модели машинного обучения. Пример состоит из одной или нескольких функций. Например, предположим, что вы обучаете модель для определения влияния погодных условий на результаты тестов учащихся. В следующей таблице показаны три примера, каждый из которых содержит три функции и одну метку:

Функции Этикетка
Температура Влажность Давление Оценка теста
15 47 998 92
19 34 1020 84
18 92 1012 87

Контраст с этикеткой .

Дополнительную информацию см. в разделе «Обучение с учителем» в курсе «Введение в машинное обучение».

особенность креста

#основы

Синтетический признак , образованный путем «пересечения» категориальных или группированных признаков.

Например, рассмотрим модель «прогноза настроения», которая представляет температуру в одном из следующих четырех сегментов:

  • freezing
  • chilly
  • temperate
  • warm

И представляет скорость ветра в одном из следующих трех сегментов:

  • still
  • light
  • windy

Без перекрестия функций линейная модель обучается независимо на каждом из семи предыдущих сегментов. Итак, модель тренируется, например, freezing независимо от тренировки, например, windy .

В качестве альтернативы вы можете создать перекрестную функцию температуры и скорости ветра. Эта синтетическая функция будет иметь следующие 12 возможных значений:

  • freezing-still
  • freezing-light
  • freezing-windy
  • chilly-still
  • chilly-light
  • chilly-windy
  • temperate-still
  • temperate-light
  • temperate-windy
  • warm-still
  • warm-light
  • warm-windy

Благодаря крестикам функций модель может запоминать разницу в настроении между freezing-windy и freezing-still днем.

Если вы создадите синтетический объект из двух объектов, каждый из которых имеет множество разных сегментов, полученный кросс объектов будет иметь огромное количество возможных комбинаций. Например, если один объект имеет 1000 сегментов, а другой — 2000 сегментов, результирующий кросс объектов будет иметь 2 000 000 сегментов.

Формально крест — это декартово произведение .

Перекрещивания признаков в основном используются с линейными моделями и редко используются с нейронными сетями.

См. Категориальные данные: функции скрещивания в курсе сбоя машинного обучения для получения дополнительной информации.

Функциональная инженерия

#fundamentals
#Tensorflow

Процесс, который включает в себя следующие шаги:

  1. Определение того, какие функции могут быть полезны при обучении модели.
  2. Преобразование необработанных данных из набора данных в эффективные версии этих функций.

Например, вы можете определить, что temperature может быть полезной функцией. Затем вы можете экспериментировать с ведением , чтобы оптимизировать то, что модель может извлечь из разных temperature диапазонов.

Инженерная инженерия иногда называется извлечением функций или эксплуатацией .

См. Числовые данные: как модель проглатывает данные с использованием векторов функций в курсе сбоя машинного обучения для получения дополнительной информации.

набор функций

#fundamentals

Группа функций модели машинного обучения поезжает. Например, простой набор функций для модели, которая прогнозирует цены на жилье, может состоять из почтового кода, размера свойства и условия свойства.

функции вектор

#fundamentals

Массив значений функций содержит пример . Вектор функций вводится во время обучения и во время вывода . Например, вектор функций для модели с двумя отдельными функциями может быть:

[0.92, 0.56]

Четыре слоя: входной слой, два скрытых слоя и один выходной слой.           Входной слой содержит два узла, один содержит значение           0,92, а другое, содержащее значение 0,56.

Каждый пример предоставляет различные значения для вектора функций, поэтому вектор функций для следующего примера может быть что -то вроде:

[0.73, 0.49]

Инженерная функция определяет, как представлять функции в векторе функций. Например, двоичная категориальная особенность с пятью возможными значениями может быть представлена ​​с помощью однопольного кодирования . В этом случае часть вектора признаков для конкретного примера будет состоять из четырех нулей и одного 1,0 в третьей позиции, следующим образом:

[0.0, 0.0, 1.0, 0.0, 0.0]

В качестве другого примера, предположим, что ваша модель состоит из трех функций:

  • бинарная категориальная особенность с пятью возможными значениями, представленными с одним горячим кодированием; Например: [0.0, 1.0, 0.0, 0.0, 0.0]
  • Еще одна бинарная категориальная особенность с тремя возможными значениями, представленными с одним горячим кодированием; Например: [0.0, 0.0, 1.0]
  • функция с плавающей точкой; Например: 8.3 .

В этом случае вектор функций для каждого примера будет представлен девятью значениями. Учитывая примеры значений в предыдущем списке, вектор функций будет:

0.0
1.0
0.0
0.0
0.0
0.0
0.0
1.0
8.3

См. Числовые данные: как модель проглатывает данные с использованием векторов функций в курсе сбоя машинного обучения для получения дополнительной информации.

обратная связь

#fundamentals

В машинном обучении ситуация, в которой прогнозы модели влияют на учебные данные для той же модели или другой модели. Например, модель, которая рекомендует фильмы, будет влиять на фильмы, которые люди смотрят, что затем повлияет на последующие модели рекомендаций фильма.

См. Производство ML Systems: вопросы, которые нужно задать в курсе по сбою машинного обучения для получения дополнительной информации.

Г

обобщение

#fundamentals

Способность модели делать правильные прогнозы на новые, ранее невидимые данные. Модель, которая может обобщить, является противоположностью модели, которая переживает .

См. Обобщение в курсе по сбою машинного обучения для получения дополнительной информации.

Кривая обобщения

#fundamentals

Сюжет как потери обучения , так и потери проверки в зависимости от количества итераций .

Кривая обобщения может помочь вам обнаружить возможный переосмысление . Например, следующая кривая обобщения предполагает переосмысление, поскольку потеря проверки в конечном итоге становится значительно выше, чем потери тренировок.

Картесайский график, в котором ось Y обозначена потерей и ось X           меченые итерации. Появляются два сюжета. На одном участке показывает           Потеря обучения, а другая показывает потерю проверки.           Два участка начинаются аналогично, но в конечном итоге потеря тренировок в конечном итоге           опускаются намного ниже, чем утрата проверки.

См. Обобщение в курсе по сбою машинного обучения для получения дополнительной информации.

градиентный спуск

#fundamentals

Математический метод, чтобы минимизировать потерю . Градиент спуск итеративно регулирует веса и смещения , постепенно находя наилучшую комбинацию, чтобы минимизировать потерю.

Градиент спуск старше - намного старше, чем машинное обучение.

См. Линейную регрессию: градиент спуск в курсе сбоя машинного обучения для получения дополнительной информации.

наземная правда

#fundamentals

Реальность.

То, что на самом деле произошло.

Например, рассмотрим модель бинарной классификации , которая предсказывает, будет ли студент на первом курсе университета в течение шести лет. Основная правда для этой модели заключается в том, действительно ли этот студент получил высшее образование в течение шести лет.

ЧАС

Скрытый слой

#fundamentals

Слой в нейронной сети между входным уровнем (функциями) и выходным слоем (прогноз). Каждый скрытый слой состоит из одного или нескольких нейронов . Например, следующая нейронная сеть содержит два скрытых слоя, первый с тремя нейронами, а второй с двумя нейронами:

Четыре слоя. Первый слой - это входной слой, содержащий два           функции. Второй слой представляет собой скрытый слой, содержащий три           нейроны. Третий слой - это скрытый слой, содержащий два           нейроны. Четвертый слой является выходным слоем. Каждая функция           содержит три края, каждый из которых указывает на другой нейрон           во втором слое. Каждый из нейронов во втором слое           содержит два края, каждый из которых указывает на другой нейрон           в третьем слое. Каждый из нейронов в третьем слое содержит           Один край, каждый из которых указывает на выходной слой.

Глубокая нейронная сеть содержит более одного скрытого слоя. Например, предыдущая иллюстрация представляет собой глубокую нейронную сеть, потому что модель содержит два скрытых слоя.

См. Нейронные сети: узлы и скрытые слои в крушении машинного обучения для получения дополнительной информации.

гиперпараметр

#fundamentals

Переменные, которые вы или служба настройки гиперпараметровПри скорректировке во время последовательных заездов обучения модели. Например, скорость обучения является гиперпараметром. Вы можете установить уровень обучения на 0,01 перед одной тренировкой. Если вы определите, что 0,01 слишком высока, вы можете установить скорость обучения на 0,003 для следующей тренировки.

Напротив, параметры - это различные веса и предвзятость , которые модель изучает во время обучения.

См. Линейную регрессию: гиперпараметры в крушении машинного обучения для получения дополнительной информации.

я

самостоятельно и идентично распределено (IID)

#fundamentals

Данные, взятые из распределения, которое не изменяется, и где каждое нарисованное значение не зависит от значений, которые были нарисованы ранее. IID - это идеальный газ машинного обучения - полезная математическая конструкция, но почти никогда не встречается в реальном мире. Например, распространение посетителей на веб -страницу может быть IID в течение короткого окна времени; То есть распространение не меняется во время этого краткого окна, и визит одного человека, как правило, не зависит от визита другого. Однако, если вы расширяете это окно времени, могут появиться сезонные различия в посетителях веб -страницы.

См. Также нестационарность .

вывод

#fundamentals

В машинном обучении процесс прогнозирования путем применения обученной модели к немеченым примерам .

Вывод имеет несколько иное значение в статистике. Смотрите статью Википедии о статистическом выводе для деталей.

См. Контролируемое обучение во вступлении в курс ML, чтобы увидеть роль вывода в контролируемой системе обучения.

входной слой

#fundamentals

Уровень нейронной сети , которая содержит вектор функций . То есть входной слой содержит примеры для обучения или вывода . Например, входной уровень в следующей нейронной сети состоит из двух функций:

Четыре слоя: входной слой, два скрытых слоя и выходной слой.

интерпретируемость

#fundamentals

Способность объяснять или представлять рассуждения модели ML в понятных терминах человеку.

Например, большинство моделей линейной регрессии очень интерпретируются. (Вам просто нужно взглянуть на обученные веса для каждой функции.) Леса принятия решений также хорошо интерпретируются. Некоторые модели, однако, требуют сложной визуализации, чтобы стать интерпретируемой.

Вы можете использовать инструмент интерпретации обучения (LIT) для интерпретации моделей ML.

итерация

#fundamentals

Единое обновление параметров модели - веса модели и предвзятости - обучение . Размер партии определяет, сколько примеров обрабатывает модель в одной итерации. Например, если размер партии составляет 20, то модель обрабатывает 20 примеров перед настройкой параметров.

При обучении нейронной сети , одна итерация включает в себя следующие два прохода:

  1. Правный проход для оценки потери на одной партии.
  2. Обратный проход ( обратный процесс ) для корректировки параметров модели на основе потери и скорости обучения.

См. Градиентный спуск в курсе по сбою машинного обучения для получения дополнительной информации.

л

L 0 регуляризация

#fundamentals

Тип регуляризации , который наказывает общее количество ненулевых весов в модели. Например, модель, имеющая 11 ненулевых весов, была бы оштрафована больше, чем аналогичная модель с 10 ненулевыми весами.

L 0 регуляризация иногда называют регуляризацией L0-Norm .

L 1 потеря

#fundamentals
#Metric

Функция потери , которая вычисляет абсолютное значение разницы между фактическими значениями метки и значениями, которые предсказывает модель . Например, вот расчет потери L 1 для партии из пяти примеров :

Фактическое значение примера Прогнозируемое значение модели Абсолютное значение дельты
7 6 1
5 4 1
8 11 3
4 6 2
9 8 1
8 = L 1 потеря

L 1 Потеря менее чувствительна к выбросам , чем потеря L 2 .

Средняя абсолютная ошибка - это средняя потеря L 1 на пример.

См. Линейную регрессию: курс потерь в машинном обучении для получения дополнительной информации.

L 1 регуляризация

#fundamentals

Тип регуляризации , который наказывает веса пропорционально сумме абсолютного значения весов. L 1 Ретализация помогает стимулировать вес неактуальных или едва соответствующих функций ровно 0 . Функция с весом 0 эффективно удалена из модели.

Контраст с регуляризацией L 2 .

L 2 потеря

#fundamentals
#Metric

Функция потери , которая вычисляет квадрат разницы между фактическими значениями метки и значениями, которые предсказывает модель . Например, вот расчет потери L 2 для партии из пяти примеров :

Фактическое значение примера Прогнозируемое значение модели Квадрат Дельта
7 6 1
5 4 1
8 11 9
4 6 4
9 8 1
16 = L 2 потеря

Из -за квадрата потери L 2 усиливают влияние выбросов . То есть потеря L 2 реагирует более сильно на плохие прогнозы, чем потеря L 1 . Например, потеря L 1 для предыдущей партии составит 8, а не 16. Обратите внимание, что один выброс учитывает 9 из 16.

Регрессионные модели обычно используют потерю L 2 в качестве функции потери.

Средняя квадратная ошибка - это средняя потеря L 2 за пример. Потери квадрата - это еще одно название для потери L 2 .

См. Логистическая регрессия: потеря и регуляризация в курсе сбоя машинного обучения для получения дополнительной информации.

L 2 регуляризация

#fundamentals

Тип регуляризации , который наказывает веса пропорционально сумме квадратов весов. L 2 Ретализация помогает стимулировать вес (те, которые с высокими положительными или низкими отрицательными значениями) ближе к 0, но не совсем до 0 . Особенности со значениями, очень близкими 0, остаются в модели, но не сильно влияют на прогноз модели.

L 2 Ретализация всегда улучшает обобщение в линейных моделях .

Контраст с регуляризацией L 1 .

См. Пересмотрение: L2 регуляризация в курсе по сбою машинного обучения для получения дополнительной информации.

этикетка

#fundamentals

В контролируемом машинном обучении «ответ» или «Результат» часть примера .

Каждый помеченный пример состоит из одной или нескольких функций и метки. Например, в наборе данных обнаружения спама на этикетке, вероятно, будет либо «спам», либо «не спам». В наборе данных осадков этикетка может быть количество дождя, которое упало в течение определенного периода.

См. Контролируемое обучение во введении в машинное обучение для получения дополнительной информации.

помеченный пример

#fundamentals

Пример, который содержит одну или несколько функций и этикетку . Например, в следующей таблице показаны три помеченных примера из модели оценки дома, каждый с тремя функциями и одной меткой:

Количество спален Количество ванных комнат Дом возраст Цена дома (этикетка)
3 2 15 345 000 долларов
2 1 72 179 000 долларов
4 2 34 392 000 долларов

В контролируемом машинном обучении модели обучаются на маркированных примерах и делают прогнозы на немеченых примерах .

Контрастные помеченные пример с немечеными примерами.

См. Контролируемое обучение во введении в машинное обучение для получения дополнительной информации.

лямбда

#fundamentals

Синоним уровня регуляризации .

Lambda - это перегруженный термин. Здесь мы сосредотачиваемся на определении термина в рамках регуляризации .

слой

#fundamentals

Набор нейронов в нейронной сети . Три распространенных типа слоев следующие:

Например, на следующей иллюстрации показана нейронная сеть с одним входным уровнем, двумя скрытыми слоями и одним выходным слоем:

Нейронная сеть с одним входным уровнем, двумя скрытыми слоями и одним           выходной слой. Входной слой состоит из двух функций. Первый           Скрытый слой состоит из трех нейронов и второго скрытого слоя           состоит из двух нейронов. Выходной слой состоит из одного узла.

В TensorFlow слои также являются функциями Python, которые принимают тензоры и параметры конфигурации в качестве входного и производят другие тензоры в качестве вывода.

скорость обучения

#fundamentals

Номер с плавающей запятой, который сообщает алгоритм градиентного спуска, насколько сильно регулировать веса и смещения на каждой итерации . Например, скорость обучения 0,3 будет корректировать вес и смещения в три раза более мощно, чем скорость обучения 0,1.

Уровень обучения является ключевым гиперпараметром . Если вы установите слишком низкую скорость обучения, обучение займет слишком много времени. Если вы устанавливаете слишком высокий уровень обучения, у градиентного спуска часто возникают проблемы с достижением сходимости .

См. Линейную регрессию: гиперпараметры в крушении машинного обучения для получения дополнительной информации.

линейный

#fundamentals

Связь между двумя или более переменными, которые могут быть представлены исключительно посредством добавления и умножения.

Сюжет линейных отношений - это линия.

Контраст с нелинейным .

линейная модель

#fundamentals

Модель , которая присваивает один вес на функцию для прогнозирования . (Линейные модели также включают в себя смещение .) Напротив, связь функций к прогнозам в глубоких моделях , как правило, нелинейная .

Линейные модели, как правило, легче тренировать и более интерпретируются , чем глубокие модели. Тем не менее, глубокие модели могут изучать сложные отношения между функциями.

Линейная регрессия и логистическая регрессия являются двумя типами линейных моделей.

линейная регрессия

#fundamentals

Тип модели машинного обучения, в которой оба из следующих

Контрастная линейная регрессия с логистической регрессией . Кроме того, контрастная регрессия с классификацией .

См. Линейную регрессию в крушении машинного обучения для получения дополнительной информации.

логистическая регрессия

#fundamentals

Тип регрессионной модели , которая предсказывает вероятность. Модели логистической регрессии имеют следующие характеристики:

  • Этикетка категориальна . Термин логистическая регрессия обычно относится к бинарной логистической регрессии , то есть к модели, которая вычисляет вероятности для меток с двумя возможными значениями. Менее распространенный вариант, мультиномиальная логистическая регрессия , вычисляет вероятности для меток с более чем двумя возможными значениями.
  • Функция потери во время обучения - потеря журнала . (Несколько единиц потери журнала могут быть размещены на параллели для меток с более чем двумя возможными значениями.)
  • Модель имеет линейную архитектуру, а не глубокую нейронную сеть. Тем не менее, оставшаяся часть этого определения также применима к глубоким моделям , которые предсказывают вероятности категориальных меток.

Например, рассмотрим модель логистической регрессии, которая вычисляет вероятность того, что входное электронное письмо будет либо спамом, либо не спам. Во время вывода предположим, что модель предсказывает 0,72. Следовательно, модель оценивает:

  • 72% шанс на спам.
  • 28% вероятность того, что электронное письмо не является спамом.

Модель логистической регрессии использует следующую двухэтапную архитектуру:

  1. Модель генерирует необработанное прогноз (Y '), применяя линейную функцию входных функций.
  2. Модель использует этот необработанный прогноз в качестве входного вводного в сигмоидную функцию , которая преобразует необработанное прогноз в значение от 0 до 1, исключительно.

Как и любая модель регрессии, модель логистической регрессии предсказывает число. Однако это число обычно становится частью бинарной классификационной модели следующим образом:

  • Если прогнозируемое число больше , чем порог классификации , модель бинарной классификации предсказывает положительный класс.
  • Если прогнозируемое число меньше порога классификации, модель бинарной классификации предсказывает отрицательный класс.

См. Логистическую регрессию в курсе сбоя машинного обучения для получения дополнительной информации.

Потеря

#fundamentals

Функция потерь, используемая в бинарной логистической регрессии .

См. Логистическая регрессия: потеря и регуляризация в курсе сбоя машинного обучения для получения дополнительной информации.

логарифмические

#fundamentals

Логарифм шансов какого -то события.

потеря

#fundamentals
#Metric

Во время обучения контролируемой модели мера того, насколько далеко прогнозирование модели от его ярлыка .

Функция потери вычисляет потерю.

См. Линейную регрессию: курс потерь в машинном обучении для получения дополнительной информации.

кривая потери

#fundamentals

График потери как функция количества обучающих итераций . На следующем графике показана типичная кривая потерь:

Картезианский график потерь по сравнению с обучающими итерациями, показывающий           быстрое падение потерь для первоначальных итераций, за которыми следует постепенный           брось, а затем плоский склон во время последних итераций.

Кривые потерь могут помочь вам определить, когда ваша модель сходится или переживает .

Кривые потерь могут построить все следующие типы потерь:

См. Также кривая обобщения .

См. Пересмотр: Интерпретация кривых потерь в курсе по сбою машинного обучения для получения дополнительной информации.

функция потерь

#fundamentals
#Metric

Во время обучения или тестирования математическая функция, которая вычисляет потерю на партии примеров. Функция потери возвращает более низкую потерю для моделей, которые делают хорошие прогнозы, чем для моделей, которые делают плохие прогнозы.

Цель обучения, как правило, состоит в том, чтобы минимизировать потери, которую возвращает функция потери.

Существует много различных видов потерь. Выберите соответствующую функцию потерь для той модели, которую вы строите. Например:

М

машинное обучение

#fundamentals

Программа или система, которая обучает модель из входных данных. Обученная модель может сделать полезные прогнозы из новых (никогда не видно) данных, взятых из того же распределения, что и то, что используется для обучения модели.

Машинное обучение также относится к области исследования, связанной с этими программами или системами.

Смотрите курс «Введение в машинное обучение» для получения дополнительной информации.

большинство класс

#fundamentals

Более распространенная метка в классе-имбалансированном наборе данных . Например, учитывая набор данных, содержащий 99% отрицательных меток и 1% положительных меток, отрицательные этикетки - это большинство класса.

Контраст с классом меньшинства .

См. Наборы данных: несбалансированные наборы данных в курсе сбоя машинного обучения для получения дополнительной информации.

мини-партия

#fundamentals

Небольшая, случайно выбранная подмножество партии , обработанного в одной итерации . Размер партии мини-партии обычно составляет от 10 до 1000 примеров.

Например, предположим, что весь учебный набор (полная партия) состоит из 1000 примеров. Кроме того, предположим, что вы устанавливаете размер партии каждой мини-партии на 20. Следовательно, каждая итерация определяет потерю на случайных 20 из 1000 примеров, а затем соответственно корректирует веса и смещения .

Гораздо эффективнее рассчитать потерю на мини-партии, чем потери всех примеров в полной партии.

См. Линейную регрессию: гиперпараметры в крушении машинного обучения для получения дополнительной информации.

класс меньшинства

#fundamentals

Менее распространенная метка в класс-имбалансированном наборе данных . Например, учитывая набор данных, содержащий 99% отрицательных меток и 1% положительных меток, положительными этикетками являются класс меньшинства.

Контраст с классом большинства .

См. Наборы данных: несбалансированные наборы данных в курсе сбоя машинного обучения для получения дополнительной информации.

модель

#fundamentals

В целом, любая математическая конструкция, которая обрабатывает входные данные и возвращает вывод. Фрагрировано иначе, модель - это набор параметров и структуры, необходимых для системы для прогнозирования. В контролируемом машинном обучении модель получает пример в качестве ввода и делает прогноз в качестве вывода. Внутри контролируемого машинного обучения модели несколько отличаются. Например:

Вы можете сохранить, восстановить или сделать копии модели.

Неконтролируемое машинное обучение также генерирует модели, как правило, функция, которая может отобразить входной пример с наиболее подходящим кластером .

Многоклассовая классификация

#fundamentals

В контролируемом обучении задача классификации , в которой набор данных содержит более двух классов метки. Например, этикетки в наборе данных Iris должны быть одним из следующих трех классов:

  • Радужная оболочка Сетоса
  • Айрис Вирджиния
  • Iris versicolor

Модель, обученная набору данных IRIS, которая прогнозирует тип IRIS на новых примерах,-это многоклассная классификация.

Напротив, проблемы классификации, которые различают ровно двух классов, являются моделями бинарной классификации . Например, модель электронной почты, которая прогнозирует либо спам , либо не спам, является моделью бинарной классификации.

В задачах кластеризации многоклассная классификация относится к более чем двум кластерам.

См. Нейронные сети: многоклассная классификация в области сбоя машинного обучения для получения дополнительной информации.

Н

отрицательный класс

#fundamentals
#Metric

В бинарной классификации один класс называется положительным , а другой называется отрицательным . Положительный класс - это то, что модель тестирует, а отрицательный класс - другая возможность. Например:

  • Отрицательный класс в медицинском тесте может быть «не опухоль».
  • Отрицательный класс в модели классификации электронной почты может быть «не спам».

Контраст с положительным классом .

нейронная сеть

#fundamentals

Модель , содержащая хотя бы один скрытый слой . Глубокая нейронная сеть - это тип нейронной сети, содержащей более одного скрытого уровня. Например, на следующей диаграмме показана глубокая нейронная сеть, содержащая два скрытых слоя.

Нейронная сеть с входным слоем, два скрытых слоя и           выходной слой.

Каждый нейрон в нейронной сети подключается ко всем узлам в следующем уровне. Например, на предыдущей диаграмме обратите внимание, что каждый из трех нейронов в первом скрытом слое отдельно соединяется с обоими двумя нейронами во втором скрытом слое.

Нейронные сети, внедренные на компьютерах, иногда называют искусственными нейронными сетями , чтобы дифференцировать их от нейронных сетей, обнаруженных в мозге и других нервных системах.

Некоторые нейронные сети могут имитировать чрезвычайно сложные нелинейные отношения между различными функциями и меткой.

См. Также сверточная нейронная сеть и повторяющаяся нейронная сеть .

См. Нейронные сети в курсе по сбою машинного обучения для получения дополнительной информации.

нейрон

#fundamentals

В машинном обучении отличное устройство в скрытом слое нейронной сети . Каждый нейрон выполняет следующее двухэтапное действие:

  1. Вычисляет взвешенную сумму входных значений, умноженную на соответствующие веса.
  2. Передает взвешенную сумму в качестве входной функции в функцию активации .

Нейрон в первом скрытом слое принимает входы из значений функций в входном слое . Нейрон в любом скрытом слое за пределами первого принимает входные данные от нейронов в предыдущем скрытом слое. Например, нейрон во втором скрытом слое принимает входные данные от нейронов в первом скрытом слое.

Следующая иллюстрация подчеркивает два нейрона и их входные данные.

Нейронная сеть с входным слоем, два скрытых слоя и           выходной слой. Выделились два нейрона: один в первом           Скрытый слой и один во втором скрытом слое. Выделено           Нейрон в первом скрытом слое получает входы от обеих функций           в входном слое. Выделенный нейрон во втором скрытом слое           получает входные данные от каждого из трех нейронов в первом скрытом           слой.

Нейрон в нейронной сети имитирует поведение нейронов в мозге и других частях нервных систем.

Узел (нейронная сеть)

#fundamentals

Нейрон в скрытом слое .

См. Нейронные сети в курсе по сбою машинного обучения для получения дополнительной информации.

нелинейный

#fundamentals

Отношения между двумя или более переменными, которые не могут быть представлены исключительно через сложение и умножение. Линейная связь может быть представлена ​​как линия; Нелинейные отношения не могут быть представлены как строка. For example, consider two models that each relate a single feature to a single label. The model on the left is linear and the model on the right is nonlinear:

Two plots. One plot is a line, so this is a linear relationship.
          The other plot is a curve, so this is a nonlinear relationship.

See Neural networks: Nodes and hidden layers in Machine Learning Crash Course to experiment with different kinds of nonlinear functions.

nonstationarity

#fundamentals

A feature whose values change across one or more dimensions, usually time. For example, consider the following examples of nonstationarity:

  • The number of swimsuits sold at a particular store varies with the season.
  • The quantity of a particular fruit harvested in a particular region is zero for much of the year but large for a brief period.
  • Due to climate change, annual mean temperatures are shifting.

Contrast with stationarity .

normalization

#fundamentals

Broadly speaking, the process of converting a variable's actual range of values into a standard range of values, such as:

  • -1 to +1
  • 0 to 1
  • Z-scores (roughly, -3 to +3)

For example, suppose the actual range of values of a certain feature is 800 to 2,400. As part of feature engineering , you could normalize the actual values down to a standard range, such as -1 to +1.

Normalization is a common task in feature engineering . Models usually train faster (and produce better predictions) when every numerical feature in the feature vector has roughly the same range.

See also Z-score normalization .

See Numerical Data: Normalization in Machine Learning Crash Course for more information.

числовые данные

#fundamentals

Features represented as integers or real-valued numbers. For example, a house valuation model would probably represent the size of a house (in square feet or square meters) as numerical data. Representing a feature as numerical data indicates that the feature's values have a mathematical relationship to the label. That is, the number of square meters in a house probably has some mathematical relationship to the value of the house.

Not all integer data should be represented as numerical data. For example, postal codes in some parts of the world are integers; however, integer postal codes shouldn't be represented as numerical data in models. That's because a postal code of 20000 is not twice (or half) as potent as a postal code of 10000. Furthermore, although different postal codes do correlate to different real estate values, we can't assume that real estate values at postal code 20000 are twice as valuable as real estate values at postal code 10000. Postal codes should be represented as categorical data instead.

Numerical features are sometimes called continuous features .

See Working with numerical data in Machine Learning Crash Course for more information.

О

офлайн

#fundamentals

Synonym for static .

offline inference

#fundamentals

The process of a model generating a batch of predictions and then caching (saving) those predictions. Apps can then access the inferred prediction from the cache rather than rerunning the model.

For example, consider a model that generates local weather forecasts (predictions) once every four hours. After each model run, the system caches all the local weather forecasts. Weather apps retrieve the forecasts from the cache.

Offline inference is also called static inference .

Contrast with online inference .

See Production ML systems: Static versus dynamic inference in Machine Learning Crash Course for more information.

one-hot encoding

#fundamentals

Representing categorical data as a vector in which:

  • One element is set to 1.
  • All other elements are set to 0.

One-hot encoding is commonly used to represent strings or identifiers that have a finite set of possible values. For example, suppose a certain categorical feature named Scandinavia has five possible values:

  • "Дания"
  • "Швеция"
  • "Норвегия"
  • "Финляндия"
  • "Исландия"

One-hot encoding could represent each of the five values as follows:

страна Вектор
"Дания" 1 0 0 0 0
"Швеция" 0 1 0 0 0
"Норвегия" 0 0 1 0 0
"Финляндия" 0 0 0 1 0
"Исландия" 0 0 0 0 1

Thanks to one-hot encoding, a model can learn different connections based on each of the five countries.

Representing a feature as numerical data is an alternative to one-hot encoding. Unfortunately, representing the Scandinavian countries numerically is not a good choice. For example, consider the following numeric representation:

  • "Denmark" is 0
  • "Sweden" is 1
  • "Norway" is 2
  • "Finland" is 3
  • "Iceland" is 4

With numeric encoding, a model would interpret the raw numbers mathematically and would try to train on those numbers. However, Iceland isn't actually twice as much (or half as much) of something as Norway, so the model would come to some strange conclusions.

See Categorical data: Vocabulary and one-hot encoding in Machine Learning Crash Course for more information.

one-vs.-all

#fundamentals

Given a classification problem with N classes, a solution consisting of N separate binary classifiers —one binary classifier for each possible outcome. For example, given a model that classifies examples as animal, vegetable, or mineral, a one-vs.-all solution would provide the following three separate binary classifiers:

  • animal versus not animal
  • vegetable versus not vegetable
  • mineral versus not mineral

онлайн

#fundamentals

Synonym for dynamic .

online inference

#fundamentals

Generating predictions on demand. For example, suppose an app passes input to a model and issues a request for a prediction. A system using online inference responds to the request by running the model (and returning the prediction to the app).

Contrast with offline inference .

See Production ML systems: Static versus dynamic inference in Machine Learning Crash Course for more information.

output layer

#fundamentals

The "final" layer of a neural network. The output layer contains the prediction.

The following illustration shows a small deep neural network with an input layer, two hidden layers, and an output layer:

A neural network with one input layer, two hidden layers, and one           output layer. The input layer consists of two features. Первый           hidden layer consists of three neurons and the second hidden layer           consists of two neurons. The output layer consists of a single node.

overfitting

#fundamentals

Creating a model that matches the training data so closely that the model fails to make correct predictions on new data.

Regularization can reduce overfitting. Training on a large and diverse training set can also reduce overfitting.

See Overfitting in Machine Learning Crash Course for more information.

П

панды

#fundamentals

A column-oriented data analysis API built on top of numpy . Many machine learning frameworks, including TensorFlow, support pandas data structures as inputs. See the pandas documentation for details.

parameter

#fundamentals

The weights and biases that a model learns during training . For example, in a linear regression model, the parameters consist of the bias ( b ) and all the weights ( w 1 , w 2 , and so on) in the following formula:

$$y' = b + w_1x_1 + w_2x_2 + … w_nx_n$$

In contrast, hyperparameters are the values that you (or a hyperparameter tuning service) supply to the model. For example, learning rate is a hyperparameter.

positive class

#fundamentals
#Metric

The class you are testing for.

For example, the positive class in a cancer model might be "tumor." The positive class in an email classification model might be "spam."

Contrast with negative class .

post-processing

#ответственный
#fundamentals

Adjusting the output of a model after the model has been run. Post-processing can be used to enforce fairness constraints without modifying models themselves.

For example, one might apply post-processing to a binary classifier by setting a classification threshold such that equality of opportunity is maintained for some attribute by checking that the true positive rate is the same for all values of that attribute.

прогноз

#fundamentals

A model's output. Например:

  • The prediction of a binary classification model is either the positive class or the negative class.
  • The prediction of a multi-class classification model is one class.
  • The prediction of a linear regression model is a number.

proxy labels

#fundamentals

Data used to approximate labels not directly available in a dataset.

For example, suppose you must train a model to predict employee stress level. Your dataset contains a lot of predictive features but doesn't contain a label named stress level. Undaunted, you pick "workplace accidents" as a proxy label for stress level. After all, employees under high stress get into more accidents than calm employees. Or do they? Maybe workplace accidents actually rise and fall for multiple reasons.

As a second example, suppose you want is it raining? to be a Boolean label for your dataset, but your dataset doesn't contain rain data. If photographs are available, you might establish pictures of people carrying umbrellas as a proxy label for is it raining? Is that a good proxy label? Possibly, but people in some cultures may be more likely to carry umbrellas to protect against sun than the rain.

Proxy labels are often imperfect. When possible, choose actual labels over proxy labels. That said, when an actual label is absent, pick the proxy label very carefully, choosing the least horrible proxy label candidate.

See Datasets: Labels in Machine Learning Crash Course for more information.

Р

RAG

#fundamentals

Abbreviation for retrieval-augmented generation .

оценить

#fundamentals

A human who provides labels for examples . "Annotator" is another name for rater.

See Categorical data: Common issues in Machine Learning Crash Course for more information.

Rectified Linear Unit (ReLU)

#fundamentals

An activation function with the following behavior:

  • If input is negative or zero, then the output is 0.
  • If input is positive, then the output is equal to the input.

Например:

  • If the input is -3, then the output is 0.
  • If the input is +3, then the output is 3.0.

Here is a plot of ReLU:

A cartesian plot of two lines. The first line has a constant
          y value of 0, running along the x-axis from -infinity,0 to 0,-0.
          The second line starts at 0,0. This line has a slope of +1, so
          it runs from 0,0 to +infinity,+infinity.

ReLU is a very popular activation function. Despite its simple behavior, ReLU still enables a neural network to learn nonlinear relationships between features and the label .

regression model

#fundamentals

Informally, a model that generates a numerical prediction. (In contrast, a classification model generates a class prediction.) For example, the following are all regression models:

  • A model that predicts a certain house's value in Euros, such as 423,000.
  • A model that predicts a certain tree's life expectancy in years, such as 23.2.
  • A model that predicts the amount of rain in inches that will fall in a certain city over the next six hours, such as 0.18.

Two common types of regression models are:

  • Linear regression , which finds the line that best fits label values to features.
  • Logistic regression , which generates a probability between 0.0 and 1.0 that a system typically then maps to a class prediction.

Not every model that outputs numerical predictions is a regression model. In some cases, a numeric prediction is really just a classification model that happens to have numeric class names. For example, a model that predicts a numeric postal code is a classification model, not a regression model.

регуляризация

#fundamentals

Any mechanism that reduces overfitting . Popular types of regularization include:

Regularization can also be defined as the penalty on a model's complexity.

See Overfitting: Model complexity in Machine Learning Crash Course for more information.

regularization rate

#fundamentals

A number that specifies the relative importance of regularization during training. Raising the regularization rate reduces overfitting but may reduce the model's predictive power. Conversely, reducing or omitting the regularization rate increases overfitting.

See Overfitting: L2 regularization in Machine Learning Crash Course for more information.

ReLU

#fundamentals

Abbreviation for Rectified Linear Unit .

retrieval-augmented generation (RAG)

#fundamentals

A technique for improving the quality of large language model (LLM) output by grounding it with sources of knowledge retrieved after the model was trained. RAG improves the accuracy of LLM responses by providing the trained LLM with access to information retrieved from trusted knowledge bases or documents.

Common motivations to use retrieval-augmented generation include:

  • Increasing the factual accuracy of a model's generated responses.
  • Giving the model access to knowledge it was not trained on.
  • Changing the knowledge that the model uses.
  • Enabling the model to cite sources.

For example, suppose that a chemistry app uses the PaLM API to generate summaries related to user queries. When the app's backend receives a query, the backend:

  1. Searches for ("retrieves") data that's relevant to the user's query.
  2. Appends ("augments") the relevant chemistry data to the user's query.
  3. Instructs the LLM to create a summary based on the appended data.

ROC (receiver operating characteristic) Curve

#fundamentals
#Metric

A graph of true positive rate versus false positive rate for different classification thresholds in binary classification.

The shape of an ROC curve suggests a binary classification model's ability to separate positive classes from negative classes. Suppose, for example, that a binary classification model perfectly separates all the negative classes from all the positive classes:

A number line with 8 positive examples on the right side and
          7 negative examples on the left.

The ROC curve for the preceding model looks as follows:

An ROC curve. The x-axis is False Positive Rate and the y-axis
          is True Positive Rate. The curve has an inverted L shape. The curve
          starts at (0.0,0.0) and goes straight up to (0.0,1.0). Then the curve
          goes from (0.0,1.0) to (1.0,1.0).

In contrast, the following illustration graphs the raw logistic regression values for a terrible model that can't separate negative classes from positive classes at all:

A number line with positive examples and negative classes
          completely intermixed.

The ROC curve for this model looks as follows:

An ROC curve, which is actually a straight line from (0.0,0.0)
          to (1.0,1.0).

Meanwhile, back in the real world, most binary classification models separate positive and negative classes to some degree, but usually not perfectly. So, a typical ROC curve falls somewhere between the two extremes:

An ROC curve. The x-axis is False Positive Rate and the y-axis
          is True Positive Rate. The ROC curve approximates a shaky arc
          traversing the compass points from West to North.

The point on an ROC curve closest to (0.0,1.0) theoretically identifies the ideal classification threshold. However, several other real-world issues influence the selection of the ideal classification threshold. For example, perhaps false negatives cause far more pain than false positives.

A numerical metric called AUC summarizes the ROC curve into a single floating-point value.

Root Mean Squared Error (RMSE)

#fundamentals
#Metric

The square root of the Mean Squared Error .

С

sigmoid function

#fundamentals

A mathematical function that "squishes" an input value into a constrained range, typically 0 to 1 or -1 to +1. That is, you can pass any number (two, a million, negative billion, whatever) to a sigmoid and the output will still be in the constrained range. A plot of the sigmoid activation function looks as follows:

A two-dimensional curved plot with x values spanning the domain
          -infinity to +positive, while y values span the range almost 0 to
          almost 1. When x is 0, y is 0.5. The slope of the curve is always
          positive, with the highest slope at 0,0.5 and gradually decreasing
          slopes as the absolute value of x increases.

The sigmoid function has several uses in machine learning, including:

софтмакс

#fundamentals

A function that determines probabilities for each possible class in a multi-class classification model . The probabilities add up to exactly 1.0. For example, the following table shows how softmax distributes various probabilities:

Image is a... Вероятность
собака .85
кот .13
лошадь .02

Softmax is also called full softmax .

Contrast with candidate sampling .

See Neural networks: Multi-class classification in Machine Learning Crash Course for more information.

sparse feature

#язык
#fundamentals

A feature whose values are predominately zero or empty. For example, a feature containing a single 1 value and a million 0 values is sparse. In contrast, a dense feature has values that are predominantly not zero or empty.

In machine learning, a surprising number of features are sparse features. Categorical features are usually sparse features. For example, of the 300 possible tree species in a forest, a single example might identify just a maple tree . Or, of the millions of possible videos in a video library, a single example might identify just "Casablanca."

In a model, you typically represent sparse features with one-hot encoding . If the one-hot encoding is big, you might put an embedding layer on top of the one-hot encoding for greater efficiency.

sparse representation

#язык
#fundamentals

Storing only the position(s) of nonzero elements in a sparse feature.

For example, suppose a categorical feature named species identifies the 36 tree species in a particular forest. Further assume that each example identifies only a single species.

You could use a one-hot vector to represent the tree species in each example. A one-hot vector would contain a single 1 (to represent the particular tree species in that example) and 35 0 s (to represent the 35 tree species not in that example). So, the one-hot representation of maple might look something like the following:

A vector in which positions 0 through 23 hold the value 0, position
          24 holds the value 1, and positions 25 through 35 hold the value 0.

Alternatively, sparse representation would simply identify the position of the particular species. If maple is at position 24, then the sparse representation of maple would simply be:

24

Notice that the sparse representation is much more compact than the one-hot representation.

See Working with categorical data in Machine Learning Crash Course for more information.

sparse vector

#fundamentals

A vector whose values are mostly zeroes. See also sparse feature and sparsity .

squared loss

#fundamentals
#Metric

Synonym for L 2 loss .

статический

#fundamentals

Something done once rather than continuously. The terms static and offline are synonyms. The following are common uses of static and offline in machine learning:

  • static model (or offline model ) is a model trained once and then used for a while.
  • static training (or offline training ) is the process of training a static model.
  • static inference (or offline inference ) is a process in which a model generates a batch of predictions at a time.

Contrast with dynamic .

static inference

#fundamentals

Synonym for offline inference .

stationarity

#fundamentals

A feature whose values don't change across one or more dimensions, usually time. For example, a feature whose values look about the same in 2021 and 2023 exhibits stationarity.

In the real world, very few features exhibit stationarity. Even features synonymous with stability (like sea level) change over time.

Contrast with nonstationarity .

stochastic gradient descent (SGD)

#fundamentals

A gradient descent algorithm in which the batch size is one. In other words, SGD trains on a single example chosen uniformly at random from a training set .

See Linear regression: Hyperparameters in Machine Learning Crash Course for more information.

supervised machine learning

#fundamentals

Training a model from features and their corresponding labels . Supervised machine learning is analogous to learning a subject by studying a set of questions and their corresponding answers. After mastering the mapping between questions and answers, a student can then provide answers to new (never-before-seen) questions on the same topic.

Compare with unsupervised machine learning .

See Supervised Learning in the Introduction to ML course for more information.

synthetic feature

#fundamentals

A feature not present among the input features, but assembled from one or more of them. Methods for creating synthetic features include the following:

  • Bucketing a continuous feature into range bins.
  • Creating a feature cross .
  • Multiplying (or dividing) one feature value by other feature value(s) or by itself. For example, if a and b are input features, then the following are examples of synthetic features:
    • аб
    • a 2
  • Applying a transcendental function to a feature value. For example, if c is an input feature, then the following are examples of synthetic features:
    • sin(c)
    • ln(c)

Features created by normalizing or scaling alone are not considered synthetic features.

Т

test loss

#fundamentals
#Metric

A metric representing a model's loss against the test set . When building a model , you typically try to minimize test loss. That's because a low test loss is a stronger quality signal than a low training loss or low validation loss .

A large gap between test loss and training loss or validation loss sometimes suggests that you need to increase the regularization rate .

обучение

#fundamentals

The process of determining the ideal parameters (weights and biases) comprising a model . During training, a system reads in examples and gradually adjusts parameters. Training uses each example anywhere from a few times to billions of times.

See Supervised Learning in the Introduction to ML course for more information.

training loss

#fundamentals
#Metric

A metric representing a model's loss during a particular training iteration. For example, suppose the loss function is Mean Squared Error . Perhaps the training loss (the Mean Squared Error) for the 10th iteration is 2.2, and the training loss for the 100th iteration is 1.9.

A loss curve plots training loss versus the number of iterations. A loss curve provides the following hints about training:

  • A downward slope implies that the model is improving.
  • An upward slope implies that the model is getting worse.
  • A flat slope implies that the model has reached convergence .

For example, the following somewhat idealized loss curve shows:

  • A steep downward slope during the initial iterations, which implies rapid model improvement.
  • A gradually flattening (but still downward) slope until close to the end of training, which implies continued model improvement at a somewhat slower pace then during the initial iterations.
  • A flat slope towards the end of training, which suggests convergence.

The plot of training loss versus iterations. This loss curve starts
     with a steep downward slope. The slope gradually flattens until the
     slope becomes zero.

Although training loss is important, see also generalization .

training-serving skew

#fundamentals

The difference between a model's performance during training and that same model's performance during serving .

training set

#fundamentals

The subset of the dataset used to train a model .

Traditionally, examples in the dataset are divided into the following three distinct subsets:

Ideally, each example in the dataset should belong to only one of the preceding subsets. For example, a single example shouldn't belong to both the training set and the validation set.

See Datasets: Dividing the original dataset in Machine Learning Crash Course for more information.

true negative (TN)

#fundamentals
#Metric

An example in which the model correctly predicts the negative class . For example, the model infers that a particular email message is not spam , and that email message really is not spam .

true positive (TP)

#fundamentals
#Metric

An example in which the model correctly predicts the positive class . For example, the model infers that a particular email message is spam, and that email message really is spam.

true positive rate (TPR)

#fundamentals
#Metric

Synonym for recall . То есть:

$$\text{true positive rate} = \frac {\text{true positives}} {\text{true positives} + \text{false negatives}}$$

True positive rate is the y-axis in an ROC curve .

ты

underfitting

#fundamentals

Producing a model with poor predictive ability because the model hasn't fully captured the complexity of the training data. Many problems can cause underfitting, including:

See Overfitting in Machine Learning Crash Course for more information.

unlabeled example

#fundamentals

An example that contains features but no label . For example, the following table shows three unlabeled examples from a house valuation model, each with three features but no house value:

Number of bedrooms Number of bathrooms House age
3 2 15
2 1 72
4 2 34

In supervised machine learning , models train on labeled examples and make predictions on unlabeled examples .

In semi-supervised and unsupervised learning, unlabeled examples are used during training.

Contrast unlabeled example with labeled example .

unsupervised machine learning

#clustering
#fundamentals

Training a model to find patterns in a dataset, typically an unlabeled dataset.

The most common use of unsupervised machine learning is to cluster data into groups of similar examples. For example, an unsupervised machine learning algorithm can cluster songs based on various properties of the music. The resulting clusters can become an input to other machine learning algorithms (for example, to a music recommendation service). Clustering can help when useful labels are scarce or absent. For example, in domains such as anti-abuse and fraud, clusters can help humans better understand the data.

Contrast with supervised machine learning .

See What is Machine Learning? in the Introduction to ML course for more information.

В

проверка

#fundamentals

The initial evaluation of a model's quality. Validation checks the quality of a model's predictions against the validation set .

Because the validation set differs from the training set , validation helps guard against overfitting .

You might think of evaluating the model against the validation set as the first round of testing and evaluating the model against the test set as the second round of testing.

validation loss

#fundamentals
#Metric

A metric representing a model's loss on the validation set during a particular iteration of training.

See also generalization curve .

validation set

#fundamentals

The subset of the dataset that performs initial evaluation against a trained model . Typically, you evaluate the trained model against the validation set several times before evaluating the model against the test set .

Traditionally, you divide the examples in the dataset into the following three distinct subsets:

Ideally, each example in the dataset should belong to only one of the preceding subsets. For example, a single example shouldn't belong to both the training set and the validation set.

See Datasets: Dividing the original dataset in Machine Learning Crash Course for more information.

Вт

масса

#fundamentals

A value that a model multiplies by another value. Training is the process of determining a model's ideal weights; inference is the process of using those learned weights to make predictions.

See Linear regression in Machine Learning Crash Course for more information.

weighted sum

#fundamentals

The sum of all the relevant input values multiplied by their corresponding weights. For example, suppose the relevant inputs consist of the following:

input value input weight
2 -1.3
-1 0,6
3 0,4

The weighted sum is therefore:

weighted sum = (2)(-1.3) + (-1)(0.6) + (3)(0.4) = -2.0

A weighted sum is the input argument to an activation function .

З

Z-score normalization

#fundamentals

A scaling technique that replaces a raw feature value with a floating-point value representing the number of standard deviations from that feature's mean. For example, consider a feature whose mean is 800 and whose standard deviation is 100. The following table shows how Z-score normalization would map the raw value to its Z-score:

Raw value Z-score
800 0
950 +1.5
575 -2.25

The machine learning model then trains on the Z-scores for that feature instead of on the raw values.

See Numerical data: Normalization in Machine Learning Crash Course for more information.

,

This page contains ML Fundamentals glossary terms. For all glossary terms, click here .

А

точность

#fundamentals
#Metric

The number of correct classification predictions divided by the total number of predictions. То есть:

$$\text{Accuracy} = \frac{\text{correct predictions}} {\text{correct predictions + incorrect predictions }}$$

For example, a model that made 40 correct predictions and 10 incorrect predictions would have an accuracy of:

$$\text{Accuracy} = \frac{\text{40}} {\text{40 + 10}} = \text{80%}$$

Binary classification provides specific names for the different categories of correct predictions and incorrect predictions . So, the accuracy formula for binary classification is as follows:

$$\text{Accuracy} = \frac{\text{TP} + \text{TN}} {\text{TP} + \text{TN} + \text{FP} + \text{FN}}$$

где:

Compare and contrast accuracy with precision and recall .

See Classification: Accuracy, recall, precision and related metrics in Machine Learning Crash Course for more information.

activation function

#fundamentals

A function that enables neural networks to learn nonlinear (complex) relationships between features and the label.

Popular activation functions include:

The plots of activation functions are never single straight lines. For example, the plot of the ReLU activation function consists of two straight lines:

A cartesian plot of two lines. The first line has a constant
          y value of 0, running along the x-axis from -infinity,0 to 0,-0.
          The second line starts at 0,0. This line has a slope of +1, so
          it runs from 0,0 to +infinity,+infinity.

A plot of the sigmoid activation function looks as follows:

A two-dimensional curved plot with x values spanning the domain
          -infinity to +positive, while y values span the range almost 0 to
          almost 1. When x is 0, y is 0.5. The slope of the curve is always
          positive, with the highest slope at 0,0.5 and gradually decreasing
          slopes as the absolute value of x increases.

See Neural networks: Activation functions in Machine Learning Crash Course for more information.

искусственный интеллект

#fundamentals

A non-human program or model that can solve sophisticated tasks. For example, a program or model that translates text or a program or model that identifies diseases from radiologic images both exhibit artificial intelligence.

Formally, machine learning is a sub-field of artificial intelligence. However, in recent years, some organizations have begun using the terms artificial intelligence and machine learning interchangeably.

AUC (Area under the ROC curve)

#fundamentals
#Metric

A number between 0.0 and 1.0 representing a binary classification model's ability to separate positive classes from negative classes . The closer the AUC is to 1.0, the better the model's ability to separate classes from each other.

For example, the following illustration shows a classification model that separates positive classes (green ovals) from negative classes (purple rectangles) perfectly. This unrealistically perfect model has an AUC of 1.0:

A number line with 8 positive examples on one side and
          9 negative examples on the other side.

Conversely, the following illustration shows the results for a classification model that generated random results. This model has an AUC of 0.5:

A number line with 6 positive examples and 6 negative examples.
          The sequence of examples is positive, negative,
          positive, negative, positive, negative, positive, negative, positive
          negative, positive, negative.

Yes, the preceding model has an AUC of 0.5, not 0.0.

Most models are somewhere between the two extremes. For instance, the following model separates positives from negatives somewhat, and therefore has an AUC somewhere between 0.5 and 1.0:

A number line with 6 positive examples and 6 negative examples.
          The sequence of examples is negative, negative, negative, negative,
          positive, negative, positive, positive, negative, positive, positive,
          positive.

AUC ignores any value you set for classification threshold . Instead, AUC considers all possible classification thresholds.

See Classification: ROC and AUC in Machine Learning Crash Course for more information.

Б

обратное распространение ошибки

#fundamentals

The algorithm that implements gradient descent in neural networks .

Training a neural network involves many iterations of the following two-pass cycle:

  1. During the forward pass , the system processes a batch of examples to yield prediction(s). The system compares each prediction to each label value. The difference between the prediction and the label value is the loss for that example. The system aggregates the losses for all the examples to compute the total loss for the current batch.
  2. During the backward pass (backpropagation), the system reduces loss by adjusting the weights of all the neurons in all the hidden layer(s) .

Neural networks often contain many neurons across many hidden layers. Each of those neurons contribute to the overall loss in different ways. Backpropagation determines whether to increase or decrease the weights applied to particular neurons.

The learning rate is a multiplier that controls the degree to which each backward pass increases or decreases each weight. A large learning rate will increase or decrease each weight more than a small learning rate.

In calculus terms, backpropagation implements the chain rule . from calculus. That is, backpropagation calculates the partial derivative of the error with respect to each parameter.

Years ago, ML practitioners had to write code to implement backpropagation. Modern ML APIs like Keras now implement backpropagation for you. Уф!

See Neural networks in Machine Learning Crash Course for more information.

партия

#fundamentals

The set of examples used in one training iteration . The batch size determines the number of examples in a batch.

See epoch for an explanation of how a batch relates to an epoch.

See Linear regression: Hyperparameters in Machine Learning Crash Course for more information.

batch size

#fundamentals

The number of examples in a batch . For instance, if the batch size is 100, then the model processes 100 examples per iteration .

The following are popular batch size strategies:

  • Stochastic Gradient Descent (SGD) , in which the batch size is 1.
  • Full batch, in which the batch size is the number of examples in the entire training set . For instance, if the training set contains a million examples, then the batch size would be a million examples. Full batch is usually an inefficient strategy.
  • mini-batch in which the batch size is usually between 10 and 1000. Mini-batch is usually the most efficient strategy.

See the following for more information:

bias (ethics/fairness)

#ответственный
#fundamentals

1. Stereotyping, prejudice or favoritism towards some things, people, or groups over others. These biases can affect collection and interpretation of data, the design of a system, and how users interact with a system. Forms of this type of bias include:

2. Systematic error introduced by a sampling or reporting procedure. Forms of this type of bias include:

Not to be confused with the bias term in machine learning models or prediction bias .

See Fairness: Types of bias in Machine Learning Crash Course for more information.

bias (math) or bias term

#fundamentals

An intercept or offset from an origin. Bias is a parameter in machine learning models, which is symbolized by either of the following:

  • б
  • w 0

For example, bias is the b in the following formula:

$$y' = b + w_1x_1 + w_2x_2 + … w_nx_n$$

In a simple two-dimensional line, bias just means "y-intercept." For example, the bias of the line in the following illustration is 2.

The plot of a line with a slope of 0.5 and a bias (y-intercept) of 2.

Bias exists because not all models start from the origin (0,0). For example, suppose an amusement park costs 2 Euros to enter and an additional 0.5 Euro for every hour a customer stays. Therefore, a model mapping the total cost has a bias of 2 because the lowest cost is 2 Euros.

Bias is not to be confused with bias in ethics and fairness or prediction bias .

See Linear Regression in Machine Learning Crash Course for more information.

binary classification

#fundamentals

A type of classification task that predicts one of two mutually exclusive classes:

For example, the following two machine learning models each perform binary classification:

  • A model that determines whether email messages are spam (the positive class) or not spam (the negative class).
  • A model that evaluates medical symptoms to determine whether a person has a particular disease (the positive class) or doesn't have that disease (the negative class).

Contrast with multi-class classification .

See also logistic regression and classification threshold .

See Classification in Machine Learning Crash Course for more information.

bucketing

#fundamentals

Converting a single feature into multiple binary features called buckets or bins , typically based on a value range. The chopped feature is typically a continuous feature .

For example, instead of representing temperature as a single continuous floating-point feature, you could chop ranges of temperatures into discrete buckets, such as:

  • <= 10 degrees Celsius would be the "cold" bucket.
  • 11 - 24 degrees Celsius would be the "temperate" bucket.
  • >= 25 degrees Celsius would be the "warm" bucket.

The model will treat every value in the same bucket identically. For example, the values 13 and 22 are both in the temperate bucket, so the model treats the two values identically.

See Numerical data: Binning in Machine Learning Crash Course for more information.

С

categorical data

#fundamentals

Features having a specific set of possible values. For example, consider a categorical feature named traffic-light-state , which can only have one of the following three possible values:

  • red
  • yellow
  • green

By representing traffic-light-state as a categorical feature, a model can learn the differing impacts of red , green , and yellow on driver behavior.

Categorical features are sometimes called discrete features .

Contrast with numerical data .

See Working with categorical data in Machine Learning Crash Course for more information.

сорт

#fundamentals

A category that a label can belong to. Например:

A classification model predicts a class. In contrast, a regression model predicts a number rather than a class.

See Classification in Machine Learning Crash Course for more information.

classification model

#fundamentals

A model whose prediction is a class . For example, the following are all classification models:

  • A model that predicts an input sentence's language (French? Spanish? Italian?).
  • A model that predicts tree species (Maple? Oak? Baobab?).
  • A model that predicts the positive or negative class for a particular medical condition.

In contrast, regression models predict numbers rather than classes.

Two common types of classification models are:

classification threshold

#fundamentals

In a binary classification , a number between 0 and 1 that converts the raw output of a logistic regression model into a prediction of either the positive class or the negative class . Note that the classification threshold is a value that a human chooses, not a value chosen by model training.

A logistic regression model outputs a raw value between 0 and 1. Then:

  • If this raw value is greater than the classification threshold, then the positive class is predicted.
  • If this raw value is less than the classification threshold, then the negative class is predicted.

For example, suppose the classification threshold is 0.8. If the raw value is 0.9, then the model predicts the positive class. If the raw value is 0.7, then the model predicts the negative class.

The choice of classification threshold strongly influences the number of false positives and false negatives .

See Thresholds and the confusion matrix in Machine Learning Crash Course for more information.

classifier

#fundamentals

A casual term for a classification model .

class-imbalanced dataset

#fundamentals

A dataset for a classification problem in which the total number of labels of each class differs significantly. For example, consider a binary classification dataset whose two labels are divided as follows:

  • 1,000,000 negative labels
  • 10 positive labels

The ratio of negative to positive labels is 100,000 to 1, so this is a class-imbalanced dataset.

In contrast, the following dataset is not class-imbalanced because the ratio of negative labels to positive labels is relatively close to 1:

  • 517 negative labels
  • 483 positive labels

Multi-class datasets can also be class-imbalanced. For example, the following multi-class classification dataset is also class-imbalanced because one label has far more examples than the other two:

  • 1,000,000 labels with class "green"
  • 200 labels with class "purple"
  • 350 labels with class "orange"

See also entropy , majority class , and minority class .

clipping

#fundamentals

A technique for handling outliers by doing either or both of the following:

  • Reducing feature values that are greater than a maximum threshold down to that maximum threshold.
  • Increasing feature values that are less than a minimum threshold up to that minimum threshold.

For example, suppose that <0.5% of values for a particular feature fall outside the range 40–60. In this case, you could do the following:

  • Clip all values over 60 (the maximum threshold) to be exactly 60.
  • Clip all values under 40 (the minimum threshold) to be exactly 40.

Outliers can damage models, sometimes causing weights to overflow during training. Some outliers can also dramatically spoil metrics like accuracy . Clipping is a common technique to limit the damage.

Gradient clipping forces gradient values within a designated range during training.

See Numerical data: Normalization in Machine Learning Crash Course for more information.

матрица путаницы

#fundamentals

An NxN table that summarizes the number of correct and incorrect predictions that a classification model made. For example, consider the following confusion matrix for a binary classification model:

Tumor (predicted) Non-Tumor (predicted)
Tumor (ground truth) 18 (TP) 1 (FN)
Non-Tumor (ground truth) 6 (FP) 452 (TN)

The preceding confusion matrix shows the following:

  • Of the 19 predictions in which ground truth was Tumor, the model correctly classified 18 and incorrectly classified 1.
  • Of the 458 predictions in which ground truth was Non-Tumor, the model correctly classified 452 and incorrectly classified 6.

The confusion matrix for a multi-class classification problem can help you identify patterns of mistakes. For example, consider the following confusion matrix for a 3-class multi-class classification model that categorizes three different iris types (Virginica, Versicolor, and Setosa). When the ground truth was Virginica, the confusion matrix shows that the model was far more likely to mistakenly predict Versicolor than Setosa:

Setosa (predicted) Versicolor (predicted) Virginica (predicted)
Setosa (ground truth) 88 12 0
Versicolor (ground truth) 6 141 7
Virginica (ground truth) 2 27 109

As yet another example, a confusion matrix could reveal that a model trained to recognize handwritten digits tends to mistakenly predict 9 instead of 4, or mistakenly predict 1 instead of 7.

Confusion matrixes contain sufficient information to calculate a variety of performance metrics, including precision and recall .

continuous feature

#fundamentals

A floating-point feature with an infinite range of possible values, such as temperature or weight.

Contrast with discrete feature .

конвергенция

#fundamentals

A state reached when loss values change very little or not at all with each iteration . For example, the following loss curve suggests convergence at around 700 iterations:

Cartesian plot. X-axis is loss. Y-axis is the number of training
          iterations. Loss is very high during first few iterations, but
          drops sharply. After about 100 iterations, loss is still
          descending but far more gradually. After about 700 iterations,
          loss stays flat.

A model converges when additional training won't improve the model.

In deep learning , loss values sometimes stay constant or nearly so for many iterations before finally descending. During a long period of constant loss values, you may temporarily get a false sense of convergence.

See also early stopping .

See Model convergence and loss curves in Machine Learning Crash Course for more information.

Д

DataFrame

#fundamentals

A popular pandas data type for representing datasets in memory.

A DataFrame is analogous to a table or a spreadsheet. Each column of a DataFrame has a name (a header), and each row is identified by a unique number.

Each column in a DataFrame is structured like a 2D array, except that each column can be assigned its own data type.

See also the official pandas.DataFrame reference page .

data set or dataset

#fundamentals

A collection of raw data, commonly (but not exclusively) organized in one of the following formats:

  • a spreadsheet
  • a file in CSV (comma-separated values) format

deep model

#fundamentals

A neural network containing more than one hidden layer .

A deep model is also called a deep neural network .

Contrast with wide model .

dense feature

#fundamentals

A feature in which most or all values are nonzero, typically a Tensor of floating-point values. For example, the following 10-element Tensor is dense because 9 of its values are nonzero:

8 3 7 5 2 4 0 4 9 6

Contrast with sparse feature .

глубина

#fundamentals

The sum of the following in a neural network :

For example, a neural network with five hidden layers and one output layer has a depth of 6.

Notice that the input layer doesn't influence depth.

discrete feature

#fundamentals

A feature with a finite set of possible values. For example, a feature whose values may only be animal , vegetable , or mineral is a discrete (or categorical) feature.

Contrast with continuous feature .

динамичный

#fundamentals

Something done frequently or continuously. The terms dynamic and online are synonyms in machine learning. The following are common uses of dynamic and online in machine learning:

  • A dynamic model (or online model ) is a model that is retrained frequently or continuously.
  • Dynamic training (or online training ) is the process of training frequently or continuously.
  • Dynamic inference (or online inference ) is the process of generating predictions on demand.

dynamic model

#fundamentals

A model that is frequently (maybe even continuously) retrained. A dynamic model is a "lifelong learner" that constantly adapts to evolving data. A dynamic model is also known as an online model .

Contrast with static model .

Э

early stopping

#fundamentals

A method for regularization that involves ending training before training loss finishes decreasing. In early stopping, you intentionally stop training the model when the loss on a validation dataset starts to increase; that is, when generalization performance worsens.

embedding layer

#язык
#fundamentals

A special hidden layer that trains on a high-dimensional categorical feature to gradually learn a lower dimension embedding vector. An embedding layer enables a neural network to train far more efficiently than training just on the high-dimensional categorical feature.

For example, Earth currently supports about 73,000 tree species. Suppose tree species is a feature in your model, so your model's input layer includes a one-hot vector 73,000 elements long. For example, perhaps baobab would be represented something like this:

An array of 73,000 elements. The first 6,232 elements hold the value
     0. The next element holds the value 1. The final 66,767 elements hold
     the value zero.

A 73,000-element array is very long. If you don't add an embedding layer to the model, training is going to be very time consuming due to multiplying 72,999 zeros. Perhaps you pick the embedding layer to consist of 12 dimensions. Consequently, the embedding layer will gradually learn a new embedding vector for each tree species.

In certain situations, hashing is a reasonable alternative to an embedding layer.

See Embeddings in Machine Learning Crash Course for more information.

эпоха

#fundamentals

A full training pass over the entire training set such that each example has been processed once.

An epoch represents N / batch size training iterations , where N is the total number of examples.

For instance, suppose the following:

  • The dataset consists of 1,000 examples.
  • The batch size is 50 examples.

Therefore, a single epoch requires 20 iterations:

1 epoch = (N/batch size) = (1,000 / 50) = 20 iterations

See Linear regression: Hyperparameters in Machine Learning Crash Course for more information.

пример

#fundamentals

The values of one row of features and possibly a label . Examples in supervised learning fall into two general categories:

  • A labeled example consists of one or more features and a label. Labeled examples are used during training.
  • An unlabeled example consists of one or more features but no label. Unlabeled examples are used during inference.

For instance, suppose you are training a model to determine the influence of weather conditions on student test scores. Here are three labeled examples:

Функции Этикетка
Температура Влажность Давление Test score
15 47 998 Хороший
19 34 1020 Отличный
18 92 1012 Бедный

Here are three unlabeled examples:

Температура Влажность Давление
12 62 1014
21 47 1017
19 41 1021

The row of a dataset is typically the raw source for an example. That is, an example typically consists of a subset of the columns in the dataset. Furthermore, the features in an example can also include synthetic features , such as feature crosses .

See Supervised Learning in the Introduction to Machine Learning course for more information.

Ф

false negative (FN)

#fundamentals
#Metric

An example in which the model mistakenly predicts the negative class . For example, the model predicts that a particular email message is not spam (the negative class), but that email message actually is spam .

false positive (FP)

#fundamentals
#Metric

An example in which the model mistakenly predicts the positive class . For example, the model predicts that a particular email message is spam (the positive class), but that email message is actually not spam .

See Thresholds and the confusion matrix in Machine Learning Crash Course for more information.

false positive rate (FPR)

#fundamentals
#Metric

The proportion of actual negative examples for which the model mistakenly predicted the positive class. The following formula calculates the false positive rate:

$$\text{false positive rate} = \frac{\text{false positives}}{\text{false positives} + \text{true negatives}}$$

The false positive rate is the x-axis in an ROC curve .

See Classification: ROC and AUC in Machine Learning Crash Course for more information.

особенность

#fundamentals

An input variable to a machine learning model. An example consists of one or more features. For instance, suppose you are training a model to determine the influence of weather conditions on student test scores. The following table shows three examples, each of which contains three features and one label:

Функции Этикетка
Температура Влажность Давление Test score
15 47 998 92
19 34 1020 84
18 92 1012 87

Contrast with label .

See Supervised Learning in the Introduction to Machine Learning course for more information.

feature cross

#fundamentals

A synthetic feature formed by "crossing" categorical or bucketed features.

For example, consider a "mood forecasting" model that represents temperature in one of the following four buckets:

  • freezing
  • chilly
  • temperate
  • warm

And represents wind speed in one of the following three buckets:

  • still
  • light
  • windy

Without feature crosses, the linear model trains independently on each of the preceding seven various buckets. So, the model trains on, for example, freezing independently of the training on, for example, windy .

Alternatively, you could create a feature cross of temperature and wind speed. This synthetic feature would have the following 12 possible values:

  • freezing-still
  • freezing-light
  • freezing-windy
  • chilly-still
  • chilly-light
  • chilly-windy
  • temperate-still
  • temperate-light
  • temperate-windy
  • warm-still
  • warm-light
  • warm-windy

Thanks to feature crosses, the model can learn mood differences between a freezing-windy day and a freezing-still day.

If you create a synthetic feature from two features that each have a lot of different buckets, the resulting feature cross will have a huge number of possible combinations. For example, if one feature has 1,000 buckets and the other feature has 2,000 buckets, the resulting feature cross has 2,000,000 buckets.

Formally, a cross is a Cartesian product .

Feature crosses are mostly used with linear models and are rarely used with neural networks.

See Categorical data: Feature crosses in Machine Learning Crash Course for more information.

feature engineering

#fundamentals
#TensorFlow

A process that involves the following steps:

  1. Determining which features might be useful in training a model.
  2. Converting raw data from the dataset into efficient versions of those features.

For example, you might determine that temperature might be a useful feature. Then, you might experiment with bucketing to optimize what the model can learn from different temperature ranges.

Feature engineering is sometimes called feature extraction or featurization .

See Numerical data: How a model ingests data using feature vectors in Machine Learning Crash Course for more information.

feature set

#fundamentals

The group of features your machine learning model trains on. For example, a simple feature set for a model that predicts housing prices might consist of postal code, property size, and property condition.

feature vector

#fundamentals

The array of feature values comprising an example . The feature vector is input during training and during inference . For example, the feature vector for a model with two discrete features might be:

[0.92, 0.56]

Four layers: an input layer, two hidden layers, and one output layer.
          The input layer contains two nodes, one containing the value
          0.92 and the other containing the value 0.56.

Each example supplies different values for the feature vector, so the feature vector for the next example could be something like:

[0.73, 0.49]

Feature engineering determines how to represent features in the feature vector. For example, a binary categorical feature with five possible values might be represented with one-hot encoding . In this case, the portion of the feature vector for a particular example would consist of four zeroes and a single 1.0 in the third position, as follows:

[0.0, 0.0, 1.0, 0.0, 0.0]

As another example, suppose your model consists of three features:

  • a binary categorical feature with five possible values represented with one-hot encoding; for example: [0.0, 1.0, 0.0, 0.0, 0.0]
  • another binary categorical feature with three possible values represented with one-hot encoding; for example: [0.0, 0.0, 1.0]
  • a floating-point feature; for example: 8.3 .

In this case, the feature vector for each example would be represented by nine values. Given the example values in the preceding list, the feature vector would be:

0.0
1.0
0.0
0.0
0.0
0.0
0.0
1.0
8.3

See Numerical data: How a model ingests data using feature vectors in Machine Learning Crash Course for more information.

feedback loop

#fundamentals

In machine learning, a situation in which a model's predictions influence the training data for the same model or another model. For example, a model that recommends movies will influence the movies that people see, which will then influence subsequent movie recommendation models.

See Production ML systems: Questions to ask in Machine Learning Crash Course for more information.

Г

обобщение

#fundamentals

A model's ability to make correct predictions on new, previously unseen data. A model that can generalize is the opposite of a model that is overfitting .

See Generalization in Machine Learning Crash Course for more information.

generalization curve

#fundamentals

A plot of both training loss and validation loss as a function of the number of iterations .

A generalization curve can help you detect possible overfitting . For example, the following generalization curve suggests overfitting because validation loss ultimately becomes significantly higher than training loss.

A Cartesian graph in which the y-axis is labeled loss and the x-axis
          is labeled iterations. Two plots appear. One plots shows the
          training loss and the other shows the validation loss.
          The two plots start off similarly, but the training loss eventually
          dips far lower than the validation loss.

See Generalization in Machine Learning Crash Course for more information.

градиентный спуск

#fundamentals

A mathematical technique to minimize loss . Gradient descent iteratively adjusts weights and biases , gradually finding the best combination to minimize loss.

Gradient descent is older—much, much older—than machine learning.

See the Linear regression: Gradient descent in Machine Learning Crash Course for more information.

ground truth

#fundamentals

Реальность.

The thing that actually happened.

For example, consider a binary classification model that predicts whether a student in their first year of university will graduate within six years. Ground truth for this model is whether or not that student actually graduated within six years.

ЧАС

hidden layer

#fundamentals

A layer in a neural network between the input layer (the features) and the output layer (the prediction). Each hidden layer consists of one or more neurons . For example, the following neural network contains two hidden layers, the first with three neurons and the second with two neurons:

Four layers. The first layer is an input layer containing two           функции. The second layer is a hidden layer containing three           neurons. The third layer is a hidden layer containing two           neurons. The fourth layer is an output layer. Each feature           contains three edges, each of which points to a different neuron           in the second layer. Each of the neurons in the second layer           contains two edges, each of which points to a different neuron           in the third layer. Each of the neurons in the third layer contain           one edge, each pointing to the output layer.

A deep neural network contains more than one hidden layer. For example, the preceding illustration is a deep neural network because the model contains two hidden layers.

See Neural networks: Nodes and hidden layers in Machine Learning Crash Course for more information.

hyperparameter

#fundamentals

The variables that you or a hyperparameter tuning serviceadjust during successive runs of training a model. For example, learning rate is a hyperparameter. You could set the learning rate to 0.01 before one training session. If you determine that 0.01 is too high, you could perhaps set the learning rate to 0.003 for the next training session.

In contrast, parameters are the various weights and bias that the model learns during training.

See Linear regression: Hyperparameters in Machine Learning Crash Course for more information.

я

independently and identically distributed (iid)

#fundamentals

Data drawn from a distribution that doesn't change, and where each value drawn doesn't depend on values that have been drawn previously. An iid is the ideal gas of machine learning—a useful mathematical construct but almost never exactly found in the real world. For example, the distribution of visitors to a web page may be iid over a brief window of time; that is, the distribution doesn't change during that brief window and one person's visit is generally independent of another's visit. However, if you expand that window of time, seasonal differences in the web page's visitors may appear.

See also nonstationarity .

вывод

#fundamentals

In machine learning, the process of making predictions by applying a trained model to unlabeled examples .

Inference has a somewhat different meaning in statistics. See the Wikipedia article on statistical inference for details.

See Supervised Learning in the Intro to ML course to see inference's role in a supervised learning system.

input layer

#fundamentals

The layer of a neural network that holds the feature vector . That is, the input layer provides examples for training or inference . For example, the input layer in the following neural network consists of two features:

Four layers: an input layer, two hidden layers, and an output layer.

interpretability

#fundamentals

The ability to explain or to present an ML model's reasoning in understandable terms to a human.

Most linear regression models, for example, are highly interpretable. (You merely need to look at the trained weights for each feature.) Decision forests are also highly interpretable. Some models, however, require sophisticated visualization to become interpretable.

You can use the Learning Interpretability Tool (LIT) to interpret ML models.

итерация

#fundamentals

A single update of a model's parameters—the model's weights and biases —during training . The batch size determines how many examples the model processes in a single iteration. For instance, if the batch size is 20, then the model processes 20 examples before adjusting the parameters.

When training a neural network , a single iteration involves the following two passes:

  1. A forward pass to evaluate loss on a single batch.
  2. A backward pass ( backpropagation ) to adjust the model's parameters based on the loss and the learning rate.

See Gradient descent in Machine Learning Crash Course for more information.

л

L 0 regularization

#fundamentals

A type of regularization that penalizes the total number of nonzero weights in a model. For example, a model having 11 nonzero weights would be penalized more than a similar model having 10 nonzero weights.

L 0 regularization is sometimes called L0-norm regularization .

L 1 loss

#fundamentals
#Metric

A loss function that calculates the absolute value of the difference between actual label values and the values that a model predicts. For example, here's the calculation of L 1 loss for a batch of five examples :

Actual value of example Model's predicted value Absolute value of delta
7 6 1
5 4 1
8 11 3
4 6 2
9 8 1
8 = L 1 loss

L 1 loss is less sensitive to outliers than L 2 loss .

The Mean Absolute Error is the average L 1 loss per example.

See Linear regression: Loss in Machine Learning Crash Course for more information.

L 1 regularization

#fundamentals

A type of regularization that penalizes weights in proportion to the sum of the absolute value of the weights. L 1 regularization helps drive the weights of irrelevant or barely relevant features to exactly 0 . A feature with a weight of 0 is effectively removed from the model.

Contrast with L 2 regularization .

L 2 loss

#fundamentals
#Metric

A loss function that calculates the square of the difference between actual label values and the values that a model predicts. For example, here's the calculation of L 2 loss for a batch of five examples :

Actual value of example Model's predicted value Square of delta
7 6 1
5 4 1
8 11 9
4 6 4
9 8 1
16 = L 2 loss

Due to squaring, L 2 loss amplifies the influence of outliers . That is, L 2 loss reacts more strongly to bad predictions than L 1 loss . For example, the L 1 loss for the preceding batch would be 8 rather than 16. Notice that a single outlier accounts for 9 of the 16.

Regression models typically use L 2 loss as the loss function.

The Mean Squared Error is the average L 2 loss per example. Squared loss is another name for L 2 loss.

See Logistic regression: Loss and regularization in Machine Learning Crash Course for more information.

L 2 regularization

#fundamentals

A type of regularization that penalizes weights in proportion to the sum of the squares of the weights. L 2 regularization helps drive outlier weights (those with high positive or low negative values) closer to 0 but not quite to 0 . Features with values very close to 0 remain in the model but don't influence the model's prediction very much.

L 2 regularization always improves generalization in linear models .

Contrast with L 1 regularization .

See Overfitting: L2 regularization in Machine Learning Crash Course for more information.

этикетка

#fundamentals

In supervised machine learning , the "answer" or "result" portion of an example .

Each labeled example consists of one or more features and a label. For example, in a spam detection dataset, the label would probably be either "spam" or "not spam." In a rainfall dataset, the label might be the amount of rain that fell during a certain period.

See Supervised Learning in Introduction to Machine Learning for more information.

labeled example

#fundamentals

An example that contains one or more features and a label . For example, the following table shows three labeled examples from a house valuation model, each with three features and one label:

Number of bedrooms Number of bathrooms House age House price (label)
3 2 15 $345,000
2 1 72 $179,000
4 2 34 $392,000

In supervised machine learning , models train on labeled examples and make predictions on unlabeled examples .

Contrast labeled example with unlabeled examples.

See Supervised Learning in Introduction to Machine Learning for more information.

лямбда

#fundamentals

Synonym for regularization rate .

Lambda is an overloaded term. Here we're focusing on the term's definition within regularization .

слой

#fundamentals

A set of neurons in a neural network . Three common types of layers are as follows:

For example, the following illustration shows a neural network with one input layer, two hidden layers, and one output layer:

A neural network with one input layer, two hidden layers, and one           output layer. The input layer consists of two features. Первый           hidden layer consists of three neurons and the second hidden layer           consists of two neurons. The output layer consists of a single node.

In TensorFlow , layers are also Python functions that take Tensors and configuration options as input and produce other tensors as output.

learning rate

#fundamentals

A floating-point number that tells the gradient descent algorithm how strongly to adjust weights and biases on each iteration . For example, a learning rate of 0.3 would adjust weights and biases three times more powerfully than a learning rate of 0.1.

Learning rate is a key hyperparameter . If you set the learning rate too low, training will take too long. If you set the learning rate too high, gradient descent often has trouble reaching convergence .

See Linear regression: Hyperparameters in Machine Learning Crash Course for more information.

linear

#fundamentals

A relationship between two or more variables that can be represented solely through addition and multiplication.

The plot of a linear relationship is a line.

Contrast with nonlinear .

линейная модель

#fundamentals

A model that assigns one weight per feature to make predictions . (Linear models also incorporate a bias .) In contrast, the relationship of features to predictions in deep models is generally nonlinear .

Linear models are usually easier to train and more interpretable than deep models. However, deep models can learn complex relationships between features.

Linear regression and logistic regression are two types of linear models.

linear regression

#fundamentals

A type of machine learning model in which both of the following are true:

  • The model is a linear model .
  • The prediction is a floating-point value. (This is the regression part of linear regression .)

Contrast linear regression with logistic regression . Also, contrast regression with classification .

See Linear regression in Machine Learning Crash Course for more information.

логистическая регрессия

#fundamentals

A type of regression model that predicts a probability. Logistic regression models have the following characteristics:

  • The label is categorical . The term logistic regression usually refers to binary logistic regression , that is, to a model that calculates probabilities for labels with two possible values. A less common variant, multinomial logistic regression , calculates probabilities for labels with more than two possible values.
  • The loss function during training is Log Loss . (Multiple Log Loss units can be placed in parallel for labels with more than two possible values.)
  • The model has a linear architecture, not a deep neural network. However, the remainder of this definition also applies to deep models that predict probabilities for categorical labels.

For example, consider a logistic regression model that calculates the probability of an input email being either spam or not spam. During inference, suppose the model predicts 0.72. Therefore, the model is estimating:

  • A 72% chance of the email being spam.
  • A 28% chance of the email not being spam.

A logistic regression model uses the following two-step architecture:

  1. The model generates a raw prediction (y') by applying a linear function of input features.
  2. The model uses that raw prediction as input to a sigmoid function , which converts the raw prediction to a value between 0 and 1, exclusive.

Like any regression model, a logistic regression model predicts a number. However, this number typically becomes part of a binary classification model as follows:

  • If the predicted number is greater than the classification threshold , the binary classification model predicts the positive class.
  • If the predicted number is less than the classification threshold, the binary classification model predicts the negative class.

See Logistic regression in Machine Learning Crash Course for more information.

Log Loss

#fundamentals

The loss function used in binary logistic regression .

See Logistic regression: Loss and regularization in Machine Learning Crash Course for more information.

log-odds

#fundamentals

The logarithm of the odds of some event.

потеря

#fundamentals
#Metric

During the training of a supervised model , a measure of how far a model's prediction is from its label .

A loss function calculates the loss.

See Linear regression: Loss in Machine Learning Crash Course for more information.

loss curve

#fundamentals

A plot of loss as a function of the number of training iterations . The following plot shows a typical loss curve:

A Cartesian graph of loss versus training iterations, showing a
          rapid drop in loss for the initial iterations, followed by a gradual
          drop, and then a flat slope during the final iterations.

Loss curves can help you determine when your model is converging or overfitting .

Loss curves can plot all of the following types of loss:

See also generalization curve .

See Overfitting: Interpreting loss curves in Machine Learning Crash Course for more information.

функция потерь

#fundamentals
#Metric

During training or testing, a mathematical function that calculates the loss on a batch of examples. A loss function returns a lower loss for models that makes good predictions than for models that make bad predictions.

The goal of training is typically to minimize the loss that a loss function returns.

Many different kinds of loss functions exist. Pick the appropriate loss function for the kind of model you are building. Например:

М

машинное обучение

#fundamentals

A program or system that trains a model from input data. The trained model can make useful predictions from new (never-before-seen) data drawn from the same distribution as the one used to train the model.

Machine learning also refers to the field of study concerned with these programs or systems.

See the Introduction to Machine Learning course for more information.

majority class

#fundamentals

The more common label in a class-imbalanced dataset . For example, given a dataset containing 99% negative labels and 1% positive labels, the negative labels are the majority class.

Contrast with minority class .

See Datasets: Imbalanced datasets in Machine Learning Crash Course for more information.

mini-batch

#fundamentals

A small, randomly selected subset of a batch processed in one iteration . The batch size of a mini-batch is usually between 10 and 1,000 examples.

For example, suppose the entire training set (the full batch) consists of 1,000 examples. Further suppose that you set the batch size of each mini-batch to 20. Therefore, each iteration determines the loss on a random 20 of the 1,000 examples and then adjusts the weights and biases accordingly.

It is much more efficient to calculate the loss on a mini-batch than the loss on all the examples in the full batch.

See Linear regression: Hyperparameters in Machine Learning Crash Course for more information.

minority class

#fundamentals

The less common label in a class-imbalanced dataset . For example, given a dataset containing 99% negative labels and 1% positive labels, the positive labels are the minority class.

Contrast with majority class .

See Datasets: Imbalanced datasets in Machine Learning Crash Course for more information.

модель

#fundamentals

In general, any mathematical construct that processes input data and returns output. Phrased differently, a model is the set of parameters and structure needed for a system to make predictions. In supervised machine learning , a model takes an example as input and infers a prediction as output. Within supervised machine learning, models differ somewhat. Например:

  • A linear regression model consists of a set of weights and a bias .
  • A neural network model consists of:
    • A set of hidden layers , each containing one or more neurons .
    • The weights and bias associated with each neuron.
  • A decision tree model consists of:
    • The shape of the tree; that is, the pattern in which the conditions and leaves are connected.
    • The conditions and leaves.

You can save, restore, or make copies of a model.

Unsupervised machine learning also generates models, typically a function that can map an input example to the most appropriate cluster .

multi-class classification

#fundamentals

In supervised learning, a classification problem in which the dataset contains more than two classes of labels. For example, the labels in the Iris dataset must be one of the following three classes:

  • Iris setosa
  • Iris virginica
  • Iris versicolor

A model trained on the Iris dataset that predicts Iris type on new examples is performing multi-class classification.

In contrast, classification problems that distinguish between exactly two classes are binary classification models . For example, an email model that predicts either spam or not spam is a binary classification model.

In clustering problems, multi-class classification refers to more than two clusters.

See Neural networks: Multi-class classification in Machine Learning Crash Course for more information.

Н

negative class

#fundamentals
#Metric

In binary classification , one class is termed positive and the other is termed negative . The positive class is the thing or event that the model is testing for and the negative class is the other possibility. Например:

  • The negative class in a medical test might be "not tumor."
  • The negative class in an email classification model might be "not spam."

Contrast with positive class .

нейронная сеть

#fundamentals

A model containing at least one hidden layer . A deep neural network is a type of neural network containing more than one hidden layer. For example, the following diagram shows a deep neural network containing two hidden layers.

A neural network with an input layer, two hidden layers, and an
          output layer.

Each neuron in a neural network connects to all of the nodes in the next layer. For example, in the preceding diagram, notice that each of the three neurons in the first hidden layer separately connect to both of the two neurons in the second hidden layer.

Neural networks implemented on computers are sometimes called artificial neural networks to differentiate them from neural networks found in brains and other nervous systems.

Some neural networks can mimic extremely complex nonlinear relationships between different features and the label.

See also convolutional neural network and recurrent neural network .

See Neural networks in Machine Learning Crash Course for more information.

нейрон

#fundamentals

In machine learning, a distinct unit within a hidden layer of a neural network . Each neuron performs the following two-step action:

  1. Calculates the weighted sum of input values multiplied by their corresponding weights.
  2. Passes the weighted sum as input to an activation function .

A neuron in the first hidden layer accepts inputs from the feature values in the input layer . A neuron in any hidden layer beyond the first accepts inputs from the neurons in the preceding hidden layer. For example, a neuron in the second hidden layer accepts inputs from the neurons in the first hidden layer.

The following illustration highlights two neurons and their inputs.

A neural network with an input layer, two hidden layers, and an           output layer. Two neurons are highlighted: one in the first           hidden layer and one in the second hidden layer. The highlighted           neuron in the first hidden layer receives inputs from both features           in the input layer. The highlighted neuron in the second hidden layer           receives inputs from each of the three neurons in the first hidden           слой.

A neuron in a neural network mimics the behavior of neurons in brains and other parts of nervous systems.

node (neural network)

#fundamentals

A neuron in a hidden layer .

See Neural Networks in Machine Learning Crash Course for more information.

nonlinear

#fundamentals

A relationship between two or more variables that can't be represented solely through addition and multiplication. A linear relationship can be represented as a line; a nonlinear relationship can't be represented as a line. For example, consider two models that each relate a single feature to a single label. The model on the left is linear and the model on the right is nonlinear:

Two plots. One plot is a line, so this is a linear relationship.
          The other plot is a curve, so this is a nonlinear relationship.

See Neural networks: Nodes and hidden layers in Machine Learning Crash Course to experiment with different kinds of nonlinear functions.

nonstationarity

#fundamentals

A feature whose values change across one or more dimensions, usually time. For example, consider the following examples of nonstationarity:

  • The number of swimsuits sold at a particular store varies with the season.
  • The quantity of a particular fruit harvested in a particular region is zero for much of the year but large for a brief period.
  • Due to climate change, annual mean temperatures are shifting.

Contrast with stationarity .

normalization

#fundamentals

Broadly speaking, the process of converting a variable's actual range of values into a standard range of values, such as:

  • -1 to +1
  • 0 to 1
  • Z-scores (roughly, -3 to +3)

For example, suppose the actual range of values of a certain feature is 800 to 2,400. As part of feature engineering , you could normalize the actual values down to a standard range, such as -1 to +1.

Normalization is a common task in feature engineering . Models usually train faster (and produce better predictions) when every numerical feature in the feature vector has roughly the same range.

See also Z-score normalization .

See Numerical Data: Normalization in Machine Learning Crash Course for more information.

числовые данные

#fundamentals

Features represented as integers or real-valued numbers. For example, a house valuation model would probably represent the size of a house (in square feet or square meters) as numerical data. Representing a feature as numerical data indicates that the feature's values have a mathematical relationship to the label. That is, the number of square meters in a house probably has some mathematical relationship to the value of the house.

Not all integer data should be represented as numerical data. For example, postal codes in some parts of the world are integers; however, integer postal codes shouldn't be represented as numerical data in models. That's because a postal code of 20000 is not twice (or half) as potent as a postal code of 10000. Furthermore, although different postal codes do correlate to different real estate values, we can't assume that real estate values at postal code 20000 are twice as valuable as real estate values at postal code 10000. Postal codes should be represented as categorical data instead.

Numerical features are sometimes called continuous features .

See Working with numerical data in Machine Learning Crash Course for more information.

О

офлайн

#fundamentals

Synonym for static .

offline inference

#fundamentals

The process of a model generating a batch of predictions and then caching (saving) those predictions. Apps can then access the inferred prediction from the cache rather than rerunning the model.

For example, consider a model that generates local weather forecasts (predictions) once every four hours. After each model run, the system caches all the local weather forecasts. Weather apps retrieve the forecasts from the cache.

Offline inference is also called static inference .

Contrast with online inference .

See Production ML systems: Static versus dynamic inference in Machine Learning Crash Course for more information.

one-hot encoding

#fundamentals

Representing categorical data as a vector in which:

  • One element is set to 1.
  • All other elements are set to 0.

One-hot encoding is commonly used to represent strings or identifiers that have a finite set of possible values. For example, suppose a certain categorical feature named Scandinavia has five possible values:

  • "Дания"
  • "Швеция"
  • "Норвегия"
  • "Финляндия"
  • "Исландия"

One-hot encoding could represent each of the five values as follows:

страна Вектор
"Дания" 1 0 0 0 0
"Швеция" 0 1 0 0 0
"Норвегия" 0 0 1 0 0
"Финляндия" 0 0 0 1 0
"Исландия" 0 0 0 0 1

Thanks to one-hot encoding, a model can learn different connections based on each of the five countries.

Representing a feature as numerical data is an alternative to one-hot encoding. Unfortunately, representing the Scandinavian countries numerically is not a good choice. For example, consider the following numeric representation:

  • "Denmark" is 0
  • "Sweden" is 1
  • "Norway" is 2
  • "Finland" is 3
  • "Iceland" is 4

With numeric encoding, a model would interpret the raw numbers mathematically and would try to train on those numbers. However, Iceland isn't actually twice as much (or half as much) of something as Norway, so the model would come to some strange conclusions.

See Categorical data: Vocabulary and one-hot encoding in Machine Learning Crash Course for more information.

one-vs.-all

#fundamentals

Given a classification problem with N classes, a solution consisting of N separate binary classifiers —one binary classifier for each possible outcome. For example, given a model that classifies examples as animal, vegetable, or mineral, a one-vs.-all solution would provide the following three separate binary classifiers:

  • animal versus not animal
  • vegetable versus not vegetable
  • mineral versus not mineral

онлайн

#fundamentals

Synonym for dynamic .

online inference

#fundamentals

Generating predictions on demand. For example, suppose an app passes input to a model and issues a request for a prediction. A system using online inference responds to the request by running the model (and returning the prediction to the app).

Contrast with offline inference .

See Production ML systems: Static versus dynamic inference in Machine Learning Crash Course for more information.

output layer

#fundamentals

The "final" layer of a neural network. The output layer contains the prediction.

The following illustration shows a small deep neural network with an input layer, two hidden layers, and an output layer:

A neural network with one input layer, two hidden layers, and one           output layer. The input layer consists of two features. Первый           hidden layer consists of three neurons and the second hidden layer           consists of two neurons. The output layer consists of a single node.

overfitting

#fundamentals

Creating a model that matches the training data so closely that the model fails to make correct predictions on new data.

Regularization can reduce overfitting. Training on a large and diverse training set can also reduce overfitting.

See Overfitting in Machine Learning Crash Course for more information.

П

панды

#fundamentals

A column-oriented data analysis API built on top of numpy . Many machine learning frameworks, including TensorFlow, support pandas data structures as inputs. See the pandas documentation for details.

parameter

#fundamentals

The weights and biases that a model learns during training . For example, in a linear regression model, the parameters consist of the bias ( b ) and all the weights ( w 1 , w 2 , and so on) in the following formula:

$$y' = b + w_1x_1 + w_2x_2 + … w_nx_n$$

In contrast, hyperparameters are the values that you (or a hyperparameter tuning service) supply to the model. For example, learning rate is a hyperparameter.

positive class

#fundamentals
#Metric

The class you are testing for.

For example, the positive class in a cancer model might be "tumor." The positive class in an email classification model might be "spam."

Contrast with negative class .

post-processing

#ответственный
#fundamentals

Adjusting the output of a model after the model has been run. Post-processing can be used to enforce fairness constraints without modifying models themselves.

For example, one might apply post-processing to a binary classifier by setting a classification threshold such that equality of opportunity is maintained for some attribute by checking that the true positive rate is the same for all values of that attribute.

прогноз

#fundamentals

A model's output. Например:

  • The prediction of a binary classification model is either the positive class or the negative class.
  • The prediction of a multi-class classification model is one class.
  • The prediction of a linear regression model is a number.

proxy labels

#fundamentals

Data used to approximate labels not directly available in a dataset.

For example, suppose you must train a model to predict employee stress level. Your dataset contains a lot of predictive features but doesn't contain a label named stress level. Undaunted, you pick "workplace accidents" as a proxy label for stress level. After all, employees under high stress get into more accidents than calm employees. Or do they? Maybe workplace accidents actually rise and fall for multiple reasons.

As a second example, suppose you want is it raining? to be a Boolean label for your dataset, but your dataset doesn't contain rain data. If photographs are available, you might establish pictures of people carrying umbrellas as a proxy label for is it raining? Is that a good proxy label? Possibly, but people in some cultures may be more likely to carry umbrellas to protect against sun than the rain.

Proxy labels are often imperfect. When possible, choose actual labels over proxy labels. That said, when an actual label is absent, pick the proxy label very carefully, choosing the least horrible proxy label candidate.

See Datasets: Labels in Machine Learning Crash Course for more information.

Р

RAG

#fundamentals

Abbreviation for retrieval-augmented generation .

оценить

#fundamentals

A human who provides labels for examples . "Annotator" is another name for rater.

See Categorical data: Common issues in Machine Learning Crash Course for more information.

Rectified Linear Unit (ReLU)

#fundamentals

An activation function with the following behavior:

  • If input is negative or zero, then the output is 0.
  • If input is positive, then the output is equal to the input.

Например:

  • If the input is -3, then the output is 0.
  • If the input is +3, then the output is 3.0.

Here is a plot of ReLU:

A cartesian plot of two lines. The first line has a constant
          y value of 0, running along the x-axis from -infinity,0 to 0,-0.
          The second line starts at 0,0. This line has a slope of +1, so
          it runs from 0,0 to +infinity,+infinity.

ReLU is a very popular activation function. Despite its simple behavior, ReLU still enables a neural network to learn nonlinear relationships between features and the label .

regression model

#fundamentals

Informally, a model that generates a numerical prediction. (In contrast, a classification model generates a class prediction.) For example, the following are all regression models:

  • A model that predicts a certain house's value in Euros, such as 423,000.
  • A model that predicts a certain tree's life expectancy in years, such as 23.2.
  • A model that predicts the amount of rain in inches that will fall in a certain city over the next six hours, such as 0.18.

Two common types of regression models are:

  • Linear regression , which finds the line that best fits label values to features.
  • Logistic regression , which generates a probability between 0.0 and 1.0 that a system typically then maps to a class prediction.

Not every model that outputs numerical predictions is a regression model. In some cases, a numeric prediction is really just a classification model that happens to have numeric class names. For example, a model that predicts a numeric postal code is a classification model, not a regression model.

регуляризация

#fundamentals

Any mechanism that reduces overfitting . Popular types of regularization include:

Regularization can also be defined as the penalty on a model's complexity.

See Overfitting: Model complexity in Machine Learning Crash Course for more information.

regularization rate

#fundamentals

A number that specifies the relative importance of regularization during training. Raising the regularization rate reduces overfitting but may reduce the model's predictive power. Conversely, reducing or omitting the regularization rate increases overfitting.

See Overfitting: L2 regularization in Machine Learning Crash Course for more information.

ReLU

#fundamentals

Abbreviation for Rectified Linear Unit .

retrieval-augmented generation (RAG)

#fundamentals

A technique for improving the quality of large language model (LLM) output by grounding it with sources of knowledge retrieved after the model was trained. RAG improves the accuracy of LLM responses by providing the trained LLM with access to information retrieved from trusted knowledge bases or documents.

Common motivations to use retrieval-augmented generation include:

  • Increasing the factual accuracy of a model's generated responses.
  • Giving the model access to knowledge it was not trained on.
  • Changing the knowledge that the model uses.
  • Enabling the model to cite sources.

For example, suppose that a chemistry app uses the PaLM API to generate summaries related to user queries. When the app's backend receives a query, the backend:

  1. Searches for ("retrieves") data that's relevant to the user's query.
  2. Appends ("augments") the relevant chemistry data to the user's query.
  3. Instructs the LLM to create a summary based on the appended data.

ROC (receiver operating characteristic) Curve

#fundamentals
#Metric

A graph of true positive rate versus false positive rate for different classification thresholds in binary classification.

The shape of an ROC curve suggests a binary classification model's ability to separate positive classes from negative classes. Suppose, for example, that a binary classification model perfectly separates all the negative classes from all the positive classes:

A number line with 8 positive examples on the right side and
          7 negative examples on the left.

The ROC curve for the preceding model looks as follows:

An ROC curve. The x-axis is False Positive Rate and the y-axis
          is True Positive Rate. The curve has an inverted L shape. The curve
          starts at (0.0,0.0) and goes straight up to (0.0,1.0). Then the curve
          goes from (0.0,1.0) to (1.0,1.0).

In contrast, the following illustration graphs the raw logistic regression values for a terrible model that can't separate negative classes from positive classes at all:

A number line with positive examples and negative classes
          completely intermixed.

The ROC curve for this model looks as follows:

An ROC curve, which is actually a straight line from (0.0,0.0)
          to (1.0,1.0).

Meanwhile, back in the real world, most binary classification models separate positive and negative classes to some degree, but usually not perfectly. So, a typical ROC curve falls somewhere between the two extremes:

An ROC curve. The x-axis is False Positive Rate and the y-axis
          is True Positive Rate. The ROC curve approximates a shaky arc
          traversing the compass points from West to North.

The point on an ROC curve closest to (0.0,1.0) theoretically identifies the ideal classification threshold. However, several other real-world issues influence the selection of the ideal classification threshold. For example, perhaps false negatives cause far more pain than false positives.

A numerical metric called AUC summarizes the ROC curve into a single floating-point value.

Root Mean Squared Error (RMSE)

#fundamentals
#Metric

The square root of the Mean Squared Error .

С

sigmoid function

#fundamentals

A mathematical function that "squishes" an input value into a constrained range, typically 0 to 1 or -1 to +1. That is, you can pass any number (two, a million, negative billion, whatever) to a sigmoid and the output will still be in the constrained range. A plot of the sigmoid activation function looks as follows:

A two-dimensional curved plot with x values spanning the domain
          -infinity to +positive, while y values span the range almost 0 to
          almost 1. When x is 0, y is 0.5. The slope of the curve is always
          positive, with the highest slope at 0,0.5 and gradually decreasing
          slopes as the absolute value of x increases.

The sigmoid function has several uses in machine learning, including:

софтмакс

#fundamentals

A function that determines probabilities for each possible class in a multi-class classification model . The probabilities add up to exactly 1.0. For example, the following table shows how softmax distributes various probabilities:

Image is a... Вероятность
собака .85
кот .13
лошадь .02

Softmax is also called full softmax .

Contrast with candidate sampling .

See Neural networks: Multi-class classification in Machine Learning Crash Course for more information.

sparse feature

#язык
#fundamentals

A feature whose values are predominately zero or empty. For example, a feature containing a single 1 value and a million 0 values is sparse. In contrast, a dense feature has values that are predominantly not zero or empty.

In machine learning, a surprising number of features are sparse features. Categorical features are usually sparse features. For example, of the 300 possible tree species in a forest, a single example might identify just a maple tree . Or, of the millions of possible videos in a video library, a single example might identify just "Casablanca."

In a model, you typically represent sparse features with one-hot encoding . If the one-hot encoding is big, you might put an embedding layer on top of the one-hot encoding for greater efficiency.

sparse representation

#язык
#fundamentals

Storing only the position(s) of nonzero elements in a sparse feature.

For example, suppose a categorical feature named species identifies the 36 tree species in a particular forest. Further assume that each example identifies only a single species.

You could use a one-hot vector to represent the tree species in each example. A one-hot vector would contain a single 1 (to represent the particular tree species in that example) and 35 0 s (to represent the 35 tree species not in that example). So, the one-hot representation of maple might look something like the following:

A vector in which positions 0 through 23 hold the value 0, position
          24 holds the value 1, and positions 25 through 35 hold the value 0.

Alternatively, sparse representation would simply identify the position of the particular species. If maple is at position 24, then the sparse representation of maple would simply be:

24

Notice that the sparse representation is much more compact than the one-hot representation.

See Working with categorical data in Machine Learning Crash Course for more information.

sparse vector

#fundamentals

A vector whose values are mostly zeroes. See also sparse feature and sparsity .

squared loss

#fundamentals
#Metric

Synonym for L 2 loss .

статический

#fundamentals

Something done once rather than continuously. The terms static and offline are synonyms. The following are common uses of static and offline in machine learning:

  • static model (or offline model ) is a model trained once and then used for a while.
  • static training (or offline training ) is the process of training a static model.
  • static inference (or offline inference ) is a process in which a model generates a batch of predictions at a time.

Contrast with dynamic .

static inference

#fundamentals

Synonym for offline inference .

stationarity

#fundamentals

A feature whose values don't change across one or more dimensions, usually time. For example, a feature whose values look about the same in 2021 and 2023 exhibits stationarity.

In the real world, very few features exhibit stationarity. Even features synonymous with stability (like sea level) change over time.

Contrast with nonstationarity .

stochastic gradient descent (SGD)

#fundamentals

A gradient descent algorithm in which the batch size is one. In other words, SGD trains on a single example chosen uniformly at random from a training set .

See Linear regression: Hyperparameters in Machine Learning Crash Course for more information.

supervised machine learning

#fundamentals

Training a model from features and their corresponding labels . Supervised machine learning is analogous to learning a subject by studying a set of questions and their corresponding answers. After mastering the mapping between questions and answers, a student can then provide answers to new (never-before-seen) questions on the same topic.

Compare with unsupervised machine learning .

See Supervised Learning in the Introduction to ML course for more information.

synthetic feature

#fundamentals

A feature not present among the input features, but assembled from one or more of them. Methods for creating synthetic features include the following:

  • Bucketing a continuous feature into range bins.
  • Creating a feature cross .
  • Multiplying (or dividing) one feature value by other feature value(s) or by itself. For example, if a and b are input features, then the following are examples of synthetic features:
    • аб
    • a 2
  • Applying a transcendental function to a feature value. For example, if c is an input feature, then the following are examples of synthetic features:
    • sin(c)
    • ln(c)

Features created by normalizing or scaling alone are not considered synthetic features.

Т

test loss

#fundamentals
#Metric

A metric representing a model's loss against the test set . When building a model , you typically try to minimize test loss. That's because a low test loss is a stronger quality signal than a low training loss or low validation loss .

A large gap between test loss and training loss or validation loss sometimes suggests that you need to increase the regularization rate .

обучение

#fundamentals

The process of determining the ideal parameters (weights and biases) comprising a model . During training, a system reads in examples and gradually adjusts parameters. Training uses each example anywhere from a few times to billions of times.

See Supervised Learning in the Introduction to ML course for more information.

training loss

#fundamentals
#Metric

A metric representing a model's loss during a particular training iteration. For example, suppose the loss function is Mean Squared Error . Perhaps the training loss (the Mean Squared Error) for the 10th iteration is 2.2, and the training loss for the 100th iteration is 1.9.

A loss curve plots training loss versus the number of iterations. A loss curve provides the following hints about training:

  • A downward slope implies that the model is improving.
  • An upward slope implies that the model is getting worse.
  • A flat slope implies that the model has reached convergence .

For example, the following somewhat idealized loss curve shows:

  • A steep downward slope during the initial iterations, which implies rapid model improvement.
  • A gradually flattening (but still downward) slope until close to the end of training, which implies continued model improvement at a somewhat slower pace then during the initial iterations.
  • A flat slope towards the end of training, which suggests convergence.

The plot of training loss versus iterations. This loss curve starts
     with a steep downward slope. The slope gradually flattens until the
     slope becomes zero.

Although training loss is important, see also generalization .

training-serving skew

#fundamentals

The difference between a model's performance during training and that same model's performance during serving .

training set

#fundamentals

The subset of the dataset used to train a model .

Traditionally, examples in the dataset are divided into the following three distinct subsets:

Ideally, each example in the dataset should belong to only one of the preceding subsets. For example, a single example shouldn't belong to both the training set and the validation set.

See Datasets: Dividing the original dataset in Machine Learning Crash Course for more information.

true negative (TN)

#fundamentals
#Metric

An example in which the model correctly predicts the negative class . For example, the model infers that a particular email message is not spam , and that email message really is not spam .

true positive (TP)

#fundamentals
#Metric

An example in which the model correctly predicts the positive class . For example, the model infers that a particular email message is spam, and that email message really is spam.

true positive rate (TPR)

#fundamentals
#Metric

Synonym for recall . То есть:

$$\text{true positive rate} = \frac {\text{true positives}} {\text{true positives} + \text{false negatives}}$$

True positive rate is the y-axis in an ROC curve .

ты

underfitting

#fundamentals

Producing a model with poor predictive ability because the model hasn't fully captured the complexity of the training data. Many problems can cause underfitting, including:

See Overfitting in Machine Learning Crash Course for more information.

unlabeled example

#fundamentals

An example that contains features but no label . For example, the following table shows three unlabeled examples from a house valuation model, each with three features but no house value:

Number of bedrooms Number of bathrooms House age
3 2 15
2 1 72
4 2 34

In supervised machine learning , models train on labeled examples and make predictions on unlabeled examples .

In semi-supervised and unsupervised learning, unlabeled examples are used during training.

Contrast unlabeled example with labeled example .

unsupervised machine learning

#clustering
#fundamentals

Training a model to find patterns in a dataset, typically an unlabeled dataset.

The most common use of unsupervised machine learning is to cluster data into groups of similar examples. For example, an unsupervised machine learning algorithm can cluster songs based on various properties of the music. The resulting clusters can become an input to other machine learning algorithms (for example, to a music recommendation service). Clustering can help when useful labels are scarce or absent. For example, in domains such as anti-abuse and fraud, clusters can help humans better understand the data.

Contrast with supervised machine learning .

See What is Machine Learning? in the Introduction to ML course for more information.

В

проверка

#fundamentals

The initial evaluation of a model's quality. Validation checks the quality of a model's predictions against the validation set .

Because the validation set differs from the training set , validation helps guard against overfitting .

You might think of evaluating the model against the validation set as the first round of testing and evaluating the model against the test set as the second round of testing.

validation loss

#fundamentals
#Metric

A metric representing a model's loss on the validation set during a particular iteration of training.

See also generalization curve .

validation set

#fundamentals

The subset of the dataset that performs initial evaluation against a trained model . Typically, you evaluate the trained model against the validation set several times before evaluating the model against the test set .

Traditionally, you divide the examples in the dataset into the following three distinct subsets:

Ideally, each example in the dataset should belong to only one of the preceding subsets. For example, a single example shouldn't belong to both the training set and the validation set.

See Datasets: Dividing the original dataset in Machine Learning Crash Course for more information.

Вт

масса

#fundamentals

A value that a model multiplies by another value. Training is the process of determining a model's ideal weights; inference is the process of using those learned weights to make predictions.

See Linear regression in Machine Learning Crash Course for more information.

weighted sum

#fundamentals

The sum of all the relevant input values multiplied by their corresponding weights. For example, suppose the relevant inputs consist of the following:

input value input weight
2 -1.3
-1 0,6
3 0,4

The weighted sum is therefore:

weighted sum = (2)(-1.3) + (-1)(0.6) + (3)(0.4) = -2.0

A weighted sum is the input argument to an activation function .

З

Z-score normalization

#fundamentals

A scaling technique that replaces a raw feature value with a floating-point value representing the number of standard deviations from that feature's mean. For example, consider a feature whose mean is 800 and whose standard deviation is 100. The following table shows how Z-score normalization would map the raw value to its Z-score:

Raw value Z-score
800 0
950 +1.5
575 -2.25

The machine learning model then trains on the Z-scores for that feature instead of on the raw values.

See Numerical data: Normalization in Machine Learning Crash Course for more information.

,

This page contains ML Fundamentals glossary terms. For all glossary terms, click here .

А

точность

#fundamentals
#Metric

The number of correct classification predictions divided by the total number of predictions. То есть:

$$\text{Accuracy} = \frac{\text{correct predictions}} {\text{correct predictions + incorrect predictions }}$$

For example, a model that made 40 correct predictions and 10 incorrect predictions would have an accuracy of:

$$\text{Accuracy} = \frac{\text{40}} {\text{40 + 10}} = \text{80%}$$

Binary classification provides specific names for the different categories of correct predictions and incorrect predictions . So, the accuracy formula for binary classification is as follows:

$$\text{Accuracy} = \frac{\text{TP} + \text{TN}} {\text{TP} + \text{TN} + \text{FP} + \text{FN}}$$

где:

Compare and contrast accuracy with precision and recall .

See Classification: Accuracy, recall, precision and related metrics in Machine Learning Crash Course for more information.

activation function

#fundamentals

A function that enables neural networks to learn nonlinear (complex) relationships between features and the label.

Popular activation functions include:

The plots of activation functions are never single straight lines. For example, the plot of the ReLU activation function consists of two straight lines:

A cartesian plot of two lines. The first line has a constant
          y value of 0, running along the x-axis from -infinity,0 to 0,-0.
          The second line starts at 0,0. This line has a slope of +1, so
          it runs from 0,0 to +infinity,+infinity.

A plot of the sigmoid activation function looks as follows:

A two-dimensional curved plot with x values spanning the domain
          -infinity to +positive, while y values span the range almost 0 to
          almost 1. When x is 0, y is 0.5. The slope of the curve is always
          positive, with the highest slope at 0,0.5 and gradually decreasing
          slopes as the absolute value of x increases.

See Neural networks: Activation functions in Machine Learning Crash Course for more information.

искусственный интеллект

#fundamentals

A non-human program or model that can solve sophisticated tasks. For example, a program or model that translates text or a program or model that identifies diseases from radiologic images both exhibit artificial intelligence.

Formally, machine learning is a sub-field of artificial intelligence. However, in recent years, some organizations have begun using the terms artificial intelligence and machine learning interchangeably.

AUC (Area under the ROC curve)

#fundamentals
#Metric

A number between 0.0 and 1.0 representing a binary classification model's ability to separate positive classes from negative classes . The closer the AUC is to 1.0, the better the model's ability to separate classes from each other.

For example, the following illustration shows a classification model that separates positive classes (green ovals) from negative classes (purple rectangles) perfectly. This unrealistically perfect model has an AUC of 1.0:

A number line with 8 positive examples on one side and
          9 negative examples on the other side.

Conversely, the following illustration shows the results for a classification model that generated random results. This model has an AUC of 0.5:

A number line with 6 positive examples and 6 negative examples.
          The sequence of examples is positive, negative,
          positive, negative, positive, negative, positive, negative, positive
          negative, positive, negative.

Yes, the preceding model has an AUC of 0.5, not 0.0.

Most models are somewhere between the two extremes. For instance, the following model separates positives from negatives somewhat, and therefore has an AUC somewhere between 0.5 and 1.0:

A number line with 6 positive examples and 6 negative examples.
          The sequence of examples is negative, negative, negative, negative,
          positive, negative, positive, positive, negative, positive, positive,
          positive.

AUC ignores any value you set for classification threshold . Instead, AUC considers all possible classification thresholds.

See Classification: ROC and AUC in Machine Learning Crash Course for more information.

Б

обратное распространение ошибки

#fundamentals

The algorithm that implements gradient descent in neural networks .

Training a neural network involves many iterations of the following two-pass cycle:

  1. During the forward pass , the system processes a batch of examples to yield prediction(s). The system compares each prediction to each label value. The difference between the prediction and the label value is the loss for that example. The system aggregates the losses for all the examples to compute the total loss for the current batch.
  2. During the backward pass (backpropagation), the system reduces loss by adjusting the weights of all the neurons in all the hidden layer(s) .

Neural networks often contain many neurons across many hidden layers. Each of those neurons contribute to the overall loss in different ways. Backpropagation determines whether to increase or decrease the weights applied to particular neurons.

The learning rate is a multiplier that controls the degree to which each backward pass increases or decreases each weight. A large learning rate will increase or decrease each weight more than a small learning rate.

In calculus terms, backpropagation implements the chain rule . from calculus. That is, backpropagation calculates the partial derivative of the error with respect to each parameter.

Years ago, ML practitioners had to write code to implement backpropagation. Modern ML APIs like Keras now implement backpropagation for you. Уф!

See Neural networks in Machine Learning Crash Course for more information.

партия

#fundamentals

The set of examples used in one training iteration . The batch size determines the number of examples in a batch.

See epoch for an explanation of how a batch relates to an epoch.

See Linear regression: Hyperparameters in Machine Learning Crash Course for more information.

batch size

#fundamentals

The number of examples in a batch . For instance, if the batch size is 100, then the model processes 100 examples per iteration .

The following are popular batch size strategies:

  • Stochastic Gradient Descent (SGD) , in which the batch size is 1.
  • Full batch, in which the batch size is the number of examples in the entire training set . For instance, if the training set contains a million examples, then the batch size would be a million examples. Full batch is usually an inefficient strategy.
  • mini-batch in which the batch size is usually between 10 and 1000. Mini-batch is usually the most efficient strategy.

See the following for more information:

bias (ethics/fairness)

#ответственный
#fundamentals

1. Stereotyping, prejudice or favoritism towards some things, people, or groups over others. These biases can affect collection and interpretation of data, the design of a system, and how users interact with a system. Forms of this type of bias include:

2. Systematic error introduced by a sampling or reporting procedure. Forms of this type of bias include:

Not to be confused with the bias term in machine learning models or prediction bias .

See Fairness: Types of bias in Machine Learning Crash Course for more information.

bias (math) or bias term

#fundamentals

An intercept or offset from an origin. Bias is a parameter in machine learning models, which is symbolized by either of the following:

  • б
  • w 0

For example, bias is the b in the following formula:

$$y' = b + w_1x_1 + w_2x_2 + … w_nx_n$$

In a simple two-dimensional line, bias just means "y-intercept." For example, the bias of the line in the following illustration is 2.

The plot of a line with a slope of 0.5 and a bias (y-intercept) of 2.

Bias exists because not all models start from the origin (0,0). For example, suppose an amusement park costs 2 Euros to enter and an additional 0.5 Euro for every hour a customer stays. Therefore, a model mapping the total cost has a bias of 2 because the lowest cost is 2 Euros.

Bias is not to be confused with bias in ethics and fairness or prediction bias .

See Linear Regression in Machine Learning Crash Course for more information.

binary classification

#fundamentals

A type of classification task that predicts one of two mutually exclusive classes:

For example, the following two machine learning models each perform binary classification:

  • A model that determines whether email messages are spam (the positive class) or not spam (the negative class).
  • A model that evaluates medical symptoms to determine whether a person has a particular disease (the positive class) or doesn't have that disease (the negative class).

Contrast with multi-class classification .

See also logistic regression and classification threshold .

See Classification in Machine Learning Crash Course for more information.

bucketing

#fundamentals

Converting a single feature into multiple binary features called buckets or bins , typically based on a value range. The chopped feature is typically a continuous feature .

For example, instead of representing temperature as a single continuous floating-point feature, you could chop ranges of temperatures into discrete buckets, such as:

  • <= 10 degrees Celsius would be the "cold" bucket.
  • 11 - 24 degrees Celsius would be the "temperate" bucket.
  • >= 25 degrees Celsius would be the "warm" bucket.

The model will treat every value in the same bucket identically. For example, the values 13 and 22 are both in the temperate bucket, so the model treats the two values identically.

See Numerical data: Binning in Machine Learning Crash Course for more information.

С

categorical data

#fundamentals

Features having a specific set of possible values. For example, consider a categorical feature named traffic-light-state , which can only have one of the following three possible values:

  • red
  • yellow
  • green

By representing traffic-light-state as a categorical feature, a model can learn the differing impacts of red , green , and yellow on driver behavior.

Categorical features are sometimes called discrete features .

Contrast with numerical data .

See Working with categorical data in Machine Learning Crash Course for more information.

сорт

#fundamentals

A category that a label can belong to. Например:

A classification model predicts a class. In contrast, a regression model predicts a number rather than a class.

See Classification in Machine Learning Crash Course for more information.

classification model

#fundamentals

A model whose prediction is a class . For example, the following are all classification models:

  • A model that predicts an input sentence's language (French? Spanish? Italian?).
  • A model that predicts tree species (Maple? Oak? Baobab?).
  • A model that predicts the positive or negative class for a particular medical condition.

In contrast, regression models predict numbers rather than classes.

Two common types of classification models are:

classification threshold

#fundamentals

In a binary classification , a number between 0 and 1 that converts the raw output of a logistic regression model into a prediction of either the positive class or the negative class . Note that the classification threshold is a value that a human chooses, not a value chosen by model training.

A logistic regression model outputs a raw value between 0 and 1. Then:

  • If this raw value is greater than the classification threshold, then the positive class is predicted.
  • If this raw value is less than the classification threshold, then the negative class is predicted.

For example, suppose the classification threshold is 0.8. If the raw value is 0.9, then the model predicts the positive class. If the raw value is 0.7, then the model predicts the negative class.

The choice of classification threshold strongly influences the number of false positives and false negatives .

See Thresholds and the confusion matrix in Machine Learning Crash Course for more information.

classifier

#fundamentals

A casual term for a classification model .

class-imbalanced dataset

#fundamentals

A dataset for a classification problem in which the total number of labels of each class differs significantly. For example, consider a binary classification dataset whose two labels are divided as follows:

  • 1,000,000 negative labels
  • 10 positive labels

The ratio of negative to positive labels is 100,000 to 1, so this is a class-imbalanced dataset.

In contrast, the following dataset is not class-imbalanced because the ratio of negative labels to positive labels is relatively close to 1:

  • 517 negative labels
  • 483 positive labels

Multi-class datasets can also be class-imbalanced. For example, the following multi-class classification dataset is also class-imbalanced because one label has far more examples than the other two:

  • 1,000,000 labels with class "green"
  • 200 labels with class "purple"
  • 350 labels with class "orange"

See also entropy , majority class , and minority class .

clipping

#fundamentals

A technique for handling outliers by doing either or both of the following:

  • Reducing feature values that are greater than a maximum threshold down to that maximum threshold.
  • Increasing feature values that are less than a minimum threshold up to that minimum threshold.

For example, suppose that <0.5% of values for a particular feature fall outside the range 40–60. In this case, you could do the following:

  • Clip all values over 60 (the maximum threshold) to be exactly 60.
  • Clip all values under 40 (the minimum threshold) to be exactly 40.

Outliers can damage models, sometimes causing weights to overflow during training. Some outliers can also dramatically spoil metrics like accuracy . Clipping is a common technique to limit the damage.

Gradient clipping forces gradient values within a designated range during training.

See Numerical data: Normalization in Machine Learning Crash Course for more information.

матрица путаницы

#fundamentals

An NxN table that summarizes the number of correct and incorrect predictions that a classification model made. For example, consider the following confusion matrix for a binary classification model:

Tumor (predicted) Non-Tumor (predicted)
Tumor (ground truth) 18 (TP) 1 (FN)
Non-Tumor (ground truth) 6 (FP) 452 (TN)

The preceding confusion matrix shows the following:

  • Of the 19 predictions in which ground truth was Tumor, the model correctly classified 18 and incorrectly classified 1.
  • Of the 458 predictions in which ground truth was Non-Tumor, the model correctly classified 452 and incorrectly classified 6.

The confusion matrix for a multi-class classification problem can help you identify patterns of mistakes. For example, consider the following confusion matrix for a 3-class multi-class classification model that categorizes three different iris types (Virginica, Versicolor, and Setosa). When the ground truth was Virginica, the confusion matrix shows that the model was far more likely to mistakenly predict Versicolor than Setosa:

Setosa (predicted) Versicolor (predicted) Virginica (predicted)
Setosa (ground truth) 88 12 0
Versicolor (ground truth) 6 141 7
Virginica (ground truth) 2 27 109

As yet another example, a confusion matrix could reveal that a model trained to recognize handwritten digits tends to mistakenly predict 9 instead of 4, or mistakenly predict 1 instead of 7.

Confusion matrixes contain sufficient information to calculate a variety of performance metrics, including precision and recall .

continuous feature

#fundamentals

A floating-point feature with an infinite range of possible values, such as temperature or weight.

Contrast with discrete feature .

конвергенция

#fundamentals

A state reached when loss values change very little or not at all with each iteration . For example, the following loss curve suggests convergence at around 700 iterations:

Cartesian plot. X-axis is loss. Y-axis is the number of training
          iterations. Loss is very high during first few iterations, but
          drops sharply. After about 100 iterations, loss is still
          descending but far more gradually. After about 700 iterations,
          loss stays flat.

A model converges when additional training won't improve the model.

In deep learning , loss values sometimes stay constant or nearly so for many iterations before finally descending. During a long period of constant loss values, you may temporarily get a false sense of convergence.

See also early stopping .

See Model convergence and loss curves in Machine Learning Crash Course for more information.

Д

DataFrame

#fundamentals

A popular pandas data type for representing datasets in memory.

A DataFrame is analogous to a table or a spreadsheet. Each column of a DataFrame has a name (a header), and each row is identified by a unique number.

Each column in a DataFrame is structured like a 2D array, except that each column can be assigned its own data type.

See also the official pandas.DataFrame reference page .

data set or dataset

#fundamentals

A collection of raw data, commonly (but not exclusively) organized in one of the following formats:

  • a spreadsheet
  • a file in CSV (comma-separated values) format

deep model

#fundamentals

A neural network containing more than one hidden layer .

A deep model is also called a deep neural network .

Contrast with wide model .

dense feature

#fundamentals

A feature in which most or all values are nonzero, typically a Tensor of floating-point values. For example, the following 10-element Tensor is dense because 9 of its values are nonzero:

8 3 7 5 2 4 0 4 9 6

Contrast with sparse feature .

глубина

#fundamentals

The sum of the following in a neural network :

For example, a neural network with five hidden layers and one output layer has a depth of 6.

Notice that the input layer doesn't influence depth.

discrete feature

#fundamentals

A feature with a finite set of possible values. For example, a feature whose values may only be animal , vegetable , or mineral is a discrete (or categorical) feature.

Contrast with continuous feature .

динамичный

#fundamentals

Something done frequently or continuously. The terms dynamic and online are synonyms in machine learning. The following are common uses of dynamic and online in machine learning:

  • A dynamic model (or online model ) is a model that is retrained frequently or continuously.
  • Dynamic training (or online training ) is the process of training frequently or continuously.
  • Dynamic inference (or online inference ) is the process of generating predictions on demand.

dynamic model

#fundamentals

A model that is frequently (maybe even continuously) retrained. A dynamic model is a "lifelong learner" that constantly adapts to evolving data. A dynamic model is also known as an online model .

Contrast with static model .

Э

early stopping

#fundamentals

A method for regularization that involves ending training before training loss finishes decreasing. In early stopping, you intentionally stop training the model when the loss on a validation dataset starts to increase; that is, when generalization performance worsens.

embedding layer

#язык
#fundamentals

A special hidden layer that trains on a high-dimensional categorical feature to gradually learn a lower dimension embedding vector. An embedding layer enables a neural network to train far more efficiently than training just on the high-dimensional categorical feature.

For example, Earth currently supports about 73,000 tree species. Suppose tree species is a feature in your model, so your model's input layer includes a one-hot vector 73,000 elements long. For example, perhaps baobab would be represented something like this:

An array of 73,000 elements. The first 6,232 elements hold the value
     0. The next element holds the value 1. The final 66,767 elements hold
     the value zero.

A 73,000-element array is very long. If you don't add an embedding layer to the model, training is going to be very time consuming due to multiplying 72,999 zeros. Perhaps you pick the embedding layer to consist of 12 dimensions. Consequently, the embedding layer will gradually learn a new embedding vector for each tree species.

In certain situations, hashing is a reasonable alternative to an embedding layer.

See Embeddings in Machine Learning Crash Course for more information.

эпоха

#fundamentals

A full training pass over the entire training set such that each example has been processed once.

An epoch represents N / batch size training iterations , where N is the total number of examples.

For instance, suppose the following:

  • The dataset consists of 1,000 examples.
  • The batch size is 50 examples.

Therefore, a single epoch requires 20 iterations:

1 epoch = (N/batch size) = (1,000 / 50) = 20 iterations

See Linear regression: Hyperparameters in Machine Learning Crash Course for more information.

пример

#fundamentals

The values of one row of features and possibly a label . Examples in supervised learning fall into two general categories:

  • A labeled example consists of one or more features and a label. Labeled examples are used during training.
  • An unlabeled example consists of one or more features but no label. Unlabeled examples are used during inference.

For instance, suppose you are training a model to determine the influence of weather conditions on student test scores. Here are three labeled examples:

Функции Этикетка
Температура Влажность Давление Test score
15 47 998 Хороший
19 34 1020 Отличный
18 92 1012 Бедный

Here are three unlabeled examples:

Температура Влажность Давление
12 62 1014
21 47 1017
19 41 1021

The row of a dataset is typically the raw source for an example. That is, an example typically consists of a subset of the columns in the dataset. Furthermore, the features in an example can also include synthetic features , such as feature crosses .

See Supervised Learning in the Introduction to Machine Learning course for more information.

Ф

false negative (FN)

#fundamentals
#Metric

An example in which the model mistakenly predicts the negative class . For example, the model predicts that a particular email message is not spam (the negative class), but that email message actually is spam .

false positive (FP)

#fundamentals
#Metric

An example in which the model mistakenly predicts the positive class . For example, the model predicts that a particular email message is spam (the positive class), but that email message is actually not spam .

See Thresholds and the confusion matrix in Machine Learning Crash Course for more information.

false positive rate (FPR)

#fundamentals
#Metric

The proportion of actual negative examples for which the model mistakenly predicted the positive class. The following formula calculates the false positive rate:

$$\text{false positive rate} = \frac{\text{false positives}}{\text{false positives} + \text{true negatives}}$$

The false positive rate is the x-axis in an ROC curve .

See Classification: ROC and AUC in Machine Learning Crash Course for more information.

особенность

#fundamentals

An input variable to a machine learning model. An example consists of one or more features. For instance, suppose you are training a model to determine the influence of weather conditions on student test scores. The following table shows three examples, each of which contains three features and one label:

Функции Этикетка
Температура Влажность Давление Test score
15 47 998 92
19 34 1020 84
18 92 1012 87

Contrast with label .

See Supervised Learning in the Introduction to Machine Learning course for more information.

feature cross

#fundamentals

A synthetic feature formed by "crossing" categorical or bucketed features.

For example, consider a "mood forecasting" model that represents temperature in one of the following four buckets:

  • freezing
  • chilly
  • temperate
  • warm

And represents wind speed in one of the following three buckets:

  • still
  • light
  • windy

Without feature crosses, the linear model trains independently on each of the preceding seven various buckets. So, the model trains on, for example, freezing independently of the training on, for example, windy .

Alternatively, you could create a feature cross of temperature and wind speed. This synthetic feature would have the following 12 possible values:

  • freezing-still
  • freezing-light
  • freezing-windy
  • chilly-still
  • chilly-light
  • chilly-windy
  • temperate-still
  • temperate-light
  • temperate-windy
  • warm-still
  • warm-light
  • warm-windy

Thanks to feature crosses, the model can learn mood differences between a freezing-windy day and a freezing-still day.

If you create a synthetic feature from two features that each have a lot of different buckets, the resulting feature cross will have a huge number of possible combinations. For example, if one feature has 1,000 buckets and the other feature has 2,000 buckets, the resulting feature cross has 2,000,000 buckets.

Formally, a cross is a Cartesian product .

Feature crosses are mostly used with linear models and are rarely used with neural networks.

See Categorical data: Feature crosses in Machine Learning Crash Course for more information.

feature engineering

#fundamentals
#TensorFlow

A process that involves the following steps:

  1. Determining which features might be useful in training a model.
  2. Converting raw data from the dataset into efficient versions of those features.

For example, you might determine that temperature might be a useful feature. Then, you might experiment with bucketing to optimize what the model can learn from different temperature ranges.

Feature engineering is sometimes called feature extraction or featurization .

See Numerical data: How a model ingests data using feature vectors in Machine Learning Crash Course for more information.

feature set

#fundamentals

The group of features your machine learning model trains on. For example, a simple feature set for a model that predicts housing prices might consist of postal code, property size, and property condition.

feature vector

#fundamentals

The array of feature values comprising an example . The feature vector is input during training and during inference . For example, the feature vector for a model with two discrete features might be:

[0.92, 0.56]

Four layers: an input layer, two hidden layers, and one output layer.
          The input layer contains two nodes, one containing the value
          0.92 and the other containing the value 0.56.

Each example supplies different values for the feature vector, so the feature vector for the next example could be something like:

[0.73, 0.49]

Feature engineering determines how to represent features in the feature vector. For example, a binary categorical feature with five possible values might be represented with one-hot encoding . In this case, the portion of the feature vector for a particular example would consist of four zeroes and a single 1.0 in the third position, as follows:

[0.0, 0.0, 1.0, 0.0, 0.0]

As another example, suppose your model consists of three features:

  • a binary categorical feature with five possible values represented with one-hot encoding; for example: [0.0, 1.0, 0.0, 0.0, 0.0]
  • another binary categorical feature with three possible values represented with one-hot encoding; for example: [0.0, 0.0, 1.0]
  • a floating-point feature; for example: 8.3 .

In this case, the feature vector for each example would be represented by nine values. Given the example values in the preceding list, the feature vector would be:

0.0
1.0
0.0
0.0
0.0
0.0
0.0
1.0
8.3

See Numerical data: How a model ingests data using feature vectors in Machine Learning Crash Course for more information.

feedback loop

#fundamentals

In machine learning, a situation in which a model's predictions influence the training data for the same model or another model. For example, a model that recommends movies will influence the movies that people see, which will then influence subsequent movie recommendation models.

See Production ML systems: Questions to ask in Machine Learning Crash Course for more information.

Г

обобщение

#fundamentals

A model's ability to make correct predictions on new, previously unseen data. A model that can generalize is the opposite of a model that is overfitting .

See Generalization in Machine Learning Crash Course for more information.

generalization curve

#fundamentals

A plot of both training loss and validation loss as a function of the number of iterations .

A generalization curve can help you detect possible overfitting . For example, the following generalization curve suggests overfitting because validation loss ultimately becomes significantly higher than training loss.

A Cartesian graph in which the y-axis is labeled loss and the x-axis
          is labeled iterations. Two plots appear. One plots shows the
          training loss and the other shows the validation loss.
          The two plots start off similarly, but the training loss eventually
          dips far lower than the validation loss.

See Generalization in Machine Learning Crash Course for more information.

градиентный спуск

#fundamentals

A mathematical technique to minimize loss . Gradient descent iteratively adjusts weights and biases , gradually finding the best combination to minimize loss.

Gradient descent is older—much, much older—than machine learning.

See the Linear regression: Gradient descent in Machine Learning Crash Course for more information.

ground truth

#fundamentals

Реальность.

The thing that actually happened.

For example, consider a binary classification model that predicts whether a student in their first year of university will graduate within six years. Ground truth for this model is whether or not that student actually graduated within six years.

ЧАС

hidden layer

#fundamentals

A layer in a neural network between the input layer (the features) and the output layer (the prediction). Each hidden layer consists of one or more neurons . For example, the following neural network contains two hidden layers, the first with three neurons and the second with two neurons:

Four layers. The first layer is an input layer containing two           функции. The second layer is a hidden layer containing three           neurons. The third layer is a hidden layer containing two           neurons. The fourth layer is an output layer. Each feature           contains three edges, each of which points to a different neuron           in the second layer. Each of the neurons in the second layer           contains two edges, each of which points to a different neuron           in the third layer. Each of the neurons in the third layer contain           one edge, each pointing to the output layer.

A deep neural network contains more than one hidden layer. For example, the preceding illustration is a deep neural network because the model contains two hidden layers.

See Neural networks: Nodes and hidden layers in Machine Learning Crash Course for more information.

hyperparameter

#fundamentals

The variables that you or a hyperparameter tuning serviceadjust during successive runs of training a model. For example, learning rate is a hyperparameter. You could set the learning rate to 0.01 before one training session. If you determine that 0.01 is too high, you could perhaps set the learning rate to 0.003 for the next training session.

In contrast, parameters are the various weights and bias that the model learns during training.

See Linear regression: Hyperparameters in Machine Learning Crash Course for more information.

я

independently and identically distributed (iid)

#fundamentals

Data drawn from a distribution that doesn't change, and where each value drawn doesn't depend on values that have been drawn previously. An iid is the ideal gas of machine learning—a useful mathematical construct but almost never exactly found in the real world. For example, the distribution of visitors to a web page may be iid over a brief window of time; that is, the distribution doesn't change during that brief window and one person's visit is generally independent of another's visit. However, if you expand that window of time, seasonal differences in the web page's visitors may appear.

See also nonstationarity .

вывод

#fundamentals

In machine learning, the process of making predictions by applying a trained model to unlabeled examples .

Inference has a somewhat different meaning in statistics. See the Wikipedia article on statistical inference for details.

See Supervised Learning in the Intro to ML course to see inference's role in a supervised learning system.

input layer

#fundamentals

The layer of a neural network that holds the feature vector . That is, the input layer provides examples for training or inference . For example, the input layer in the following neural network consists of two features:

Four layers: an input layer, two hidden layers, and an output layer.

interpretability

#fundamentals

The ability to explain or to present an ML model's reasoning in understandable terms to a human.

Most linear regression models, for example, are highly interpretable. (You merely need to look at the trained weights for each feature.) Decision forests are also highly interpretable. Some models, however, require sophisticated visualization to become interpretable.

You can use the Learning Interpretability Tool (LIT) to interpret ML models.

итерация

#fundamentals

A single update of a model's parameters—the model's weights and biases —during training . The batch size determines how many examples the model processes in a single iteration. For instance, if the batch size is 20, then the model processes 20 examples before adjusting the parameters.

When training a neural network , a single iteration involves the following two passes:

  1. A forward pass to evaluate loss on a single batch.
  2. A backward pass ( backpropagation ) to adjust the model's parameters based on the loss and the learning rate.

See Gradient descent in Machine Learning Crash Course for more information.

л

L 0 regularization

#fundamentals

A type of regularization that penalizes the total number of nonzero weights in a model. For example, a model having 11 nonzero weights would be penalized more than a similar model having 10 nonzero weights.

L 0 regularization is sometimes called L0-norm regularization .

L 1 loss

#fundamentals
#Metric

A loss function that calculates the absolute value of the difference between actual label values and the values that a model predicts. For example, here's the calculation of L 1 loss for a batch of five examples :

Actual value of example Model's predicted value Absolute value of delta
7 6 1
5 4 1
8 11 3
4 6 2
9 8 1
8 = L 1 loss

L 1 loss is less sensitive to outliers than L 2 loss .

The Mean Absolute Error is the average L 1 loss per example.

See Linear regression: Loss in Machine Learning Crash Course for more information.

L 1 regularization

#fundamentals

A type of regularization that penalizes weights in proportion to the sum of the absolute value of the weights. L 1 regularization helps drive the weights of irrelevant or barely relevant features to exactly 0 . A feature with a weight of 0 is effectively removed from the model.

Contrast with L 2 regularization .

L 2 loss

#fundamentals
#Metric

A loss function that calculates the square of the difference between actual label values and the values that a model predicts. For example, here's the calculation of L 2 loss for a batch of five examples :

Actual value of example Model's predicted value Square of delta
7 6 1
5 4 1
8 11 9
4 6 4
9 8 1
16 = L 2 loss

Due to squaring, L 2 loss amplifies the influence of outliers . That is, L 2 loss reacts more strongly to bad predictions than L 1 loss . For example, the L 1 loss for the preceding batch would be 8 rather than 16. Notice that a single outlier accounts for 9 of the 16.

Regression models typically use L 2 loss as the loss function.

The Mean Squared Error is the average L 2 loss per example. Squared loss is another name for L 2 loss.

See Logistic regression: Loss and regularization in Machine Learning Crash Course for more information.

L 2 regularization

#fundamentals

A type of regularization that penalizes weights in proportion to the sum of the squares of the weights. L 2 regularization helps drive outlier weights (those with high positive or low negative values) closer to 0 but not quite to 0 . Features with values very close to 0 remain in the model but don't influence the model's prediction very much.

L 2 regularization always improves generalization in linear models .

Contrast with L 1 regularization .

See Overfitting: L2 regularization in Machine Learning Crash Course for more information.

этикетка

#fundamentals

In supervised machine learning , the "answer" or "result" portion of an example .

Each labeled example consists of one or more features and a label. For example, in a spam detection dataset, the label would probably be either "spam" or "not spam." In a rainfall dataset, the label might be the amount of rain that fell during a certain period.

See Supervised Learning in Introduction to Machine Learning for more information.

labeled example

#fundamentals

An example that contains one or more features and a label . For example, the following table shows three labeled examples from a house valuation model, each with three features and one label:

Number of bedrooms Number of bathrooms House age House price (label)
3 2 15 $345,000
2 1 72 $179,000
4 2 34 $392,000

In supervised machine learning , models train on labeled examples and make predictions on unlabeled examples .

Contrast labeled example with unlabeled examples.

See Supervised Learning in Introduction to Machine Learning for more information.

лямбда

#fundamentals

Synonym for regularization rate .

Lambda is an overloaded term. Here we're focusing on the term's definition within regularization .

слой

#fundamentals

A set of neurons in a neural network . Three common types of layers are as follows:

For example, the following illustration shows a neural network with one input layer, two hidden layers, and one output layer:

A neural network with one input layer, two hidden layers, and one           output layer. The input layer consists of two features. Первый           hidden layer consists of three neurons and the second hidden layer           consists of two neurons. The output layer consists of a single node.

In TensorFlow , layers are also Python functions that take Tensors and configuration options as input and produce other tensors as output.

learning rate

#fundamentals

A floating-point number that tells the gradient descent algorithm how strongly to adjust weights and biases on each iteration . For example, a learning rate of 0.3 would adjust weights and biases three times more powerfully than a learning rate of 0.1.

Learning rate is a key hyperparameter . If you set the learning rate too low, training will take too long. If you set the learning rate too high, gradient descent often has trouble reaching convergence .

See Linear regression: Hyperparameters in Machine Learning Crash Course for more information.

linear

#fundamentals

A relationship between two or more variables that can be represented solely through addition and multiplication.

The plot of a linear relationship is a line.

Contrast with nonlinear .

линейная модель

#fundamentals

A model that assigns one weight per feature to make predictions . (Linear models also incorporate a bias .) In contrast, the relationship of features to predictions in deep models is generally nonlinear .

Linear models are usually easier to train and more interpretable than deep models. However, deep models can learn complex relationships between features.

Linear regression and logistic regression are two types of linear models.

linear regression

#fundamentals

A type of machine learning model in which both of the following are true:

  • The model is a linear model .
  • The prediction is a floating-point value. (This is the regression part of linear regression .)

Contrast linear regression with logistic regression . Also, contrast regression with classification .

See Linear regression in Machine Learning Crash Course for more information.

логистическая регрессия

#fundamentals

A type of regression model that predicts a probability. Logistic regression models have the following characteristics:

  • The label is categorical . The term logistic regression usually refers to binary logistic regression , that is, to a model that calculates probabilities for labels with two possible values. A less common variant, multinomial logistic regression , calculates probabilities for labels with more than two possible values.
  • The loss function during training is Log Loss . (Multiple Log Loss units can be placed in parallel for labels with more than two possible values.)
  • The model has a linear architecture, not a deep neural network. However, the remainder of this definition also applies to deep models that predict probabilities for categorical labels.

For example, consider a logistic regression model that calculates the probability of an input email being either spam or not spam. During inference, suppose the model predicts 0.72. Therefore, the model is estimating:

  • A 72% chance of the email being spam.
  • A 28% chance of the email not being spam.

A logistic regression model uses the following two-step architecture:

  1. The model generates a raw prediction (y') by applying a linear function of input features.
  2. The model uses that raw prediction as input to a sigmoid function , which converts the raw prediction to a value between 0 and 1, exclusive.

Like any regression model, a logistic regression model predicts a number. However, this number typically becomes part of a binary classification model as follows:

  • If the predicted number is greater than the classification threshold , the binary classification model predicts the positive class.
  • If the predicted number is less than the classification threshold, the binary classification model predicts the negative class.

See Logistic regression in Machine Learning Crash Course for more information.

Log Loss

#fundamentals

The loss function used in binary logistic regression .

See Logistic regression: Loss and regularization in Machine Learning Crash Course for more information.

log-odds

#fundamentals

The logarithm of the odds of some event.

потеря

#fundamentals
#Metric

During the training of a supervised model , a measure of how far a model's prediction is from its label .

A loss function calculates the loss.

See Linear regression: Loss in Machine Learning Crash Course for more information.

loss curve

#fundamentals

A plot of loss as a function of the number of training iterations . The following plot shows a typical loss curve:

A Cartesian graph of loss versus training iterations, showing a
          rapid drop in loss for the initial iterations, followed by a gradual
          drop, and then a flat slope during the final iterations.

Loss curves can help you determine when your model is converging or overfitting .

Loss curves can plot all of the following types of loss:

See also generalization curve .

See Overfitting: Interpreting loss curves in Machine Learning Crash Course for more information.

функция потерь

#fundamentals
#Metric

During training or testing, a mathematical function that calculates the loss on a batch of examples. A loss function returns a lower loss for models that makes good predictions than for models that make bad predictions.

The goal of training is typically to minimize the loss that a loss function returns.

Many different kinds of loss functions exist. Pick the appropriate loss function for the kind of model you are building. Например:

М

машинное обучение

#fundamentals

A program or system that trains a model from input data. The trained model can make useful predictions from new (never-before-seen) data drawn from the same distribution as the one used to train the model.

Machine learning also refers to the field of study concerned with these programs or systems.

See the Introduction to Machine Learning course for more information.

majority class

#fundamentals

The more common label in a class-imbalanced dataset . For example, given a dataset containing 99% negative labels and 1% positive labels, the negative labels are the majority class.

Contrast with minority class .

See Datasets: Imbalanced datasets in Machine Learning Crash Course for more information.

mini-batch

#fundamentals

A small, randomly selected subset of a batch processed in one iteration . The batch size of a mini-batch is usually between 10 and 1,000 examples.

For example, suppose the entire training set (the full batch) consists of 1,000 examples. Further suppose that you set the batch size of each mini-batch to 20. Therefore, each iteration determines the loss on a random 20 of the 1,000 examples and then adjusts the weights and biases accordingly.

It is much more efficient to calculate the loss on a mini-batch than the loss on all the examples in the full batch.

See Linear regression: Hyperparameters in Machine Learning Crash Course for more information.

minority class

#fundamentals

The less common label in a class-imbalanced dataset . For example, given a dataset containing 99% negative labels and 1% positive labels, the positive labels are the minority class.

Contrast with majority class .

See Datasets: Imbalanced datasets in Machine Learning Crash Course for more information.

модель

#fundamentals

In general, any mathematical construct that processes input data and returns output. Phrased differently, a model is the set of parameters and structure needed for a system to make predictions. In supervised machine learning , a model takes an example as input and infers a prediction as output. Within supervised machine learning, models differ somewhat. Например:

  • A linear regression model consists of a set of weights and a bias .
  • A neural network model consists of:
    • A set of hidden layers , each containing one or more neurons .
    • The weights and bias associated with each neuron.
  • A decision tree model consists of:
    • The shape of the tree; that is, the pattern in which the conditions and leaves are connected.
    • The conditions and leaves.

You can save, restore, or make copies of a model.

Unsupervised machine learning also generates models, typically a function that can map an input example to the most appropriate cluster .

multi-class classification

#fundamentals

In supervised learning, a classification problem in which the dataset contains more than two classes of labels. For example, the labels in the Iris dataset must be one of the following three classes:

  • Iris setosa
  • Iris virginica
  • Iris versicolor

A model trained on the Iris dataset that predicts Iris type on new examples is performing multi-class classification.

In contrast, classification problems that distinguish between exactly two classes are binary classification models . For example, an email model that predicts either spam or not spam is a binary classification model.

In clustering problems, multi-class classification refers to more than two clusters.

See Neural networks: Multi-class classification in Machine Learning Crash Course for more information.

Н

negative class

#fundamentals
#Metric

In binary classification , one class is termed positive and the other is termed negative . The positive class is the thing or event that the model is testing for and the negative class is the other possibility. Например:

  • The negative class in a medical test might be "not tumor."
  • The negative class in an email classification model might be "not spam."

Contrast with positive class .

нейронная сеть

#fundamentals

A model containing at least one hidden layer . A deep neural network is a type of neural network containing more than one hidden layer. For example, the following diagram shows a deep neural network containing two hidden layers.

A neural network with an input layer, two hidden layers, and an
          output layer.

Each neuron in a neural network connects to all of the nodes in the next layer. For example, in the preceding diagram, notice that each of the three neurons in the first hidden layer separately connect to both of the two neurons in the second hidden layer.

Neural networks implemented on computers are sometimes called artificial neural networks to differentiate them from neural networks found in brains and other nervous systems.

Some neural networks can mimic extremely complex nonlinear relationships between different features and the label.

See also convolutional neural network and recurrent neural network .

See Neural networks in Machine Learning Crash Course for more information.

нейрон

#fundamentals

In machine learning, a distinct unit within a hidden layer of a neural network . Each neuron performs the following two-step action:

  1. Calculates the weighted sum of input values multiplied by their corresponding weights.
  2. Passes the weighted sum as input to an activation function .

A neuron in the first hidden layer accepts inputs from the feature values in the input layer . A neuron in any hidden layer beyond the first accepts inputs from the neurons in the preceding hidden layer. For example, a neuron in the second hidden layer accepts inputs from the neurons in the first hidden layer.

The following illustration highlights two neurons and their inputs.

A neural network with an input layer, two hidden layers, and an           output layer. Two neurons are highlighted: one in the first           hidden layer and one in the second hidden layer. The highlighted           neuron in the first hidden layer receives inputs from both features           in the input layer. The highlighted neuron in the second hidden layer           receives inputs from each of the three neurons in the first hidden           слой.

A neuron in a neural network mimics the behavior of neurons in brains and other parts of nervous systems.

node (neural network)

#fundamentals

A neuron in a hidden layer .

See Neural Networks in Machine Learning Crash Course for more information.

nonlinear

#fundamentals

A relationship between two or more variables that can't be represented solely through addition and multiplication. A linear relationship can be represented as a line; a nonlinear relationship can't be represented as a line. For example, consider two models that each relate a single feature to a single label. The model on the left is linear and the model on the right is nonlinear:

Two plots. One plot is a line, so this is a linear relationship.
          The other plot is a curve, so this is a nonlinear relationship.

See Neural networks: Nodes and hidden layers in Machine Learning Crash Course to experiment with different kinds of nonlinear functions.

nonstationarity

#fundamentals

A feature whose values change across one or more dimensions, usually time. For example, consider the following examples of nonstationarity:

  • The number of swimsuits sold at a particular store varies with the season.
  • The quantity of a particular fruit harvested in a particular region is zero for much of the year but large for a brief period.
  • Due to climate change, annual mean temperatures are shifting.

Contrast with stationarity .

normalization

#fundamentals

Broadly speaking, the process of converting a variable's actual range of values into a standard range of values, such as:

  • -1 to +1
  • 0 to 1
  • Z-scores (roughly, -3 to +3)

For example, suppose the actual range of values of a certain feature is 800 to 2,400. As part of feature engineering , you could normalize the actual values down to a standard range, such as -1 to +1.

Normalization is a common task in feature engineering . Models usually train faster (and produce better predictions) when every numerical feature in the feature vector has roughly the same range.

See also Z-score normalization .

See Numerical Data: Normalization in Machine Learning Crash Course for more information.

числовые данные

#fundamentals

Features represented as integers or real-valued numbers. For example, a house valuation model would probably represent the size of a house (in square feet or square meters) as numerical data. Representing a feature as numerical data indicates that the feature's values have a mathematical relationship to the label. That is, the number of square meters in a house probably has some mathematical relationship to the value of the house.

Not all integer data should be represented as numerical data. For example, postal codes in some parts of the world are integers; however, integer postal codes shouldn't be represented as numerical data in models. That's because a postal code of 20000 is not twice (or half) as potent as a postal code of 10000. Furthermore, although different postal codes do correlate to different real estate values, we can't assume that real estate values at postal code 20000 are twice as valuable as real estate values at postal code 10000. Postal codes should be represented as categorical data instead.

Numerical features are sometimes called continuous features .

See Working with numerical data in Machine Learning Crash Course for more information.

О

офлайн

#fundamentals

Synonym for static .

offline inference

#fundamentals

The process of a model generating a batch of predictions and then caching (saving) those predictions. Apps can then access the inferred prediction from the cache rather than rerunning the model.

For example, consider a model that generates local weather forecasts (predictions) once every four hours. After each model run, the system caches all the local weather forecasts. Weather apps retrieve the forecasts from the cache.

Offline inference is also called static inference .

Contrast with online inference .

See Production ML systems: Static versus dynamic inference in Machine Learning Crash Course for more information.

one-hot encoding

#fundamentals

Representing categorical data as a vector in which:

  • One element is set to 1.
  • All other elements are set to 0.

One-hot encoding is commonly used to represent strings or identifiers that have a finite set of possible values. For example, suppose a certain categorical feature named Scandinavia has five possible values:

  • "Дания"
  • "Швеция"
  • "Норвегия"
  • "Финляндия"
  • "Исландия"

One-hot encoding could represent each of the five values as follows:

страна Вектор
"Дания" 1 0 0 0 0
"Швеция" 0 1 0 0 0
"Норвегия" 0 0 1 0 0
"Финляндия" 0 0 0 1 0
"Исландия" 0 0 0 0 1

Thanks to one-hot encoding, a model can learn different connections based on each of the five countries.

Representing a feature as numerical data is an alternative to one-hot encoding. Unfortunately, representing the Scandinavian countries numerically is not a good choice. For example, consider the following numeric representation:

  • "Denmark" is 0
  • "Sweden" is 1
  • "Norway" is 2
  • "Finland" is 3
  • "Iceland" is 4

With numeric encoding, a model would interpret the raw numbers mathematically and would try to train on those numbers. However, Iceland isn't actually twice as much (or half as much) of something as Norway, so the model would come to some strange conclusions.

See Categorical data: Vocabulary and one-hot encoding in Machine Learning Crash Course for more information.

one-vs.-all

#fundamentals

Given a classification problem with N classes, a solution consisting of N separate binary classifiers —one binary classifier for each possible outcome. For example, given a model that classifies examples as animal, vegetable, or mineral, a one-vs.-all solution would provide the following three separate binary classifiers:

  • animal versus not animal
  • vegetable versus not vegetable
  • mineral versus not mineral

онлайн

#fundamentals

Synonym for dynamic .

online inference

#fundamentals

Generating predictions on demand. For example, suppose an app passes input to a model and issues a request for a prediction. A system using online inference responds to the request by running the model (and returning the prediction to the app).

Contrast with offline inference .

See Production ML systems: Static versus dynamic inference in Machine Learning Crash Course for more information.

output layer

#fundamentals

The "final" layer of a neural network. The output layer contains the prediction.

The following illustration shows a small deep neural network with an input layer, two hidden layers, and an output layer:

A neural network with one input layer, two hidden layers, and one           output layer. The input layer consists of two features. Первый           hidden layer consists of three neurons and the second hidden layer           consists of two neurons. The output layer consists of a single node.

overfitting

#fundamentals

Creating a model that matches the training data so closely that the model fails to make correct predictions on new data.

Regularization can reduce overfitting. Training on a large and diverse training set can also reduce overfitting.

See Overfitting in Machine Learning Crash Course for more information.

П

панды

#fundamentals

A column-oriented data analysis API built on top of numpy . Many machine learning frameworks, including TensorFlow, support pandas data structures as inputs. See the pandas documentation for details.

parameter

#fundamentals

The weights and biases that a model learns during training . For example, in a linear regression model, the parameters consist of the bias ( b ) and all the weights ( w 1 , w 2 , and so on) in the following formula:

$$y' = b + w_1x_1 + w_2x_2 + … w_nx_n$$

In contrast, hyperparameters are the values that you (or a hyperparameter tuning service) supply to the model. For example, learning rate is a hyperparameter.

positive class

#fundamentals
#Metric

The class you are testing for.

For example, the positive class in a cancer model might be "tumor." The positive class in an email classification model might be "spam."

Contrast with negative class .

post-processing

#ответственный
#fundamentals

Adjusting the output of a model after the model has been run. Post-processing can be used to enforce fairness constraints without modifying models themselves.

For example, one might apply post-processing to a binary classifier by setting a classification threshold such that equality of opportunity is maintained for some attribute by checking that the true positive rate is the same for all values of that attribute.

прогноз

#fundamentals

A model's output. Например:

  • The prediction of a binary classification model is either the positive class or the negative class.
  • The prediction of a multi-class classification model is one class.
  • The prediction of a linear regression model is a number.

proxy labels

#fundamentals

Data used to approximate labels not directly available in a dataset.

For example, suppose you must train a model to predict employee stress level. Your dataset contains a lot of predictive features but doesn't contain a label named stress level. Undaunted, you pick "workplace accidents" as a proxy label for stress level. After all, employees under high stress get into more accidents than calm employees. Or do they? Maybe workplace accidents actually rise and fall for multiple reasons.

As a second example, suppose you want is it raining? to be a Boolean label for your dataset, but your dataset doesn't contain rain data. If photographs are available, you might establish pictures of people carrying umbrellas as a proxy label for is it raining? Is that a good proxy label? Possibly, but people in some cultures may be more likely to carry umbrellas to protect against sun than the rain.

Proxy labels are often imperfect. When possible, choose actual labels over proxy labels. That said, when an actual label is absent, pick the proxy label very carefully, choosing the least horrible proxy label candidate.

See Datasets: Labels in Machine Learning Crash Course for more information.

Р

RAG

#fundamentals

Abbreviation for retrieval-augmented generation .

оценить

#fundamentals

A human who provides labels for examples . "Annotator" is another name for rater.

See Categorical data: Common issues in Machine Learning Crash Course for more information.

Rectified Linear Unit (ReLU)

#fundamentals

An activation function with the following behavior:

  • If input is negative or zero, then the output is 0.
  • If input is positive, then the output is equal to the input.

Например:

  • If the input is -3, then the output is 0.
  • If the input is +3, then the output is 3.0.

Here is a plot of ReLU:

A cartesian plot of two lines. The first line has a constant
          y value of 0, running along the x-axis from -infinity,0 to 0,-0.
          The second line starts at 0,0. This line has a slope of +1, so
          it runs from 0,0 to +infinity,+infinity.

ReLU is a very popular activation function. Despite its simple behavior, ReLU still enables a neural network to learn nonlinear relationships between features and the label .

regression model

#fundamentals

Informally, a model that generates a numerical prediction. (In contrast, a classification model generates a class prediction.) For example, the following are all regression models:

  • A model that predicts a certain house's value in Euros, such as 423,000.
  • A model that predicts a certain tree's life expectancy in years, such as 23.2.
  • A model that predicts the amount of rain in inches that will fall in a certain city over the next six hours, such as 0.18.

Two common types of regression models are:

  • Linear regression , which finds the line that best fits label values to features.
  • Logistic regression , which generates a probability between 0.0 and 1.0 that a system typically then maps to a class prediction.

Not every model that outputs numerical predictions is a regression model. In some cases, a numeric prediction is really just a classification model that happens to have numeric class names. For example, a model that predicts a numeric postal code is a classification model, not a regression model.

регуляризация

#fundamentals

Any mechanism that reduces overfitting . Popular types of regularization include:

Regularization can also be defined as the penalty on a model's complexity.

See Overfitting: Model complexity in Machine Learning Crash Course for more information.

regularization rate

#fundamentals

A number that specifies the relative importance of regularization during training. Raising the regularization rate reduces overfitting but may reduce the model's predictive power. Conversely, reducing or omitting the regularization rate increases overfitting.

See Overfitting: L2 regularization in Machine Learning Crash Course for more information.

ReLU

#fundamentals

Abbreviation for Rectified Linear Unit .

retrieval-augmented generation (RAG)

#fundamentals

A technique for improving the quality of large language model (LLM) output by grounding it with sources of knowledge retrieved after the model was trained. RAG improves the accuracy of LLM responses by providing the trained LLM with access to information retrieved from trusted knowledge bases or documents.

Common motivations to use retrieval-augmented generation include:

  • Increasing the factual accuracy of a model's generated responses.
  • Giving the model access to knowledge it was not trained on.
  • Changing the knowledge that the model uses.
  • Enabling the model to cite sources.

For example, suppose that a chemistry app uses the PaLM API to generate summaries related to user queries. When the app's backend receives a query, the backend:

  1. Searches for ("retrieves") data that's relevant to the user's query.
  2. Appends ("augments") the relevant chemistry data to the user's query.
  3. Instructs the LLM to create a summary based on the appended data.

ROC (receiver operating characteristic) Curve

#fundamentals
#Metric

A graph of true positive rate versus false positive rate for different classification thresholds in binary classification.

The shape of an ROC curve suggests a binary classification model's ability to separate positive classes from negative classes. Suppose, for example, that a binary classification model perfectly separates all the negative classes from all the positive classes:

A number line with 8 positive examples on the right side and
          7 negative examples on the left.

The ROC curve for the preceding model looks as follows:

An ROC curve. The x-axis is False Positive Rate and the y-axis
          is True Positive Rate. The curve has an inverted L shape. The curve
          starts at (0.0,0.0) and goes straight up to (0.0,1.0). Then the curve
          goes from (0.0,1.0) to (1.0,1.0).

In contrast, the following illustration graphs the raw logistic regression values for a terrible model that can't separate negative classes from positive classes at all:

A number line with positive examples and negative classes
          completely intermixed.

The ROC curve for this model looks as follows:

An ROC curve, which is actually a straight line from (0.0,0.0)
          to (1.0,1.0).

Meanwhile, back in the real world, most binary classification models separate positive and negative classes to some degree, but usually not perfectly. So, a typical ROC curve falls somewhere between the two extremes:

An ROC curve. The x-axis is False Positive Rate and the y-axis
          is True Positive Rate. The ROC curve approximates a shaky arc
          traversing the compass points from West to North.

The point on an ROC curve closest to (0.0,1.0) theoretically identifies the ideal classification threshold. However, several other real-world issues influence the selection of the ideal classification threshold. For example, perhaps false negatives cause far more pain than false positives.

A numerical metric called AUC summarizes the ROC curve into a single floating-point value.

Root Mean Squared Error (RMSE)

#fundamentals
#Metric

The square root of the Mean Squared Error .

С

sigmoid function

#fundamentals

A mathematical function that "squishes" an input value into a constrained range, typically 0 to 1 or -1 to +1. That is, you can pass any number (two, a million, negative billion, whatever) to a sigmoid and the output will still be in the constrained range. A plot of the sigmoid activation function looks as follows:

A two-dimensional curved plot with x values spanning the domain
          -infinity to +positive, while y values span the range almost 0 to
          almost 1. When x is 0, y is 0.5. The slope of the curve is always
          positive, with the highest slope at 0,0.5 and gradually decreasing
          slopes as the absolute value of x increases.

The sigmoid function has several uses in machine learning, including:

софтмакс

#fundamentals

A function that determines probabilities for each possible class in a multi-class classification model . The probabilities add up to exactly 1.0. For example, the following table shows how softmax distributes various probabilities:

Image is a... Вероятность
собака .85
кот .13
лошадь .02

Softmax is also called full softmax .

Contrast with candidate sampling .

See Neural networks: Multi-class classification in Machine Learning Crash Course for more information.

sparse feature

#язык
#fundamentals

A feature whose values are predominately zero or empty. For example, a feature containing a single 1 value and a million 0 values is sparse. In contrast, a dense feature has values that are predominantly not zero or empty.

In machine learning, a surprising number of features are sparse features. Categorical features are usually sparse features. For example, of the 300 possible tree species in a forest, a single example might identify just a maple tree . Or, of the millions of possible videos in a video library, a single example might identify just "Casablanca."

In a model, you typically represent sparse features with one-hot encoding . If the one-hot encoding is big, you might put an embedding layer on top of the one-hot encoding for greater efficiency.

sparse representation

#язык
#fundamentals

Storing only the position(s) of nonzero elements in a sparse feature.

For example, suppose a categorical feature named species identifies the 36 tree species in a particular forest. Further assume that each example identifies only a single species.

You could use a one-hot vector to represent the tree species in each example. A one-hot vector would contain a single 1 (to represent the particular tree species in that example) and 35 0 s (to represent the 35 tree species not in that example). So, the one-hot representation of maple might look something like the following:

A vector in which positions 0 through 23 hold the value 0, position
          24 holds the value 1, and positions 25 through 35 hold the value 0.

Alternatively, sparse representation would simply identify the position of the particular species. If maple is at position 24, then the sparse representation of maple would simply be:

24

Notice that the sparse representation is much more compact than the one-hot representation.

See Working with categorical data in Machine Learning Crash Course for more information.

sparse vector

#fundamentals

A vector whose values are mostly zeroes. See also sparse feature and sparsity .

squared loss

#fundamentals
#Metric

Synonym for L 2 loss .

статический

#fundamentals

Something done once rather than continuously. The terms static and offline are synonyms. The following are common uses of static and offline in machine learning:

  • static model (or offline model ) is a model trained once and then used for a while.
  • static training (or offline training ) is the process of training a static model.
  • static inference (or offline inference ) is a process in which a model generates a batch of predictions at a time.

Contrast with dynamic .

static inference

#fundamentals

Synonym for offline inference .

stationarity

#fundamentals

A feature whose values don't change across one or more dimensions, usually time. For example, a feature whose values look about the same in 2021 and 2023 exhibits stationarity.

In the real world, very few features exhibit stationarity. Even features synonymous with stability (like sea level) change over time.

Contrast with nonstationarity .

stochastic gradient descent (SGD)

#fundamentals

A gradient descent algorithm in which the batch size is one. In other words, SGD trains on a single example chosen uniformly at random from a training set .

See Linear regression: Hyperparameters in Machine Learning Crash Course for more information.

supervised machine learning

#fundamentals

Training a model from features and their corresponding labels . Supervised machine learning is analogous to learning a subject by studying a set of questions and their corresponding answers. After mastering the mapping between questions and answers, a student can then provide answers to new (never-before-seen) questions on the same topic.

Compare with unsupervised machine learning .

See Supervised Learning in the Introduction to ML course for more information.

synthetic feature

#fundamentals

A feature not present among the input features, but assembled from one or more of them. Methods for creating synthetic features include the following:

  • Bucketing a continuous feature into range bins.
  • Creating a feature cross .
  • Multiplying (or dividing) one feature value by other feature value(s) or by itself. For example, if a and b are input features, then the following are examples of synthetic features:
    • аб
    • a 2
  • Applying a transcendental function to a feature value. For example, if c is an input feature, then the following are examples of synthetic features:
    • sin(c)
    • ln(c)

Features created by normalizing or scaling alone are not considered synthetic features.

Т

test loss

#fundamentals
#Metric

A metric representing a model's loss against the test set . When building a model , you typically try to minimize test loss. That's because a low test loss is a stronger quality signal than a low training loss or low validation loss .

A large gap between test loss and training loss or validation loss sometimes suggests that you need to increase the regularization rate .

обучение

#fundamentals

The process of determining the ideal parameters (weights and biases) comprising a model . During training, a system reads in examples and gradually adjusts parameters. Training uses each example anywhere from a few times to billions of times.

See Supervised Learning in the Introduction to ML course for more information.

training loss

#fundamentals
#Metric

A metric representing a model's loss during a particular training iteration. For example, suppose the loss function is Mean Squared Error . Perhaps the training loss (the Mean Squared Error) for the 10th iteration is 2.2, and the training loss for the 100th iteration is 1.9.

A loss curve plots training loss versus the number of iterations. A loss curve provides the following hints about training:

  • A downward slope implies that the model is improving.
  • An upward slope implies that the model is getting worse.
  • A flat slope implies that the model has reached convergence .

For example, the following somewhat idealized loss curve shows:

  • A steep downward slope during the initial iterations, which implies rapid model improvement.
  • A gradually flattening (but still downward) slope until close to the end of training, which implies continued model improvement at a somewhat slower pace then during the initial iterations.
  • A flat slope towards the end of training, which suggests convergence.

The plot of training loss versus iterations. This loss curve starts
     with a steep downward slope. The slope gradually flattens until the
     slope becomes zero.

Although training loss is important, see also generalization .

training-serving skew

#fundamentals

The difference between a model's performance during training and that same model's performance during serving .

training set

#fundamentals

The subset of the dataset used to train a model .

Traditionally, examples in the dataset are divided into the following three distinct subsets:

Ideally, each example in the dataset should belong to only one of the preceding subsets. For example, a single example shouldn't belong to both the training set and the validation set.

See Datasets: Dividing the original dataset in Machine Learning Crash Course for more information.

true negative (TN)

#fundamentals
#Metric

An example in which the model correctly predicts the negative class . For example, the model infers that a particular email message is not spam , and that email message really is not spam .

true positive (TP)

#fundamentals
#Metric

An example in which the model correctly predicts the positive class . For example, the model infers that a particular email message is spam, and that email message really is spam.

true positive rate (TPR)

#fundamentals
#Metric

Synonym for recall . То есть:

$$\text{true positive rate} = \frac {\text{true positives}} {\text{true positives} + \text{false negatives}}$$

True positive rate is the y-axis in an ROC curve .

ты

underfitting

#fundamentals

Producing a model with poor predictive ability because the model hasn't fully captured the complexity of the training data. Many problems can cause underfitting, including:

See Overfitting in Machine Learning Crash Course for more information.

unlabeled example

#fundamentals

An example that contains features but no label . For example, the following table shows three unlabeled examples from a house valuation model, each with three features but no house value:

Number of bedrooms Number of bathrooms House age
3 2 15
2 1 72
4 2 34

In supervised machine learning , models train on labeled examples and make predictions on unlabeled examples .

In semi-supervised and unsupervised learning, unlabeled examples are used during training.

Contrast unlabeled example with labeled example .

unsupervised machine learning

#clustering
#fundamentals

Training a model to find patterns in a dataset, typically an unlabeled dataset.

The most common use of unsupervised machine learning is to cluster data into groups of similar examples. For example, an unsupervised machine learning algorithm can cluster songs based on various properties of the music. The resulting clusters can become an input to other machine learning algorithms (for example, to a music recommendation service). Clustering can help when useful labels are scarce or absent. For example, in domains such as anti-abuse and fraud, clusters can help humans better understand the data.

Contrast with supervised machine learning .

See What is Machine Learning? in the Introduction to ML course for more information.

В

проверка

#fundamentals

The initial evaluation of a model's quality. Validation checks the quality of a model's predictions against the validation set .

Because the validation set differs from the training set , validation helps guard against overfitting .

You might think of evaluating the model against the validation set as the first round of testing and evaluating the model against the test set as the second round of testing.

validation loss

#fundamentals
#Metric

A metric representing a model's loss on the validation set during a particular iteration of training.

See also generalization curve .

validation set

#fundamentals

The subset of the dataset that performs initial evaluation against a trained model . Typically, you evaluate the trained model against the validation set several times before evaluating the model against the test set .

Traditionally, you divide the examples in the dataset into the following three distinct subsets:

Ideally, each example in the dataset should belong to only one of the preceding subsets. For example, a single example shouldn't belong to both the training set and the validation set.

See Datasets: Dividing the original dataset in Machine Learning Crash Course for more information.

Вт

масса

#fundamentals

A value that a model multiplies by another value. Training is the process of determining a model's ideal weights; inference is the process of using those learned weights to make predictions.

See Linear regression in Machine Learning Crash Course for more information.

weighted sum

#fundamentals

The sum of all the relevant input values multiplied by their corresponding weights. For example, suppose the relevant inputs consist of the following:

input value input weight
2 -1.3
-1 0,6
3 0,4

The weighted sum is therefore:

weighted sum = (2)(-1.3) + (-1)(0.6) + (3)(0.4) = -2.0

A weighted sum is the input argument to an activation function .

З

Z-score normalization

#fundamentals

A scaling technique that replaces a raw feature value with a floating-point value representing the number of standard deviations from that feature's mean. For example, consider a feature whose mean is 800 and whose standard deviation is 100. The following table shows how Z-score normalization would map the raw value to its Z-score:

Raw value Z-score
800 0
950 +1.5
575 -2.25

The machine learning model then trains on the Z-scores for that feature instead of on the raw values.

See Numerical data: Normalization in Machine Learning Crash Course for more information.

,

This page contains ML Fundamentals glossary terms. For all glossary terms, click here .

А

точность

#fundamentals
#Metric

The number of correct classification predictions divided by the total number of predictions. То есть:

$$\text{Accuracy} = \frac{\text{correct predictions}} {\text{correct predictions + incorrect predictions }}$$

For example, a model that made 40 correct predictions and 10 incorrect predictions would have an accuracy of:

$$\text{Accuracy} = \frac{\text{40}} {\text{40 + 10}} = \text{80%}$$

Binary classification provides specific names for the different categories of correct predictions and incorrect predictions . So, the accuracy formula for binary classification is as follows:

$$\text{Accuracy} = \frac{\text{TP} + \text{TN}} {\text{TP} + \text{TN} + \text{FP} + \text{FN}}$$

где:

Compare and contrast accuracy with precision and recall .

See Classification: Accuracy, recall, precision and related metrics in Machine Learning Crash Course for more information.

activation function

#fundamentals

A function that enables neural networks to learn nonlinear (complex) relationships between features and the label.

Popular activation functions include:

The plots of activation functions are never single straight lines. For example, the plot of the ReLU activation function consists of two straight lines:

A cartesian plot of two lines. The first line has a constant
          y value of 0, running along the x-axis from -infinity,0 to 0,-0.
          The second line starts at 0,0. This line has a slope of +1, so
          it runs from 0,0 to +infinity,+infinity.

A plot of the sigmoid activation function looks as follows:

A two-dimensional curved plot with x values spanning the domain
          -infinity to +positive, while y values span the range almost 0 to
          almost 1. When x is 0, y is 0.5. The slope of the curve is always
          positive, with the highest slope at 0,0.5 and gradually decreasing
          slopes as the absolute value of x increases.

See Neural networks: Activation functions in Machine Learning Crash Course for more information.

искусственный интеллект

#fundamentals

A non-human program or model that can solve sophisticated tasks. For example, a program or model that translates text or a program or model that identifies diseases from radiologic images both exhibit artificial intelligence.

Formally, machine learning is a sub-field of artificial intelligence. However, in recent years, some organizations have begun using the terms artificial intelligence and machine learning interchangeably.

AUC (Area under the ROC curve)

#fundamentals
#Metric

A number between 0.0 and 1.0 representing a binary classification model's ability to separate positive classes from negative classes . The closer the AUC is to 1.0, the better the model's ability to separate classes from each other.

For example, the following illustration shows a classification model that separates positive classes (green ovals) from negative classes (purple rectangles) perfectly. This unrealistically perfect model has an AUC of 1.0:

A number line with 8 positive examples on one side and
          9 negative examples on the other side.

Conversely, the following illustration shows the results for a classification model that generated random results. This model has an AUC of 0.5:

A number line with 6 positive examples and 6 negative examples.
          The sequence of examples is positive, negative,
          positive, negative, positive, negative, positive, negative, positive
          negative, positive, negative.

Yes, the preceding model has an AUC of 0.5, not 0.0.

Most models are somewhere between the two extremes. For instance, the following model separates positives from negatives somewhat, and therefore has an AUC somewhere between 0.5 and 1.0:

A number line with 6 positive examples and 6 negative examples.
          The sequence of examples is negative, negative, negative, negative,
          positive, negative, positive, positive, negative, positive, positive,
          positive.

AUC ignores any value you set for classification threshold . Instead, AUC considers all possible classification thresholds.

See Classification: ROC and AUC in Machine Learning Crash Course for more information.

Б

обратное распространение ошибки

#fundamentals

The algorithm that implements gradient descent in neural networks .

Training a neural network involves many iterations of the following two-pass cycle:

  1. During the forward pass , the system processes a batch of examples to yield prediction(s). The system compares each prediction to each label value. The difference between the prediction and the label value is the loss for that example. The system aggregates the losses for all the examples to compute the total loss for the current batch.
  2. During the backward pass (backpropagation), the system reduces loss by adjusting the weights of all the neurons in all the hidden layer(s) .

Neural networks often contain many neurons across many hidden layers. Each of those neurons contribute to the overall loss in different ways. Backpropagation determines whether to increase or decrease the weights applied to particular neurons.

The learning rate is a multiplier that controls the degree to which each backward pass increases or decreases each weight. A large learning rate will increase or decrease each weight more than a small learning rate.

In calculus terms, backpropagation implements the chain rule . from calculus. That is, backpropagation calculates the partial derivative of the error with respect to each parameter.

Years ago, ML practitioners had to write code to implement backpropagation. Modern ML APIs like Keras now implement backpropagation for you. Уф!

See Neural networks in Machine Learning Crash Course for more information.

партия

#fundamentals

The set of examples used in one training iteration . The batch size determines the number of examples in a batch.

See epoch for an explanation of how a batch relates to an epoch.

See Linear regression: Hyperparameters in Machine Learning Crash Course for more information.

batch size

#fundamentals

The number of examples in a batch . For instance, if the batch size is 100, then the model processes 100 examples per iteration .

The following are popular batch size strategies:

  • Stochastic Gradient Descent (SGD) , in which the batch size is 1.
  • Full batch, in which the batch size is the number of examples in the entire training set . For instance, if the training set contains a million examples, then the batch size would be a million examples. Full batch is usually an inefficient strategy.
  • mini-batch in which the batch size is usually between 10 and 1000. Mini-batch is usually the most efficient strategy.

See the following for more information:

bias (ethics/fairness)

#ответственный
#fundamentals

1. Stereotyping, prejudice or favoritism towards some things, people, or groups over others. These biases can affect collection and interpretation of data, the design of a system, and how users interact with a system. Forms of this type of bias include:

2. Systematic error introduced by a sampling or reporting procedure. Forms of this type of bias include:

Not to be confused with the bias term in machine learning models or prediction bias .

See Fairness: Types of bias in Machine Learning Crash Course for more information.

bias (math) or bias term

#fundamentals

An intercept or offset from an origin. Bias is a parameter in machine learning models, which is symbolized by either of the following:

  • б
  • w 0

For example, bias is the b in the following formula:

$$y' = b + w_1x_1 + w_2x_2 + … w_nx_n$$

In a simple two-dimensional line, bias just means "y-intercept." For example, the bias of the line in the following illustration is 2.

The plot of a line with a slope of 0.5 and a bias (y-intercept) of 2.

Bias exists because not all models start from the origin (0,0). For example, suppose an amusement park costs 2 Euros to enter and an additional 0.5 Euro for every hour a customer stays. Therefore, a model mapping the total cost has a bias of 2 because the lowest cost is 2 Euros.

Bias is not to be confused with bias in ethics and fairness or prediction bias .

See Linear Regression in Machine Learning Crash Course for more information.

binary classification

#fundamentals

A type of classification task that predicts one of two mutually exclusive classes:

For example, the following two machine learning models each perform binary classification:

  • A model that determines whether email messages are spam (the positive class) or not spam (the negative class).
  • A model that evaluates medical symptoms to determine whether a person has a particular disease (the positive class) or doesn't have that disease (the negative class).

Contrast with multi-class classification .

See also logistic regression and classification threshold .

See Classification in Machine Learning Crash Course for more information.

bucketing

#fundamentals

Converting a single feature into multiple binary features called buckets or bins , typically based on a value range. The chopped feature is typically a continuous feature .

For example, instead of representing temperature as a single continuous floating-point feature, you could chop ranges of temperatures into discrete buckets, such as:

  • <= 10 degrees Celsius would be the "cold" bucket.
  • 11 - 24 degrees Celsius would be the "temperate" bucket.
  • >= 25 degrees Celsius would be the "warm" bucket.

The model will treat every value in the same bucket identically. For example, the values 13 and 22 are both in the temperate bucket, so the model treats the two values identically.

See Numerical data: Binning in Machine Learning Crash Course for more information.

С

categorical data

#fundamentals

Features having a specific set of possible values. For example, consider a categorical feature named traffic-light-state , which can only have one of the following three possible values:

  • red
  • yellow
  • green

By representing traffic-light-state as a categorical feature, a model can learn the differing impacts of red , green , and yellow on driver behavior.

Categorical features are sometimes called discrete features .

Contrast with numerical data .

See Working with categorical data in Machine Learning Crash Course for more information.

сорт

#fundamentals

A category that a label can belong to. Например:

A classification model predicts a class. In contrast, a regression model predicts a number rather than a class.

See Classification in Machine Learning Crash Course for more information.

classification model

#fundamentals

A model whose prediction is a class . For example, the following are all classification models:

  • A model that predicts an input sentence's language (French? Spanish? Italian?).
  • A model that predicts tree species (Maple? Oak? Baobab?).
  • A model that predicts the positive or negative class for a particular medical condition.

In contrast, regression models predict numbers rather than classes.

Two common types of classification models are:

classification threshold

#fundamentals

In a binary classification , a number between 0 and 1 that converts the raw output of a logistic regression model into a prediction of either the positive class or the negative class . Note that the classification threshold is a value that a human chooses, not a value chosen by model training.

A logistic regression model outputs a raw value between 0 and 1. Then:

  • If this raw value is greater than the classification threshold, then the positive class is predicted.
  • If this raw value is less than the classification threshold, then the negative class is predicted.

For example, suppose the classification threshold is 0.8. If the raw value is 0.9, then the model predicts the positive class. If the raw value is 0.7, then the model predicts the negative class.

The choice of classification threshold strongly influences the number of false positives and false negatives .

See Thresholds and the confusion matrix in Machine Learning Crash Course for more information.

classifier

#fundamentals

A casual term for a classification model .

class-imbalanced dataset

#fundamentals

A dataset for a classification problem in which the total number of labels of each class differs significantly. For example, consider a binary classification dataset whose two labels are divided as follows:

  • 1,000,000 negative labels
  • 10 positive labels

The ratio of negative to positive labels is 100,000 to 1, so this is a class-imbalanced dataset.

In contrast, the following dataset is not class-imbalanced because the ratio of negative labels to positive labels is relatively close to 1:

  • 517 negative labels
  • 483 positive labels

Multi-class datasets can also be class-imbalanced. For example, the following multi-class classification dataset is also class-imbalanced because one label has far more examples than the other two:

  • 1,000,000 labels with class "green"
  • 200 labels with class "purple"
  • 350 labels with class "orange"

See also entropy , majority class , and minority class .

clipping

#fundamentals

A technique for handling outliers by doing either or both of the following:

  • Reducing feature values that are greater than a maximum threshold down to that maximum threshold.
  • Increasing feature values that are less than a minimum threshold up to that minimum threshold.

For example, suppose that <0.5% of values for a particular feature fall outside the range 40–60. In this case, you could do the following:

  • Clip all values over 60 (the maximum threshold) to be exactly 60.
  • Clip all values under 40 (the minimum threshold) to be exactly 40.

Outliers can damage models, sometimes causing weights to overflow during training. Some outliers can also dramatically spoil metrics like accuracy . Clipping is a common technique to limit the damage.

Gradient clipping forces gradient values within a designated range during training.

See Numerical data: Normalization in Machine Learning Crash Course for more information.

матрица путаницы

#fundamentals

An NxN table that summarizes the number of correct and incorrect predictions that a classification model made. For example, consider the following confusion matrix for a binary classification model:

Tumor (predicted) Non-Tumor (predicted)
Tumor (ground truth) 18 (TP) 1 (FN)
Non-Tumor (ground truth) 6 (FP) 452 (TN)

The preceding confusion matrix shows the following:

  • Of the 19 predictions in which ground truth was Tumor, the model correctly classified 18 and incorrectly classified 1.
  • Of the 458 predictions in which ground truth was Non-Tumor, the model correctly classified 452 and incorrectly classified 6.

The confusion matrix for a multi-class classification problem can help you identify patterns of mistakes. For example, consider the following confusion matrix for a 3-class multi-class classification model that categorizes three different iris types (Virginica, Versicolor, and Setosa). When the ground truth was Virginica, the confusion matrix shows that the model was far more likely to mistakenly predict Versicolor than Setosa:

Setosa (predicted) Versicolor (predicted) Virginica (predicted)
Setosa (ground truth) 88 12 0
Versicolor (ground truth) 6 141 7
Virginica (ground truth) 2 27 109

As yet another example, a confusion matrix could reveal that a model trained to recognize handwritten digits tends to mistakenly predict 9 instead of 4, or mistakenly predict 1 instead of 7.

Confusion matrixes contain sufficient information to calculate a variety of performance metrics, including precision and recall .

continuous feature

#fundamentals

A floating-point feature with an infinite range of possible values, such as temperature or weight.

Contrast with discrete feature .

конвергенция

#fundamentals

A state reached when loss values change very little or not at all with each iteration . For example, the following loss curve suggests convergence at around 700 iterations:

Cartesian plot. X-axis is loss. Y-axis is the number of training
          iterations. Loss is very high during first few iterations, but
          drops sharply. After about 100 iterations, loss is still
          descending but far more gradually. After about 700 iterations,
          loss stays flat.

A model converges when additional training won't improve the model.

In deep learning , loss values sometimes stay constant or nearly so for many iterations before finally descending. During a long period of constant loss values, you may temporarily get a false sense of convergence.

See also early stopping .

See Model convergence and loss curves in Machine Learning Crash Course for more information.

Д

DataFrame

#fundamentals

A popular pandas data type for representing datasets in memory.

A DataFrame is analogous to a table or a spreadsheet. Each column of a DataFrame has a name (a header), and each row is identified by a unique number.

Each column in a DataFrame is structured like a 2D array, except that each column can be assigned its own data type.

See also the official pandas.DataFrame reference page .

data set or dataset

#fundamentals

A collection of raw data, commonly (but not exclusively) organized in one of the following formats:

  • a spreadsheet
  • a file in CSV (comma-separated values) format

deep model

#fundamentals

A neural network containing more than one hidden layer .

A deep model is also called a deep neural network .

Contrast with wide model .

dense feature

#fundamentals

A feature in which most or all values are nonzero, typically a Tensor of floating-point values. For example, the following 10-element Tensor is dense because 9 of its values are nonzero:

8 3 7 5 2 4 0 4 9 6

Contrast with sparse feature .

глубина

#fundamentals

The sum of the following in a neural network :

For example, a neural network with five hidden layers and one output layer has a depth of 6.

Notice that the input layer doesn't influence depth.

discrete feature

#fundamentals

A feature with a finite set of possible values. For example, a feature whose values may only be animal , vegetable , or mineral is a discrete (or categorical) feature.

Contrast with continuous feature .

динамичный

#fundamentals

Something done frequently or continuously. The terms dynamic and online are synonyms in machine learning. The following are common uses of dynamic and online in machine learning:

  • A dynamic model (or online model ) is a model that is retrained frequently or continuously.
  • Dynamic training (or online training ) is the process of training frequently or continuously.
  • Dynamic inference (or online inference ) is the process of generating predictions on demand.

dynamic model

#fundamentals

A model that is frequently (maybe even continuously) retrained. A dynamic model is a "lifelong learner" that constantly adapts to evolving data. A dynamic model is also known as an online model .

Contrast with static model .

Э

early stopping

#fundamentals

A method for regularization that involves ending training before training loss finishes decreasing. In early stopping, you intentionally stop training the model when the loss on a validation dataset starts to increase; that is, when generalization performance worsens.

embedding layer

#язык
#fundamentals

A special hidden layer that trains on a high-dimensional categorical feature to gradually learn a lower dimension embedding vector. An embedding layer enables a neural network to train far more efficiently than training just on the high-dimensional categorical feature.

For example, Earth currently supports about 73,000 tree species. Suppose tree species is a feature in your model, so your model's input layer includes a one-hot vector 73,000 elements long. For example, perhaps baobab would be represented something like this:

An array of 73,000 elements. The first 6,232 elements hold the value
     0. The next element holds the value 1. The final 66,767 elements hold
     the value zero.

A 73,000-element array is very long. If you don't add an embedding layer to the model, training is going to be very time consuming due to multiplying 72,999 zeros. Perhaps you pick the embedding layer to consist of 12 dimensions. Consequently, the embedding layer will gradually learn a new embedding vector for each tree species.

In certain situations, hashing is a reasonable alternative to an embedding layer.

See Embeddings in Machine Learning Crash Course for more information.

эпоха

#fundamentals

A full training pass over the entire training set such that each example has been processed once.

An epoch represents N / batch size training iterations , where N is the total number of examples.

For instance, suppose the following:

  • The dataset consists of 1,000 examples.
  • The batch size is 50 examples.

Therefore, a single epoch requires 20 iterations:

1 epoch = (N/batch size) = (1,000 / 50) = 20 iterations

See Linear regression: Hyperparameters in Machine Learning Crash Course for more information.

пример

#fundamentals

The values of one row of features and possibly a label . Examples in supervised learning fall into two general categories:

  • A labeled example consists of one or more features and a label. Labeled examples are used during training.
  • An unlabeled example consists of one or more features but no label. Unlabeled examples are used during inference.

For instance, suppose you are training a model to determine the influence of weather conditions on student test scores. Here are three labeled examples:

Функции Этикетка
Температура Влажность Давление Test score
15 47 998 Хороший
19 34 1020 Отличный
18 92 1012 Бедный

Here are three unlabeled examples:

Температура Влажность Давление
12 62 1014
21 47 1017
19 41 1021

The row of a dataset is typically the raw source for an example. That is, an example typically consists of a subset of the columns in the dataset. Furthermore, the features in an example can also include synthetic features , such as feature crosses .

See Supervised Learning in the Introduction to Machine Learning course for more information.

Ф

false negative (FN)

#fundamentals
#Metric

An example in which the model mistakenly predicts the negative class . For example, the model predicts that a particular email message is not spam (the negative class), but that email message actually is spam .

false positive (FP)

#fundamentals
#Metric

An example in which the model mistakenly predicts the positive class . For example, the model predicts that a particular email message is spam (the positive class), but that email message is actually not spam .

See Thresholds and the confusion matrix in Machine Learning Crash Course for more information.

false positive rate (FPR)

#fundamentals
#Metric

The proportion of actual negative examples for which the model mistakenly predicted the positive class. The following formula calculates the false positive rate:

$$\text{false positive rate} = \frac{\text{false positives}}{\text{false positives} + \text{true negatives}}$$

The false positive rate is the x-axis in an ROC curve .

See Classification: ROC and AUC in Machine Learning Crash Course for more information.

особенность

#fundamentals

An input variable to a machine learning model. An example consists of one or more features. For instance, suppose you are training a model to determine the influence of weather conditions on student test scores. The following table shows three examples, each of which contains three features and one label:

Функции Этикетка
Температура Влажность Давление Test score
15 47 998 92
19 34 1020 84
18 92 1012 87

Contrast with label .

See Supervised Learning in the Introduction to Machine Learning course for more information.

feature cross

#fundamentals

A synthetic feature formed by "crossing" categorical or bucketed features.

For example, consider a "mood forecasting" model that represents temperature in one of the following four buckets:

  • freezing
  • chilly
  • temperate
  • warm

And represents wind speed in one of the following three buckets:

  • still
  • light
  • windy

Without feature crosses, the linear model trains independently on each of the preceding seven various buckets. So, the model trains on, for example, freezing independently of the training on, for example, windy .

Alternatively, you could create a feature cross of temperature and wind speed. This synthetic feature would have the following 12 possible values:

  • freezing-still
  • freezing-light
  • freezing-windy
  • chilly-still
  • chilly-light
  • chilly-windy
  • temperate-still
  • temperate-light
  • temperate-windy
  • warm-still
  • warm-light
  • warm-windy

Thanks to feature crosses, the model can learn mood differences between a freezing-windy day and a freezing-still day.

If you create a synthetic feature from two features that each have a lot of different buckets, the resulting feature cross will have a huge number of possible combinations. For example, if one feature has 1,000 buckets and the other feature has 2,000 buckets, the resulting feature cross has 2,000,000 buckets.

Formally, a cross is a Cartesian product .

Feature crosses are mostly used with linear models and are rarely used with neural networks.

See Categorical data: Feature crosses in Machine Learning Crash Course for more information.

feature engineering

#fundamentals
#TensorFlow

A process that involves the following steps:

  1. Determining which features might be useful in training a model.
  2. Converting raw data from the dataset into efficient versions of those features.

For example, you might determine that temperature might be a useful feature. Then, you might experiment with bucketing to optimize what the model can learn from different temperature ranges.

Feature engineering is sometimes called feature extraction or featurization .

See Numerical data: How a model ingests data using feature vectors in Machine Learning Crash Course for more information.

feature set

#fundamentals

The group of features your machine learning model trains on. For example, a simple feature set for a model that predicts housing prices might consist of postal code, property size, and property condition.

feature vector

#fundamentals

The array of feature values comprising an example . The feature vector is input during training and during inference . For example, the feature vector for a model with two discrete features might be:

[0.92, 0.56]

Four layers: an input layer, two hidden layers, and one output layer.
          The input layer contains two nodes, one containing the value
          0.92 and the other containing the value 0.56.

Each example supplies different values for the feature vector, so the feature vector for the next example could be something like:

[0.73, 0.49]

Feature engineering determines how to represent features in the feature vector. For example, a binary categorical feature with five possible values might be represented with one-hot encoding . In this case, the portion of the feature vector for a particular example would consist of four zeroes and a single 1.0 in the third position, as follows:

[0.0, 0.0, 1.0, 0.0, 0.0]

As another example, suppose your model consists of three features:

  • a binary categorical feature with five possible values represented with one-hot encoding; for example: [0.0, 1.0, 0.0, 0.0, 0.0]
  • another binary categorical feature with three possible values represented with one-hot encoding; for example: [0.0, 0.0, 1.0]
  • a floating-point feature; for example: 8.3 .

In this case, the feature vector for each example would be represented by nine values. Given the example values in the preceding list, the feature vector would be:

0.0
1.0
0.0
0.0
0.0
0.0
0.0
1.0
8.3

See Numerical data: How a model ingests data using feature vectors in Machine Learning Crash Course for more information.

feedback loop

#fundamentals

In machine learning, a situation in which a model's predictions influence the training data for the same model or another model. For example, a model that recommends movies will influence the movies that people see, which will then influence subsequent movie recommendation models.

See Production ML systems: Questions to ask in Machine Learning Crash Course for more information.

Г

обобщение

#fundamentals

A model's ability to make correct predictions on new, previously unseen data. A model that can generalize is the opposite of a model that is overfitting .

See Generalization in Machine Learning Crash Course for more information.

generalization curve

#fundamentals

A plot of both training loss and validation loss as a function of the number of iterations .

A generalization curve can help you detect possible overfitting . For example, the following generalization curve suggests overfitting because validation loss ultimately becomes significantly higher than training loss.

A Cartesian graph in which the y-axis is labeled loss and the x-axis
          is labeled iterations. Two plots appear. One plots shows the
          training loss and the other shows the validation loss.
          The two plots start off similarly, but the training loss eventually
          dips far lower than the validation loss.

See Generalization in Machine Learning Crash Course for more information.

градиентный спуск

#fundamentals

A mathematical technique to minimize loss . Gradient descent iteratively adjusts weights and biases , gradually finding the best combination to minimize loss.

Gradient descent is older—much, much older—than machine learning.

See the Linear regression: Gradient descent in Machine Learning Crash Course for more information.

ground truth

#fundamentals

Реальность.

The thing that actually happened.

For example, consider a binary classification model that predicts whether a student in their first year of university will graduate within six years. Ground truth for this model is whether or not that student actually graduated within six years.

ЧАС

hidden layer

#fundamentals

A layer in a neural network between the input layer (the features) and the output layer (the prediction). Each hidden layer consists of one or more neurons . For example, the following neural network contains two hidden layers, the first with three neurons and the second with two neurons:

Four layers. The first layer is an input layer containing two           функции. The second layer is a hidden layer containing three           neurons. The third layer is a hidden layer containing two           neurons. The fourth layer is an output layer. Each feature           contains three edges, each of which points to a different neuron           in the second layer. Each of the neurons in the second layer           contains two edges, each of which points to a different neuron           in the third layer. Each of the neurons in the third layer contain           one edge, each pointing to the output layer.

A deep neural network contains more than one hidden layer. For example, the preceding illustration is a deep neural network because the model contains two hidden layers.

See Neural networks: Nodes and hidden layers in Machine Learning Crash Course for more information.

hyperparameter

#fundamentals

The variables that you or a hyperparameter tuning serviceadjust during successive runs of training a model. For example, learning rate is a hyperparameter. You could set the learning rate to 0.01 before one training session. If you determine that 0.01 is too high, you could perhaps set the learning rate to 0.003 for the next training session.

In contrast, parameters are the various weights and bias that the model learns during training.

See Linear regression: Hyperparameters in Machine Learning Crash Course for more information.

я

independently and identically distributed (iid)

#fundamentals

Data drawn from a distribution that doesn't change, and where each value drawn doesn't depend on values that have been drawn previously. An iid is the ideal gas of machine learning—a useful mathematical construct but almost never exactly found in the real world. For example, the distribution of visitors to a web page may be iid over a brief window of time; that is, the distribution doesn't change during that brief window and one person's visit is generally independent of another's visit. However, if you expand that window of time, seasonal differences in the web page's visitors may appear.

See also nonstationarity .

вывод

#fundamentals

In machine learning, the process of making predictions by applying a trained model to unlabeled examples .

Inference has a somewhat different meaning in statistics. See the Wikipedia article on statistical inference for details.

See Supervised Learning in the Intro to ML course to see inference's role in a supervised learning system.

input layer

#fundamentals

The layer of a neural network that holds the feature vector . That is, the input layer provides examples for training or inference . For example, the input layer in the following neural network consists of two features:

Four layers: an input layer, two hidden layers, and an output layer.

interpretability

#fundamentals

The ability to explain or to present an ML model's reasoning in understandable terms to a human.

Most linear regression models, for example, are highly interpretable. (You merely need to look at the trained weights for each feature.) Decision forests are also highly interpretable. Some models, however, require sophisticated visualization to become interpretable.

You can use the Learning Interpretability Tool (LIT) to interpret ML models.

итерация

#fundamentals

A single update of a model's parameters—the model's weights and biases —during training . The batch size determines how many examples the model processes in a single iteration. For instance, if the batch size is 20, then the model processes 20 examples before adjusting the parameters.

When training a neural network , a single iteration involves the following two passes:

  1. A forward pass to evaluate loss on a single batch.
  2. A backward pass ( backpropagation ) to adjust the model's parameters based on the loss and the learning rate.

See Gradient descent in Machine Learning Crash Course for more information.

л

L 0 regularization

#fundamentals

A type of regularization that penalizes the total number of nonzero weights in a model. For example, a model having 11 nonzero weights would be penalized more than a similar model having 10 nonzero weights.

L 0 regularization is sometimes called L0-norm regularization .

L 1 loss

#fundamentals
#Metric

A loss function that calculates the absolute value of the difference between actual label values and the values that a model predicts. For example, here's the calculation of L 1 loss for a batch of five examples :

Actual value of example Model's predicted value Absolute value of delta
7 6 1
5 4 1
8 11 3
4 6 2
9 8 1
8 = L 1 loss

L 1 loss is less sensitive to outliers than L 2 loss .

The Mean Absolute Error is the average L 1 loss per example.

See Linear regression: Loss in Machine Learning Crash Course for more information.

L 1 regularization

#fundamentals

A type of regularization that penalizes weights in proportion to the sum of the absolute value of the weights. L 1 regularization helps drive the weights of irrelevant or barely relevant features to exactly 0 . A feature with a weight of 0 is effectively removed from the model.

Contrast with L 2 regularization .

L 2 loss

#fundamentals
#Metric

A loss function that calculates the square of the difference between actual label values and the values that a model predicts. For example, here's the calculation of L 2 loss for a batch of five examples :

Actual value of example Model's predicted value Square of delta
7 6 1
5 4 1
8 11 9
4 6 4
9 8 1
16 = L 2 loss

Due to squaring, L 2 loss amplifies the influence of outliers . That is, L 2 loss reacts more strongly to bad predictions than L 1 loss . For example, the L 1 loss for the preceding batch would be 8 rather than 16. Notice that a single outlier accounts for 9 of the 16.

Regression models typically use L 2 loss as the loss function.

The Mean Squared Error is the average L 2 loss per example. Squared loss is another name for L 2 loss.

See Logistic regression: Loss and regularization in Machine Learning Crash Course for more information.

L 2 regularization

#fundamentals

A type of regularization that penalizes weights in proportion to the sum of the squares of the weights. L 2 regularization helps drive outlier weights (those with high positive or low negative values) closer to 0 but not quite to 0 . Features with values very close to 0 remain in the model but don't influence the model's prediction very much.

L 2 regularization always improves generalization in linear models .

Contrast with L 1 regularization .

See Overfitting: L2 regularization in Machine Learning Crash Course for more information.

этикетка

#fundamentals

In supervised machine learning , the "answer" or "result" portion of an example .

Each labeled example consists of one or more features and a label. For example, in a spam detection dataset, the label would probably be either "spam" or "not spam." In a rainfall dataset, the label might be the amount of rain that fell during a certain period.

See Supervised Learning in Introduction to Machine Learning for more information.

labeled example

#fundamentals

An example that contains one or more features and a label . For example, the following table shows three labeled examples from a house valuation model, each with three features and one label:

Number of bedrooms Number of bathrooms House age House price (label)
3 2 15 $345,000
2 1 72 $179,000
4 2 34 $392,000

In supervised machine learning , models train on labeled examples and make predictions on unlabeled examples .

Contrast labeled example with unlabeled examples.

See Supervised Learning in Introduction to Machine Learning for more information.

лямбда

#fundamentals

Synonym for regularization rate .

Lambda is an overloaded term. Here we're focusing on the term's definition within regularization .

слой

#fundamentals

A set of neurons in a neural network . Three common types of layers are as follows:

For example, the following illustration shows a neural network with one input layer, two hidden layers, and one output layer:

A neural network with one input layer, two hidden layers, and one           output layer. The input layer consists of two features. Первый           hidden layer consists of three neurons and the second hidden layer           consists of two neurons. The output layer consists of a single node.

In TensorFlow , layers are also Python functions that take Tensors and configuration options as input and produce other tensors as output.

learning rate

#fundamentals

A floating-point number that tells the gradient descent algorithm how strongly to adjust weights and biases on each iteration . For example, a learning rate of 0.3 would adjust weights and biases three times more powerfully than a learning rate of 0.1.

Learning rate is a key hyperparameter . If you set the learning rate too low, training will take too long. If you set the learning rate too high, gradient descent often has trouble reaching convergence .

See Linear regression: Hyperparameters in Machine Learning Crash Course for more information.

linear

#fundamentals

A relationship between two or more variables that can be represented solely through addition and multiplication.

The plot of a linear relationship is a line.

Contrast with nonlinear .

линейная модель

#fundamentals

A model that assigns one weight per feature to make predictions . (Linear models also incorporate a bias .) In contrast, the relationship of features to predictions in deep models is generally nonlinear .

Linear models are usually easier to train and more interpretable than deep models. However, deep models can learn complex relationships between features.

Linear regression and logistic regression are two types of linear models.

linear regression

#fundamentals

A type of machine learning model in which both of the following are true:

  • The model is a linear model .
  • The prediction is a floating-point value. (This is the regression part of linear regression .)

Contrast linear regression with logistic regression . Also, contrast regression with classification .

See Linear regression in Machine Learning Crash Course for more information.

логистическая регрессия

#fundamentals

A type of regression model that predicts a probability. Logistic regression models have the following characteristics:

  • The label is categorical . The term logistic regression usually refers to binary logistic regression , that is, to a model that calculates probabilities for labels with two possible values. A less common variant, multinomial logistic regression , calculates probabilities for labels with more than two possible values.
  • The loss function during training is Log Loss . (Multiple Log Loss units can be placed in parallel for labels with more than two possible values.)
  • The model has a linear architecture, not a deep neural network. However, the remainder of this definition also applies to deep models that predict probabilities for categorical labels.

For example, consider a logistic regression model that calculates the probability of an input email being either spam or not spam. During inference, suppose the model predicts 0.72. Therefore, the model is estimating:

  • A 72% chance of the email being spam.
  • A 28% chance of the email not being spam.

A logistic regression model uses the following two-step architecture:

  1. The model generates a raw prediction (y') by applying a linear function of input features.
  2. The model uses that raw prediction as input to a sigmoid function , which converts the raw prediction to a value between 0 and 1, exclusive.

Like any regression model, a logistic regression model predicts a number. However, this number typically becomes part of a binary classification model as follows:

  • If the predicted number is greater than the classification threshold , the binary classification model predicts the positive class.
  • If the predicted number is less than the classification threshold, the binary classification model predicts the negative class.

See Logistic regression in Machine Learning Crash Course for more information.

Log Loss

#fundamentals

The loss function used in binary logistic regression .

See Logistic regression: Loss and regularization in Machine Learning Crash Course for more information.

log-odds

#fundamentals

The logarithm of the odds of some event.

потеря

#fundamentals
#Metric

During the training of a supervised model , a measure of how far a model's prediction is from its label .

A loss function calculates the loss.

See Linear regression: Loss in Machine Learning Crash Course for more information.

loss curve

#fundamentals

A plot of loss as a function of the number of training iterations . The following plot shows a typical loss curve:

A Cartesian graph of loss versus training iterations, showing a
          rapid drop in loss for the initial iterations, followed by a gradual
          drop, and then a flat slope during the final iterations.

Loss curves can help you determine when your model is converging or overfitting .

Loss curves can plot all of the following types of loss:

See also generalization curve .

See Overfitting: Interpreting loss curves in Machine Learning Crash Course for more information.

функция потерь

#fundamentals
#Metric

During training or testing, a mathematical function that calculates the loss on a batch of examples. A loss function returns a lower loss for models that makes good predictions than for models that make bad predictions.

The goal of training is typically to minimize the loss that a loss function returns.

Many different kinds of loss functions exist. Pick the appropriate loss function for the kind of model you are building. Например:

М

машинное обучение

#fundamentals

A program or system that trains a model from input data. The trained model can make useful predictions from new (never-before-seen) data drawn from the same distribution as the one used to train the model.

Machine learning also refers to the field of study concerned with these programs or systems.

See the Introduction to Machine Learning course for more information.

majority class

#fundamentals

The more common label in a class-imbalanced dataset . For example, given a dataset containing 99% negative labels and 1% positive labels, the negative labels are the majority class.

Contrast with minority class .

See Datasets: Imbalanced datasets in Machine Learning Crash Course for more information.

mini-batch

#fundamentals

A small, randomly selected subset of a batch processed in one iteration . The batch size of a mini-batch is usually between 10 and 1,000 examples.

For example, suppose the entire training set (the full batch) consists of 1,000 examples. Further suppose that you set the batch size of each mini-batch to 20. Therefore, each iteration determines the loss on a random 20 of the 1,000 examples and then adjusts the weights and biases accordingly.

It is much more efficient to calculate the loss on a mini-batch than the loss on all the examples in the full batch.

See Linear regression: Hyperparameters in Machine Learning Crash Course for more information.

minority class

#fundamentals

The less common label in a class-imbalanced dataset . For example, given a dataset containing 99% negative labels and 1% positive labels, the positive labels are the minority class.

Contrast with majority class .

See Datasets: Imbalanced datasets in Machine Learning Crash Course for more information.

модель

#fundamentals

In general, any mathematical construct that processes input data and returns output. Phrased differently, a model is the set of parameters and structure needed for a system to make predictions. In supervised machine learning , a model takes an example as input and infers a prediction as output. Within supervised machine learning, models differ somewhat. Например:

  • A linear regression model consists of a set of weights and a bias .
  • A neural network model consists of:
    • A set of hidden layers , each containing one or more neurons .
    • The weights and bias associated with each neuron.
  • A decision tree model consists of:
    • The shape of the tree; that is, the pattern in which the conditions and leaves are connected.
    • The conditions and leaves.

You can save, restore, or make copies of a model.

Unsupervised machine learning also generates models, typically a function that can map an input example to the most appropriate cluster .

multi-class classification

#fundamentals

In supervised learning, a classification problem in which the dataset contains more than two classes of labels. For example, the labels in the Iris dataset must be one of the following three classes:

  • Iris setosa
  • Iris virginica
  • Iris versicolor

A model trained on the Iris dataset that predicts Iris type on new examples is performing multi-class classification.

In contrast, classification problems that distinguish between exactly two classes are binary classification models . For example, an email model that predicts either spam or not spam is a binary classification model.

In clustering problems, multi-class classification refers to more than two clusters.

See Neural networks: Multi-class classification in Machine Learning Crash Course for more information.

Н

negative class

#fundamentals
#Metric

In binary classification , one class is termed positive and the other is termed negative . The positive class is the thing or event that the model is testing for and the negative class is the other possibility. Например:

  • The negative class in a medical test might be "not tumor."
  • The negative class in an email classification model might be "not spam."

Contrast with positive class .

нейронная сеть

#fundamentals

A model containing at least one hidden layer . A deep neural network is a type of neural network containing more than one hidden layer. For example, the following diagram shows a deep neural network containing two hidden layers.

A neural network with an input layer, two hidden layers, and an
          output layer.

Each neuron in a neural network connects to all of the nodes in the next layer. For example, in the preceding diagram, notice that each of the three neurons in the first hidden layer separately connect to both of the two neurons in the second hidden layer.

Neural networks implemented on computers are sometimes called artificial neural networks to differentiate them from neural networks found in brains and other nervous systems.

Some neural networks can mimic extremely complex nonlinear relationships between different features and the label.

See also convolutional neural network and recurrent neural network .

See Neural networks in Machine Learning Crash Course for more information.

нейрон

#fundamentals

In machine learning, a distinct unit within a hidden layer of a neural network . Each neuron performs the following two-step action:

  1. Calculates the weighted sum of input values multiplied by their corresponding weights.
  2. Passes the weighted sum as input to an activation function .

A neuron in the first hidden layer accepts inputs from the feature values in the input layer . A neuron in any hidden layer beyond the first accepts inputs from the neurons in the preceding hidden layer. For example, a neuron in the second hidden layer accepts inputs from the neurons in the first hidden layer.

The following illustration highlights two neurons and their inputs.

A neural network with an input layer, two hidden layers, and an           output layer. Two neurons are highlighted: one in the first           hidden layer and one in the second hidden layer. The highlighted           neuron in the first hidden layer receives inputs from both features           in the input layer. The highlighted neuron in the second hidden layer           receives inputs from each of the three neurons in the first hidden           слой.

A neuron in a neural network mimics the behavior of neurons in brains and other parts of nervous systems.

node (neural network)

#fundamentals

A neuron in a hidden layer .

See Neural Networks in Machine Learning Crash Course for more information.

nonlinear

#fundamentals

A relationship between two or more variables that can't be represented solely through addition and multiplication. A linear relationship can be represented as a line; a nonlinear relationship can't be represented as a line. For example, consider two models that each relate a single feature to a single label. The model on the left is linear and the model on the right is nonlinear:

Two plots. One plot is a line, so this is a linear relationship.
          The other plot is a curve, so this is a nonlinear relationship.

See Neural networks: Nodes and hidden layers in Machine Learning Crash Course to experiment with different kinds of nonlinear functions.

nonstationarity

#fundamentals

A feature whose values change across one or more dimensions, usually time. For example, consider the following examples of nonstationarity:

  • The number of swimsuits sold at a particular store varies with the season.
  • The quantity of a particular fruit harvested in a particular region is zero for much of the year but large for a brief period.
  • Due to climate change, annual mean temperatures are shifting.

Contrast with stationarity .

normalization

#fundamentals

Broadly speaking, the process of converting a variable's actual range of values into a standard range of values, such as:

  • -1 to +1
  • 0 to 1
  • Z-scores (roughly, -3 to +3)

For example, suppose the actual range of values of a certain feature is 800 to 2,400. As part of feature engineering , you could normalize the actual values down to a standard range, such as -1 to +1.

Normalization is a common task in feature engineering . Models usually train faster (and produce better predictions) when every numerical feature in the feature vector has roughly the same range.

See also Z-score normalization .

See Numerical Data: Normalization in Machine Learning Crash Course for more information.

числовые данные

#fundamentals

Features represented as integers or real-valued numbers. For example, a house valuation model would probably represent the size of a house (in square feet or square meters) as numerical data. Representing a feature as numerical data indicates that the feature's values have a mathematical relationship to the label. That is, the number of square meters in a house probably has some mathematical relationship to the value of the house.

Not all integer data should be represented as numerical data. For example, postal codes in some parts of the world are integers; however, integer postal codes shouldn't be represented as numerical data in models. That's because a postal code of 20000 is not twice (or half) as potent as a postal code of 10000. Furthermore, although different postal codes do correlate to different real estate values, we can't assume that real estate values at postal code 20000 are twice as valuable as real estate values at postal code 10000. Postal codes should be represented as categorical data instead.

Numerical features are sometimes called continuous features .

See Working with numerical data in Machine Learning Crash Course for more information.

О

офлайн

#fundamentals

Synonym for static .

offline inference

#fundamentals

The process of a model generating a batch of predictions and then caching (saving) those predictions. Apps can then access the inferred prediction from the cache rather than rerunning the model.

For example, consider a model that generates local weather forecasts (predictions) once every four hours. After each model run, the system caches all the local weather forecasts. Weather apps retrieve the forecasts from the cache.

Offline inference is also called static inference .

Contrast with online inference .

See Production ML systems: Static versus dynamic inference in Machine Learning Crash Course for more information.

one-hot encoding

#fundamentals

Representing categorical data as a vector in which:

  • One element is set to 1.
  • All other elements are set to 0.

One-hot encoding is commonly used to represent strings or identifiers that have a finite set of possible values. For example, suppose a certain categorical feature named Scandinavia has five possible values:

  • "Дания"
  • "Швеция"
  • "Норвегия"
  • "Финляндия"
  • "Исландия"

One-hot encoding could represent each of the five values as follows:

страна Вектор
"Дания" 1 0 0 0 0
"Швеция" 0 1 0 0 0
"Норвегия" 0 0 1 0 0
"Финляндия" 0 0 0 1 0
"Исландия" 0 0 0 0 1

Thanks to one-hot encoding, a model can learn different connections based on each of the five countries.

Representing a feature as numerical data is an alternative to one-hot encoding. Unfortunately, representing the Scandinavian countries numerically is not a good choice. For example, consider the following numeric representation:

  • "Denmark" is 0
  • "Sweden" is 1
  • "Norway" is 2
  • "Finland" is 3
  • "Iceland" is 4

With numeric encoding, a model would interpret the raw numbers mathematically and would try to train on those numbers. However, Iceland isn't actually twice as much (or half as much) of something as Norway, so the model would come to some strange conclusions.

See Categorical data: Vocabulary and one-hot encoding in Machine Learning Crash Course for more information.

one-vs.-all

#fundamentals

Given a classification problem with N classes, a solution consisting of N separate binary classifiers —one binary classifier for each possible outcome. For example, given a model that classifies examples as animal, vegetable, or mineral, a one-vs.-all solution would provide the following three separate binary classifiers:

  • animal versus not animal
  • vegetable versus not vegetable
  • mineral versus not mineral

онлайн

#fundamentals

Synonym for dynamic .

online inference

#fundamentals

Generating predictions on demand. For example, suppose an app passes input to a model and issues a request for a prediction. A system using online inference responds to the request by running the model (and returning the prediction to the app).

Contrast with offline inference .

See Production ML systems: Static versus dynamic inference in Machine Learning Crash Course for more information.

output layer

#fundamentals

The "final" layer of a neural network. The output layer contains the prediction.

The following illustration shows a small deep neural network with an input layer, two hidden layers, and an output layer:

A neural network with one input layer, two hidden layers, and one           output layer. The input layer consists of two features. Первый           hidden layer consists of three neurons and the second hidden layer           consists of two neurons. The output layer consists of a single node.

overfitting

#fundamentals

Creating a model that matches the training data so closely that the model fails to make correct predictions on new data.

Regularization can reduce overfitting. Training on a large and diverse training set can also reduce overfitting.

See Overfitting in Machine Learning Crash Course for more information.

П

панды

#fundamentals

A column-oriented data analysis API built on top of numpy . Many machine learning frameworks, including TensorFlow, support pandas data structures as inputs. See the pandas documentation for details.

parameter

#fundamentals

The weights and biases that a model learns during training . For example, in a linear regression model, the parameters consist of the bias ( b ) and all the weights ( w 1 , w 2 , and so on) in the following formula:

$$y' = b + w_1x_1 + w_2x_2 + … w_nx_n$$

In contrast, hyperparameters are the values that you (or a hyperparameter tuning service) supply to the model. For example, learning rate is a hyperparameter.

positive class

#fundamentals
#Metric

The class you are testing for.

For example, the positive class in a cancer model might be "tumor." The positive class in an email classification model might be "spam."

Contrast with negative class .

post-processing

#ответственный
#fundamentals

Adjusting the output of a model after the model has been run. Post-processing can be used to enforce fairness constraints without modifying models themselves.

For example, one might apply post-processing to a binary classifier by setting a classification threshold such that equality of opportunity is maintained for some attribute by checking that the true positive rate is the same for all values of that attribute.

прогноз

#fundamentals

A model's output. Например:

  • The prediction of a binary classification model is either the positive class or the negative class.
  • The prediction of a multi-class classification model is one class.
  • The prediction of a linear regression model is a number.

proxy labels

#fundamentals

Data used to approximate labels not directly available in a dataset.

For example, suppose you must train a model to predict employee stress level. Your dataset contains a lot of predictive features but doesn't contain a label named stress level. Undaunted, you pick "workplace accidents" as a proxy label for stress level. After all, employees under high stress get into more accidents than calm employees. Or do they? Maybe workplace accidents actually rise and fall for multiple reasons.

As a second example, suppose you want is it raining? to be a Boolean label for your dataset, but your dataset doesn't contain rain data. If photographs are available, you might establish pictures of people carrying umbrellas as a proxy label for is it raining? Is that a good proxy label? Possibly, but people in some cultures may be more likely to carry umbrellas to protect against sun than the rain.

Proxy labels are often imperfect. When possible, choose actual labels over proxy labels. That said, when an actual label is absent, pick the proxy label very carefully, choosing the least horrible proxy label candidate.

See Datasets: Labels in Machine Learning Crash Course for more information.

Р

RAG

#fundamentals

Abbreviation for retrieval-augmented generation .

оценить

#fundamentals

A human who provides labels for examples . "Annotator" is another name for rater.

See Categorical data: Common issues in Machine Learning Crash Course for more information.

Rectified Linear Unit (ReLU)

#fundamentals

An activation function with the following behavior:

  • If input is negative or zero, then the output is 0.
  • If input is positive, then the output is equal to the input.

Например:

  • If the input is -3, then the output is 0.
  • If the input is +3, then the output is 3.0.

Here is a plot of ReLU:

A cartesian plot of two lines. The first line has a constant
          y value of 0, running along the x-axis from -infinity,0 to 0,-0.
          The second line starts at 0,0. This line has a slope of +1, so
          it runs from 0,0 to +infinity,+infinity.

ReLU is a very popular activation function. Despite its simple behavior, ReLU still enables a neural network to learn nonlinear relationships between features and the label .

regression model

#fundamentals

Informally, a model that generates a numerical prediction. (In contrast, a classification model generates a class prediction.) For example, the following are all regression models:

  • A model that predicts a certain house's value in Euros, such as 423,000.
  • A model that predicts a certain tree's life expectancy in years, such as 23.2.
  • A model that predicts the amount of rain in inches that will fall in a certain city over the next six hours, such as 0.18.

Two common types of regression models are:

  • Linear regression , which finds the line that best fits label values to features.
  • Logistic regression , which generates a probability between 0.0 and 1.0 that a system typically then maps to a class prediction.

Not every model that outputs numerical predictions is a regression model. In some cases, a numeric prediction is really just a classification model that happens to have numeric class names. For example, a model that predicts a numeric postal code is a classification model, not a regression model.

регуляризация

#fundamentals

Any mechanism that reduces overfitting . Popular types of regularization include:

Regularization can also be defined as the penalty on a model's complexity.

See Overfitting: Model complexity in Machine Learning Crash Course for more information.

regularization rate

#fundamentals

A number that specifies the relative importance of regularization during training. Raising the regularization rate reduces overfitting but may reduce the model's predictive power. Conversely, reducing or omitting the regularization rate increases overfitting.

See Overfitting: L2 regularization in Machine Learning Crash Course for more information.

ReLU

#fundamentals

Abbreviation for Rectified Linear Unit .

retrieval-augmented generation (RAG)

#fundamentals

A technique for improving the quality of large language model (LLM) output by grounding it with sources of knowledge retrieved after the model was trained. RAG improves the accuracy of LLM responses by providing the trained LLM with access to information retrieved from trusted knowledge bases or documents.

Common motivations to use retrieval-augmented generation include:

  • Increasing the factual accuracy of a model's generated responses.
  • Giving the model access to knowledge it was not trained on.
  • Changing the knowledge that the model uses.
  • Enabling the model to cite sources.

For example, suppose that a chemistry app uses the PaLM API to generate summaries related to user queries. When the app's backend receives a query, the backend:

  1. Searches for ("retrieves") data that's relevant to the user's query.
  2. Appends ("augments") the relevant chemistry data to the user's query.
  3. Instructs the LLM to create a summary based on the appended data.

ROC (receiver operating characteristic) Curve

#fundamentals
#Metric

A graph of true positive rate versus false positive rate for different classification thresholds in binary classification.

The shape of an ROC curve suggests a binary classification model's ability to separate positive classes from negative classes. Suppose, for example, that a binary classification model perfectly separates all the negative classes from all the positive classes:

A number line with 8 positive examples on the right side and
          7 negative examples on the left.

The ROC curve for the preceding model looks as follows:

An ROC curve. The x-axis is False Positive Rate and the y-axis
          is True Positive Rate. The curve has an inverted L shape. The curve
          starts at (0.0,0.0) and goes straight up to (0.0,1.0). Then the curve
          goes from (0.0,1.0) to (1.0,1.0).

In contrast, the following illustration graphs the raw logistic regression values for a terrible model that can't separate negative classes from positive classes at all:

A number line with positive examples and negative classes
          completely intermixed.

The ROC curve for this model looks as follows:

An ROC curve, which is actually a straight line from (0.0,0.0)
          to (1.0,1.0).

Meanwhile, back in the real world, most binary classification models separate positive and negative classes to some degree, but usually not perfectly. So, a typical ROC curve falls somewhere between the two extremes:

An ROC curve. The x-axis is False Positive Rate and the y-axis
          is True Positive Rate. The ROC curve approximates a shaky arc
          traversing the compass points from West to North.

The point on an ROC curve closest to (0.0,1.0) theoretically identifies the ideal classification threshold. However, several other real-world issues influence the selection of the ideal classification threshold. For example, perhaps false negatives cause far more pain than false positives.

A numerical metric called AUC summarizes the ROC curve into a single floating-point value.

Root Mean Squared Error (RMSE)

#fundamentals
#Metric

The square root of the Mean Squared Error .

С

sigmoid function

#fundamentals

A mathematical function that "squishes" an input value into a constrained range, typically 0 to 1 or -1 to +1. That is, you can pass any number (two, a million, negative billion, whatever) to a sigmoid and the output will still be in the constrained range. A plot of the sigmoid activation function looks as follows:

A two-dimensional curved plot with x values spanning the domain
          -infinity to +positive, while y values span the range almost 0 to
          almost 1. When x is 0, y is 0.5. The slope of the curve is always
          positive, with the highest slope at 0,0.5 and gradually decreasing
          slopes as the absolute value of x increases.

The sigmoid function has several uses in machine learning, including:

софтмакс

#fundamentals

A function that determines probabilities for each possible class in a multi-class classification model . The probabilities add up to exactly 1.0. For example, the following table shows how softmax distributes various probabilities:

Image is a... Вероятность
собака .85
кот .13
лошадь .02

Softmax is also called full softmax .

Contrast with candidate sampling .

See Neural networks: Multi-class classification in Machine Learning Crash Course for more information.

sparse feature

#язык
#fundamentals

A feature whose values are predominately zero or empty. For example, a feature containing a single 1 value and a million 0 values is sparse. In contrast, a dense feature has values that are predominantly not zero or empty.

In machine learning, a surprising number of features are sparse features. Categorical features are usually sparse features. For example, of the 300 possible tree species in a forest, a single example might identify just a maple tree . Or, of the millions of possible videos in a video library, a single example might identify just "Casablanca."

In a model, you typically represent sparse features with one-hot encoding . If the one-hot encoding is big, you might put an embedding layer on top of the one-hot encoding for greater efficiency.

sparse representation

#язык
#fundamentals

Storing only the position(s) of nonzero elements in a sparse feature.

For example, suppose a categorical feature named species identifies the 36 tree species in a particular forest. Further assume that each example identifies only a single species.

You could use a one-hot vector to represent the tree species in each example. A one-hot vector would contain a single 1 (to represent the particular tree species in that example) and 35 0 s (to represent the 35 tree species not in that example). So, the one-hot representation of maple might look something like the following:

A vector in which positions 0 through 23 hold the value 0, position
          24 holds the value 1, and positions 25 through 35 hold the value 0.

Alternatively, sparse representation would simply identify the position of the particular species. If maple is at position 24, then the sparse representation of maple would simply be:

24

Notice that the sparse representation is much more compact than the one-hot representation.

See Working with categorical data in Machine Learning Crash Course for more information.

sparse vector

#fundamentals

A vector whose values are mostly zeroes. See also sparse feature and sparsity .

squared loss

#fundamentals
#Metric

Synonym for L 2 loss .

статический

#fundamentals

Something done once rather than continuously. The terms static and offline are synonyms. The following are common uses of static and offline in machine learning:

  • static model (or offline model ) is a model trained once and then used for a while.
  • static training (or offline training ) is the process of training a static model.
  • static inference (or offline inference ) is a process in which a model generates a batch of predictions at a time.

Contrast with dynamic .

static inference

#fundamentals

Synonym for offline inference .

stationarity

#fundamentals

A feature whose values don't change across one or more dimensions, usually time. For example, a feature whose values look about the same in 2021 and 2023 exhibits stationarity.

In the real world, very few features exhibit stationarity. Even features synonymous with stability (like sea level) change over time.

Contrast with nonstationarity .

stochastic gradient descent (SGD)

#fundamentals

A gradient descent algorithm in which the batch size is one. In other words, SGD trains on a single example chosen uniformly at random from a training set .

See Linear regression: Hyperparameters in Machine Learning Crash Course for more information.

supervised machine learning

#fundamentals

Training a model from features and their corresponding labels . Supervised machine learning is analogous to learning a subject by studying a set of questions and their corresponding answers. After mastering the mapping between questions and answers, a student can then provide answers to new (never-before-seen) questions on the same topic.

Compare with unsupervised machine learning .

See Supervised Learning in the Introduction to ML course for more information.

synthetic feature

#fundamentals

A feature not present among the input features, but assembled from one or more of them. Methods for creating synthetic features include the following:

  • Bucketing a continuous feature into range bins.
  • Creating a feature cross .
  • Multiplying (or dividing) one feature value by other feature value(s) or by itself. For example, if a and b are input features, then the following are examples of synthetic features:
    • аб
    • a 2
  • Applying a transcendental function to a feature value. For example, if c is an input feature, then the following are examples of synthetic features:
    • sin(c)
    • ln(c)

Features created by normalizing or scaling alone are not considered synthetic features.

Т

test loss

#fundamentals
#Metric

A metric representing a model's loss against the test set . When building a model , you typically try to minimize test loss. That's because a low test loss is a stronger quality signal than a low training loss or low validation loss .

A large gap between test loss and training loss or validation loss sometimes suggests that you need to increase the regularization rate .

обучение

#fundamentals

The process of determining the ideal parameters (weights and biases) comprising a model . During training, a system reads in examples and gradually adjusts parameters. Training uses each example anywhere from a few times to billions of times.

See Supervised Learning in the Introduction to ML course for more information.

training loss

#fundamentals
#Metric

A metric representing a model's loss during a particular training iteration. For example, suppose the loss function is Mean Squared Error . Perhaps the training loss (the Mean Squared Error) for the 10th iteration is 2.2, and the training loss for the 100th iteration is 1.9.

A loss curve plots training loss versus the number of iterations. A loss curve provides the following hints about training:

  • A downward slope implies that the model is improving.
  • An upward slope implies that the model is getting worse.
  • A flat slope implies that the model has reached convergence .

For example, the following somewhat idealized loss curve shows:

  • A steep downward slope during the initial iterations, which implies rapid model improvement.
  • A gradually flattening (but still downward) slope until close to the end of training, which implies continued model improvement at a somewhat slower pace then during the initial iterations.
  • A flat slope towards the end of training, which suggests convergence.

The plot of training loss versus iterations. This loss curve starts
     with a steep downward slope. The slope gradually flattens until the
     slope becomes zero.

Although training loss is important, see also generalization .

training-serving skew

#fundamentals

The difference between a model's performance during training and that same model's performance during serving .

training set

#fundamentals

The subset of the dataset used to train a model .

Traditionally, examples in the dataset are divided into the following three distinct subsets:

Ideally, each example in the dataset should belong to only one of the preceding subsets. For example, a single example shouldn't belong to both the training set and the validation set.

See Datasets: Dividing the original dataset in Machine Learning Crash Course for more information.

true negative (TN)

#fundamentals
#Metric

An example in which the model correctly predicts the negative class . For example, the model infers that a particular email message is not spam , and that email message really is not spam .

true positive (TP)

#fundamentals
#Metric

An example in which the model correctly predicts the positive class . For example, the model infers that a particular email message is spam, and that email message really is spam.

true positive rate (TPR)

#fundamentals
#Metric

Synonym for recall . То есть:

$$\text{true positive rate} = \frac {\text{true positives}} {\text{true positives} + \text{false negatives}}$$

True positive rate is the y-axis in an ROC curve .

ты

underfitting

#fundamentals

Producing a model with poor predictive ability because the model hasn't fully captured the complexity of the training data. Many problems can cause underfitting, including:

See Overfitting in Machine Learning Crash Course for more information.

unlabeled example

#fundamentals

An example that contains features but no label . For example, the following table shows three unlabeled examples from a house valuation model, each with three features but no house value:

Number of bedrooms Number of bathrooms House age
3 2 15
2 1 72
4 2 34

In supervised machine learning , models train on labeled examples and make predictions on unlabeled examples .

In semi-supervised and unsupervised learning, unlabeled examples are used during training.

Contrast unlabeled example with labeled example .

unsupervised machine learning

#clustering
#fundamentals

Training a model to find patterns in a dataset, typically an unlabeled dataset.

The most common use of unsupervised machine learning is to cluster data into groups of similar examples. For example, an unsupervised machine learning algorithm can cluster songs based on various properties of the music. The resulting clusters can become an input to other machine learning algorithms (for example, to a music recommendation service). Clustering can help when useful labels are scarce or absent. For example, in domains such as anti-abuse and fraud, clusters can help humans better understand the data.

Contrast with supervised machine learning .

See What is Machine Learning? in the Introduction to ML course for more information.

В

проверка

#fundamentals

The initial evaluation of a model's quality. Validation checks the quality of a model's predictions against the validation set .

Because the validation set differs from the training set , validation helps guard against overfitting .

You might think of evaluating the model against the validation set as the first round of testing and evaluating the model against the test set as the second round of testing.

validation loss

#fundamentals
#Metric

A metric representing a model's loss on the validation set during a particular iteration of training.

See also generalization curve .

validation set

#fundamentals

The subset of the dataset that performs initial evaluation against a trained model . Typically, you evaluate the trained model against the validation set several times before evaluating the model against the test set .

Traditionally, you divide the examples in the dataset into the following three distinct subsets:

Ideally, each example in the dataset should belong to only one of the preceding subsets. For example, a single example shouldn't belong to both the training set and the validation set.

See Datasets: Dividing the original dataset in Machine Learning Crash Course for more information.

Вт

масса

#fundamentals

A value that a model multiplies by another value. Training is the process of determining a model's ideal weights; inference is the process of using those learned weights to make predictions.

See Linear regression in Machine Learning Crash Course for more information.

weighted sum

#fundamentals

The sum of all the relevant input values multiplied by their corresponding weights. For example, suppose the relevant inputs consist of the following:

input value input weight
2 -1.3
-1 0,6
3 0,4

The weighted sum is therefore:

weighted sum = (2)(-1.3) + (-1)(0.6) + (3)(0.4) = -2.0

A weighted sum is the input argument to an activation function .

З

Z-score normalization

#fundamentals

A scaling technique that replaces a raw feature value with a floating-point value representing the number of standard deviations from that feature's mean. For example, consider a feature whose mean is 800 and whose standard deviation is 100. The following table shows how Z-score normalization would map the raw value to its Z-score:

Raw value Z-score
800 0
950 +1.5
575 -2.25

The machine learning model then trains on the Z-scores for that feature instead of on the raw values.

See Numerical data: Normalization in Machine Learning Crash Course for more information.