این صفحه به‌وسیله ‏Cloud Translation API‏ ترجمه شده است.

واژه نامه یادگیری ماشینی: متریک

این صفحه شامل اصطلاحات واژه‌نامه‌ی معیارها است. برای مشاهده‌ی تمام اصطلاحات واژه‌نامه، اینجا کلیک کنید .

الف

دقت

#مبانی

#متریک

تعداد پیش‌بینی‌های طبقه‌بندی صحیح تقسیم بر تعداد کل پیش‌بینی‌ها. یعنی:

$$\text{Accuracy} = \frac{\text{correct predictions}} {\text{correct predictions + incorrect predictions }}$$

برای مثال، مدلی که ۴۰ پیش‌بینی درست و ۱۰ پیش‌بینی نادرست انجام داده باشد، دقتی برابر با:

$$\text{Accuracy} = \frac{\text{40}} {\text{40 + 10}} = \text{80%}$$

طبقه‌بندی دودویی نام‌های خاصی را برای دسته‌های مختلف پیش‌بینی‌های درست و پیش‌بینی‌های نادرست ارائه می‌دهد. بنابراین، فرمول دقت برای طبقه‌بندی دودویی به شرح زیر است:

$$\text{Accuracy} = \frac{\text{TP} + \text{TN}} {\text{TP} + \text{TN} + \text{FP} + \text{FN}}$$

کجا:

TP تعداد موارد مثبت واقعی (پیش‌بینی‌های صحیح) است.
TN تعداد منفی‌های واقعی (پیش‌بینی‌های صحیح) است.
FP تعداد مثبت‌های کاذب (پیش‌بینی‌های نادرست) است.
FN تعداد نتایج منفی کاذب (پیش‌بینی‌های نادرست) است.

دقت را با دقت و یادآوری مقایسه و مقابله کنید.

برای جزئیات بیشتر در مورد دقت و مجموعه داده‌های نامتعادل از نظر کلاس، روی نماد کلیک کنید.

اگرچه دقت برای برخی موقعیت‌ها معیار ارزشمندی است، اما برای برخی دیگر بسیار گمراه‌کننده است. نکته قابل توجه این است که دقت معمولاً معیار ضعیفی برای ارزیابی مدل‌های طبقه‌بندی است که مجموعه داده‌های نامتعادل از نظر کلاس را پردازش می‌کنند.

برای مثال، فرض کنید در یک شهر نیمه‌گرمسیری خاص، فقط ۲۵ روز در هر قرن برف می‌بارد. از آنجایی که روزهای بدون برف (دسته منفی) بسیار بیشتر از روزهای دارای برف (دسته مثبت) است، مجموعه داده‌های برف برای این شهر از نظر کلاس نامتعادل است. یک مدل طبقه‌بندی دودویی را تصور کنید که قرار است هر روز برف یا بدون برف را پیش‌بینی کند، اما به سادگی هر روز «بدون برف» را پیش‌بینی می‌کند. این مدل بسیار دقیق است اما قدرت پیش‌بینی ندارد. جدول زیر نتایج پیش‌بینی‌های یک قرن را خلاصه می‌کند:

دسته بندی	شماره
تی پی	0
تنسی	۳۶۴۹۹
اف پی	0
اف ان	۲۵

بنابراین دقت این مدل برابر است با:

accuracy = (TP + TN) / (TP + TN + FP + FN)
accuracy = (0 + 36499) / (0 + 36499 + 0 + 25) = 0.9993 = 99.93%

اگرچه دقت ۹۹.۹۳٪ درصد بسیار چشمگیری به نظر می‌رسد، اما این مدل در واقع هیچ قدرت پیش‌بینی ندارد.

دقت و فراخوانی معمولاً معیارهای مفیدتری نسبت به دقت برای ارزیابی مدل‌های آموزش‌دیده روی مجموعه داده‌های نامتوازن از نظر کلاس هستند.

برای اطلاعات بیشتر به بخش طبقه‌بندی: دقت، فراخوانی، دقت و معیارهای مرتبط در دوره فشرده یادگیری ماشین مراجعه کنید.

مساحت زیر منحنی PR

#متریک

به PR AUC (مساحت زیر منحنی PR) مراجعه کنید.

مساحت زیر منحنی ROC

#متریک

به AUC (مساحت زیر منحنی ROC) مراجعه کنید.

AUC (مساحت زیر منحنی ROC)

#مبانی

#متریک

عددی بین ۰.۰ و ۱.۰ که نشان‌دهنده توانایی یک مدل طبقه‌بندی دودویی در جداسازی کلاس‌های مثبت از کلاس‌های منفی است. هرچه AUC به ۱.۰ نزدیک‌تر باشد، توانایی مدل در جداسازی کلاس‌ها از یکدیگر بهتر است.

برای مثال، تصویر زیر یک مدل طبقه‌بندی را نشان می‌دهد که کلاس‌های مثبت (بیضی‌های سبز) را از کلاس‌های منفی (مستطیل‌های بنفش) به طور کامل جدا می‌کند. این مدل که به طور غیرواقعی بی‌نقص است، AUC برابر با ۱.۰ دارد:

یک محور اعداد با ۸ مثال مثبت در یک طرف و ۹ مثال منفی در طرف دیگر.

برعکس، تصویر زیر نتایج یک مدل طبقه‌بندی را نشان می‌دهد که نتایج تصادفی تولید کرده است. این مدل دارای AUC برابر با 0.5 است:

یک محور اعداد با ۶ مثال مثبت و ۶ مثال منفی. دنباله مثال‌ها به صورت مثبت، منفی، مثبت، منفی، مثبت، منفی، مثبت، منفی، مثبت، منفی، منفی، مثبت، منفی است.

بله، مدل قبلی AUC برابر با 0.5 دارد، نه 0.0.

بیشتر مدل‌ها جایی بین این دو حالت افراطی قرار دارند. برای مثال، مدل زیر تا حدودی موارد مثبت را از موارد منفی جدا می‌کند و بنابراین AUC آن بین 0.5 تا 1.0 است:

یک محور اعداد با ۶ مثال مثبت و ۶ مثال منفی. دنباله مثال‌ها به صورت منفی، منفی، منفی، منفی، مثبت، منفی، مثبت، مثبت، منفی، مثبت، مثبت است.

AUC هر مقداری را که برای آستانه طبقه‌بندی تعیین می‌کنید، نادیده می‌گیرد. در عوض، AUC تمام آستانه‌های طبقه‌بندی ممکن را در نظر می‌گیرد.

برای آشنایی با رابطه بین منحنی‌های AUC و ROC، روی آیکون کلیک کنید.

AUC نشان دهنده مساحت زیر منحنی ROC است. برای مثال، منحنی ROC برای مدلی که به طور کامل موارد مثبت را از موارد منفی جدا می‌کند، به شکل زیر است:

AUC مساحت ناحیه خاکستری در تصویر قبلی است. در این مورد غیرمعمول، مساحت به سادگی حاصل ضرب طول ناحیه خاکستری (1.0) در عرض ناحیه خاکستری (1.0) است. بنابراین، حاصل ضرب 1.0 و 1.0، AUC دقیقاً 1.0 را به دست می‌دهد که بالاترین امتیاز AUC ممکن است.

برعکس، منحنی ROC برای یک مدل طبقه‌بندی که اصلاً نمی‌تواند کلاس‌ها را از هم جدا کند به صورت زیر است. مساحت این ناحیه خاکستری 0.5 است.

یک منحنی ROC معمولی‌تر تقریباً شبیه به شکل زیر است:

محاسبه‌ی دستی مساحت زیر این منحنی کار دشواری خواهد بود، به همین دلیل است که معمولاً یک برنامه بیشتر مقادیر AUC را محاسبه می‌کند.

برای تعریف رسمی‌تر AUC، روی آیکون کلیک کنید.

AUC احتمال این است که یک مدل طبقه‌بندی اطمینان بیشتری داشته باشد که یک مثال مثبت که به صورت تصادفی انتخاب شده است، واقعاً مثبت باشد تا اینکه یک مثال منفی که به صورت تصادفی انتخاب شده است، مثبت باشد.

برای اطلاعات بیشتر به بخش طبقه‌بندی: ROC و AUC در دوره فشرده یادگیری ماشین مراجعه کنید.

دقت متوسط در k

#متریک

معیاری برای خلاصه کردن عملکرد یک مدل در یک درخواست واحد که نتایج رتبه‌بندی‌شده تولید می‌کند، مانند فهرست شماره‌گذاری‌شده‌ای از توصیه‌های کتاب. میانگین دقت در k ، در واقع میانگین مقادیر دقت در k برای هر نتیجه مرتبط است. بنابراین فرمول میانگین دقت در k به صورت زیر است:

\[{\text{average precision at k}} = \frac{1}{n} \sum_{i=1}^n {\text{precision at k for each relevant item} } \]

کجا:

$n$ تعداد موارد مرتبط در لیست است.

با یادآوری در نقطه k مقایسه کنید.

برای مثال روی آیکون کلیک کنید

فرض کنید به یک مدل زبان بزرگ، پرس‌وجوی زیر داده شده است:

List the 6 funniest movies of all time in order.

و مدل زبان بزرگ لیست زیر را برمی‌گرداند:

ژنرال
دختران بدجنس
جوخه
ساقدوش‌ها
همشهری کین
این اسپاینال تپ است

چهار فیلم از فهرست برگردانده شده بسیار خنده‌دار هستند (یعنی مرتبط هستند) اما دو فیلم درام هستند (غیرمرتبط). جدول زیر نتایج را با جزئیات نشان می‌دهد:

موقعیت	فیلم	مربوطه؟	دقت در k
۱	ژنرال	بله	۱.۰
۲	دختران بدجنس	بله	۱.۰
۳	جوخه	خیر	مربوط نیست
۴	ساقدوش‌ها	بله	۰.۷۵
۵	همشهری کین	خیر	مربوط نیست
۶	این اسپاینال تپ است	بله	۰.۶۷

تعداد نتایج مرتبط ۴ است. بنابراین، می‌توانید میانگین دقت را در ۶ به صورت زیر محاسبه کنید:

$${\text{average precision at 6}} = \frac{1}{4} {\text{(1.0 + 1.0 + 0.75 + 0.67)} } $$$${\text{average precision at 6}} = {\text{~0.85} } $$

ب

خط پایه

#متریک

مدلی که به عنوان نقطه مرجع برای مقایسه عملکرد یک مدل دیگر (معمولاً یک مدل پیچیده‌تر) استفاده می‌شود. برای مثال، یک مدل رگرسیون لجستیک می‌تواند به عنوان یک مبنای خوب برای یک مدل عمیق عمل کند.

برای یک مسئله خاص، خط مبنا به توسعه‌دهندگان مدل کمک می‌کند تا حداقل عملکرد مورد انتظاری را که یک مدل جدید باید برای مفید بودن به آن دست یابد، تعیین کنند.

سوالات بولی (BoolQ)

#متریک

یک مجموعه داده برای ارزیابی مهارت یک LLM در پاسخ به سوالات بله یا خیر. هر یک از چالش های موجود در مجموعه داده دارای سه جزء است:

یک پرس و جو
عبارتی که به طور ضمنی پاسخ پرسش را بیان می‌کند.
پاسخ صحیح، که یا بله است یا خیر .

برای مثال:

سوال : آیا در میشیگان نیروگاه هسته‌ای وجود دارد؟
متن : ...سه نیروگاه هسته‌ای حدود 30 درصد از برق میشیگان را تأمین می‌کنند.
پاسخ صحیح : بله

محققان سوالات را از جستجوهای ناشناس و تجمیع‌شده در گوگل جمع‌آوری کردند و سپس از صفحات ویکی‌پدیا برای پایه‌گذاری اطلاعات استفاده کردند.

برای اطلاعات بیشتر، به BoolQ: بررسی دشواری شگفت‌انگیز سوالات بله/خیر طبیعی مراجعه کنید.

BoolQ جزئی از مجموعه SuperGLUE است.

بول کیو

#متریک

مخفف سوالات بولی .

سی

سی بی

#متریک

مخفف بانک تعهد (CommitmentBank) است.

امتیاز F کاراکتر N-gram (ChrF)

#متریک

معیاری برای ارزیابی مدل‌های ترجمه ماشینی . امتیاز F کاراکتر N-gram میزان همپوشانی N-gramها در متن مرجع با N-gramها در متن تولید شده توسط مدل ML را تعیین می‌کند.

امتیاز F کاراکتر N-gram مشابه معیارهای خانواده‌های ROUGE و BLEU است، با این تفاوت که:

امتیاز F مربوط به N-gram کاراکتری روی N-gram کاراکتری عمل می‌کند.
ROUGE و BLEU روی N-gramهای کلمه یا توکن‌ها عمل می‌کنند.

انتخاب گزینه‌های محتمل (COPA)

#متریک

یک مجموعه داده برای ارزیابی اینکه یک LLM چقدر خوب می‌تواند از بین دو پاسخ جایگزین برای یک فرضیه، پاسخ بهتر را تشخیص دهد. هر یک از چالش‌های موجود در مجموعه داده از سه جزء تشکیل شده است:

فرضیه، که معمولاً یک جمله است و به دنبال آن یک سوال مطرح می‌شود
دو پاسخ ممکن به سوال مطرح شده در مقدمه، که یکی صحیح و دیگری غلط است
پاسخ صحیح

برای مثال:

فرضیه: انگشت پای مرد شکست. علت این اتفاق چه بود؟
پاسخ‌های احتمالی:
1. جورابش سوراخ شد.
2. او یک چکش روی پایش انداخت.
پاسخ صحیح: ۲

COPA جزئی از مجموعه SuperGLUE است.

بانک تعهد (CB)

#متریک

یک مجموعه داده برای ارزیابی مهارت یک LLM در تعیین اینکه آیا نویسنده یک متن به یک بند هدف در آن متن اعتقاد دارد یا خیر. هر ورودی در مجموعه داده شامل موارد زیر است:

یک گذرگاه
یک جمله‌ی هدف در آن متن
یک مقدار بولی که نشان می‌دهد آیا نویسنده‌ی متن، جمله‌ی هدف را باور دارد یا خیر.

برای مثال:

متن: چقدر شنیدن خنده آرتمیس لذت‌بخش است. او بچه خیلی جدی‌ای است. نمی‌دانستم شوخ‌طبع هم هست.
جمله هدف: او حس شوخ طبعی داشت
بولی : درست، به این معنی که نویسنده جمله هدف را باور دارد.

بانک تعهد (CommitmentBank) بخشی از مجموعه SuperGLUE است.

کوپا

#متریک

مخفف « انتخاب گزینه‌های محتمل» است.

هزینه

#متریک

مترادف ضرر و زیان .

انصاف خلاف واقع

#مسئولیت_پذیر

#متریک

یک معیار انصاف که بررسی می‌کند آیا یک مدل طبقه‌بندی برای یک فرد، همان نتیجه‌ای را تولید می‌کند که برای فرد دیگری که با فرد اول یکسان است، تولید می‌کند یا خیر، مگر در مورد یک یا چند ویژگی حساس . ارزیابی یک مدل طبقه‌بندی برای انصاف خلاف واقع، روشی برای آشکارسازی منابع بالقوه سوگیری در یک مدل است.

برای اطلاعات بیشتر به یکی از دو روش زیر مراجعه کنید:

انصاف: انصاف خلاف واقع در دوره فشرده یادگیری ماشین.
وقتی جهان‌ها با هم برخورد می‌کنند: ادغام فرضیات خلاف واقع مختلف در انصاف

آنتروپی متقاطع

#متریک

تعمیمی از لگاریتم تلفات به مسائل طبقه‌بندی چندکلاسه . آنتروپی متقاطع، تفاوت بین دو توزیع احتمال را کمّی می‌کند. همچنین به سرگشتگی مراجعه کنید.

تابع توزیع تجمعی (CDF)

#متریک

تابعی که فراوانی نمونه‌های کمتر یا مساوی یک مقدار هدف را تعریف می‌کند. برای مثال، توزیع نرمال مقادیر پیوسته را در نظر بگیرید. یک CDF به شما می‌گوید که تقریباً ۵۰٪ نمونه‌ها باید کمتر یا مساوی میانگین باشند و تقریباً ۸۴٪ نمونه‌ها باید کمتر یا مساوی یک انحراف معیار بالاتر از میانگین باشند.

دی

برابری جمعیتی

#مسئولیت_پذیر

#متریک

یک معیار انصاف که اگر نتایج طبقه‌بندی یک مدل به یک ویژگی حساس معین وابسته نباشد، برآورده می‌شود.

برای مثال، اگر هم لیلیپوتی‌ها و هم برابدینگناگی‌ها برای دانشگاه گلوبدابدریب درخواست دهند، برابری جمعیتی در صورتی حاصل می‌شود که درصد لیلیپوتی‌های پذیرفته‌شده با درصد برابدینگناگی‌های پذیرفته‌شده برابر باشد، صرف نظر از اینکه آیا یک گروه به طور متوسط واجد شرایط‌تر از گروه دیگر است یا خیر.

در مقابل، شانس‌های برابر و برابری فرصت‌ها قرار دارند که اجازه می‌دهند نتایج طبقه‌بندی در مجموع به ویژگی‌های حساس وابسته باشند، اما اجازه نمی‌دهند نتایج طبقه‌بندی برای برچسب‌های حقیقت پایه مشخص‌شده به ویژگی‌های حساس وابسته باشند. برای تجسمی که به بررسی بده‌بستان‌ها هنگام بهینه‌سازی برابری جمعیتی می‌پردازد، به «حمله به تبعیض با یادگیری ماشینی هوشمندتر» مراجعه کنید.

برای اطلاعات بیشتر به بخش «انصاف: برابری جمعیتی» در دوره فشرده یادگیری ماشین مراجعه کنید.

ای

فاصله حرکت دهنده خاک (EMD)

#متریک

معیاری برای سنجش شباهت نسبی دو توزیع . هرچه فاصله‌ی عامل متحرک کمتر باشد، توزیع‌ها به هم شباهت بیشتری دارند.

فاصله را ویرایش کنید

#متریک

معیاری برای سنجش میزان شباهت دو رشته متنی. در یادگیری ماشین، فاصله ویرایش به دلایل زیر مفید است:

محاسبه فاصله ویرایش آسان است.
دستور Edit distance می‌تواند دو رشته که به نظر می‌رسد شبیه به هم هستند را با هم مقایسه کند.
فاصله ویرایش می‌تواند میزان شباهت رشته‌های مختلف به یک رشته مشخص را تعیین کند.

تعاریف متعددی از فاصله ویرایشی وجود دارد که هر کدام از عملیات رشته‌ای متفاوتی استفاده می‌کنند. برای مثال به فاصله لونشتاین مراجعه کنید.

تابع توزیع تجمعی تجربی (eCDF یا EDF)

#متریک

یک تابع توزیع تجمعی بر اساس اندازه‌گیری‌های تجربی از یک مجموعه داده واقعی. مقدار تابع در هر نقطه در امتداد محور x، کسری از مشاهدات در مجموعه داده است که کمتر یا مساوی مقدار مشخص شده هستند.

آنتروپی

#دی‌اف

#متریک

در نظریه اطلاعات ، توصیفی از میزان غیرقابل پیش‌بینی بودن یک توزیع احتمال. از طرف دیگر، آنتروپی به عنوان میزان اطلاعات موجود در هر مثال نیز تعریف می‌شود. یک توزیع زمانی بالاترین آنتروپی ممکن را دارد که همه مقادیر یک متغیر تصادفی به طور مساوی محتمل باشند.

آنتروپی یک مجموعه با دو مقدار ممکن "0" و "1" (برای مثال، برچسب‌ها در یک مسئله طبقه‌بندی دودویی ) فرمول زیر را دارد:

H = -p log p - q log q = -p log p - (1-p) * log (1-p)

کجا:

H آنتروپی است.
p کسری از مثال‌های "1" است.
q کسری از نمونه‌های "0" است. توجه داشته باشید که q = (1 - p)
لگاریتم معمولاً برابر با لگاریتم _۲ است. در این حالت، واحد آنتروپی بیت است.

برای مثال، موارد زیر را فرض کنید:

۱۰۰ مثال شامل مقدار "۱" هستند
۳۰۰ مثال شامل مقدار "۰" هستند

بنابراین، مقدار آنتروپی برابر است با:

پی = 0.25
q = 0.75
H = (-0.25) _log2 (0.25) - (0.75) _log2 (0.75) = 0.81 بیت در هر مثال

مجموعه‌ای که کاملاً متعادل باشد (برای مثال، ۲۰۰ عدد «۰» و ۲۰۰ عدد «۱») آنتروپی ۱.۰ بیت در هر مثال خواهد داشت. هرچه یک مجموعه نامتعادل‌تر شود، آنتروپی آن به سمت ۰.۰ حرکت می‌کند.

در درخت‌های تصمیم‌گیری ، آنتروپی به فرمول‌بندی بهره اطلاعات کمک می‌کند تا به تقسیم‌کننده در انتخاب شرایط در طول رشد درخت تصمیم‌گیری طبقه‌بندی کمک کند.

آنتروپی را با موارد زیر مقایسه کنید:

ناخالصی جینی
تابع زیان آنتروپی متقاطع

آنتروپی اغلب آنتروپی شانون نامیده می‌شود.

برای اطلاعات بیشتر به بخش تقسیم‌کننده دقیق برای طبقه‌بندی دودویی با ویژگی‌های عددی در دوره جنگل‌های تصمیم‌گیری مراجعه کنید.

برابری فرصت‌ها

#مسئولیت_پذیر

#متریک

یک معیار انصاف برای ارزیابی اینکه آیا یک مدل، نتیجه مطلوب را برای همه مقادیر یک ویژگی حساس به طور یکسان پیش‌بینی می‌کند یا خیر. به عبارت دیگر، اگر نتیجه مطلوب برای یک مدل، کلاس مثبت باشد، هدف این است که نرخ مثبت واقعی برای همه گروه‌ها یکسان باشد.

برابری فرصت‌ها با شانس‌های برابر مرتبط است، که مستلزم آن است که هم نرخ‌های مثبت واقعی و هم نرخ‌های مثبت کاذب برای همه گروه‌ها یکسان باشند.

فرض کنید دانشگاه گلوبدابدریب هم لیلیپوتی‌ها و هم برابدینگناگی‌ها را در یک برنامه ریاضی دقیق پذیرش می‌کند. مدارس متوسطه لیلیپوتی‌ها برنامه درسی قوی از کلاس‌های ریاضی ارائه می‌دهند و اکثریت قریب به اتفاق دانش‌آموزان برای برنامه دانشگاهی واجد شرایط هستند. مدارس متوسطه برابدینگناگی‌ها اصلاً کلاس ریاضی ارائه نمی‌دهند و در نتیجه، تعداد بسیار کمتری از دانش‌آموزان آنها واجد شرایط هستند. برابری فرصت برای برچسب ترجیحی «پذیرفته شده» با توجه به ملیت (لیلیپوتی یا برابدینگناگی) در صورتی برآورده می‌شود که دانش‌آموزان واجد شرایط صرف نظر از اینکه لیلیپوتی هستند یا برابدینگناگی، احتمال پذیرش یکسانی داشته باشند.

برای مثال، فرض کنید ۱۰۰ نفر از اهالی لیلیپوت و ۱۰۰ نفر از اهالی برابدینگ ناگیا برای دانشگاه گلوبدابدریب درخواست داده‌اند و تصمیمات پذیرش به شرح زیر گرفته شده است:

جدول ۱. متقاضیان لیلیپوتی (۹۰٪ واجد شرایط هستند)

	واجد شرایط	فاقد صلاحیت
پذیرفته شده	۴۵	۳
رد شد	۴۵	۷
مجموع	۹۰	۱۰
درصد دانشجویان واجد شرایط پذیرفته شده: ۴۵/۹۰ = ۵۰٪ درصد دانشجویان فاقد صلاحیت رد شده: 7/10 = 70% درصد کل دانشجویان لیلیپوتی پذیرفته شده: (45 + 3) / 100 = 48٪

جدول ۲. متقاضیان Brobdingnagian (۱۰٪ واجد شرایط هستند):

	واجد شرایط	فاقد صلاحیت
پذیرفته شده	۵	۹
رد شد	۵	۸۱
مجموع	۱۰	۹۰
درصد دانشجویان واجد شرایط پذیرفته شده: ۵/۱۰ = ۵۰٪ درصد دانشجویان فاقد صلاحیت رد شده: ۸۱/۹۰ = ۹۰٪ درصد کل دانشجویان بروبدینگ ناگی پذیرفته شده: (5+9)/100 = 14%

مثال‌های قبلی برابری فرصت برای پذیرش دانشجویان واجد شرایط را برآورده می‌کنند، زیرا لیلیپوتی‌ها و برابدینگناگی‌های واجد شرایط هر دو 50٪ شانس پذیرش دارند.

در حالی که برابری فرصت‌ها برقرار است، دو معیار انصاف زیر برقرار نیستند:

برابری جمعیتی : لیلیپوتی‌ها و برابدینگناگی‌ها با نرخ‌های متفاوتی در دانشگاه پذیرفته می‌شوند؛ ۴۸٪ از دانشجویان لیلیپوتی پذیرفته می‌شوند، اما تنها ۱۴٪ از دانشجویان برابدینگناگی پذیرفته می‌شوند.
شانس‌های برابر : در حالی که دانشجویان لیلیپوتی واجد شرایط و برابدینگناگی هر دو شانس یکسانی برای پذیرش دارند، محدودیت اضافی مبنی بر اینکه دانشجویان لیلیپوتی و برابدینگناگی فاقد صلاحیت هر دو شانس یکسانی برای رد شدن دارند، برآورده نمی‌شود. نرخ رد شدن دانشجویان لیلیپوتی فاقد صلاحیت ۷۰٪ و نرخ رد شدن دانشجویان برابدینگناگی فاقد صلاحیت ۹۰٪ است.

برای اطلاعات بیشتر به دوره فشرده انصاف: برابری فرصت‌ها در یادگیری ماشین مراجعه کنید.

ضرایب مساوی

#مسئولیت_پذیر

#متریک

یک معیار انصاف برای ارزیابی اینکه آیا یک مدل، نتایج را برای همه مقادیر یک ویژگی حساس ، با توجه به هر دو دسته مثبت و منفی، به طور یکسان پیش‌بینی می‌کند یا خیر - نه فقط یک دسته یا دسته دیگر به طور انحصاری. به عبارت دیگر، هم نرخ مثبت واقعی و هم نرخ منفی کاذب باید برای همه گروه‌ها یکسان باشد.

شانس‌های برابر با برابری فرصت مرتبط است، که فقط بر نرخ خطا برای یک کلاس واحد (مثبت یا منفی) تمرکز دارد.

برای مثال، فرض کنید دانشگاه گلوبدابدریب هم لیلیپوتی‌ها و هم برابدینگناگی‌ها را در یک برنامه ریاضی دقیق پذیرش می‌کند. مدارس متوسطه لیلیپوتی‌ها برنامه درسی قوی از کلاس‌های ریاضی ارائه می‌دهند و اکثریت قریب به اتفاق دانش‌آموزان برای برنامه دانشگاهی واجد شرایط هستند. مدارس متوسطه برابدینگناگی‌ها اصلاً کلاس ریاضی ارائه نمی‌دهند و در نتیجه، تعداد بسیار کمتری از دانش‌آموزان آنها واجد شرایط هستند. شانس برابر در صورتی برقرار است که صرف نظر از اینکه متقاضی لیلیپوتی است یا برابدینگناگی، اگر واجد شرایط باشد، احتمال پذیرش در برنامه به یک اندازه باشد و اگر واجد شرایط نباشد، احتمال رد شدن او به یک اندازه باشد.

فرض کنید ۱۰۰ نفر از اهالی لیلیپوت و ۱۰۰ نفر از اهالی برابدینگ ناگی برای دانشگاه گلوبدابدریب درخواست می‌دهند و تصمیمات پذیرش به شرح زیر است:

جدول ۳. متقاضیان لیلیپوتی (۹۰٪ واجد شرایط هستند)

	واجد شرایط	فاقد صلاحیت
پذیرفته شده	۴۵	۲
رد شد	۴۵	۸
مجموع	۹۰	۱۰
درصد دانشجویان واجد شرایط پذیرفته شده: ۴۵/۹۰ = ۵۰٪ درصد دانشجویان فاقد صلاحیت رد شده: ۸/۱۰ = ۸۰٪ درصد کل دانشجویان لیلیپوتی پذیرفته شده: (45+2)/100 = 47%

جدول ۴. متقاضیان Brobdingnagian (۱۰٪ واجد شرایط هستند):

	واجد شرایط	فاقد صلاحیت
پذیرفته شده	۵	۱۸
رد شد	۵	۷۲
مجموع	۱۰	۹۰
درصد دانشجویان واجد شرایط پذیرفته شده: ۵/۱۰ = ۵۰٪ درصد دانشجویان فاقد صلاحیت رد شده: ۷۲/۹۰ = ۸۰٪ درصد کل دانشجویان بروبدینگ ناگی پذیرفته شده: (5+18)/100 = 23%

شانس برابر برآورده می‌شود، زیرا دانشجویان لیلیپوتی و برابدینگناگی واجد شرایط هر دو ۵۰٪ شانس پذیرش دارند و دانشجویان لیلیپوتی و برابدینگناگی فاقد صلاحیت ۸۰٪ احتمال رد شدن دارند.

شانس‌های برابر شده به طور رسمی در «برابری فرصت در یادگیری نظارت شده» به صورت زیر تعریف شده است: «پیش‌بین Ŷ شانس‌های برابر شده را با توجه به ویژگی محافظت شده A و نتیجه Y برآورده می‌کند اگر Ŷ و A مستقل و مشروط به Y باشند.»

ارزیابی‌ها

#هوش_مصنوعی_تولیدی

#متریک

در درجه اول به عنوان مخفف ارزیابی‌های LLM استفاده می‌شود. به طور گسترده‌تر، evals مخفف هر نوع ارزیابی است.

ارزیابی

#هوش_مصنوعی_تولیدی

#متریک

فرآیند اندازه‌گیری کیفیت یک مدل یا مقایسه مدل‌های مختلف با یکدیگر.

برای ارزیابی یک مدل یادگیری ماشین تحت نظارت ، معمولاً آن را در برابر یک مجموعه اعتبارسنجی و یک مجموعه تست ارزیابی می‌کنید. ارزیابی یک LLM معمولاً شامل ارزیابی‌های گسترده‌تر کیفیت و ایمنی است.

تطابق دقیق

#متریک

یک معیار همه یا هیچ که در آن خروجی مدل یا دقیقاً با حقیقت زمینه‌ای یا متن مرجع مطابقت دارد یا ندارد. برای مثال، اگر حقیقت زمینه‌ای نارنجی باشد، تنها خروجی مدل که تطابق دقیق را برآورده می‌کند، نارنجی است.

تطابق دقیق همچنین می‌تواند مدل‌هایی را ارزیابی کند که خروجی آنها یک دنباله (یک لیست رتبه‌بندی شده از موارد) است. به طور کلی، تطابق دقیق مستلزم آن است که لیست رتبه‌بندی شده تولید شده دقیقاً با حقیقت پایه مطابقت داشته باشد؛ یعنی هر مورد در هر دو لیست باید به یک ترتیب باشد. با این اوصاف، اگر حقیقت پایه شامل چندین دنباله صحیح باشد، تطابق دقیق فقط مستلزم آن است که خروجی مدل با یکی از دنباله‌های صحیح مطابقت داشته باشد.

خلاصه‌سازی مفرط (xsum)

#متریک

یک مجموعه داده برای ارزیابی توانایی یک LLM در خلاصه کردن یک سند واحد. هر ورودی در مجموعه داده شامل موارد زیر است:

سندی که توسط شرکت پخش بریتانیا (بی‌بی‌سی) تهیه شده است.
خلاصه‌ای یک جمله‌ای از آن سند.

برای جزئیات بیشتر، به «جزئیات را به من ندهید، فقط خلاصه را!» مراجعه کنید. شبکه‌های عصبی کانولوشنی آگاه از موضوع برای خلاصه‌سازی نهایی .

ف

اف _۱

#متریک

یک معیار طبقه‌بندی دودویی «جمع‌بندی‌شده» که به دقت و فراخوانی متکی است. فرمول آن به صورت زیر است:

$$F{_1} = \frac{\text{2 * precision * recall}} {\text{precision + recall}}$$

برای دیدن نمونه‌ها روی آیکون کلیک کنید.

فرض کنید دقت و فراخوانی مقادیر زیر را دارند:

دقت = ۰.۶
یادآوری = ۰.۴

شما _F1 را به صورت زیر محاسبه می‌کنید:

$$F{_1} = \frac{\text{2 * 0.6 * 0.4}} {\text{0.6 + 0.4}} = 0.48$$

وقتی دقت و فراخوانی نسبتاً مشابه باشند (مانند مثال قبل)، _F1 به میانگین آنها نزدیک است. وقتی دقت و فراخوانی تفاوت قابل توجهی داشته باشند، _F1 به مقدار پایین‌تر نزدیک‌تر است. برای مثال:

دقت = ۰.۹
یادآوری = 0.1

$$F{_1} = \frac{\text{2 * 0.9 * 0.1}} {\text{0.9 + 0.1}} = 0.18$$

معیار انصاف

#مسئولیت_پذیر

#متریک

تعریف ریاضی «انصاف» که قابل اندازه‌گیری باشد. برخی از معیارهای رایج برای سنجش انصاف عبارتند از:

ضرایب مساوی
برابری پیش‌بینی‌کننده
انصاف خلاف واقع
برابری جمعیتی

بسیاری از معیارهای انصاف، ناسازگاری متقابل دارند؛ به ناسازگاری معیارهای انصاف مراجعه کنید.

منفی کاذب (FN)

#مبانی

#متریک

مثالی که در آن مدل به اشتباه کلاس منفی را پیش‌بینی می‌کند. برای مثال، مدل پیش‌بینی می‌کند که یک پیام ایمیل خاص هرزنامه (کلاس منفی) نیست ، اما آن پیام ایمیل در واقع هرزنامه است .

نرخ منفی کاذب

#متریک

نسبت نمونه‌های مثبت واقعی که مدل به اشتباه کلاس منفی را برای آنها پیش‌بینی کرده است. فرمول زیر نرخ منفی کاذب را محاسبه می‌کند:

$$\text{false negative rate} = \frac{\text{false negatives}}{\text{false negatives} + \text{true positives}}$$

برای اطلاعات بیشتر به بخش آستانه‌ها و ماتریس درهم‌ریختگی در دوره فشرده یادگیری ماشین مراجعه کنید.

مثبت کاذب (FP)

#مبانی

#متریک

مثالی که در آن مدل به اشتباه کلاس مثبت را پیش‌بینی می‌کند. برای مثال، مدل پیش‌بینی می‌کند که یک پیام ایمیل خاص هرزنامه (کلاس مثبت) است، اما آن پیام ایمیل در واقع هرزنامه نیست .

برای اطلاعات بیشتر به بخش آستانه‌ها و ماتریس درهم‌ریختگی در دوره فشرده یادگیری ماشین مراجعه کنید.

نرخ مثبت کاذب (FPR)

#مبانی

#متریک

نسبت نمونه‌های منفی واقعی که مدل به اشتباه کلاس مثبت را برای آنها پیش‌بینی کرده است. فرمول زیر نرخ مثبت کاذب را محاسبه می‌کند:

$$\text{false positive rate} = \frac{\text{false positives}}{\text{false positives} + \text{true negatives}}$$

نرخ مثبت کاذب، محور x در منحنی ROC است.

برای اطلاعات بیشتر به بخش طبقه‌بندی: ROC و AUC در دوره فشرده یادگیری ماشین مراجعه کنید.

اهمیت ویژگی‌ها

#دی‌اف

#متریک

مترادف برای اهمیت متغیر .

مدل فونداسیون

#هوش_مصنوعی_تولیدی

#متریک

یک مدل از پیش آموزش‌دیده بسیار بزرگ که روی یک مجموعه آموزشی عظیم و متنوع آموزش دیده است. یک مدل پایه می‌تواند هر دو کار زیر را انجام دهد:

به طیف وسیعی از درخواست‌ها به خوبی پاسخ دهید.
به عنوان یک مدل پایه برای تنظیم دقیق‌تر یا سایر سفارشی‌سازی‌ها عمل می‌کند.

به عبارت دیگر، یک مدل پایه از قبل به طور کلی بسیار توانمند است، اما می‌تواند بیشتر سفارشی شود تا برای یک کار خاص مفیدتر شود.

کسری از موفقیت‌ها

#هوش_مصنوعی_تولیدی

#متریک

معیاری برای ارزیابی متن تولید شده توسط یک مدل یادگیری ماشین. کسر موفقیت‌ها، تعداد خروجی‌های متنی تولید شده "موفق" تقسیم بر تعداد کل خروجی‌های متنی تولید شده است. برای مثال، اگر یک مدل زبانی بزرگ 10 بلوک کد تولید کند که پنج تای آنها موفق باشند، کسر موفقیت‌ها 50٪ خواهد بود.

اگرچه کسر موفقیت‌ها در آمار به‌طور گسترده مفید است، اما در یادگیری ماشین، این معیار در درجه اول برای اندازه‌گیری وظایف قابل تأیید مانند تولید کد یا مسائل ریاضی مفید است.

جی

ناخالصی جینی

#دی‌اف

#متریک

معیاری مشابه آنتروپی . تقسیم‌کننده‌ها از مقادیر مشتق‌شده از ناخالصی جینی یا آنتروپی برای ایجاد شرایط برای درخت‌های تصمیم‌گیری طبقه‌بندی استفاده می‌کنند. بهره اطلاعات از آنتروپی مشتق می‌شود. هیچ اصطلاح معادل پذیرفته‌شده جهانی برای معیار مشتق‌شده از ناخالصی جینی وجود ندارد. با این حال، این معیار بدون نام به اندازه بهره اطلاعات مهم است.

ناخالصی جینی، شاخص جینی یا به اختصار جینی نیز نامیده می‌شود.

برای جزئیات ریاضی در مورد ناخالصی جینی، روی نماد کلیک کنید.

ناخالصی جینی احتمال طبقه‌بندی نادرست یک قطعه داده جدید گرفته شده از همان توزیع است. ناخالصی جینی یک مجموعه با دو مقدار ممکن "0" و "1" (به عنوان مثال، برچسب‌ها در یک مسئله طبقه‌بندی دودویی ) از فرمول زیر محاسبه می‌شود:

I = 1 - (p^ ² + q^ ² ) = 1 - (p ^{^2} + (1-p^ ² ))

کجا:

من ناخالصی جینی هستم.
p کسری از مثال‌های "1" است.
q کسری از نمونه‌های "0" است. توجه داشته باشید که q = 1-p

برای مثال، مجموعه داده زیر را در نظر بگیرید:

۱۰۰ برچسب (۰.۲۵ از مجموعه داده‌ها) حاوی مقدار "۱" هستند
۳۰۰ برچسب (۰.۷۵ از مجموعه داده‌ها) حاوی مقدار "۰" هستند

بنابراین، ناخالصی جینی برابر است با:

پی = 0.25
q = 0.75
من = ۱ - (۰.۲۵ ^۲ + ۰.۷۵ ^۲ ) = ۰.۳۷۵

در نتیجه، یک برچسب تصادفی از همان مجموعه داده، ۳۷.۵٪ احتمال طبقه‌بندی نادرست و ۶۲.۵٪ احتمال طبقه‌بندی صحیح خواهد داشت.

یک برچسب کاملاً متعادل (برای مثال، ۲۰۰ عدد «۰» و ۲۰۰ عدد «۱») ناخالصی جینی ۰.۵ خواهد داشت. یک برچسب بسیار نامتعادل ناخالصی جینی نزدیک به ۰.۰ خواهد داشت.

ح

از دست دادن لولا

#متریک

خانواده‌ای از توابع زیان برای طبقه‌بندی که برای یافتن مرز تصمیم‌گیری تا حد امکان دور از هر مثال آموزشی طراحی شده‌اند، و در نتیجه حاشیه بین مثال‌ها و مرز را به حداکثر می‌رسانند. KSVMها از تابع زیان لولا (یا یک تابع مرتبط، مانند مربع زیان لولا) استفاده می‌کنند. برای طبقه‌بندی دودویی، تابع زیان لولا به صورت زیر تعریف می‌شود:

$$\text{loss} = \text{max}(0, 1 - (y * y'))$$

که در آن y برچسب واقعی است، چه -1 و چه +1، و y' خروجی خام مدل طبقه‌بندی است:

$$y' = b + w_1x_1 + w_2x_2 + … w_nx_n$$

در نتیجه، نمودار تلفات لولا در مقابل (y * y') به شکل زیر است:

یک نمودار دکارتی متشکل از دو پاره خط متصل. پاره خط اول از (-3, 4) شروع می‌شود و در (1, 0) پایان می‌یابد. پاره خط دوم از (1, 0) شروع می‌شود و با شیب 0 به طور نامحدود ادامه می‌یابد.

من

ناسازگاری معیارهای انصاف

#مسئولیت_پذیر

#متریک

این ایده که برخی از مفاهیم عدالت با هم ناسازگارند و نمی‌توانند همزمان برآورده شوند. در نتیجه، هیچ معیار جهانی واحدی برای سنجش عدالت وجود ندارد که بتوان آن را برای همه مسائل یادگیری ماشینی به کار برد.

اگرچه این ممکن است دلسردکننده به نظر برسد، اما ناسازگاری معیارهای انصاف به این معنی نیست که تلاش‌های انصاف بی‌ثمر هستند. در عوض، نشان می‌دهد که انصاف باید برای یک مسئله یادگیری ماشینی مشخص، با هدف جلوگیری از آسیب‌های خاص موارد استفاده آن، به صورت زمینه‌ای تعریف شود.

برای بحث مفصل‌تر در مورد ناسازگاری معیارهای انصاف، به «درباره (عدم)امکان انصاف» مراجعه کنید.

انصاف فردی

#مسئولیت_پذیر

#متریک

یک معیار انصاف که بررسی می‌کند آیا افراد مشابه به طور مشابه طبقه‌بندی شده‌اند یا خیر. به عنوان مثال، آکادمی بروبدینگناگیان ممکن است بخواهد با اطمینان از اینکه دو دانش‌آموز با نمرات و نتایج آزمون استاندارد یکسان، احتمال پذیرش یکسانی دارند، انصاف فردی را رعایت کند.

توجه داشته باشید که عدالت فردی کاملاً به نحوه تعریف شما از «شباهت» (در این مورد، نمرات و نتایج آزمون) بستگی دارد و اگر معیار شباهت شما اطلاعات مهمی (مانند دقت برنامه درسی دانش‌آموز) را از قلم بیندازد، می‌توانید خطر بروز مشکلات جدید عدالت را به جان بخرید.

برای بحث مفصل‌تر در مورد انصاف فردی، به «انصاف از طریق آگاهی» مراجعه کنید.

کسب اطلاعات

#دی‌اف

#متریک

در جنگل‌های تصمیم‌گیری ، تفاوت بین آنتروپی یک گره و مجموع وزن‌دار (بر اساس تعداد مثال‌ها) آنتروپی گره‌های فرزند آن است. آنتروپی یک گره، آنتروپی مثال‌های موجود در آن گره است.

برای مثال، مقادیر آنتروپی زیر را در نظر بگیرید:

آنتروپی گره والد = 0.6
آنتروپی یک گره فرزند با ۱۶ مثال مرتبط = ۰.۲
آنتروپی یک گره فرزند دیگر با ۲۴ مثال مرتبط = ۰.۱

بنابراین ۴۰٪ از مثال‌ها در یک گره فرزند و ۶۰٪ در گره فرزند دیگر هستند. بنابراین:

مجموع آنتروپی وزنی گره‌های فرزند = (0.4 * 0.2) + (0.6 * 0.1) = 0.14

بنابراین، سود اطلاعات برابر است با:

بهره اطلاعات = آنتروپی گره والد - مجموع آنتروپی وزنی گره‌های فرزند
بهره اطلاعات = ۰.۶ - ۰.۱۴ = ۰.۴۶

بیشتر تقسیم‌کننده‌ها به دنبال ایجاد شرایطی هستند که به حداکثر رساندن اطلاعات را به دنبال داشته باشد.

توافق بین ارزیابان

#متریک

سنجشی از میزان توافق ارزیابان انسانی هنگام انجام یک کار. اگر ارزیابان اختلاف نظر داشته باشند، ممکن است دستورالعمل‌های کار نیاز به بهبود داشته باشند. همچنین گاهی اوقات توافق بین مفسران یا پایایی بین ارزیابان نامیده می‌شود. همچنین به کاپای کوهن مراجعه کنید که یکی از محبوب‌ترین معیارهای توافق بین ارزیابان است.

برای اطلاعات بیشتر به «داده‌های دسته‌بندی‌شده: مسائل رایج در دوره فشرده یادگیری ماشین» مراجعه کنید.

ل

ضرر L ₁

#مبانی

#متریک

یک تابع زیان که قدر مطلق اختلاف بین مقادیر واقعی برچسب و مقادیری که یک مدل پیش‌بینی می‌کند را محاسبه می‌کند. برای مثال، در اینجا محاسبه زیان _L1 برای یک دسته پنج تایی آمده است:

مقدار واقعی مثال	مقدار پیش‌بینی‌شده مدل	مقدار مطلق دلتا
۷	۶	۱
۵	۴	۱
۸	۱۱	۳
۴	۶	۲
۹	۸	۱
		۸ = L _۱ ضرر

خطای _L1 نسبت به خطای _L2 حساسیت کمتری به داده‌های پرت دارد.

میانگین خطای مطلق، میانگین تلفات _L1 برای هر مثال است.

برای دیدن فرمول ریاضی روی آیکون کلیک کنید.

$$ L_1 loss = \sum_{i=0}^n | y_i - \hat{y}_i |$$

کجا:

$n$ تعداد مثال‌ها است.
$y$ مقدار واقعی برچسب است.
$\hat{y}$ مقداری است که مدل برای $y$ پیش‌بینی می‌کند.

برای اطلاعات بیشتر به دوره فشرده رگرسیون خطی: زیان در یادگیری ماشین مراجعه کنید.

ضرر L ₂

#مبانی

#متریک

یک تابع زیان که مربع اختلاف بین مقادیر واقعی برچسب و مقادیری که یک مدل پیش‌بینی می‌کند را محاسبه می‌کند. برای مثال، در اینجا محاسبه زیان _L2 برای یک دسته پنج تایی آمده است:

مقدار واقعی مثال	مقدار پیش‌بینی‌شده مدل	مربع دلتا
۷	۶	۱
۵	۴	۱
۸	۱۱	۹
۴	۶	۴
۹	۸	۱
		۱۶ = L ₂ ضرر

به دلیل مربع‌سازی، خطای _L2 تأثیر داده‌های پرت را تقویت می‌کند. یعنی خطای _L2 نسبت به خطای _L1 واکنش قوی‌تری به پیش‌بینی‌های بد نشان می‌دهد. برای مثال، خطای _L1 برای دسته قبلی ۸ خواهد بود نه ۱۶. توجه داشته باشید که یک خطای واحد، ۹ مورد از ۱۶ مورد را تشکیل می‌دهد.

مدل‌های رگرسیون معمولاً از تابع زیان _L2 استفاده می‌کنند.

میانگین مربعات خطا، میانگین _L2 زیان برای هر مثال است. مربعات زیان نام دیگری برای _L2 زیان است.

برای دیدن فرمول ریاضی روی آیکون کلیک کنید.

$$ L_2 loss = \sum_{i=0}^n {(y_i - \hat{y}_i)}^2$$

کجا:

$n$ تعداد مثال‌ها است.
$y$ مقدار واقعی برچسب است.
$\hat{y}$ مقداری است که مدل برای $y$ پیش‌بینی می‌کند.

برای اطلاعات بیشتر به رگرسیون لجستیک: زیان و منظم‌سازی در دوره فشرده یادگیری ماشین مراجعه کنید.

ارزیابی‌های LLM (ارزیابی‌ها)

#هوش_مصنوعی_تولیدی

#متریک

مجموعه‌ای از معیارها و بنچمارک‌ها برای ارزیابی عملکرد مدل‌های زبانی بزرگ (LLM). در سطح بالا، ارزیابی‌های LLM:

Help researchers identify areas where LLMs need improvement.
Are useful in comparing different LLMs and identifying the best LLM for a particular task.
Help ensure that LLMs are safe and ethical to use.

See Large language models (LLMs) in Machine Learning Crash Course for more information.

ضرر

#fundamentals

#Metric

During the training of a supervised model , a measure of how far a model's prediction is from its label .

A loss function calculates the loss.

See Linear regression: Loss in Machine Learning Crash Course for more information.

loss function

#fundamentals

#Metric

During training or testing, a mathematical function that calculates the loss on a batch of examples. A loss function returns a lower loss for models that makes good predictions than for models that make bad predictions.

The goal of training is typically to minimize the loss that a loss function returns.

Many different kinds of loss functions exist. Pick the appropriate loss function for the kind of model you are building. For example:

L ₂ loss (or Mean Squared Error ) is the loss function for linear regression .
Log Loss is the loss function for logistic regression .

م

matrix factorization

In math, a mechanism for finding the matrixes whose dot product approximates a target matrix.

In recommendation systems , the target matrix often holds users' ratings on items . For example, the target matrix for a movie recommendation system might look something like the following, where the positive integers are user ratings and 0 means that the user didn't rate the movie:

	کازابلانکا	The Philadelphia Story	پلنگ سیاه	زن شگفت‌انگیز	داستان عامه‌پسند
کاربر ۱	۵.۰	۳.۰	۰.۰	۲.۰	۰.۰
کاربر ۲	۴.۰	۰.۰	۰.۰	۱.۰	۵.۰
کاربر ۳	۳.۰	۱.۰	۴.۰	۵.۰	۰.۰

The movie recommendation system aims to predict user ratings for unrated movies. For example, will User 1 like Black Panther ?

One approach for recommendation systems is to use matrix factorization to generate the following two matrixes:

A user matrix , shaped as the number of users X the number of embedding dimensions.
An item matrix , shaped as the number of embedding dimensions X the number of items.

For example, using matrix factorization on our three users and five items could yield the following user matrix and item matrix:

User Matrix                 Item Matrix

1.1   2.3           0.9   0.2   1.4    2.0   1.2
0.6   2.0           1.7   1.2   1.2   -0.1   2.1
2.5   0.5

The dot product of the user matrix and item matrix yields a recommendation matrix that contains not only the original user ratings but also predictions for the movies that each user hasn't seen. For example, consider User 1's rating of Casablanca , which was 5.0. The dot product corresponding to that cell in the recommendation matrix should hopefully be around 5.0, and it is:

(1.1 * 0.9) + (2.3 * 1.7) = 4.9

More importantly, will User 1 like Black Panther ? Taking the dot product corresponding to the first row and the third column yields a predicted rating of 4.3:

(1.1 * 1.4) + (2.3 * 1.2) = 4.3

Matrix factorization typically yields a user matrix and item matrix that, together, are significantly more compact than the target matrix.

MBPP

#Metric

Abbreviation for Mostly Basic Python Problems .

میانگین خطای مطلق (MAE)

#Metric

The average loss per example when L ₁ loss is used. Calculate Mean Absolute Error as follows:

Calculate the L ₁ loss for a batch.
Divide the L ₁ loss by the number of examples in the batch.

Click the icon to see the formal math.

$$\text{Mean Absolute Error} = \frac{1}{n}\sum_{i=0}^n | y_i - \hat{y}_i |$$

کجا:

$n$ is the number of examples.
$y$ is the actual value of the label.
$\hat{y}$ is the value that the model predicts for $y$.

For example, consider the calculation of L ₁ loss on the following batch of five examples:

Actual value of example	Model's predicted value	Loss (difference between actual and predicted)
۷	۶	۱
۵	۴	۱
۸	۱۱	۳
۴	۶	۲
۹	۸	۱
		8 = L ₁ loss

So, L ₁ loss is 8 and the number of examples is 5. Therefore, the Mean Absolute Error is:

Mean Absolute Error = L₁ loss / Number of Examples
Mean Absolute Error = 8/5 = 1.6

Contrast Mean Absolute Error with Mean Squared Error and Root Mean Squared Error .

mean average precision at k (mAP@k)

#generativeAI

#Metric

The statistical mean of all average precision at k scores across a validation dataset. One use of mean average precision at k is to judge the quality of recommendations generated by a recommendation system .

Although the phrase "mean average" sounds redundant, the name of the metric is appropriate. After all, this metric finds the mean of multiple average precision at k values.

Click the icon to see an example.

Suppose you build a recommendation system that generates a personalized list of recommended novels for each user. Based on feedback from selected users, you calculate the following five average precision at k scores (one score per user):

۰.۷۳
۰.۷۷
۰.۶۷
۰.۸۲
۰.۷۶

The mean Average Precision at K is therefore:

$$\text{mean } = \frac{\text{0.73 + 0.77 + 0.67 + 0.82 + 0.76}} {\text{5}} = \text{0.75}$$

میانگین مربعات خطا (MSE)

#Metric

The average loss per example when L ₂ loss is used. Calculate Mean Squared Error as follows:

Calculate the L ₂ loss for a batch.
Divide the L ₂ loss by the number of examples in the batch.

Click the icon to see the formal math.

$$\text{Mean Squared Error} = \frac{1}{n}\sum_{i=0}^n {(y_i - \hat{y}_i)}^2$$کجا:

$n$ is the number of examples.
$y$ is the actual value of the label.
$\hat{y}$ is the model's prediction for $y$.

For example, consider the loss on the following batch of five examples:

Actual value	Model's prediction	ضرر	Squared loss
۷	۶	۱	۱
۵	۴	۱	۱
۸	۱۱	۳	۹
۴	۶	۲	۴
۹	۸	۱	۱
			16 = L ₂ loss

Therefore, the Mean Squared Error is:

Mean Squared Error = L₂ loss / Number of Examples
Mean Squared Error = 16/5 = 3.2

Mean Squared Error is a popular training optimizer , particularly for linear regression .

Contrast Mean Squared Error with Mean Absolute Error and Root Mean Squared Error .

TensorFlow Playground uses Mean Squared Error to calculate loss values.

Click the icon to see more details about outliers.

Outliers strongly influence Mean Squared Error. For example, a loss of 1 is a squared loss of 1, but a loss of 3 is a squared loss of 9. In the preceding table, the example with a loss of 3 accounts for ~56% of the Mean Squared Error, while each of the examples with a loss of 1 accounts for only 6% of the Mean Squared Error.

Outliers don't influence Mean Absolute Error as strongly as Mean Squared Error. For example, a loss of 3 accounts for only ~38% of the Mean Absolute Error.

Clipping is one way to prevent extreme outliers from damaging your model's predictive ability.

متریک

#TensorFlow

#Metric

A statistic that you care about.

An objective is a metric that a machine learning system tries to optimize.

Metrics API (tf.metrics)

#Metric

A TensorFlow API for evaluating models. For example, tf.metrics.accuracy determines how often a model's predictions match labels.

minimax loss

#Metric

A loss function for generative adversarial networks , based on the cross-entropy between the distribution of generated data and real data.

Minimax loss is used in the first paper to describe generative adversarial networks.

See Loss Functions in the Generative Adversarial Networks course for more information.

model capacity

#Metric

The complexity of problems that a model can learn. The more complex the problems that a model can learn, the higher the model's capacity. A model's capacity typically increases with the number of model parameters. For a formal definition of classification model capacity, see VC dimension .

شتاب

A sophisticated gradient descent algorithm in which a learning step depends not only on the derivative in the current step, but also on the derivatives of the step(s) that immediately preceded it. Momentum involves computing an exponentially weighted moving average of the gradients over time, analogous to momentum in physics. Momentum sometimes prevents learning from getting stuck in local minima.

Mostly Basic Python Problems (MBPP)

#Metric

A dataset for evaluating an LLM's proficiency in generating Python code. Mostly Basic Python Problems provides about 1,000 crowd-sourced programming problems. Each problem in the dataset contains:

A task description
Solution code
Three automated test cases

ن

negative class

#fundamentals

#Metric

In binary classification , one class is termed positive and the other is termed negative . The positive class is the thing or event that the model is testing for and the negative class is the other possibility. For example:

The negative class in a medical test might be "not tumor."
The negative class in an email classification model might be "not spam."

Contrast with positive class .

ای

عینی

#Metric

A metric that your algorithm is trying to optimize.

تابع هدف

#Metric

The mathematical formula or metric that a model aims to optimize. For example, the objective function for linear regression is usually Mean Squared Loss . Therefore, when training a linear regression model, training aims to minimize Mean Squared Loss.

In some cases, the goal is to maximize the objective function. For example, if the objective function is accuracy, the goal is to maximize accuracy.

پ

pass at k (pass@k)

#Metric

A metric to determine the quality of code (for example, Python) that a large language model generates. More specifically, pass at k tells you the likelihood that at least one generated block of code out of k generated blocks of code will pass all of its unit tests.

Large language models often struggle to generate good code for complex programming problems. Software engineers adapt to this problem by prompting the large language model to generate multiple ( k ) solutions for the same problem. Then, software engineers test each of the solutions against unit tests. The calculation of pass at k depends on the outcome of the unit tests:

If one or more of those solutions pass the unit test, then the LLM Passes that code generation challenge.
If none of the solutions pass the unit test, then the LLM Fails that code generation challenge.

The formula for pass at k is as follows:

\[\text{pass at k} = \frac{\text{total number of passes}} {\text{total number of challenges}}\]

In general, higher values of k produce higher pass at k scores; however, higher values of k require more large language model and unit testing resources.

Click the icon for an example.

Suppose a software engineer asks a large language model to generate k =10 solutions for n =50 challenging coding problems. Here are the results:

30 Passes
20 Fails

The pass at 10 score is therefore:

$$\text{pass at 10} = \frac{\text{30}} {\text{50}} = 0.6$$

عملکرد

#Metric

Overloaded term with the following meanings:

The standard meaning within software engineering. Namely: How fast (or efficiently) does this piece of software run?
The meaning within machine learning. Here, performance answers the following question: How correct is this model ? That is, how good are the model's predictions?

permutation variable importances

#df

#Metric

A type of variable importance that evaluates the increase in the prediction error of a model after permuting the feature's values. Permutation variable importance is a model-independent metric.

سردرگمی

#Metric

One measure of how well a model is accomplishing its task. For example, suppose your task is to read the first few letters of a word a user is typing on a phone keyboard, and to offer a list of possible completion words. Perplexity, P, for this task is approximately the number of guesses you need to offer in order for your list to contain the actual word the user is trying to type.

Perplexity is related to cross-entropy as follows:

$$P= 2^{-\text{cross entropy}}$$

positive class

#fundamentals

#Metric

The class you are testing for.

For example, the positive class in a cancer model might be "tumor." The positive class in an email classification model might be "spam."

Contrast with negative class .

Click the icon for additional notes.

The term positive class can be confusing because the "positive" outcome of many tests is often an undesirable result. For example, the positive class in many medical tests corresponds to tumors or diseases. In general, you want a doctor to tell you, "Congratulations! Your test results were negative." Regardless, the positive class is the event that the test is seeking to find.

Admittedly, you're simultaneously testing for both the positive and negative classes.

PR AUC (area under the PR curve)

#Metric

Area under the interpolated precision-recall curve , obtained by plotting (recall, precision) points for different values of the classification threshold .

دقت

#fundamentals

#Metric

A metric for classification models that answers the following question:

When the model predicted the positive class , what percentage of the predictions were correct?

Here is the formula:

$$\text{Precision} = \frac{\text{true positives}} {\text{true positives} + \text{false positives}}$$

کجا:

true positive means the model correctly predicted the positive class.
false positive means the model mistakenly predicted the positive class.

For example, suppose a model made 200 positive predictions. Of these 200 positive predictions:

150 were true positives.
50 were false positives.

در این مورد:

$$\text{Precision} = \frac{\text{150}} {\text{150} + \text{50}} = 0.75$$

Contrast with accuracy and recall .

See Classification: Accuracy, recall, precision and related metrics in Machine Learning Crash Course for more information.

precision at k (precision@k)

#Metric

A metric for evaluating a ranked (ordered) list of items. Precision at k identifies the fraction of the first k items in that list that are "relevant." That is:

\[\text{precision at k} = \frac{\text{relevant items in first k items of the list}} {\text{k}}\]

The value of k must be less than or equal to the length of the returned list. Note that the length of the returned list is not part of the calculation.

Relevance is often subjective; even expert human evaluators often disagree on which items are relevant.

مقایسه کنید با:

average precision at k
mean average precision at k

Click the icon to see an example.

Suppose a large language model is given the following query:

List the 6 funniest movies of all time in order.

And the large language model returns the list shown in the first two columns of the following table:

موقعیت	فیلم	Relevant?
۱	ژنرال	بله
۲	دختران بدجنس	بله
۳	جوخه	خیر
۴	ساقدوش‌ها	بله
۵	Citizen Kane	خیر
۶	این اسپاینال تپ است	بله

Two of the first three movies are relevant, so precision at 3 is:

$$\text{precision at 3} = \frac{\text{2}} {\text{3}} = 0.67$$

Three of the first five movies are very funny, so precision at 5 is:

$$\text{precision at 5} = \frac{\text{3}} {\text{5}} = 0.6$$

precision-recall curve

#Metric

A curve of precision versus recall at different classification thresholds .

prediction bias

#Metric

A value indicating how far apart the average of predictions is from the average of labels in the dataset.

Not to be confused with the bias term in machine learning models or with bias in ethics and fairness .

predictive parity

#responsible

#Metric

A fairness metric that checks whether, for a given classification model , the precision rates are equivalent for subgroups under consideration.

For example, a model that predicts college acceptance would satisfy predictive parity for nationality if its precision rate is the same for Lilliputians and Brobdingnagians.

Predictive parity is sometime also called predictive rate parity .

See "Fairness Definitions Explained" (section 3.2.1) for a more detailed discussion of predictive parity.

predictive rate parity

#responsible

#Metric

Another name for predictive parity .

تابع چگالی احتمال

#Metric

A function that identifies the frequency of data samples having exactly a particular value. When a dataset's values are continuous floating-point numbers, exact matches rarely occur. However, integrating a probability density function from value x to value y yields the expected frequency of data samples between x and y .

For example, consider a normal distribution having a mean of 200 and a standard deviation of 30. To determine the expected frequency of data samples falling within the range 211.4 to 218.7, you can integrate the probability density function for a normal distribution from 211.4 to 218.7.

ر

Reading Comprehension with Commonsense Reasoning Dataset (ReCoRD)

#Metric

A dataset to evaluate an LLM's ability to perform commonsense reasoning. Each example in the dataset contains three components:

A paragraph or two from a news article
A query in which one of the entities explicitly or implicitly identified in the passage is masked .
The answer (the name of the entity that belongs in the mask)

See ReCoRD for an extensive list of examples.

ReCoRD is a component of the SuperGLUE ensemble.

RealToxicityPrompts

#Metric

A dataset that contains a set of sentence beginnings that might contain toxic content. Use this dataset to evaluate an LLM's ability to generate non-toxic text to complete the sentence. Typically, you use the Perspective API to determine how well the LLM performed at this task.

See RealToxicityPrompts: Evaluating Neural Toxic Degeneration in Language Models for details.

به یاد بیاورید

#fundamentals

#Metric

A metric for classification models that answers the following question:

When ground truth was the positive class , what percentage of predictions did the model correctly identify as the positive class?

Here is the formula:

\[\text{Recall} = \frac{\text{true positives}} {\text{true positives} + \text{false negatives}} \]

کجا:

true positive means the model correctly predicted the positive class.
false negative means that the model mistakenly predicted the negative class .

For instance, suppose your model made 200 predictions on examples for which ground truth was the positive class. Of these 200 predictions:

180 were true positives.
20 were false negatives.

در این مورد:

\[\text{Recall} = \frac{\text{180}} {\text{180} + \text{20}} = 0.9 \]

Click the icon for notes about class-imbalanced datasets.

Recall is particularly useful for determining the predictive power of classification models in which the positive class is rare. For example, consider a class-imbalanced dataset in which the positive class for a certain disease occurs in only 10 patients out of a million. Suppose your model makes five million predictions that yield the following outcomes:

30 True Positives
20 False Negatives
4,999,000 True Negatives
950 False Positives

The recall of this model is therefore:

recall = TP / (TP + FN)
recall = 30 / (30 + 20) = 0.6 = 60%

By contrast, the accuracy of this model is:

accuracy = (TP + TN) / (TP + TN + FP + FN)
accuracy = (30 + 4,999,000) / (30 + 4,999,000 + 950 + 20) = 99.98%

That high value of accuracy looks impressive but is essentially meaningless. Recall is a much more useful metric for class-imbalanced datasets than accuracy.

See Classification: Accuracy, recall, precision and related metrics for more information.

recall at k (recall@k)

#Metric

A metric for evaluating systems that output a ranked (ordered) list of items. Recall at k identifies the fraction of relevant items in the first k items in that list out of the total number of relevant items returned.

\[\text{recall at k} = \frac{\text{relevant items in first k items of the list}} {\text{total number of relevant items in the list}}\]

Contrast with precision at k .

Click the icon to see an example.

Suppose a large language model is given the following query:

List the 10 funniest movies of all time in order.

And the large language model returns the list shown in the first two columns:

موقعیت	فیلم	Relevant?
۱	ژنرال	بله
۲	دختران بدجنس	بله
۳	جوخه	خیر
۴	ساقدوش‌ها	بله
۵	این اسپاینال تپ است	بله
۶	هواپیما!	بله
۷	روز گراندهاگ	بله
۸	مانتی پایتون و جام مقدس	بله
۹	اوپنهایمر	خیر
۱۰	بی‌سرنخ	بله

Eight of the movies in the preceding list are very funny, so they are "relevant items in the list." Therefore, 8 will be the denominator in all the calculations of recall at k . What about the numerator? Well, 3 of the first 4 items are relevant, so recall at 4 is:

$$\text{recall at 4} = \frac{\text{3}} {\text{8}} = 0.375$$

7 of the first 8 movies are very funny, so recall at 8 is:

$$\text{recall at 8} = \frac{\text{7}} {\text{8}} = 0.875$$

Recognizing Textual Entailment (RTE)

#Metric

A dataset for evaluating an LLM's ability to determine whether a hypothesis can be entailed (logically drawn) from a text passage. Each example in an RTE evaluation consists of three parts:

A passage, typically from news or Wikipedia articles
A hypothesis
The correct answer, which is either:
- True, meaning the hypothesis can be entailed from the passage
- False, meaning the hypothesis can't be entailed from the passage

برای مثال:

Passage: The Euro is the currency of the European Union.
Hypothesis: France uses the Euro as currency.
Entailment: True, because France is part of the European Union.

RTE is a component of the SuperGLUE ensemble.

ReCoRD

#Metric

Abbreviation for Reading Comprehension with Commonsense Reasoning Dataset .

ROC (receiver operating characteristic) Curve

#fundamentals

#Metric

A graph of true positive rate versus false positive rate for different classification thresholds in binary classification.

The shape of an ROC curve suggests a binary classification model's ability to separate positive classes from negative classes. Suppose, for example, that a binary classification model perfectly separates all the negative classes from all the positive classes:

A number line with 8 positive examples on the right side and
7 negative examples on the left.

The ROC curve for the preceding model looks as follows:

An ROC curve. The x-axis is False Positive Rate and the y-axis
is True Positive Rate. The curve has an inverted L shape. The curve
starts at (0.0,0.0) and goes straight up to (0.0,1.0). Then the curve
goes from (0.0,1.0) to (1.0,1.0).

In contrast, the following illustration graphs the raw logistic regression values for a terrible model that can't separate negative classes from positive classes at all:

A number line with positive examples and negative classes
completely intermixed.

The ROC curve for this model looks as follows:

An ROC curve, which is actually a straight line from (0.0,0.0)
to (1.0,1.0).

Meanwhile, back in the real world, most binary classification models separate positive and negative classes to some degree, but usually not perfectly. So, a typical ROC curve falls somewhere between the two extremes:

An ROC curve. The x-axis is False Positive Rate and the y-axis
is True Positive Rate. The ROC curve approximates a shaky arc
traversing the compass points from West to North.

The point on an ROC curve closest to (0.0,1.0) theoretically identifies the ideal classification threshold. However, several other real-world issues influence the selection of the ideal classification threshold. For example, perhaps false negatives cause far more pain than false positives.

A numerical metric called AUC summarizes the ROC curve into a single floating-point value.

Root Mean Squared Error (RMSE)

#fundamentals

#Metric

The square root of the Mean Squared Error .

ROUGE (Recall-Oriented Understudy for Gisting Evaluation)

#Metric

A family of metrics that evaluate automatic summarization and machine translation models. ROUGE metrics determine the degree to which a reference text overlaps an ML model's generated text . Each member of the ROUGE family measures overlap in a different way. Higher ROUGE scores indicate more similarity between the reference text and generated text than lower ROUGE scores.

Each ROUGE family member typically generates the following metrics:

دقت
به یاد بیاورید
اف _۱

For details and examples, see:

ROUGE-L
ROUGE-N
ROUGE-S

ROUGE-L

#Metric

A member of the ROUGE family focused on the length of the longest common subsequence in the reference text and generated text . The following formulas calculate recall and precision for ROUGE-L:

$$\text{ROUGE-L recall} = \frac{\text{longest common sequence}} {\text{number of words in the reference text} }$$

$$\text{ROUGE-L precision} = \frac{\text{longest common sequence}} {\text{number of words in the generated text} }$$

You can then use F ₁ to roll up ROUGE-L recall and ROUGE-L precision into a single metric:

$$\text{ROUGE-L F} {_1} = \frac{\text{2} * \text{ROUGE-L recall} * \text{ROUGE-L precision}} {\text{ROUGE-L recall} + \text{ROUGE-L precision} }$$

Click the icon for an example calculation of ROUGE-L.

Consider the following reference text and generated text.

دسته بندی	Who produced?	متن
Reference text	Human translator	I want to understand a wide variety of things.
Generated text	ML model	I want to learn plenty of things.

بنابراین:

The longest common subsequence is 5 ( I want to of things )
The number of words in the reference text is 9.
The number of words in the generated text is 7.

Consequently:

$$\text{ROUGE-L recall} = \frac{\text{5}} {\text{9} } = 0.56$$

$$\text{ROUGE-L precision} = \frac{\text{5}} {\text{7} } = 0.71$$

$$\text{ROUGE-L F} {_1} = \frac{\text{2} * \text{0.56} * \text{0.71}} {\text{0.56} + \text{0.71} } = 0.63$$

ROUGE-L ignores any newlines in the reference text and generated text, so the longest common subsequence could cross multiple sentences. When the reference text and generated text involve multiple sentences, a variation of ROUGE-L called ROUGE-Lsum is generally a better metric. ROUGE-Lsum determines the longest common subsequence for each sentence in a passage and then calculates the mean of those longest common subsequences.

Click the icon for an example calculation of ROUGE-Lsum.

Consider the following reference text and generated text.

دسته بندی	Who produced?	متن
Reference text	Human translator	The surface of Mars is dry. Nearly all the water is deep underground.
Generated text	ML model	Mars has a dry surface. However, the vast majority of water is underground.

بنابراین:

	First sentence	Second sentence
Longest common sequence	2 (Mars dry)	3 (water is underground)
Sentence length of reference text	۶	۷
Sentence length of generated text	۵	۸

Consequently:

$$\text{recall of first sentence} = \frac{\text{2}} {\text{6}} = 0.33 $$

$$\text{recall of second sentence} = \frac{\text{3}} {\text{7}} = 0.43 $$

$$\text{ROUGE-Lsum recall} = \frac{\text{0.33} + \text{0.43}} {\text{2}} = 0.38 $$

$$\text{precision of first sentence} = \frac{\text{2}} {\text{5}} = 0.4 $$

$$\text{precision of second sentence} = \frac{\text{3}} {\text{8}} = 0.38 $$

$$\text{ROUGE-Lsum precision} = \frac{\text{0.4} + \text{0.38}} {\text{2}} = 0.39 $$

$$\text{ROUGE-Lsum F}{_1} = \frac{\text{2} * \text{0.38} * \text{0.39}} {\text{0.38} + \text{0.39}} = 0.38 $$

ROUGE-N

#Metric

A set of metrics within the ROUGE family that compares the shared N-grams of a certain size in the reference text and generated text . For example:

ROUGE-1 measures the number of shared tokens in the reference text and generated text.
ROUGE-2 measures the number of shared bigrams (2-grams) in the reference text and generated text.
ROUGE-3 measures the number of shared trigrams (3-grams) in the reference text and generated text.

You can use the following formulas to calculate ROUGE-N recall and ROUGE-N precision for any member of the ROUGE-N family:

$$\text{ROUGE-N recall} = \frac{\text{number of matching N-grams}} {\text{number of N-grams in the reference text} }$$

$$\text{ROUGE-N precision} = \frac{\text{number of matching N-grams}} {\text{number of N-grams in the generated text} }$$

You can then use F ₁ to roll up ROUGE-N recall and ROUGE-N precision into a single metric:

$$\text{ROUGE-N F}{_1} = \frac{\text{2} * \text{ROUGE-N recall} * \text{ROUGE-N precision}} {\text{ROUGE-N recall} + \text{ROUGE-N precision} }$$

Click the icon for an example.

Suppose you decide to use ROUGE-2 to measure the effectiveness of an ML model's translation compared to a human translator's.

دسته بندی	Who produced?	متن	Bigrams
Reference text	Human translator	I want to understand a wide variety of things.	I want, want to, to understand, understand a, a wide, wide variety, variety of, of things
Generated text	ML model	I want to learn plenty of things.	I want, want to, to learn, learn plenty, plenty of, of things

بنابراین:

The number of matching 2-grams is 3 ( I want , want to , and of things ).
The number of 2-grams in the reference text is 8.
The number of 2-grams in the generated text is 6.

Consequently:

$$\text{ROUGE-2 recall} = \frac{\text{3}} {\text{8} } = 0.375$$

$$\text{ROUGE-2 precision} = \frac{\text{3}} {\text{6} } = 0.5$$

$$\text{ROUGE-2 F}{_1} = \frac{\text{2} * \text{0.375} * \text{0.5}} {\text{0.375} + \text{0.5} } = 0.43$$

ROUGE-S

#Metric

A forgiving form of ROUGE-N that enables skip-gram matching. That is, ROUGE-N only counts N-grams that match exactly , but ROUGE-S also counts N-grams separated by one or more words. For example, consider the following:

reference text : White clouds
generated text : White billowing clouds

When calculating ROUGE-N, the 2-gram, White clouds doesn't match White billowing clouds . However, when calculating ROUGE-S, White clouds does match White billowing clouds .

ضریب تعیین (R)

#Metric

A regression metric indicating how much variation in a label is due to an individual feature or to a feature set. R-squared is a value between 0 and 1, which you can interpret as follows:

An R-squared of 0 means that none of a label's variation is due to the feature set.
An R-squared of 1 means that all of a label's variation is due to the feature set.
An R-squared between 0 and 1 indicates the extent to which the label's variation can be predicted from a particular feature or the feature set. For example, an R-squared of 0.10 means that 10 percent of the variance in the label is due to the feature set, an R-squared of 0.20 means that 20 percent is due to the feature set, and so on.

R-squared is the square of the Pearson correlation coefficient between the values that a model predicted and ground truth .

RTE

#Metric

Abbreviation for Recognizing Textual Entailment .

س

امتیازدهی

#Metric

The part of a recommendation system that provides a value or ranking for each item produced by the candidate generation phase.

similarity measure

#clustering

#Metric

In clustering algorithms, the metric used to determine how alike (how similar) any two examples are.

کم بودن

#Metric

The number of elements set to zero (or null) in a vector or matrix divided by the total number of entries in that vector or matrix. For example, consider a 100-element matrix in which 98 cells contain zero. The calculation of sparsity is as follows:

$$ {\text{sparsity}} = \frac{\text{98}} {\text{100}} = {\text{0.98}} $$

Feature sparsity refers to the sparsity of a feature vector; model sparsity refers to the sparsity of the model weights.

SQuAD

#Metric

Acronym for Stanford Question Answering Dataset , introduced in the paper SQuAD: 100,000+ Questions for Machine Comprehension of Text . The questions in this dataset come from people posing questions about Wikipedia articles. Some of the questions in SQuAD have answers, but other questions intentionally don't have answers. Therefore, you can use SQuAD to evaluate an LLM's ability to do both of the following:

Answer questions that can be answered.
Identify questions that cannot be answered.

Exact match in combination with F ₁ are the most common metrics for evaluating LLMs against SQuAD.

squared hinge loss

#Metric

The square of the hinge loss . Squared hinge loss penalizes outliers more harshly than regular hinge loss.

squared loss

#fundamentals

#Metric

Synonym for L ₂ loss .

SuperGLUE

#Metric

An ensemble of datasets for rating an LLM's overall ability to understand and generate text. The ensemble consists of the following datasets:

Boolean Questions (BoolQ)
CommitmentBank (CB)
Choice of Plausible Alternatives (COPA)
Multi-sentence Reading Comprehension (MultiRC)
Reading Comprehension with Commonsense Reasoning Dataset (ReCoRD)
Recognizing Textual Entailment (RTE)
Words in Context (WiC)
Winograd Schema Challenge (WSC)

For details, see SuperGLUE: A Stickier Benchmark for General-Purpose Language Understanding Systems .

تی

test loss

#fundamentals

#Metric

A metric representing a model's loss against the test set . When building a model , you typically try to minimize test loss. That's because a low test loss is a stronger quality signal than a low training loss or low validation loss .

A large gap between test loss and training loss or validation loss sometimes suggests that you need to increase the regularization rate .

top-k accuracy

#Metric

The percentage of times that a "target label" appears within the first k positions of generated lists. The lists could be personalized recommendations or a list of items ordered by softmax .

Top-k accuracy is also known as accuracy at k .

Click the icon for an example.

Consider a machine learning system that uses softmax to identify tree probabilities based on a picture of tree leaves. The following table shows output lists generated from five input tree pictures. Each row contains a target label and the five most likely trees. For example, when the target label was maple , the machine learning model identified elm as the most likely tree, oak as the second most likely tree, and so on.

Target label	۱	۲	۳	۴	۵
افرا	نارون	بلوط	افرا	راش	صنوبر
چوب سگ	بلوط	چوب سگ	صنوبر	hickory	افرا
بلوط	بلوط	basswood	ملخ	توسکا	نمدار
نمدار	افرا	paw-paw	بلوط	basswood	صنوبر
بلوط	ملخ	نمدار	بلوط	افرا	paw-paw

The target label appears in the first position only once, so the top-1 accuracy is:

$$\text{top-1 accuracy} = \frac{\text{1}} {\text{5}} = 0.2$$

The target label appears in one of the top three positions four times, so the top-3 accuracy is:

$$\text{top-1 accuracy} = \frac{\text{4}} {\text{5}} = 0.8$$

سمیت

#Metric

The degree to which content is abusive, threatening, or offensive. Many machine learning models can identify, measure, and classify toxicity. Most of these models identify toxicity along multiple parameters, such as the level of abusive language and the level of threatening language.

training loss

#fundamentals

#Metric

A metric representing a model's loss during a particular training iteration. For example, suppose the loss function is Mean Squared Error . Perhaps the training loss (the Mean Squared Error) for the 10th iteration is 2.2, and the training loss for the 100th iteration is 1.9.

A loss curve plots training loss versus the number of iterations. A loss curve provides the following hints about training:

A downward slope implies that the model is improving.
An upward slope implies that the model is getting worse.
A flat slope implies that the model has reached convergence .

For example, the following somewhat idealized loss curve shows:

A steep downward slope during the initial iterations, which implies rapid model improvement.
A gradually flattening (but still downward) slope until close to the end of training, which implies continued model improvement at a somewhat slower pace then during the initial iterations.
A flat slope towards the end of training, which suggests convergence.

The plot of training loss versus iterations. This loss curve starts
with a steep downward slope. The slope gradually flattens until the
slope becomes zero.

Although training loss is important, see also generalization .

Trivia Question Answering

#Metric

Datasets to evaluate an LLM's ability to answer trivia questions. Each dataset contains question-answer pairs authored by trivia enthusiasts. Different datasets are grounded by different sources, including:

Web search (TriviaQA)
Wikipedia (TriviaQA_wiki)

For more information see TriviaQA: A Large Scale Distantly Supervised Challenge Dataset for Reading Comprehension .

true negative (TN)

#fundamentals

#Metric

An example in which the model correctly predicts the negative class . For example, the model infers that a particular email message is not spam , and that email message really is not spam .

true positive (TP)

#fundamentals

#Metric

An example in which the model correctly predicts the positive class . For example, the model infers that a particular email message is spam, and that email message really is spam.

true positive rate (TPR)

#fundamentals

#Metric

Synonym for recall . That is:

$$\text{true positive rate} = \frac {\text{true positives}} {\text{true positives} + \text{false negatives}}$$

True positive rate is the y-axis in an ROC curve .

Typologically Diverse Question Answering (TyDi QA)

#Metric

A large dataset for evaluating an LLM's proficiency in answering questions. The dataset contains question and answer pairs in many languages.

For details, see TyDi QA: A Benchmark for Information-Seeking Question Answering in Typologically Diverse Languages .

پنجم

validation loss

#fundamentals

#Metric

A metric representing a model's loss on the validation set during a particular iteration of training.

variable importances

#df

#Metric

A set of scores that indicates the relative importance of each feature to the model.

For example, consider a decision tree that estimates house prices. Suppose this decision tree uses three features: size, age, and style. If a set of variable importances for the three features are calculated to be {size=5.8, age=2.5, style=4.7}, then size is more important to the decision tree than age or style.

Different variable importance metrics exist, which can inform ML experts about different aspects of models.

دبلیو

Wasserstein loss

#Metric

One of the loss functions commonly used in generative adversarial networks , based on the earth mover's distance between the distribution of generated data and real data.

WiC

#Metric

Abbreviation for Words in Context .

WikiLingua (wiki_lingua)

#Metric

A dataset for evaluating an LLM's ability to summarize short articles. WikiHow , an encyclopedia of articles explaining how to do various tasks, is the human-authored source for both the articles and the summaries. Each entry in the dataset consists of:

An article, which is created by appending each step of the prose (paragraph) version of the numbered list, minus the opening sentence of each step.
A summary of that article, consisting of the opening sentence of each step in the numbered list.

For details, see WikiLingua: A New Benchmark Dataset for Cross-Lingual Abstractive Summarization .

Winograd Schema Challenge (WSC)

#Metric

A format (or dataset conforming to that format) for evaluating an LLM's ability to determine the noun phrase that a pronoun refers to.

Each entry in a Winograd Schema Challenge consists of:

A short passage, which contains a target pronoun
A target pronoun
Candidate noun phrases, followed by the correct answer (a Boolean). If the target pronoun refers to this candidate, the answer is True. If the target pronoun does not refer to this candidate, the answer is False.

برای مثال:

Passage : Mark told Pete many lies about himself, which Pete included in his book. He should have been more truthful.
Target pronoun : He
Candidate noun phrases :
- Mark: True, because the target pronoun refers to Mark
- Pete: False, because the target pronoun doesn't refer to Peter

The Winograd Schema Challenge is a component of the SuperGLUE ensemble.

Words in Context (WiC)

#Metric

A dataset for evaluating how well an LLM uses context to understand words that have multiple meanings. Each entry in the dataset contains:

Two sentences, each containing the target word
The target word
The correct answer (a Boolean), where:
- True means the target word has the same meaning in the two sentences
- False means the target word has a different meaning in the two sentences

برای مثال:

Two sentences:
- There's a lot of trash on the bed of the river.
- I keep a glass of water next to my bed when I sleep.
The target word: bed
Correct answer : False, because the target word has a different meaning in the two sentences.

For details, see WiC: the Word-in-Context Dataset for Evaluating Context-Sensitive Meaning Representations .

Words in Context is a component of the SuperGLUE ensemble.

کنفرانس خدمات جهانی (WSC)

#Metric

Abbreviation for Winograd Schema Challenge .

ایکس

XL-Sum (xlsum)

#Metric

A dataset for evaluating an LLM's proficiency in summarizing text. XL-Sum provides entries in many languages. Each entry in the dataset contains:

An article, taken from the British Broadcasting Company (BBC).
A summary of the article, written by the article's author. Note that that summary can contain words or phrases not present in the article.

For details, see XL-Sum: Large-Scale Multilingual Abstractive Summarization for 44 Languages .

واژه نامه یادگیری ماشینی: متریک با مجموعه‌ها، منظم بمانید ذخیره و طبقه‌بندی محتوا براساس اولویت‌های شما.

الف

دقت

برای جزئیات بیشتر در مورد دقت و مجموعه داده‌های نامتعادل از نظر کلاس، روی نماد کلیک کنید.

مساحت زیر منحنی PR

مساحت زیر منحنی ROC

AUC (مساحت زیر منحنی ROC)

برای آشنایی با رابطه بین منحنی‌های AUC و ROC، روی آیکون کلیک کنید.

برای تعریف رسمی‌تر AUC، روی آیکون کلیک کنید.

دقت متوسط ​​در k

برای مثال روی آیکون کلیک کنید

ب

خط پایه

سوالات بولی (BoolQ)

بول کیو

سی

سی بی

امتیاز F کاراکتر N-gram (ChrF)

انتخاب گزینه‌های محتمل (COPA)

بانک تعهد (CB)

کوپا

هزینه

انصاف خلاف واقع

آنتروپی متقاطع

تابع توزیع تجمعی (CDF)

دی

برابری جمعیتی

ای

فاصله حرکت دهنده خاک (EMD)

فاصله را ویرایش کنید

تابع توزیع تجمعی تجربی (eCDF یا EDF)

آنتروپی

برابری فرصت‌ها

ضرایب مساوی

ارزیابی‌ها

ارزیابی

تطابق دقیق

خلاصه‌سازی مفرط (xsum)

ف

اف ۱

برای دیدن نمونه‌ها روی آیکون کلیک کنید.

معیار انصاف

منفی کاذب (FN)

نرخ منفی کاذب

مثبت کاذب (FP)

نرخ مثبت کاذب (FPR)

اهمیت ویژگی‌ها

مدل فونداسیون

کسری از موفقیت‌ها

جی

ناخالصی جینی

برای جزئیات ریاضی در مورد ناخالصی جینی، روی نماد کلیک کنید.

ح

از دست دادن لولا

من

ناسازگاری معیارهای انصاف

انصاف فردی

کسب اطلاعات

توافق بین ارزیابان

ل

ضرر L 1

برای دیدن فرمول ریاضی روی آیکون کلیک کنید.

ضرر L 2

برای دیدن فرمول ریاضی روی آیکون کلیک کنید.

ارزیابی‌های LLM (ارزیابی‌ها)

ضرر

loss function

م

matrix factorization

MBPP

میانگین خطای مطلق (MAE)

Click the icon to see the formal math.

mean average precision at k (mAP@k)

Click the icon to see an example.

میانگین مربعات خطا (MSE)

Click the icon to see the formal math.

Click the icon to see more details about outliers.

متریک

Metrics API (tf.metrics)

minimax loss

واژه نامه یادگیری ماشینی: متریک

دقت متوسط در k

اف _۱

ضرر L ₁

ضرر L ₂