類別資料:常見問題
透過集合功能整理內容
你可以依據偏好儲存及分類內容。
科學儀器或自動測量設備通常會記錄數值資料。另一方面,類別型資料通常可由人類或機器學習 (ML) 模型來分類。決定類別和標籤的人員,以及決定方式,都會影響資料的可靠性和實用性。
人工評估人員
由人工標示的資料通常稱為「黃金標籤」,由於資料品質較佳,因此在訓練模型時,比機器標示的資料更受青睞。
這並不代表任何一組人工標註資料都是高品質的。在資料收集時或資料的清理及處理期間,都可能會發生人為錯誤、偏誤和惡意狀況。請先在訓練前檢查。
如果兩個人類都是同一個示例,則可能是不同的標籤。評估人員的決策有何不同,稱為跨評估器協議。您可以使用每個範例的多位評估者,並評估評估者之間的一致性,瞭解評估者意見的差異。
點選,瞭解評分者間一致性指標
以下是評估評分者間一致性的做法:
- 科恩卡帕係數和變體
- 內部一致性係數 (ICC)
- Krippendorff 的 Alpha 版
如要進一步瞭解 Cohen 的 kappa 和內部類別相關係數,請參閱 Hallgren 2012。如要進一步瞭解 Krippendorff 的 alpha 值,請參閱 Krippendorff 2011。
機器評估器
機器標記資料 (由一或多個分類模型自動判定類別) 通常稱為「銀標籤」。機器標記資料的品質可能差異極大。除了檢查準確性和偏見,也要確認內容是否違反常識、現實和意圖。舉例來說,如果電腦視覺模型將吉娃娃是鬆餅的相片,或鬆餅一樣的毛茸茸相片,品質就會較低。
同樣地,如果情緒分析器將中性字詞評分為 -0.25 (0.0 為中性值),則可能會為所有字詞評分,並加上資料中實際不存在的額外負面偏差。過度敏感的有害內容偵測器可能會誤將許多中立的陳述標示為有害內容。在訓練資料之前,請先瞭解資料中機器標籤和註解的品質和偏差。
高維度
類別資料往往會產生高維度的特徵向量,也就是具有大量元素的特徵向量。高維度會增加訓練成本,並使訓練更加困難。因此,機器學習專家通常會在訓練前先設法減少維度數量。
對於自然語言資料,減少維度的常用方法是將特徵向量轉換為嵌入向量。本課程後續的嵌入單元會進一步說明這項功能。
除非另有註明,否則本頁面中的內容是採用創用 CC 姓名標示 4.0 授權,程式碼範例則為阿帕契 2.0 授權。詳情請參閱《Google Developers 網站政策》。Java 是 Oracle 和/或其關聯企業的註冊商標。
上次更新時間:2024-11-06 (世界標準時間)。
[null,null,["上次更新時間:2024-11-06 (世界標準時間)。"],[[["\u003cp\u003eCategorical data quality hinges on how categories are defined and labeled, impacting data reliability.\u003c/p\u003e\n"],["\u003cp\u003eHuman-labeled data, known as "gold labels," is generally preferred for training due to its higher quality, but it's essential to check for human errors and biases.\u003c/p\u003e\n"],["\u003cp\u003eMachine-labeled data, or "silver labels," can introduce biases or inaccuracies, necessitating careful quality checks and awareness of potential common-sense violations.\u003c/p\u003e\n"],["\u003cp\u003eHigh-dimensionality in categorical data increases training complexity and costs, leading to techniques like embeddings for dimensionality reduction.\u003c/p\u003e\n"]]],[],null,["# Categorical data: Common issues\n\nNumerical data is often recorded by scientific instruments or\nautomated measurements. Categorical data, on the other hand, is often\ncategorized by human beings or by machine learning (ML) models. *Who* decides\non categories and labels, and *how* they make those decisions, affects the\nreliability and usefulness of that data.\n\nHuman raters\n------------\n\nData manually labeled by human beings is often referred to as *gold labels*,\nand is considered more desirable than machine-labeled data for training models,\ndue to relatively better data quality.\n\nThis doesn't necessarily mean that any set of human-labeled data is of high\nquality. Human errors, bias, and malice can be introduced at the point\nof data collection or during data cleaning and processing. Check for them\nbefore training.\n\n\nAny two human beings may label the same example differently. The difference\nbetween human raters' decisions is called\n[**inter-rater\nagreement**](/machine-learning/glossary#inter-rater-agreement).\nYou can get a sense of the variance in raters' opinions by using\nmultiple raters per example and measuring inter-rater agreement.\n\n**Click to learn about inter-rater agreement metrics** \nThe following are ways to measure inter-rater agreement:\n\n- Cohen's kappa and variants\n- Intra-class correlation (ICC)\n- Krippendorff's alpha\n\nFor details on Cohen's kappa and intra-class correlation, see\n[Hallgren\n2012](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3402032/). For details on Krippendorff's alpha, see\n[Krippendorff 2011](https://www.asc.upenn.edu/sites/default/files/2021-03/Computing%20Krippendorff%27s%20Alpha-Reliability.pdf).\n\nMachine raters\n--------------\n\nMachine-labeled data, where categories are automatically determined by one or\nmore classification models, is often referred to as *silver labels* .\nMachine-labeled data can vary widely in quality. Check it not only for accuracy\nand biases but also for violations of common sense, reality, and intention. For\nexample, if a computer-vision model mislabels a photo of a\n[chihuahua as a muffin](https://www.freecodecamp.org/news/chihuahua-or-muffin-my-search-for-the-best-computer-vision-api-cbda4d6b425d/),\nor a photo of a muffin as a chihuahua, models trained on that labeled data will\nbe of lower quality.\n\nSimilarly, a sentiment analyzer that scores neutral words as -0.25, when 0.0 is\nthe neutral value, might be scoring all words with an additional negative bias\nthat is not actually present in the data. An oversensitive toxicity detector\nmay falsely flag many neutral statements as toxic. Try to get a sense of the\nquality and biases of machine labels and annotations in your data before\ntraining on it.\n\nHigh dimensionality\n-------------------\n\nCategorical data tends to produce high-dimensional feature vectors; that is,\nfeature vectors having a large number of elements.\nHigh dimensionality increases training costs and makes training more\ndifficult. For these reasons, ML experts often seek ways to reduce the number\nof dimensions prior to training.\n\nFor natural-language data, the main method of reducing dimensionality is\nto convert feature vectors to embedding vectors. This is discussed in the\n[Embeddings module](/machine-learning/crash-course/embeddings) later in\nthis course.\n| **Key terms:**\n|\n- [Inter-rater agreement](/machine-learning/glossary#inter-rater-agreement) \n[Help Center](https://support.google.com/machinelearningeducation)"]]