資料集:標籤
透過集合功能整理內容
你可以依據偏好儲存及分類內容。
本節將著重於標籤。
直接標籤與代理標籤
請考慮使用兩種不同類型的標籤:
- 直接標籤:與模型嘗試預測的標籤相同的標籤。也就是說,模型嘗試做出的預測會以資料集的資料欄呈現。舉例來說,如果要預測某人是否擁有腳踏車,那麼名為
bicycle owner
的資料欄就是二元分類模型的直接標籤。
- 代理標籤:與模型嘗試預測的標籤相似,但不完全相同的標籤。舉例來說,訂閱 Bicycle Bizarre 雜誌的使用者可能擁有單車,但不一定如此。
直接標籤通常比代理標籤更有用。如果資料集提供可能的直接標籤,您應該使用該標籤。不過,直接標籤通常無法使用。
代理標籤永遠是一種折衷,是直接標籤的不完美近似值。不過,某些代理標籤的近似值足以提供實用資訊。使用代理標籤的模型,其效用取決於代理標籤和預測之間的連結。
請注意,每個標籤都必須在特徵向量中以浮點數表示 (因為機器學習基本上只是大量數學運算的組合)。有時,直接標籤會存在,但無法輕易以浮點數字表示在特徵向量中。在這種情況下,請使用 Proxy 標籤。
練習:檢查您的理解程度
貴公司希望執行下列操作:
向自行車車主寄送優待券 (「舊車換新車,享 15% 折扣」)。
因此,您的模型必須執行以下操作:
預測哪些人擁有單車。
很抱歉,資料集不含名為 bike owner
的資料欄。不過,資料集確實包含名為 recently bought a bicycle
的資料欄。
recently bought a bicycle
是這個模型的良好代理標籤,還是不良代理標籤?
良好的代理標籤
欄 recently bought a bicycle
是相對不錯的代理標籤。畢竟,現在大多數購買自行車的人都有自行車。不過,recently bought a
bicycle
與所有代理標籤一樣,即使是相當優良的代理標籤,也無法達到完美。畢竟購買商品的人不一定是使用 (或擁有) 該商品的人。舉例來說,使用者有時會購買自行車當作禮物。
代理標籤不佳
如同所有代理標籤,recently bought a bicycle
並非完美無缺 (有些自行車是買來送給他人的禮物)。不過,recently bought a bicycle
仍是判斷使用者是否擁有腳踏車的相對良好指標。
人工產生的資料
部分資料是人工產生的,也就是說,一或多位人員會檢查某些資訊,並提供值,通常是標籤。舉例來說,一或多位氣象學家可以檢查天空圖片,並識別雲朵類型。
或者,系統會自動產生部分資料。也就是說,軟體 (可能是另一個機器學習模型) 會決定價值。舉例來說,機器學習模型可以檢查天空圖片,並自動識別雲朵類型。
本節將探討人類產生資料的優點和缺點。
優點
- 人類評分人員可以執行各種任務,就連精密的機器學習模型也難以勝任。
- 這項程序會強制資料集擁有者建立明確且一致的標準。
缺點
- 您通常需要支付人力評估費用,因此人為產生的資料可能會很昂貴。
- 人非聖賢,孰能無過。因此,可能需要多位評分人員評估相同資料。
請思考以下問題,判斷自己的需求:
- 評分者必須具備什麼技能?(例如,評分者是否必須懂特定語言?您是否需要語言學家協助開發對話或 NLP 應用程式?)
- 您需要多少個有標籤的範例?您最快何時需要取得這些商品?
- 您的預算是多少?
請務必仔細檢查人工評分員的評分。舉例來說,您可以自行為 1000 個範例加上標籤,然後查看您的結果與其他評分者結果的符合程度。如果出現差異,請勿假設你的評分正確無誤,特別是如果涉及價值判斷的情況。如果人工評分人員導致錯誤,建議您新增說明來協助他們,然後再試一次。
按一下加號圖示,進一步瞭解人為資料。
無論您是如何取得資料,手動查看資料都是不錯的練習。Andrej Karpathy 在 ImageNet 上做了這件事,並寫下相關體驗。
模型可同時使用自動產生的標籤和人工產生的標籤進行訓練。不過,對於大多數模型而言,額外提供一組人類產生的標籤 (可能會過時) 通常不值得額外增加複雜度和維護工作。不過,人工標籤有時可提供自動標籤無法提供的額外資訊。
除非另有註明,否則本頁面中的內容是採用創用 CC 姓名標示 4.0 授權,程式碼範例則為阿帕契 2.0 授權。詳情請參閱《Google Developers 網站政策》。Java 是 Oracle 和/或其關聯企業的註冊商標。
上次更新時間:2025-02-26 (世界標準時間)。
[null,null,["上次更新時間:2025-02-26 (世界標準時間)。"],[[["\u003cp\u003eThis document explains the differences between direct and proxy labels for machine learning models, highlighting that direct labels are preferred but often unavailable.\u003c/p\u003e\n"],["\u003cp\u003eIt emphasizes the importance of carefully evaluating proxy labels to ensure they are a suitable approximation of the target prediction.\u003c/p\u003e\n"],["\u003cp\u003eHuman-generated data, while offering flexibility and nuanced understanding, can be expensive and prone to errors, requiring careful quality control.\u003c/p\u003e\n"],["\u003cp\u003eMachine learning models can utilize a combination of automated and human-generated labels, but the added complexity of maintaining human-generated labels often outweighs the benefits.\u003c/p\u003e\n"],["\u003cp\u003eRegardless of the label source, manual data inspection and comparison with human ratings are crucial for identifying potential issues and ensuring data quality.\u003c/p\u003e\n"]]],[],null,["# Datasets: Labels\n\nThis section focuses on [**labels**](/machine-learning/glossary#label).\n\nDirect versus proxy labels\n--------------------------\n\nConsider two different kinds of labels:\n\n- **Direct labels** , which are labels identical to the prediction your model is trying to make. That is, the prediction your model is trying to make is exactly present as a column in your dataset. For example, a column named `bicycle owner` would be a direct label for a binary classification model that predicts whether or not a person owns a bicycle.\n- **Proxy labels**, which are labels that are similar---but not identical---to the prediction your model is trying to make. For example, a person subscribing to Bicycle Bizarre magazine probably---but not definitely---owns a bicycle.\n\nDirect labels are generally better than proxy labels. If your dataset\nprovides a possible direct label, you should probably use it.\nOftentimes though, direct labels aren't available.\n\nProxy labels are always a compromise---an imperfect approximation of\na direct label. However, some proxy labels are close enough approximations\nto be useful. Models that use proxy labels are only as useful as the\nconnection between the proxy label and the prediction.\n\nRecall that every label must be represented as a floating-point number\nin the [**feature vector**](/machine-learning/glossary#feature-vector)\n(because machine learning is fundamentally just a huge amalgam of mathematical\noperations). Sometimes, a direct label exists but can't be easily represented as\na floating-point number in the feature vector. In this case, use a proxy label.\n\n### Exercise: Check your understanding\n\nYour company wants to do the following:\n\u003e Mail coupons (\"Trade in your old bicycle for\n\u003e 15% off a new bicycle\") to bicycle owners.\n\nSo, your model must do the following:\n\u003e Predict which people own a bicycle.\n\nUnfortunately, the dataset doesn't contain a column named `bike owner`.\nHowever, the dataset does contain a column named `recently bought a bicycle`. \nWould `recently bought a bicycle` be a good proxy label or a poor proxy label for this model? \nGood proxy label \nThe column `recently bought a bicycle` is a relatively good proxy label. After all, most of the people who buy bicycles now own bicycles. Nevertheless, like all proxy labels, even very good ones, `recently bought a\nbicycle` is imperfect. After all, the person buying an item isn't always the person using (or owning) that item. For example, people sometimes buy bicycles as a gift. \nPoor proxy label \nLike all proxy labels, `recently bought a bicycle` is imperfect (some bicycles are bought as gifts and given to others). However, `recently bought a bicycle` is still a relatively good indicator that someone owns a bicycle.\n\nHuman-generated data\n--------------------\n\nSome data is **human-generated**; that is, one or more humans examine some\ninformation and provide a value, usually for the label. For example,\none or more meteorologists could examine pictures of the sky and identify\ncloud types.\n\nAlternatively, some data is **automatically-generated**. That is, software\n(possibly, another machine learning model) determines the value. For example, a\nmachine learning model could examine sky pictures and automatically identify\ncloud types.\n\nThis section explores the advantages and disadvantages of human-generated data.\n\nAdvantages\n\n- Human raters can perform a wide range of tasks that even sophisticated machine learning models may find difficult.\n- The process forces the owner of the dataset to develop clear and consistent criteria.\n\nDisadvantages\n\n- You typically pay human raters, so human-generated data can be expensive.\n- To err is human. Therefore, multiple human raters might have to evaluate the same data.\n\nThink through these questions to determine your needs:\n\n- How skilled must your raters be? (For example, must the raters know a specific language? Do you need linguists for dialogue or NLP applications?)\n- How many labeled examples do you need? How soon do you need them?\n- What's your budget?\n\n**Always double-check your human raters**. For example, label 1000 examples\nyourself, and see how your results match other raters' results.\nIf discrepancies surface, don't assume your ratings are the correct ones,\nespecially if a value judgment is involved. If human raters have introduced\nerrors, consider adding instructions to help them and try again.\n\n#### Click the plus icon to learn more about human-generated data.\n\nLooking at your data by hand is a good exercise regardless of how you\nobtained your data. Andrej Karpathy did this on\n[ImageNet\nand wrote about the experience](http://karpathy.github.io/2014/09/02/what-i-learned-from-competing-against-a-convnet-on-imagenet).\n\nModels can train on a mix of automated and human-generated labels. However,\nfor most models, an extra set of human-generated labels (which can become stale)\nare generally not worth the extra complexity and maintenance.\nThat said, sometimes the human-generated labels can provide extra\ninformation not available in the automated labels.\n\n*** ** * ** ***\n\n| **Key terms:**\n|\n| - [Label](/machine-learning/glossary#label)\n- [Feature vector](/machine-learning/glossary#feature-vector) \n[Help Center](https://support.google.com/machinelearningeducation)"]]