分类数据:常见问题
使用集合让一切井井有条
根据您的偏好保存内容并对其进行分类。
数值数据通常由科学仪器或自动化测量记录。另一方面,分类数据通常按人类或机器学习 (ML) 模型进行分类。决定类别和标签的人员以及他们做出这些决定的方式会影响这些数据的可靠性和实用性。
人工评分员
由人工手动标记的数据通常称为标准答案,由于数据质量相对较高,因此在训练模型时,标准答案数据比机器标记的数据更受欢迎。
这并不一定意味着任何一组人工标记的数据都是高质量的。人为错误、偏见和恶意行为可能会在数据收集时或数据清理和处理过程中引入。请在训练前检查是否存在这些问题。
任何两个人对同一示例的标签可能都不一样。人工评价者之间评价结果的差异称为评价者间一致性。您可以为每个示例使用多名评估员,并衡量评估员之间的一致性,从而了解评估员意见的差异。
点击即可了解评分者间一致性信度指标
以下是衡量评分者间一致性的方法:
- Cohen 的 kappa 和变异片段
- 类别内相关性 (ICC)
- Krippendorff 的 alpha
如需详细了解 Cohen 的 kappa 和类内相关系数,请参阅 Hallgren 2012。如需详细了解 Krippendorff 的 Alpha 版,请参阅
Krippendorff 2011。
机器评分者
机器标记的数据(类别由一个或多个分类模型自动确定)通常称为银标签。机器标注的数据在质量方面可能会有很大差异。不仅要检查准确性和偏见,还要检查是否违反常识、现实和意图。例如,如果计算机视觉模型将一张柴犬照片误标为杯状小松糕,或者将一张杯状小松糕照片误标为柴犬,那么基于这些标记数据训练的模型质量会较低。
同样,如果情感分析器将中性词语评分为 -0.25(0.0 是中性值),则可能会对所有词语评分时额外加入数据中实际上不存在的负偏差。过于敏感的毒性检测器可能会错误地将许多中性陈述标记为毒性内容。在对数据进行训练之前,请先了解数据中机器标签和注释的质量和偏差。
高维度
分类数据往往会产生高维特征向量,即包含大量元素的特征向量。高维度会增加训练费用,并增加训练难度。因此,机器学习专家通常会寻找在训练之前减少维度数量的方法。
对于自然语言数据,降维的主要方法是将特征向量转换为嵌入向量。本课程稍后的“嵌入”模块中将对此进行讨论。
如未另行说明,那么本页面中的内容已根据知识共享署名 4.0 许可获得了许可,并且代码示例已根据 Apache 2.0 许可获得了许可。有关详情,请参阅 Google 开发者网站政策。Java 是 Oracle 和/或其关联公司的注册商标。
最后更新时间 (UTC):2024-11-06。
[null,null,["最后更新时间 (UTC):2024-11-06。"],[[["\u003cp\u003eCategorical data quality hinges on how categories are defined and labeled, impacting data reliability.\u003c/p\u003e\n"],["\u003cp\u003eHuman-labeled data, known as "gold labels," is generally preferred for training due to its higher quality, but it's essential to check for human errors and biases.\u003c/p\u003e\n"],["\u003cp\u003eMachine-labeled data, or "silver labels," can introduce biases or inaccuracies, necessitating careful quality checks and awareness of potential common-sense violations.\u003c/p\u003e\n"],["\u003cp\u003eHigh-dimensionality in categorical data increases training complexity and costs, leading to techniques like embeddings for dimensionality reduction.\u003c/p\u003e\n"]]],[],null,["# Categorical data: Common issues\n\nNumerical data is often recorded by scientific instruments or\nautomated measurements. Categorical data, on the other hand, is often\ncategorized by human beings or by machine learning (ML) models. *Who* decides\non categories and labels, and *how* they make those decisions, affects the\nreliability and usefulness of that data.\n\nHuman raters\n------------\n\nData manually labeled by human beings is often referred to as *gold labels*,\nand is considered more desirable than machine-labeled data for training models,\ndue to relatively better data quality.\n\nThis doesn't necessarily mean that any set of human-labeled data is of high\nquality. Human errors, bias, and malice can be introduced at the point\nof data collection or during data cleaning and processing. Check for them\nbefore training.\n\n\nAny two human beings may label the same example differently. The difference\nbetween human raters' decisions is called\n[**inter-rater\nagreement**](/machine-learning/glossary#inter-rater-agreement).\nYou can get a sense of the variance in raters' opinions by using\nmultiple raters per example and measuring inter-rater agreement.\n\n**Click to learn about inter-rater agreement metrics** \nThe following are ways to measure inter-rater agreement:\n\n- Cohen's kappa and variants\n- Intra-class correlation (ICC)\n- Krippendorff's alpha\n\nFor details on Cohen's kappa and intra-class correlation, see\n[Hallgren\n2012](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3402032/). For details on Krippendorff's alpha, see\n[Krippendorff 2011](https://www.asc.upenn.edu/sites/default/files/2021-03/Computing%20Krippendorff%27s%20Alpha-Reliability.pdf).\n\nMachine raters\n--------------\n\nMachine-labeled data, where categories are automatically determined by one or\nmore classification models, is often referred to as *silver labels* .\nMachine-labeled data can vary widely in quality. Check it not only for accuracy\nand biases but also for violations of common sense, reality, and intention. For\nexample, if a computer-vision model mislabels a photo of a\n[chihuahua as a muffin](https://www.freecodecamp.org/news/chihuahua-or-muffin-my-search-for-the-best-computer-vision-api-cbda4d6b425d/),\nor a photo of a muffin as a chihuahua, models trained on that labeled data will\nbe of lower quality.\n\nSimilarly, a sentiment analyzer that scores neutral words as -0.25, when 0.0 is\nthe neutral value, might be scoring all words with an additional negative bias\nthat is not actually present in the data. An oversensitive toxicity detector\nmay falsely flag many neutral statements as toxic. Try to get a sense of the\nquality and biases of machine labels and annotations in your data before\ntraining on it.\n\nHigh dimensionality\n-------------------\n\nCategorical data tends to produce high-dimensional feature vectors; that is,\nfeature vectors having a large number of elements.\nHigh dimensionality increases training costs and makes training more\ndifficult. For these reasons, ML experts often seek ways to reduce the number\nof dimensions prior to training.\n\nFor natural-language data, the main method of reducing dimensionality is\nto convert feature vectors to embedding vectors. This is discussed in the\n[Embeddings module](/machine-learning/crash-course/embeddings) later in\nthis course.\n| **Key terms:**\n|\n- [Inter-rater agreement](/machine-learning/glossary#inter-rater-agreement) \n[Help Center](https://support.google.com/machinelearningeducation)"]]