数据陷阱
使用集合让一切井井有条
根据您的偏好保存内容并对其进行分类。
学习目标
在本单元中,您将学习:
- 调查原始数据集或已处理数据集的潜在问题,包括
收集和质量问题
- 识别偏见、无效推论和合理化。
- 发现数据分析中的常见问题,包括相关性、
相关性和相关性。
- 检查图表中是否存在常见问题、误解和
误导性的显示和设计选择。
机器学习的动机
虽然没有模型架构和其他下游模型工作那么迷人,
数据探索、文档和预处理对于
机器学习系统。机器学习从业者可能会遇到 Nithya Sambasivan 等人名为
数据级联
在 2021 年 ACM 论文中
如果客户不能深入了解:
- 收集其数据的条件
- 数据的质量、特征和限制
- 数据可以显示和不能显示的内容
用不良数据训练模型代价很高,
只有在输出质量不佳时,
数据。同样,如果无法理解数据的局限性,
在收集数据时存在偏差,或者误将相关性判断为因果关系,
则可能导致过度承诺和交付不足,从而可能导致
信任。
本课程详细介绍了机器学习和数据可以发现的常见但微妙的数据陷阱
从业者在工作中可能会遇到的各种问题。
如未另行说明,那么本页面中的内容已根据知识共享署名 4.0 许可获得了许可,并且代码示例已根据 Apache 2.0 许可获得了许可。有关详情,请参阅 Google 开发者网站政策。Java 是 Oracle 和/或其关联公司的注册商标。
最后更新时间 (UTC):2024-07-26。
[null,null,["最后更新时间 (UTC):2024-07-26。"],[[["\u003cp\u003eThis module teaches you to identify potential issues in datasets, including biases and invalid inferences, ultimately helping you build better ML models.\u003c/p\u003e\n"],["\u003cp\u003eUnderstanding data limitations and collection conditions is crucial to avoid "data cascades" that lead to poor model performance and wasted resources.\u003c/p\u003e\n"],["\u003cp\u003eThe module explores common data analysis pitfalls, such as mistaking correlation for causation, and emphasizes the importance of proper data exploration and preprocessing in machine learning workflows.\u003c/p\u003e\n"],["\u003cp\u003eBy recognizing common problems in charts and data visualizations, you'll be able to avoid misperceptions and ensure accurate data representation.\u003c/p\u003e\n"]]],[],null,["# Data traps\n\n\u003cbr /\u003e\n\n| **Estimated time:** 1.5 hours\n\nLearning objectives\n-------------------\n\nIn this module, you will learn to:\n\n- Investigate potential issues underlying raw or processed datasets, including collection and quality issues.\n- Identify biases, invalid inferences, and rationalizations.\n- Find common issues in data analysis, including correlation, relatedness, and irrelevance.\n- Examine a chart for common problems, misperceptions, and misleading display and design choices.\n\nML motivation\n-------------\n\nWhile not as glamorous as model architectures and other downstream model work,\ndata exploration, documentation, and preprocessing are critical to\nML work. ML practitioners can fall into what Nithya Sambasivan et al. called\n[data cascades](https://research.google/blog/data-cascades-in-machine-learning/)\nin their [2021 ACM paper](https://dl.acm.org/doi/10.1145/3411764.3445518)\nif they do not deeply understand:\n\n- the conditions under which their data is collected\n- the quality, characteristics, and limitations of the data\n- what the data can and can't show\n\nIt's very expensive to train models on bad data and\nonly find out at the point of low-quality outputs that there were problems\nwith the data. Likewise, a failure to grasp the limitations of data, human\nbiases in collecting data, or mistaking correlation for causation,\ncan result in over-promising and under-delivering results, which can lead to a\nloss of trust.\n\nThis course walks through common but subtle data traps that ML and data\npractitioners may encounter in their work."]]