Summary

This course has walked through many common data traps, from dataset quality to thinking to visualization and statistical analysis.

ML practitioners should ask:

  • How well do I understand the characteristics of my datasets and the conditions under which that data was collected?
  • What quality or bias issues exist in my data? Are confounding factors present?
  • What potential downstream issues could arise from using these particular datasets?
  • When training a model that makes predictions or classifications: does the dataset that the model is trained on contain all relevant variables?

Whatever their findings, ML practitioners should always examine themselves for confirmation bias, then check their findings against their intuition and common sense, and investigate wherever the data is in conflict with these.

Additional reading

Cairo, Alberto. How Charts Lie: Getting Smarter about Visual Information. NY: W.W. Norton, 2019.

Huff, Darrell. How to Lie with Statistics. NY: W.W. Norton, 1954.

Monmonier, Mark. How to Lie with Maps, 3rd ed. Chicago: U of Chicago P, 2018.

Jones, Ben. Avoiding Data Pitfalls. Hoboken, NJ: Wiley, 2020.

Wheelan, Charles. Naked Statistics: Stripping the Dread from the Data. NY: W.W. Norton, 2013