A new and improved version of Machine Learning Crash Course is coming in August 2024. Stay tuned!

California Housing Data Set Description

Many of the Machine Learning Crash Course Programming Exercises use the California housing data set, which contains data drawn from the 1990 U.S. Census. The following table provides descriptions, data ranges, and data types for each feature in the data set.

Column title	Description	Range*	Datatype
`longitude`	A measure of how far west a house is; a more negative value is farther west	Longitude values range from -180 to +180 Data set min: -124.3 Data set max: -114.3	float64
`latitude`	A measure of how far north a house is; a higher value is farther north	Latitude values range from -90 to +90 Data set min: 32.5 Data set max: 42.5	float64
`housingMedianAge`	Median age of a house within a block; a lower number is a newer building	Data set min: 1.0 Data set max: 52.0	float64
`totalRooms`	Total number of rooms within a block	Data set min: 2.0 Data set max: 37937.0	float64
`totalBedrooms`	Total number of bedrooms within a block	Data set min: 1.0 Data set max: 6445.0	float64
`population`	Total number of people residing within a block	Data set min: 3.0 Data set max: 35682.0	float64
`households`	Total number of households, a group of people residing within a home unit, for a block	Data set min: 1.0 Data set max: 6082.0	float64
`medianIncome`	Median income for households within a block of houses (measured in tens of thousands of US Dollars)	Data set min: 0.5 Data set max: 15.0	float64
`medianHouseValue`	Median house value for households within a block (measured in US Dollars)	Data set min: 14999.0 Data set max: 500001.0	float64

* Min and max values in the table below were obtained from the Exercise notebooks using pandas.DataFrame.describe() on the California Housing data set

Reference

Pace, R. Kelley, and Ronald Barry, "Sparse Spatial Autoregressions," Statistics and Probability Letters, Volume 33, Number 3, May 5 1997, p. 291-297.

The following is the data methodology described in the paper:

We collected information on the variables using all the block groups in California from the 1990 Census. In this sample a block group on average includes 1425.5 individuals living in a geographically compact area. Naturally, the geographical area included varies inversely with the population density. We computed distances among the centroids of each block group as measured in latitude and longitude. We excluded all the block groups reporting zero entries for the independent and dependent variables. The final data contained 20,640 observations on 9 characteristics.