Many of the Machine Learning Crash Course Programming Exercises use the California housing data set, which contains data drawn from the 1990 U.S. Census. The following table provides descriptions, data ranges, and data types for each feature in the data set.
Column title | Description | Range* | Datatype |
---|---|---|---|
longitude |
A measure of how far west a house is; a more negative value is farther west |
|
float64 |
latitude |
A measure of how far north a house is; a higher value is farther north |
|
float64 |
housingMedianAge |
Median age of a house within a block; a lower number is a newer building |
|
float64 |
totalRooms |
Total number of rooms within a block |
|
float64 |
totalBedrooms |
Total number of bedrooms within a block |
|
float64 |
population |
Total number of people residing within a block |
|
float64 |
households |
Total number of households, a group of people residing within a home unit, for a block |
|
float64 |
medianIncome |
Median income for households within a block of houses (measured in tens of thousands of US Dollars) |
|
float64 |
medianHouseValue |
Median house value for households within a block (measured in US Dollars) |
|
float64 |
* Min and max values in the table below were obtained from the Exercise notebooks
using pandas.DataFrame.describe()
on the California Housing data set
Reference
Pace, R. Kelley, and Ronald Barry, "Sparse Spatial Autoregressions," Statistics and Probability Letters, Volume 33, Number 3, May 5 1997, p. 291-297.
The following is the data methodology described in the paper:
We collected information on the variables using all the block groups in California from the 1990 Census. In this sample a block group on average includes 1425.5 individuals living in a geographically compact area. Naturally, the geographical area included varies inversely with the population density. We computed distances among the centroids of each block group as measured in latitude and longitude. We excluded all the block groups reporting zero entries for the independent and dependent variables. The final data contained 20,640 observations on 9 characteristics.