The following exercise walks you through the process of manually creating a similarity measure.
Imagine you have a simple dataset on houses as follows:
Feature | Type |
---|---|
Price | Positive integer |
Size | Positive floating-point value in units of square meters |
Postal code | Integer |
Number of bedrooms | Integer |
Type of house | A text value from “single_family," “multi-family," “apartment,” “condo” |
Garage | 0/1 for no/yes |
Colors | Multivalent categorical: one or more values from standard colors “white,” ”yellow,” ”green,” etc. |
Preprocessing
The first step is preprocessing the numerical features: price, size, number of bedrooms, and postal code. For each of these features you will have to perform a different operation. For example, in this case, assume that pricing data follows a bimodal distribution. What should you do next?
In the field below, try explaining how you would process size data.
In the field below, try explaining what how you would process data on the number of bedrooms.
How should you represent postal codes? Convert postal codes to longitude and latitude. Then process those values as you would process other numeric values.
Calculating Similarity per Feature
Now it is time to calculate the similarity per feature. For numeric features, you simply find the difference. For binary features, such as if a house has a garage, you can also find the difference to get 0 or 1. But what about categorical features? Answer the questions below to find out.
Calculating Overall Similarity
You have numerically calculated the similarity for every feature. But the clustering algorithm requires the overall similarity to cluster houses. Calculate the overall similarity between a pair of houses by combining the per- feature similarity using root mean squared error (RMSE). That is, where \(s_1,s_2,\ldots,s_N\) represent the similarities for \(N\) features:
\[\text{RMSE} = \sqrt{\frac{s_1^2+s_2^2+\ldots+s_N^2}{N}}\]
Limitations of Manual Similarity Measure
As this exercise demonstrated, when data gets complex, it is increasingly hard to process and combine the data to accurately measure similarity in a semantically meaningful way. Consider the color data. Should color really be categorical? Or should we assign colors like red and maroon to have higher similarity than black and white? And regarding combining data, we just weighted the garage feature equally with house price. However, house price is far more important than having a garage. Does it really make sense to weigh them equally?
If you create a similarity measure that doesn’t truly reflect the similarity between examples, your derived clusters will not be meaningful. This is often the case with categorical data and brings us to a supervised measure.