I just found out that I made a mistake in my previous post (it is fixed now). There, I argued that in order to calculate the distance between a point and a set (which has unequal variances along different dimensions), we can use the following formula:
The problem in this formula is that we cannot divide the distance by variance. Why? Because they have different scales, or units. Variance has squared units for that of the distance (if distance is in meters, variance is in meters squared). Therefore, we should either divide squared distance by variance or divide the distance by the squared root of the variance (which is usually referred to as standard deviation, so we have ). I know it sounds a bit complicated, so let’s explore it more. The above formula can be written either as:
which we call the variance version, or as:
which we call the std version. Here is the standard deviation of the set B along the X-axis. But which one is better?
If we look at the ML literature, we observe that the variance version is more common than the std version. So there should be a reason to prefer the former.
If we calculate the two mentioned formulas for a dataset, we observe that they do not provide same results. They are not equal, because in general we do not have:
They are not equal because each one is referring to one way of calculating the distance between two points in space. To understand it better, assume we have a point in 2-d space with dimensions x and y and our goal is to calculate the distance between the point and the origin:
One approach is to take the length of the line segment between the point and the origin (this is usually referred to as the Euclidean distance), or we can sum up the absolute x and y values of its position (this is usually referred to as the Manhattan distance).
In our initial post, we argued that the Euclidean distance is the most intuitive distance function between any two point and therefore here we prefer the variance version over the std version.