So here we are again, with some new proposals for the distance function between a point and a set. In the previous post, we became familiar with the variance measure and two methods to map points into a new space so that the distance function in the new space behaves the same way as what intuitively seems right!
With no further delay, let’s move to our next problem and see what we can do to solve our problem.
Same as before, a looks more similar to the underlying data, while a’ has a smaller distance with the mean. The variance along both axes is similar, rotation does not solve the problem and if there is a function which mapping the points with that function into a new space would solve the problem, it is not clear what that function could be.
To solve this problem, we observe that if we take the distance between the test points and their closest neighbors in set B, the respective distance for a is smaller than that of a’.
This is actually the first approach we took here. By taking this approach we get prone to the problem presented here. To solve this problem, let’s redraw the original figure here and see what we can do to solve the problem.
To solve this problem, notice that, the closet point to a in set () is itself far from other points in the set. In other words, itself is not close to its closest neighbor in the set (). So it seems the best method is to measure two distances. The first one the distance between the test point and its closest neighbor (), second the distance between the closest neighbor and its corresponding closest neighbor in the rest of the set (). Finally, we need to combine the two measurements to get the distance of the test point from the whole set. This seems a reasonable approach, but it has two problems. First, assuming that we have calculated the two mentioned distances, how should we combine these distances? We can take min, max, average, etc. but it is not clear which one is the best 1.
For now, assume that we somehow solved this problem, and combined the distances in an optimal fashion. This does not completely solve our problem. For the second problem, consider the figure below:
Here point a is close to its closest neighbor and also is close to its corresponding closest neighbor in the rest of the data , but they are all far away from other points in the set. So in this case, combining the two mentioned distances () will give us a small value, which means that point a is close to the set B, while this is not true.
We cannot continue our original strategy anymore. For example, in the above figure, the closest neighbor of is . So we are trapped in a loop and we should consider a different strategy.
One possible solution is the following: instead of calculating the distance only to the closet neighbor we can calculate the distance to the two closest neighbors, and for each of the neighbors follows the same approach. As shown below this strategy also has its own worst cases.
Here test point a and its two closest neighbors are used ( and ), instead of only the closest neighbor. For each one of them also, we showed the two closest neighbors. In this example, all the calculated distances are small, and therefore, the combination of them will also produce a small value, while point a is far from a major part of the dataset.
So for a general solution, we can calculate the distance between any combination of two points out of all the points we have (including the test point). This means to calculate the distance between the test point and all the points in the dataset, and also to calculate the distance between each point in the dataset and all other points in the set. Next we need to summarize all these values to get a final answer for our original problem d(a,B). Notice that this way, we are actually extracting all the information about the distances we have present in our dataset.
Now the problem is how to summarize these values. I could go over a solution right now, but I prefer to keep it for later, as it needs some material we haven’t covered yet, and therefore the answer wouldn’t be Simple anymore.
1. Do you see a similarity between our current problem and the original problem? There also, we had a number of distances but we didn’t know how to combine them to get a good estimate of the overall distance. Although one might argue that these two are not exactly the same. Meaning that the distance between the test point and its closest neighbor () is more important than the distance between the closest neighbor and its own closest neighbor (). I am not sure that this argument is true, and if true, how much one is more important than the other. Anyway, we should find a solution for our original problem, to be able to solve this problem.