I’ll start my posts with the distance function, as I believe a complete understanding of it, provides the basic idea behind most of the ML algorithms. Some of the approaches which come next are closely related to some known algorithms in ML, but I am not going to name any of them, as I believe that a complete understanding of the intuition behind any of them is more important than their names.
Assume we have two points ( a and b) in a 2-d space, and our goal is to measure the distance between these two points. So we are looking for a function in the form d(a,b) which gives a smaller value if points are closer and gives a large value if they are far apart. There are different possibilities for the distance function we can use, but the simplest and most widely used is the Euclidean distance. In other words the length of the straight line segment connecting the two points:
Looks very simple, so now let’s make the problem a bit more challenging. Assume that instead of a single point b, we have several points B, and we want to calculate the distance between a single point a and a set of points B.
There are several possible approaches we can take:
One possibility is to use the distance between point a and its closest neighbor in set B.
This is often a good strategy, but as shown below, sometimes the result is not a good estimate for the distance between point a and the whole set B.
The opposite approach is to use the distance between point a and its farthest neighbor in set B:
This also could be a good estimation, but it suffers from the shortcoming similar to that of the closest neighbor approach.
The approach which makes more sense is to use the average distance between a and all points in B. In other words we take the Euclidean distance between point a and all points in set B and then take the average of the calculated distances. For example, for the points in the below figure we have:
This process is computationally expensive, as calculating the distance between set B and any new given point a’ needs calculation of the distance between point a’ and all the points in the set. A more computationally efficient approach is to calculate the average point of set B beforehand, and then calculate the distance between point a and the average point. This way for any new given point a’ distance calculation would take constant time.
Well that was it for my first post, it was simple (wasn’t it?). That is the main idea anyway (remember the simplicity in the title of this page). But I promise that it will be more interesting for the next part.