The Mighty Distance Function – Part 1

I’ll start my posts with the distance function, as I believe a complete understanding of it, provides the basic idea behind most of the ML algorithms. Some of the approaches which come next are closely related to some known algorithms in ML, but I am not going to name any of them, as I believe that a complete understanding of the intuition behind any of them is more important than their names.

Assume we have two points ( a and b) in a 2-d space, and our goal is to measure the distance between these two points. So we are looking for a function in the form d(a,b) which gives a smaller value if points are closer and gives a large value if they are far apart. There are different possibilities for the distance function we can use, but the simplest and most widely used is the Euclidean distance. In other words the length of the straight line segment connecting the two points:

Assuming we have two points a and b, the length of the straight line segment connecting the two points can be used as a measure of distance between them.

Assuming we have two points (a and b), the length of the straight line segment connecting the two points can be used as a measure of distance between them.

Looks very simple, so now let’s make the problem a bit more challenging. Assume that instead of a single point b, we have several points B, and we want to calculate the distance between a single point a and a set of points B.

Unlike the distance between two points, the distance between a single point (a) and a set of points (B) does not have a single solution.

Unlike the distance between two points, the distance between a single point (a) and a set of points (B) does not have a single solution.

There are several possible approaches we can take:

One possibility is to use the distance between point a and its closest neighbor in set B.

To measure the distance between point a and the set $B$, the closest point ($b_c$) in the set can be used as a representative of the set.

To measure the distance between point (a) and the set (B), the closest neighbor to (a) in the set (b_c) can be used as a representative of the set.

This is often a good strategy, but as shown below, sometimes the result is not a good estimate for the distance between point a and the whole set B.

Sometimes the closest neighbor of a point  is not a good representative of the whole set.

Sometimes the closest neighbor is not a good representative of the whole set.

The opposite approach is to use the distance between point a and its farthest neighbor in set B:

To measure the distance between point $a$ and the set $B$, the farthest point ($b_f$) in the set can be used as a representative of the set.

To measure the distance between point (a) and the set (B), the farthest neighbor (b_f) in the set can be used as a representative of the set.

This also could be a good estimation, but it suffers from the shortcoming similar to that of the closest neighbor approach.

Farthest neighbor is not a good representative of the whole set.

The approach which makes more sense is to use the average distance between a and all points in B. In other words we take the Euclidean distance between point a and all points in set B and then take the average of the calculated distances. For example, for the points in the below figure we have:

d(a,B) = \frac{d(a,b_1)+d(a,b_2)+d(a,b_3)}{3}

or we can calculate the average point of the set ($\mu_B$) and then calculate the distance between points $a$ and $\mu_B$

To measure the distance between point a and set B, we can measure the distance between point a and each point in set B and then take the average.

This process is computationally expensive, as calculating the distance between set B and any new given point a’ needs calculation of the distance between point a’ and all the points in the set. A more computationally efficient approach is to calculate the average point \mu(B) of set B beforehand, and then calculate the distance between point a and the average point. This way for any new given point a’ distance calculation would take constant time.

we can calculate the average point of the set \mu(B) and then calculate the distance between point a and \mu(B)

Well that was it for my first post, it was simple (wasn’t it?). That is the main idea anyway (remember the simplicity in the title of this page). But I promise that it will be more interesting for the next part.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s