With no further delay, let’s look at the problem we are going to discuss today. The initial problem is very simple. Assume we have two points (*a* and *b*) in 2d space, and our goal is to find the equation of the line that passes through both points. It is not hard to show that if the two points we are considering have the coordinates and , the equation of the line would be:

for our convenience we write it as , where is usually referred to as the slope of the line and is the y-intercept of the line. We can calculate each one of these two parameters from the above formula.

Looks simple, so let’s make it more challenging. Now assume that instead of two points we have three points. If we are lucky, the third point would also lie on the line passing the first and second points, but unfortunately that is not the case most of the times. So we end up with three points (and even more most of the times), and we would like to find a straight line that passes through these points “as much as possible”.

What can we do? One solution which comes to mind is to pass a line between every two points:

and then calculate the final line to be the average of these lines. In other words we have:

where:

and the corresponding line would look like:

which seems like as a fair estimation. Unfortunately (as always), there is a problem in this method. For the reason consider the figure below:

Here, the line passing through *b* and *c* is parallel to the **Y**-axis, so we have , and therefore the final line equation would be (mathematically this is not a correct formula, but I wrote it to give you an idea). Obviously this is not a good estimation for our data points. Meaning that an estimation which looks like is not a good estimation for the three points we have (Intuitively, we expect that the line passes over point *a* and then passes somewhere in the middle way between *b* and *c*.)

As mentioned before, the problem comes from the points *b* and *c*. but what is special about these points that make them troublesome? They both have equal X values. So one solution is that we change our initial rule and take the average between slopes and intercepts of the calculated lines unless they have equal X values.

It is clear that we haven’t solved the problem. Two points might have unequal X values, but the difference between the X values be very small, so that the slope of the connecting line would be very large and it will again dominate our final slope (the same argument is valid for the intercept).

So a better approach is to use a weighted average (instead of regular average) to calculate the slope and intercept of the estimated line. In this approach the weight for each line is the square of the distance between the X values of the two points, and intuitively it makes sense. The closer two points are on the X axis, the less they will contribute to the final slope. To the extreme case that two points have exactly equal X values and therefore their contribution in the final result will be zero. More precisely for the slope of the estimated line we have:

and a similar formula would be correct for the intercept. The final line using this equation looks like:

and if we compare it with our initial average estimation we can observe that the new algorithm provides a better estimate of the three mentioned points. Actually, there are some theoretical proofs that under certain assumptions the mentioned approach produces the optimal result.

]]>

In other words, we aim to calculate the similarity, and we do so indirectly through the distance function. The inverse relationship between these two functions is clear, as one increases the other should decrease and vice versa.

So we can easily use the inverse of the distance function, and name it as the similarity function:

where is the similarity between points *a* and *b. *Let’s look at the behavior of the function and see what we have achieved:

The problem with this approach is related to the integral of this function. The integral goes to infinity if calculated for all the values the distance function can take. This might seem a bit out of the scope now, but later we observe that bounded integral is essential for having a well-formed probability density function.

A function which is usually used for that purpose is the exponential function. The function is defined as:

Which does not have the problem related to the inverse similarity function. Notice that the mentioned function is not the only function that can be used for this purpose. For example, another possibility is:

Actually these two functions look very much like each other, but there are many more possibilities for similarity function which do not necessarily result in the same shape. We will see later that the choice of different similarity function will result in different probability density functions.

]]>

Venn diagrams are the method usually used to teach the probability concepts in schools (and to some extent universities) these days. Although to me, they have a fundamental problem in that sense. For the reason, assume you have two random variables *P* and *Q. *When the following figure is given:

The first thing which came to my mind was that *P* and *Q* are independent. Obviously that is not true, because if we know that *P* has happened, we are sure that *Q* will not happen. That is a dependency right there, as giving information about one variable, gives you some information about the other one. Two variables are independent, if having some information about one of them, does not change our knowledge about the other one. So for two variables to be independent, the Venn diagram should have the following form:

The problem with this diagram is that the diagram itself does not tell us if *P* and *Q* are dependent or independent. For that we should have all the information about area of *P*, *Q* and .

So I looked around to see if I can find a better diagram, and I came across the Eikosograms, proposed by Wayne Oldford. In this method, shaded areas represent probabilities of different events inside a unite square. For example for a binary random variable *P* who can take values 0 and 1, the eikosogram will look like this:

where we have , and so we have . Sounds reasonable, now assume we have another binary random variable *Q*, which can also take two values (0 and 1), and our goal is to measure the conditional probability . We can represent this concept using the following diagram:

In this diagram we have: . The interesting metaphor is that you can imagine the square as a container, and the shaded area as the water in the container. The vertical line, or the conditional probability is a barrier separating the water into two parts. If we remove this barrier the water will settle down and the result is the marginal probability of the variable P (go ahead and do the calculation).

Finally for the probabilistic independence, if we draw the eikosogram, we can clearly observe the independence as the “water level” in both sides of the “barrier” are equal. Or knowledge about status of Q does not change our knowledge over status of P.

I strongly recommend everyone to read his paper and get a better intuition of the Bayes Theorem (I was going to write a whole post about that, but I think Oldford did a complete job in section 4.1.3).

Finally looking at his work, I came across a video on YouTube, where he presents the following puzzle:

We have three gas stations, they always have different prices for their gas. On any given day, any one of the three might have the cheapest gas, with equal probability. All three are on route, but once you pass a gas station there is no turning back. You must buy gas at one and only one of the gas stations. As you reach each station, you can see its price advertised but not the price of the next. We want the cheapest gas with high probability. Does it matter what you do?

For the answer look at his video clip.

]]>

.

The problem in this formula is that we cannot divide the distance by variance. Why? Because they have different scales, or units. Variance has squared units for that of the distance (if distance is in meters, variance is in meters squared). Therefore, we should either divide squared distance by variance or divide the distance by the squared root of the variance (which is usually referred to as standard deviation, so we have ). I know it sounds a bit complicated, so let’s explore it more. The above formula can be written either as:

,

which we call the variance version, or as:

,

which we call the std version. Here is the standard deviation of the set *B* along the *X*-axis. But which one is better?

If we look at the ML literature, we observe that the variance version is more common than the std version. So there should be a reason to prefer the former.

They are not equal because each one is referring to one way of calculating the distance between two points in space. To understand it better, assume we have a point in 2-d space with dimensions *x* and *y* and our goal is to calculate the distance between the point and the origin:

One approach is to take the length of the line segment between the point and the origin (this is usually referred to as the Euclidean distance), or we can sum up the absolute *x* and *y* values of its position (this is usually referred to as the Manhattan distance).

In our initial post, we argued that the Euclidean distance is the most intuitive distance function between any two point and therefore here we prefer the variance version over the std version.

]]>With no further delay, let’s move to our next problem and see what we can do to solve our problem.

Same as before, *a* looks more similar to the underlying data, while *a’* has a smaller distance with the mean. The variance along both axes is similar, rotation does not solve the problem and if there is a function which mapping the points with that function into a new space would solve the problem, it is not clear what that function could be.

To solve this problem, we observe that if we take the distance between the test points and their closest neighbors in set *B*, the respective distance for *a* is smaller than that of *a’*.

This is actually the first approach we took here. By taking this approach we get prone to the problem presented here. To solve this problem, let’s redraw the original figure here and see what we can do to solve the problem.

To solve this problem, notice that, the closet point to *a *in set () is itself far from other points in the set. In other words, itself is not close to its closest neighbor in the set (). So it seems the best method is to measure two distances. The first one the distance between the test point and its closest neighbor (), second the distance between the closest neighbor and its corresponding closest neighbor in the rest of the set (). Finally, we need to combine the two measurements to get the distance of the test point from the whole set. This seems a reasonable approach, but it has two problems. First, assuming that we have calculated the two mentioned distances, how should we combine these distances? We can take min, max, average, etc. but it is not clear which one is the best ^{1}.

For now, assume that we somehow solved this problem, and combined the distances in an optimal fashion. This does not completely solve our problem. For the second problem, consider the figure below:

Here point *a *is close to its closest neighbor and also is close to its corresponding closest neighbor in the rest of the data , but they are all far away from other points in the set. So in this case, combining the two mentioned distances () will give us a small value, which means that point *a* is close to the set *B*, while this is not true.

We cannot continue our original strategy anymore. For example, in the above figure, the closest neighbor of is . So we are trapped in a loop and we should consider a different strategy.

One possible solution is the following: instead of calculating the distance only to the closet neighbor we can calculate the distance to the two closest neighbors, and for each of the neighbors follows the same approach. As shown below this strategy also has its own worst cases.

Here test point *a* and its two closest neighbors are used ( and ), instead of only the closest neighbor. For each one of them also, we showed the two closest neighbors. In this example, all the calculated distances are small, and therefore, the combination of them will also produce a small value, while point *a* is far from a major part of the dataset.

So for a general solution, we can calculate the distance between any combination of two points out of all the points we have (including the test point). This means to calculate the distance between the test point and all the points in the dataset, and also to calculate the distance between each point in the dataset and all other points in the set. Next we need to summarize all these values to get a final answer for our original problem *d(a,B)*. Notice that this way, we are actually extracting all the information about the distances we have present in our dataset.

Now the problem is how to summarize these values. I could go over a solution right now, but I prefer to keep it for later, as it needs some material we haven’t covered yet, and therefore the answer wouldn’t be Simple anymore.

1. Do you see a similarity between our current problem and the original problem? There also, we had a number of distances but we didn’t know how to combine them to get a good estimate of the overall distance. Although one might argue that these two are not exactly the same. Meaning that the distance between the test point and its closest neighbor () is more important than the distance between the closest neighbor and its own closest neighbor (). I am not sure that this argument is true, and if true, how much one is more important than the other. Anyway, we should find a solution for our original problem, to be able to solve this problem.

]]>For the reason, consider the below figure.

Here, we have the usual set *B*, but instead we have two test points *a* and *a’*. Both test points have equal distances from the mean point of set so we have *d(a,B) = d(a’,B)*, but it seems that point *a* is closer to the set than the other point. In other words *a* is more similar to *B*, or *a* has a higher probability of being a member of *B*. Therefore, *a* should have a smaller distance to *B* compared to *a’*, but it is not the case, why is that?

The reason can be explained with a measurement called the variance, which represents the spread of the data along a given axis.

In the example given, the data has larger variance over the *X*-axis than the *Y*-axis. To solve our problem, we can divide the distance along each axis by the variance of the set along that axis:

,

where is the distance between *a* and along the *X*-axis, and is the distance along the *Y*-axis. Division by the variance, moves our points into a new space in which the set has equal variances along both *X* and *Y* axes.

As can be seen, in this new space we have . We can imagine this new space in two ways, either by squeezing the points along the *X*-axis, or by stretching them along the *Y*-axis, until the points in the set have equal variances along both axes.

Now, lets move on to our next problem, given below:

Here, *a’* is closer to the mean of *B* compared to *a*, but it seems that *a* better follows the pattern of *B* and therefore should have a smaller distance. The problem can not be related to the variance, as *B* has equal variance along *X* and *Y* axes. So what is the problem?

This problem can be seen similar to the one we saw before, with only a subtle difference. If we rotate the *X* and *Y* axes, we observe the same phenomena as the previous problem. As shown below if we rotate the dimensions, we can use the same equation in the new dimension.

,

where is the distance between *a* and along the new *X*-axis, and is the distance along the new *Y*-axis.

For the next problem, consider the example given below:

In this Figure also, point *a’* is closer to the mean, while point *a* better follows the pattern of the set *B*, and therefore *a* should have a smaller distance while it is not the case (using the formulas discussed so far). Here, the variance along the two axes is equal, and no matter how we rotate the axes the problem remains unchanged.

To solve this problem we change the axes in a different way. This time, the value of the points along the new *X* axis equals and the new *Y* values equal ^{1}.

The new space is the same as the one presented before, and we can define the distance function in the new space by rotating the axes and calculating the variance in the rotated space.

It’s enough for this post. I just wanted to remind you that we are not done yet, so stay tuned for the next post on the mighty distance function.

1. Notice that the data in the dataset looks similar to a circle, and that is how we chose our mapping function.

Assume we have two points ( *a* and *b*) in a 2-d space, and our goal is to measure the distance between these two points. So we are looking for a function in the form *d(a,b)* which gives a smaller value if points are closer and gives a large value if they are far apart. There are different possibilities for the distance function we can use, but the simplest and most widely used is the Euclidean distance. In other words the length of the straight line segment connecting the two points:

Looks very simple, so now let’s make the problem a bit more challenging. Assume that instead of a single point *b*, we have several points *B*, and we want to calculate the distance between a single point *a* and a set of points *B*.

There are several possible approaches we can take:

One possibility is to use the distance between point *a* and its closest neighbor in set *B*.

This is often a good strategy, but as shown below, sometimes the result is not a good estimate for the distance between point *a* and the whole set *B*.

The opposite approach is to use the distance between point *a* and its farthest neighbor in set *B*:

This also could be a good estimation, but it suffers from the shortcoming similar to that of the closest neighbor approach.

The approach which makes more sense is to use the average distance between *a* and all points in *B*. In other words we take the Euclidean distance between point *a* and all points in set *B* and then take the average of the calculated distances. For example, for the points in the below figure we have:

This process is computationally expensive, as calculating the distance between set *B* and any new given point *a’* needs calculation of the distance between point *a’* and all the points in the set. A more computationally efficient approach is to calculate the average point of set *B* beforehand, and then calculate the distance between point *a* and the average point. This way for any new given point *a’* distance calculation would take constant time.

Well that was it for my first post, it was simple (wasn’t it?). That is the main idea anyway (remember the simplicity in the title of this page). But I promise that it will be more interesting for the next part.

]]>