K-Nearest Neighbors

If you were to design a ML algorithm having known nothing about ML, you'd probably come up with this! And surprisingly, it works quite well.

Intuition

Data with similar inputs should give similar outputs

Seems obvious, and it is. If person A listens to Drake, J Cole, and Lil Wayne, they'll probably like the same music as someone who also listens to Drake, J Cole, and Kendrick Lamar, and probably won't let someone have the aux cord if they like listening to Florida Georgia Line and Kacey Musgraves.

Algorithm

I feel like this is best explained using an example. For this example, we'll be predicting tomorrow's weather. Here's the formulation of the problem:

Data

D=x1,x2,...,xN;y1,y2,...yND = x_1,x_2,...,x_N; y_1,y_2,...y_N

In English, we have N inputs (training examples) of data. Each training example is composed of m features, and represents one day. Some features we could use are:

  • Temperature that day

  • Humidity

  • Whether of not it rained

Can you think of any other features that would be useful?

Labels

Each training example x_i is accompanied by a label, y_i, indicating what that day actually was. In our case, we're restricting it to one of ("cloudy", "clear", and "partly cloudy"). I'll use "label" interchangeably with "class".

Training

There's actually no training needed for this model! We just store the data.

Prediction

Given an input X, we want to predict the weather. We can do this following this pseudo-code:

def predict(X, D):
    k = 5
    closest = getKClosest(X, D, k)
    mostLikelyClass = getMajorityClass(closest)
    return mostLikelyClass

getKClosest iterates over the data and chooses the k closest data points to X (in this case, we chose k to be 5). Any distance metric can be used, although it is prevalent to use Euclidean distance, which means that the input must be strictly numerical, meaning that features like "whether or not it rained" or "day of week" need to be converted to integers in some way.

After getting the top k closest examples, we get the most likely weather by doing a majority vote on the k closest examples, and outputting the most likely class.

How to choose k

Notice that we can set k to whatever we want, and the algorithm won't determine k. This means k is what is known as a hyperparameter--it clearly affects our algorithm, but can't really be learned. We just have to choose it and choose the one that works best. This can be done algorithmically in a process called hyperparameter turning.

Notice that using a larger k gives a smoother boundary line between the classes.

And that's it!

K-nearest neighbors is simple to understand and implement, but it works quite well. Can you think of some potential issues with KNN? Here's one: all attributes are weighted equally. Why would that be an issue?

We did KNN classification in this example. How would you apply this to a regression problem?

Last updated