> For the complete documentation index, see [llms.txt](https://15-112.gitbook.io/machine-learning/llms.txt). Markdown versions of documentation pages are available by appending `.md` to page URLs; this page is available as [Markdown](https://15-112.gitbook.io/machine-learning/machine-learning-common-algorithms/k-nearest-neighbors.md).

# K-Nearest Neighbors

## Intuition

> Data with similar inputs should give similar outputs

Seems obvious, and it is. If person A listens to Drake, J Cole, and Lil Wayne, they'll probably like the same music as someone who also listens to Drake, J Cole, and Kendrick Lamar, and probably won't let someone have the aux cord if they like listening to Florida Georgia Line and Kacey Musgraves.

## Algorithm

I feel like this is best explained using an example. For this example, we'll be predicting tomorrow's weather. Here's the formulation of the problem:

#### Data

$$
D = x\_1,x\_2,...,x\_N; y\_1,y\_2,...y\_N
$$

In English, we have N inputs (training examples) of data. Each training example is composed of m features, and represents one day. Some features we could use are:

* Temperature that day
* Humidity
* Whether of not it rained

Can you think of any other features that would be useful?

#### Labels

Each training example x\_i is accompanied by a label, y\_i, indicating what that day actually was. In our case, we're restricting it to one of ("cloudy", "clear", and "partly cloudy"). I'll use "label" interchangeably with "class".

#### Training

There's actually no training needed for this model! We just store the data.

#### Prediction

Given an input X, we want to predict the weather. We can do this following this pseudo-code:

```python
def predict(X, D):
    k = 5
    closest = getKClosest(X, D, k)
    mostLikelyClass = getMajorityClass(closest)
    return mostLikelyClass
```

getKClosest iterates over the data and chooses the k closest data points to X (in this case, we chose k to be 5). Any distance metric can be used, although it is prevalent to use Euclidean distance, which means that the input must be strictly numerical, meaning that features like "whether or not it rained" or "day of week" need to be converted to integers in some way.

After getting the top k closest examples, we get the most likely weather by doing a majority vote on the k closest examples, and outputting the most likely class.

#### How to choose k

Notice that we can set k to whatever we want, and the algorithm won't determine k. This means k is what is known as a **hyperparameter**--it clearly affects our algorithm, but can't really be learned. We just have to choose it and choose the one that works best. This can be done algorithmically in a process called **hyperparameter turning**.

![Classification over two classes with different k](/files/-Lb2BAq_XmXjPuypvH_E)

Notice that using a larger k gives a smoother boundary line between the classes.

## And that's it!

K-nearest neighbors is simple to understand and implement, but it works quite well. Can you think of some potential issues with KNN? Here's one: all attributes are weighted equally. Why would that be an issue?

We did KNN classification in this example. How would you apply this to a regression problem?


---

# Agent Instructions
This documentation is published with GitBook. GitBook is the documentation platform designed so that both humans and AI agents can read, navigate, and reason over technical content effectively. Learn more at gitbook.com.

## Querying This Documentation
If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter, and the optional `goal` query parameter:

```
GET https://15-112.gitbook.io/machine-learning/machine-learning-common-algorithms/k-nearest-neighbors.md?ask=<question>&goal=<endgoal>
```

`ask` is the immediate question: it should be specific, self-contained, and written in natural language.
`goal` is optional and describes the broader end goal you are ultimately trying to accomplish on behalf of the user. GitBook uses it to tailor the answer towards what is most useful for that goal.

The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
