# Natural Language Processing (NLP)

## What is NLP?

NLP is the study of using computers to process and understand natural language (i.e. English or Chinese). Since this is how humans communicate, NLP is one of the most applicable domains of Artificial Intelligence. Some places you might see NLP:

* Search engines
* Translation services
* Speech-to-text (Siri, Google Home, Bixby)
* Spam detection
* QA services (like Stack Overflow, Quora)

## Preprocessing

All NLP questions face the first task of "how do you represent natural language?"  To do any real data analysis, computers need to be given numbers, and it's not clear how exactly to do this.&#x20;

#### Representation

One way would be to take all the words that appear in a document (vocabulary V), assign each a number, and then when doing analysis, treat each word as a one hot vector of length V (everything is 0 except for a 1 at the correct index). This is called the **Bag of Words** approach (which is why there was/is a bag of words on Gates 8).

![Bag of words representation of a sentence](/files/-Lb5Lk4IEmh6HDx-CiZC)

#### Tokenization

Next, consider how to determine things like part of speech and dependency. Luckily, linguistics has studied this for many years, and the problems of tokenization and part of speech tagging are mostly solved, through libraries like [spaCy](https://spacy.io/) and [NLTK](https://www.nltk.org/), which help to **tokeniz**e, or separate a document into words, and assign each word a part of speech.

If you want to do something like part of speech tagging yourself, [this article](https://medium.freecodecamp.org/an-introduction-to-part-of-speech-tagging-and-the-hidden-markov-model-953d45338f24) is a great introduction! Stanford also has a variety of articles and tools to help, so check them out!

#### The Fun Stuff

Once preprocessing has been done, now you can do more interesting stuff with text. For example, you could build a [Naive-Bayes](/machine-learning/machine-learning-common-algorithms/naive-bayes.md) classifier of a text to determine whether it is spam or not. Or, you could build a similar classifier for speech text to determine whether a Democrat or Republican gave the speech, or given some text determine the sentiment ([Sentiment Analysis](https://monkeylearn.com/sentiment-analysis/)).&#x20;

Many NLP algorithms rely on what's called a Hidden-Markov-Model. Basically text is represented by a series of POS tags x1 ... xn and given a tag, each word w\_i has a probability of being "emitted". Read more about HMM's [here](https://web.stanford.edu/~jurafsky/slp3/A.pdf).

![An example of a HMM. t\_i is a tag, and w\_i is a word](/files/-Lb5T08rE-pk0m5uWPBv)


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://15-112.gitbook.io/machine-learning/untitled.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
