Natural Language Processing (NLP)

How can google

What is NLP?

NLP is the study of using computers to process and understand natural language (i.e. English or Chinese). Since this is how humans communicate, NLP is one of the most applicable domains of Artificial Intelligence. Some places you might see NLP:

  • Search engines

  • Translation services

  • Speech-to-text (Siri, Google Home, Bixby)

  • Spam detection

  • QA services (like Stack Overflow, Quora)

Preprocessing

All NLP questions face the first task of "how do you represent natural language?" To do any real data analysis, computers need to be given numbers, and it's not clear how exactly to do this.

Representation

One way would be to take all the words that appear in a document (vocabulary V), assign each a number, and then when doing analysis, treat each word as a one hot vector of length V (everything is 0 except for a 1 at the correct index). This is called the Bag of Words approach (which is why there was/is a bag of words on Gates 8).

Bag of words representation of a sentence

Tokenization

Next, consider how to determine things like part of speech and dependency. Luckily, linguistics has studied this for many years, and the problems of tokenization and part of speech tagging are mostly solved, through libraries like spaCy and NLTK, which help to tokenize, or separate a document into words, and assign each word a part of speech.

If you want to do something like part of speech tagging yourself, this article is a great introduction! Stanford also has a variety of articles and tools to help, so check them out!

The Fun Stuff

Once preprocessing has been done, now you can do more interesting stuff with text. For example, you could build a Naive-Bayes classifier of a text to determine whether it is spam or not. Or, you could build a similar classifier for speech text to determine whether a Democrat or Republican gave the speech, or given some text determine the sentiment (Sentiment Analysis).

Many NLP algorithms rely on what's called a Hidden-Markov-Model. Basically text is represented by a series of POS tags x1 ... xn and given a tag, each word w_i has a probability of being "emitted". Read more about HMM's here.

An example of a HMM. t_i is a tag, and w_i is a word