Machine Learning
  • Machine Learning & Artificial Intelligence
  • What are ML & AI?
  • Machine Learning: Common Algorithms
    • Types of ML Problems
    • K-Nearest Neighbors
    • Naive Bayes
    • Logistic Regression
  • Natural Language Processing (NLP)
  • Reinforcement Learning
  • Deep Learning
  • Computer Vision
  • How To Have a Successful ML Term Project
Powered by GitBook
On this page
  • What is NLP?
  • Preprocessing

Was this helpful?

Natural Language Processing (NLP)

How can google

PreviousLogistic RegressionNextReinforcement Learning

Last updated 6 years ago

Was this helpful?

What is NLP?

NLP is the study of using computers to process and understand natural language (i.e. English or Chinese). Since this is how humans communicate, NLP is one of the most applicable domains of Artificial Intelligence. Some places you might see NLP:

  • Search engines

  • Translation services

  • Speech-to-text (Siri, Google Home, Bixby)

  • Spam detection

  • QA services (like Stack Overflow, Quora)

Preprocessing

All NLP questions face the first task of "how do you represent natural language?" To do any real data analysis, computers need to be given numbers, and it's not clear how exactly to do this.

Representation

One way would be to take all the words that appear in a document (vocabulary V), assign each a number, and then when doing analysis, treat each word as a one hot vector of length V (everything is 0 except for a 1 at the correct index). This is called the Bag of Words approach (which is why there was/is a bag of words on Gates 8).

Tokenization

The Fun Stuff

Next, consider how to determine things like part of speech and dependency. Luckily, linguistics has studied this for many years, and the problems of tokenization and part of speech tagging are mostly solved, through libraries like and , which help to tokenize, or separate a document into words, and assign each word a part of speech.

If you want to do something like part of speech tagging yourself, is a great introduction! Stanford also has a variety of articles and tools to help, so check them out!

Once preprocessing has been done, now you can do more interesting stuff with text. For example, you could build a classifier of a text to determine whether it is spam or not. Or, you could build a similar classifier for speech text to determine whether a Democrat or Republican gave the speech, or given some text determine the sentiment ().

Many NLP algorithms rely on what's called a Hidden-Markov-Model. Basically text is represented by a series of POS tags x1 ... xn and given a tag, each word w_i has a probability of being "emitted". Read more about HMM's .

spaCy
NLTK
this article
Naive-Bayes
Sentiment Analysis
here
Bag of words representation of a sentence
An example of a HMM. t_i is a tag, and w_i is a word