Natural Language Processing with N-Gram Model

Till now we have seen two natural language processing models, Bag of Words and TF-IDF. What we are going to discuss now is totally different from both of them. Also, the applications of N-Gram model are different from that of these previously discussed models.

To understand N-gram, it is necessary to know the concept of Markov Chains.

What is Markov Chains?

According to Wikipedia, ” A Markov chain is a stochastic model describing a sequence of possible events in which the probability of each event depends only on the state attained in the previous event. “

I think this definition is pretty hard to understand, let’s try to understand from an example.

Suppose there are various states such as, state A, state B, state C, state D and so on up-to Z. We can go from state (A to B), (B to C), (C to E), (E to Z) like a ride. So, you have to ride from them, such that the the probability of future states depends only on the present state (conditional probability), not on the sequence of events that preceded it, and in this way you get a chain of different states.

From above figure you can see that, we build the sentence “He is eating” based on the probability of the present state and cancel all the other options which have comparatively less probability.

N-Gram Model

An N-Gram is a contiguous sequence of n items from a given sample of text. These n items can be characters or can be words.

N-gram is use to identify next word/character in the sentence/word from previous words/character, That means P(word|history) or P(character|history). This is a conditional probability.

For example in sentence “He is eating”, “eating” word is given “He is”

i.e. P(eating | He is)

Here,
word = eating
history = He is

So, one way to estimate the above probability function is through the relative frequency count approach. Means go through entire data and check how many times the word “eating” is coming after “He is”. Suppose 70% of the time “eating” is coming after “He is”. Then the model gets an idea that there is always 0.7 probability that “eating” comes after “He is”. But this process is lengthy, you have go through entire data and check each word and then calculate the probability.

Instead of this approach, go through Markov chains approach, Here, you, instead of computing probability using the entire data, you can approximate it by just a few historical words.

Bigram Model

If N = 2 in N-Gram, then it is called Bigram model. An Bigram model predicts the occurrence of a word based on the occurrence of its 2 – 1 previous words.

i.e. P(eating | is)

Trigram model

if N = 3, then it is Trigram model and so on. An Trigram model predicts the occurrence of a word based on the occurrence of its 3 – 1 previous words.

i.e. P(eating | He is)

Generally, the bigram model works well and it may not be necessary to use trigram models or higher N-gram models.


Let’s take an data of 3 sentences, and try to train our bigram model.

  • Good job brother
  • It was a difficult job
  • Thank you for this good job

So, the probability of word “job” followed by the word “good” is:

N gram model formula

=> (2/3) = 0.67

So, in the above data, model will learns that, there is 0.67 of probability of getting the word “good” before “job” , and 0.33 of probability of getting the word “difficult” before “job”.

In this way, model learns from one previous word in bigram. Similarly for trigram, instead of one previous word, it considers two previous words.

Application of N-gram model
  • In your mobile, when you type something and your device suggests you the next word is because of N-gram model.
  • Extracting features for clustering large sets of satellite earth images and then determining what part of the Earth a particular image came from.
  • Speech recognition.

This was a basic introduction to N-grams. For further reading, you can check out the reference:
https://ieeexplore.ieee.org/abstract/document/4470313

Leave a Reply