What is Bag of Words and How to Code it in Python

We all know that Machines can only read numbers and mathematical calculations and cannot read textual data like us. Bag of word helps to convert textual data into mathematical form that can be applied to any machine learning algorithm.

Creating Bag of Words

Let’s make the bag-of-words model concrete with the following example.

paragraph = he read books as one would breath air. books are most loyal friend. I read books as I breath air

  • Sentence 1 – “he read books as one would breath air ”
  • Sentence 2 – “books are most loyal friend ”
  • Sentence 3 – “I read books as I breath air”
import nltk
import re
import heapq

paragraph = """ he read books as one would breath air
                books are most loyal friend 
                I read books as I breath air
            """
Steps

1. Tokenize sentences

Make list of all the unique words in given sentences

  • “he”
  • “read”
  • “books”
  • “as”
  • “one”
  • “would”
  • “breath”
  • “air”
  • “I”
  • “are”
  • “most”
  • “loyal”
  • “friend”
data = nltk.sent_tokenize(paragraph)

2. Creating word histogram

The next step is to score the words in each sentence. To create histogram of words, we have to iterate through each sentence and get up count of these words. Let’s put the maximum counted word at the top.

  • “books” = 3
  • “read” = 2
  • “as” = 2
  • “breath” = 2
  • “air” = 2
  • “I” = 2
  • “he” = 1
  • “loyal” = 1
  • “friend” = 1
  • “most” = 1
  • “are” = 1
  • “one” = 1
  • “would” = 1
up_count = {}
for d in data:
    words = nltk.word_tokenize(d)
    for word in words:
        if word not in up_count.keys():
            up_count[word] = 1
        else:
            up_count[word] += 1

3. Selecting best features

The purpose of putting maximum counted word at top is to know which words are most important and which are not. By doing so we can get rid of unwanted word. because here, we are working with 3 sentences containing only 13 words, but what if we want to work with 50 million words. Then it will be very hard to analyze. Furthermore, unwanted words can affect the result.

The best features are:-

“books”, “read”, “as”, “breath”, “air”, “I”, “he”, “loyal”, friend”.

top_features= heapq.nlargest(9,up_count,key=up_count.get)

4. Converting sentences to vectors

This can be done by creating matrix. The process is to take each of these words and mark their occurrence in the three sentences above with ‘1’ and ‘0’. Where ‘1’ for presence and ‘0’ for absence of word.

Data
X = []
for d in data:
    vector = []
    for word in freq_words:
        if word in nltk.word_tokenize(d):
            vector.append(1)
        else:
            vector.append(0)
    X.append(vector)
        
X = np.asarray(X)

Now, we have data which can be understand by any machine learning algorithm. And that’s the whole point of building a Bag of Words model.

Written By –
<strong>Sarang Deshmukh</strong>
Sarang Deshmukh

Co – Founder and Developer @capablemachine.com | Performance Engineer @AMDOCS

Leave a Reply

Capable Machine