Build your own Movie Recommendation Engine using Word Embedding

Word Embedding

You may have noticed that services like Netflix, Amazon Prime, YouTube recommends the same kind of video you have watched earlier. So, how does this happen, it is because of Machine Learning based recommendation engine which determines how similar the videos are to other things you like and based on that assumption it serves up a recommendation. So, recommendation engine is a system that predicts the user preference.

Some basic concept you must know before jumping to Word2vec model, is as follows.

Collaborative Filtering and Content Based Filtering are two main approaches used by any recommendation system

Collaborative Filtering

It works on a principle of correlation. It considers the common interest shared by two or more people. Viz if Alice likes item A, B, C and Bob likes item B, C, D then there are chances that Alice might like D and Bob might like A.

Collaborative Filtering for movie recommendation

As shown in above figure we can see that the prediction of a movie for Dushyant is calculated by computing the weighted sum of the user ratings given by Sachin to Casino. So, for prediction we need similarity between two users. Users having higher correlation will tend to be similar.

The main advantage of collaborative filtering is it does not depend on machine analyzing content, that means it does not required understanding of an item.

For new user, as no history is available or because of insufficient data, collaborative filtering compromises the accuracy.

Content-based Filtering

Content-based Filtering works on the principle that people who agreed in the past will agree in the future, means based on user’s previous preferences it recommends the product. For example if Alice and Bob like movie Goodfellas, then system will recommend movies that fall under the same genre.

Content-based Filtering for movie recommendation

Content-based filtering methods are based on a context of the product and a information of the user’s. It can also include opinion-based recommended systems. In some cases, users are allowed to leave text review or feedback on the items. These user-generated texts are data for the recommended system because they are potentially rich resource of both feature/aspects of the item, and users’ evaluation/sentiment to the item.

What is Word2vec ?

We all know machine can only learn mathematical data and cannot deal with raw textual data. It is necessary to convert textual data into vectors to provide to machine learning algorithm. Word2vec model takes word as an input, convert it into vector and perform operation. It is a three layer neural network model, in which there is one hidden layer with linear activation function. It was introduce in 2013 and change entire perspective of looking towards Natural Language processing.

Word2vec model can learn from billions and millions of word corpus to produce high quality word vector.

In word2vec semantic information of the words is not stored. Even in TF-IDF model we only give more importance to the uncommon words. There’s a chance of overfitting the model. Overfitting is a scenario when model performs very well with your dataset but fails miserably when applied to any new dataset. This is because it learns outliers during training.

In this model, each word is represented as vector of 32 or more dimension instead of a single number. Also, relation between different words is preserved.

Model Architecture
Word Embedding neural network

Previously in natural language processing many different models were introduce like n Latent Semantic Analysis (LSA) and Latent Dirichlet Allocation (LDA). But, google has found that they are not effective as neural network.

Let’s see some neural network models.


CBOW is a continuous bag of words model which is similar to the Feed forward Neural Net Language Model (NNLM), where the non-linear hidden layer is removed and the projection layer is shared for all words (not just the projection matrix) as shown in the figure below; thus, all words get projected into the same position (their vectors are averaged). CBOW tries to predict the word based on context. the name bag of words is because of the order of words in the history does not influence the projection. As unlike standard bag-of-words model, it uses continuous distributed representation of the context.


The second architecture is similar to CBOW, but instead of predicting the current word based on the context, it tries to maximize classification of a word based on another word in the same sentence. We will discuss this in implementation section. Word2vec model is combination of two models CBOW and Skip gram model.

Word Embedding, skip gram model and cbow model

Here, I have used Skip-gram model.

Step 1 – Data Preparation

Let’s try to prepare data using following sentence.

“I am going to watch the season premiere”

I am going to use color combination here, red is for input and green is for the output.

1. I am going to watch the season premiere”

So, the training samples with respect to this input word will be as follows:

  • I | am
  • I | going

2. I am going to watch the season premiere”

So, the training samples with respect to this input word will be as follows and it will append to previous.

  • I | am
  • I | going
  • am | I
  • am | going
  • am | to

This will continue till last word of the sentence. And, in this way we can extract large amount of data from only one sentence. So, you can image what amount of data we can get from 1 million sentences.

Step 2 – Model Building

Skip-gram model consist of three layers, input layer, one hidden layer and output layer. The hidden layer neurons passes the weighted sum of input layer to output layer, and the output neurons use softmax.

Now, suppose we have extract the data points from million sentences.

Suppose we have 20,000 unique words(v) in our data and we want to create word vector(N) of 100 for each word.

  • V = 20,000
  • N = 100

Each word in input layer will be the one-hot encoded vectors and output layer would give the probability of being the nearby word for every word in the vocabulary.

Now we have our model trained, we can easily extract the learned weight matrix WV x N and use it to extract the word vectors:

Word Embedding, skip gram model
Word Embedding

From the above figure we can see that, fix size vectors are obtain. Similar words in this dataset would have similar vectors, i.e. vectors pointing towards the same direction. For example, the terms “King” and “Queen” would have similar vectors as shown below

Word Embedding, skip gram model

In this way we have build our model.

Step 3 – Implementation

I know some of you may have wondering, how we can apply this knowledge to recommend movies, as movie names are distinct and not sentence. But what if we consider related movie name as sentence and predict next movie.

The strategy is to consider each movie name as vector, and find similar names using that vector, as i already mentioned similar words has similar vectors. Just take the watching history of a costumer as a sentence and the movie name as its words:

Netflix movies screenshot , Word Embedding

In this way we have taken each word from the sentence -> Convert it into vector -> Predict next movie.

Coding Implementation

I am going to use python programming language for implementation. Let’s see….

Step 1 – Import python libraries

Here we are using NumPy library to operate high level mathematical functions on array, Pandas for data analysis and manipulation and Matplotlib for data visualization.

import pandas as pd
import numpy as np
from gensim.models import Word2Vec 
import random
from tqdm import tqdm
import matplotlib.pyplot as plt
%matplotlib inline
import warnings;
Step 2 – Get Data

We are using the MovieLens 20M Dataset curated by the MovieLens research team. It contains 20 million ratings and 465,000 tag applications applied to 27,000 movies by 138,000 users. For more details, you can visit the official website. You can download the dataset via this link.

df_movies = pd.read_csv('movies.csv')
df_ratings = pd.read_csv('ratings.csv')
Movie recommendation using Word Embedding dataset
Step 3 – Data Merging

Data merging means combining two data sets in such a way that each row in both dataset aligns based on common attributes or columns. Here, we will merge movie and rating dataset to get movie ID, user ID, and movie tittle in one data-frame.

df = pd.merge(df_movies,df_ratings)
Word Embedding output capable machine

So, here we have got everything in single frame. Note, user ID is unique.

This image has an empty alt attribute; its file name is image-74.png

Since we have sufficient data, we will drop all the rows with missing values.

Step 4 – Data Pre- processing

First, we will convert movie ID into string and then will check number of unique users.

df['movieId']= df['movieId'].astype(str)
users = df["userId"].unique().tolist()

So, there are total 1,62,541 users in our dataset. . For each of these user, we will extract their watching history. In other words, we can have 1,62,541 sequences of watching movie history.

Step 5 – Data Splitting

In order to check performance of our model, we have to split our data into training and testing set. Here, I have taken 90% of data for training and 10% of data for testing.

# extract 90% of user ID's
users_train = [users[i] for i in range(round(0.9*len(users)))]
# split data into train and validation set
train_df = df[df['userId'].isin(users_train)]
validation_df = df[~df['userId'].isin(users_train)]
Step 6 – Strategy

As I mention earlier, I am going to use watch history of each user and based on that I will recommend him/her movie. For that, first we have to create empty list and append movie ID and user ID , from that we can tell that this user watch that movie or this user likes that type of movies.

#list to capture watch history of the users
watch_train = []
# populate the list with the movie ID
for i in tqdm(users_train):
    temp = train_df[train_df["userId"] == i]["movieId"].tolist()

Note – It took me 3 hours to capture watch history of each user. It may take longer in your PC based upon configurations.

Step 7 – Training the Model

For training of model, we are going to use gemsim module. This module implements the word2vec family of algorithms, using highly optimized C routines, data streaming and Pythonic interfaces. I recommend you to read original documentation of gemsim module.

model = Word2Vec(window = 10, sg = 1, hs = 0,
                 negative = 10, 
                 alpha=0.03, min_alpha=0.0007,
                 seed = 14)
model.build_vocab(watch_train, progress_per=200)
model.train(watch_train, total_examples = model.corpus_count, 
            epochs=10, report_delay=1)

Now, let’s print our model.

X = model[model.wv.vocab]
This image has an empty alt attribute; its file name is image-77.png

Our model has a vocabulary of 31673 unique words and their vectors of size 100 each. Next, we will extract the vectors of all the words in our vocabulary and store it in one place for easy access.

Step 8 – Result

Let’s create a movie-ID and movie tittle dictionary to easily map a movies name to its ID and vice versa.

watch = train_df[["movieId", "title"]]
# remove duplicates
watch.drop_duplicates(inplace=True, subset='movieId', keep="last")
# create movie id and tittle dictionary
watch_dict = watch.groupby('movieId')['title'].apply(list).to_dict()

I have defined the function below. It will take a movie ID as input and return top 6 similar products:

def similar_watch(v, n = 6):
    # extract most similar movies for the input vector
    ms = model.similar_by_vector(v, topn= n+1)[1:]
    # extract name and similarity score of the similar movies
    new_ms = []
    for j in ms:
        pair = (watch_dict[j[0]][0], j[1])
    return new_ms

Let’s recommend now!

This image has an empty alt attribute; its file name is image-78.png
This image has an empty alt attribute; its file name is image-79.png

From above, we can see that, user has given movie ID as an input and print movie name and our model gave top 6 recommendations along with probabilities.

There is another method to do this.

Word Embedding Word Embedding Word Embedding Word Embedding Word Embedding Word Embedding Word Embedding Word Embedding Word Embedding Word Embedding

Code of both methods available here. Also, follow me on GitHub.



Written by –

<strong>Sarang Deshmukh</strong>
Sarang Deshmukh

Co – Founder and Developer | Performance Engineer @AMDOCS

2 thoughts on “Build your own Movie Recommendation Engine using Word Embedding

  1. It’s perfect time to make some plans for the future and it’s time to be
    happy. I’ve learn this put up and if I may just I wish to counsel you some
    interesting things or tips. Maybe you could write subsequent articles
    referring to this article. I want to read more issues about it!
    I could not refrain from commenting. Very well written!
    I just could not leave your web site prior to suggesting that I extremely enjoyed the standard info a
    person supply to your visitors? Is gonna be again steadily to check up on new posts

    Also visit my homepage … Frank

Leave a Reply

Capable Machine