Content Based recommendations

10 minute read

Content based recommendations

So far we have looked at making recommendations based solely on how the entire population feels about items. While these recommendations can be useful, they aren’t personalized. In this post, we will move to more targeted models by recommending items based on their similarities to items a user has liked in the past.

For example, if a user likes book A, and we calculate that book A and book B are similar, we believe the user will like book B.

We will address how to calculate what items are similar and which ones are not. We can do so by comparing the attributes of our items. The recommendations made by finding items with similar attributes are called content-based recommendations.

For example, if we were looking at a dataset describing books, the attributes could be the author of the book, its publishing date, its length, or its genre, really any descriptive information. A big advantage of using an item’s attributes over user feedback is that you can make recommendations for any items you have attribute data on. This includes even brand new items that users have not seen yet.

Content-based models require us to use any available attributes to build profiles of items in a way that allows us to mathematically compare between them. This allows us for example to find the most similar items and recommend them.

Vectorizing the attributes

This is best done by encoding each item as a vector. Here we can see an example with a vector for each item stored as a row and each feature as a column.

Why this shape you might ask? It is extremely valuable to have your data in this format so the distance and similarities between items can be easily calculated, which is vital for generating recommendations. We’ll discuss how to calculate distances and similarities between vectors later in the post. First, we will cover how to convert the most common data format for attributes to this shape.

One-to-many relationships

This book_genre table, as seen here on the left, contains a one to many reference of books to their genres. This type of one to many lookup is very common in relational databases. Remember from this table, we want to create a new table that contains a single row per item, encoding whether or not it has that attribute like you see here on the right.

Using pandas cross-tabulation function

To transform this data we can use pandas’ crosstab function. The crosstab function generates the cross-tabulation of two (or more) factors, and here we want to use it to find the cross-tabulation of the book titles and the genres they have been labeled with. The first argument will become the rows, and the second becomes the columns. Here we can see the desired result.


Imagine you are working for a large retailer that has a constantly changing product line, with new items being added every day. Why might content-based models be a good choice to make recommendations on your data?

Ans: As the recommendations are based on the item attributes rather than user feedback, recommendations can be made on never-before-purchased products. Content-based models are ideal for creating recommendations for products that have no user feedback data such as reviews or purchases.

Making content-based recommendations

With our data formatted, we can begin making comparisons and recommendations, but to do so, we will need a way of calculating similarity between rows.

Jaccard Similarity

The metric we will use to measure similarity between items in our newly encoded dataset is called the Jaccard similarity. The Jaccard similarity is the ratio of attributes that two items have in common, divided by the total number of their combined attributes. These are respectively shown by the two orange shaded areas in the Venn diagrams here.

It will always be between 0 and 1 and the more attributes the two items have in common, the higher the score.

import pandas as pd
import numpy as np

movies_df = pd.read_csv('data/movies.csv')
movieId title genres
0 1 Toy Story (1995) Adventure|Animation|Children|Comedy|Fantasy
1 2 Jumanji (1995) Adventure|Children|Fantasy
2 3 Grumpier Old Men (1995) Comedy|Romance
3 4 Waiting to Exhale (1995) Comedy|Drama|Romance
4 5 Father of the Bride Part II (1995) Comedy
# Remove the year from title and add onto a separate column
movies_df['year'] = movies_df.title.str.extract('(\(\d{4}\))',expand=False)
movies_df['year']= movies_df.year.str.extract('(\d{4})',expand=False)
movies_df['title']=movies_df.title.apply(lambda x:x.strip())
<ipython-input-18-6b910120983c>:4: FutureWarning: The default value of regex will change from True to False in a future version.
movieId title genres year
0 1 Toy Story Adventure|Animation|Children|Comedy|Fantasy 1995
1 2 Jumanji Adventure|Children|Fantasy 1995
2 3 Grumpier Old Men Comedy|Romance 1995
3 4 Waiting to Exhale Comedy|Drama|Romance 1995
4 5 Father of the Bride Part II Comedy 1995
# convert the genres column to a list
movies_df['genres'] = movies_df['genres'].str.split('|')
movieId title genres year
0 1 Toy Story [Adventure, Animation, Children, Comedy, Fantasy] 1995
1 2 Jumanji [Adventure, Children, Fantasy] 1995
2 3 Grumpier Old Men [Comedy, Romance] 1995
3 4 Waiting to Exhale [Comedy, Drama, Romance] 1995
4 5 Father of the Bride Part II [Comedy] 1995
# explode the list
movie_genre_df = movies_df[['title', 'genres']]
movie_genre_df = movie_genre_df.explode('genres').reset_index(drop=True)
title genres
0 Toy Story Adventure
1 Toy Story Animation
2 Toy Story Children
3 Toy Story Comedy
4 Toy Story Fantasy
5 Jumanji Adventure
6 Jumanji Children
7 Jumanji Fantasy

Creating content-based data: As much as you might want to jump right to finding similar items and making recommendations, you first need to get your data in a usable format. In the next few cells, you will explore your base data and work through how to format that data to be used for content-based recommendations.

As a reminder, the desired outcome is a row per movie with each column indicating whether a genre applies to the movie. You will be looking at movie_genre_df, which contains these columns:

  • title - Name of movie
  • genres - Genre that the movie has been labeled as.

A movie may have multiple genres, and therefore multiple rows.

# Select only the rows with values in the name column equal to Toy Story
toy_story_genres = movie_genre_df[movie_genre_df['title'] == 'Toy Story']

# Create cross-tabulated DataFrame from name and genre_list columns
movie_cross_table = pd.crosstab(movie_genre_df['title'], movie_genre_df['genres'])

# Select only the rows with Toy Story as the index
toy_story_genres_ct = movie_cross_table[movie_cross_table.index == 'Toy Story']
genres (no genres listed) Action Adventure Animation Children Comedy Crime Documentary Drama Fantasy Film-Noir Horror IMAX Musical Mystery Romance Sci-Fi Thriller War Western
Toy Story 0 0 1 1 1 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0

Comparing individual movies with Jaccard similarity:

In the last cell, you built a DataFrame of movies, where each column represents a different genre. You can now use this DataFrame to compare movies by measuring the Jaccard similarity between rows.

The higher the Jaccard similarity score, the more similar the two items are.

As an example we will compare the movie GoldenEye with the movie Toy Story, and GoldenEye with SkyFall and compare the results.

The DataFrame movie_cross_table contains all the movies as rows and the genres as Boolean columns.

(9461, 20)
genres (no genres listed) Action Adventure Animation Children Comedy Crime Documentary Drama Fantasy Film-Noir Horror IMAX Musical Mystery Romance Sci-Fi Thriller War Western
'71 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 1 0
'Hellboy': The Seeds of Creation 0 1 1 0 0 1 0 1 0 1 0 0 0 0 0 0 0 0 0 0
'Round Midnight 0 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0
'Salem's Lot 0 0 0 0 0 0 0 0 1 0 0 1 0 0 1 0 0 1 0 0
'Til There Was You 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0
movie_cross_table.loc['Toy Story'].values
array([0, 0, 1, 1, 1, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0])
# Import numpy and the distance metric
import numpy as np
from sklearn.metrics import jaccard_score

# Extract just the rows containing GoldenEye and Toy Story
goldeneye_values = movie_cross_table.loc['GoldenEye'].values
toy_story_values = movie_cross_table.loc['Toy Story'].values

# Find the similarity between GoldenEye and Toy Story
print(jaccard_score(goldeneye_values, toy_story_values))

# Repeat for GoldenEye and Skyfall
skyfall_values = movie_cross_table.loc['Skyfall'].values
print(jaccard_score(goldeneye_values, skyfall_values))

As you can see, based on Jaccard similarity, GoldenEye and Skyfall (both James Bond movies) are more similar than GoldenEye and Toy Story (a spy movie and an animated kids movie).

Comparing all your movies at once using pdist() and squareform() instead of jaccard_score

While finding the Jaccard similarity between any two individual movies in your dataset is great for small-scale analyses, it can prove slow on larger datasets to make recommendations.

So now, you will find the similarities between all movies and store them in a DataFrame for quick and easy lookup.

When finding the similarities between the rows in a DataFrame, you could run through all pairs and calculate them individually, but it’s more efficient to use the pdist() (pairwise distance) function from scipy.

This can be reshaped into the desired rectangular shape using squareform() from the same library. Since you want similarity values as opposed to distances, you should subtract the values from 1.

# Import functions from scipy
from scipy.spatial.distance import pdist, squareform

# Calculate all pairwise distances
jaccard_distances = pdist(movie_cross_table.values, metric='jaccard')

array([0.875     , 0.8       , 0.66666667, ..., 1.        , 1.        ,

jaccard_distances is returned as a 1-D array, we now have to use squareform() to transform it into the desired rectangular shape. (9461 * 9461)

(9461, 9461)
# Convert the distances to a square matrix
jaccard_similarity_array = 1 - squareform(jaccard_distances)

# Wrap the array in a pandas DataFrame
jaccard_similarity_df = pd.DataFrame(jaccard_similarity_array, index=movie_cross_table.index, columns=movie_cross_table.index)

# Print the top 5 rows of the DataFrame
title '71 'Hellboy': The Seeds of Creation 'Round Midnight 'Salem's Lot 'Til There Was You 'Tis the Season for Love 'burbs, The 'night Mother (500) Days of Summer *batteries not included ... Zulu [REC] [REC]² [REC]³ 3 Génesis anohana: The Flower We Saw That Day - The Movie eXistenZ xXx xXx: State of the Union ¡Three Amigos! À nous la liberté (Freedom for Us)
'71 1.000000 0.125 0.200000 0.333333 0.200000 0.0 0.0 0.25 0.166667 0.000000 ... 0.600000 0.40 0.2 0.2 0.200000 0.400000 0.400000 0.400000 0.000000 0.000000
'Hellboy': The Seeds of Creation 0.125000 1.000 0.000000 0.000000 0.000000 0.0 0.2 0.00 0.142857 0.285714 ... 0.111111 0.00 0.0 0.0 0.000000 0.142857 0.142857 0.142857 0.166667 0.166667
'Round Midnight 0.200000 0.000 1.000000 0.200000 0.333333 0.0 0.0 0.50 0.250000 0.000000 ... 0.000000 0.25 0.0 0.0 0.333333 0.000000 0.000000 0.000000 0.000000 0.333333
'Salem's Lot 0.333333 0.000 0.200000 1.000000 0.200000 0.0 0.0 0.25 0.166667 0.000000 ... 0.142857 0.75 0.5 0.5 0.200000 0.166667 0.166667 0.166667 0.000000 0.000000
'Til There Was You 0.200000 0.000 0.333333 0.200000 1.000000 0.5 0.0 0.50 0.666667 0.000000 ... 0.000000 0.25 0.0 0.0 0.333333 0.000000 0.000000 0.000000 0.000000 0.000000

5 rows × 9461 columns

As you can see, the table has the movies as rows and columns, allowing you to quickly look up any distance of any movie pairing.

Making recommendations based on movie genres: Now that you have your data in a usable format and know how to compare two movies, the next step is to use this to generate recommendations. Here we will see how to generate recommendations for any movie in your dataset.

# Find the values for the movie Gladiator
jaccard_similarity_series = jaccard_similarity_df.loc['Gladiator']

# Sort these values from highest to lowest
ordered_similarities = jaccard_similarity_series.sort_values(ascending=False)

# Print the results
Gladiator             1.00
Planet of the Apes    0.75
Getaway, The          0.50
3:10 to Yuma          0.50
Oliver Twist          0.50
Name: Gladiator, dtype: float64

Planet of the Apes has the highest similarity value to Gladiator! This means that viewers that liked Gladiator are likely to enjoy Planet of the Apes also. Although I find it dubious at best. Remember, we only used genres as our attributes to make these recommendations. Which is why these recommendations are so way off, IMO.