Making non-personalized recommendations

7 minute read

Non-personalized recommendations

Non-personalized recommendations is one of the most basic ways to make recommendations. It goes with the knowledge of the crowd and recommends what is already the most popular. They are called this as they are made to all users, without taking their preferences into account.

One example is recommending the items most frequently seen together like you can see here on Amazon. This might not select the ‘best’ items or items that are most suited to you, but there is a good chance you will not hate them as they are so common.

In this post, we will calculate how often each movie in the dataset has been watched and find the most frequently watched movies. The DataFrame user_ratings_df, which is a subset of the Movie Lens dataset is being used here.

This table contains identifiers for each movie (movieid) and the user who watched it (userid), along with the rating they gave it.

import pandas as pd
user_ratings_df = pd.read_csv('data/user_ratings.csv')
print(user_ratings_df.shape)
user_ratings_df.head()
(100836, 6)
userId movieId rating timestamp title genres
0 1 1 4.0 964982703 Toy Story (1995) Adventure|Animation|Children|Comedy|Fantasy
1 5 1 4.0 847434962 Toy Story (1995) Adventure|Animation|Children|Comedy|Fantasy
2 7 1 4.5 1106635946 Toy Story (1995) Adventure|Animation|Children|Comedy|Fantasy
3 15 1 2.5 1510577970 Toy Story (1995) Adventure|Animation|Children|Comedy|Fantasy
4 17 1 4.5 1305696483 Toy Story (1995) Adventure|Animation|Children|Comedy|Fantasy
# Get the counts of occurrences of each movie title
movie_popularity = user_ratings_df["title"].value_counts()

# Inspect the most common values
print(movie_popularity.head().index)
Index(['Forrest Gump (1994)', 'Shawshank Redemption, The (1994)',
       'Pulp Fiction (1994)', 'Silence of the Lambs, The (1991)',
       'Matrix, The (1999)'],
      dtype='object')

Improved non-personalized recommendations: Just because a movie has been watched by a lot of people doesn’t necessarily mean viewers enjoyed it. To understand how a viewer actually felt about a movie, more explicit data is useful. Thankfully, you also have ratings from each of the viewers in the Movie Lens dataset.

Next, you will find the average rating of each movie in the dataset, and then find the movie with the highest average rating.

# Find the mean of the ratings given to each title
average_rating_df = user_ratings_df[["title", "rating"]].groupby('title').mean()

# Order the entries by highest average rating to lowest
sorted_average_ratings = average_rating_df.sort_values(by='rating', ascending=False)

# Inspect the top movies
print(sorted_average_ratings.head())
                                     rating
title                                      
Gena the Crocodile (1969)               5.0
True Stories (1986)                     5.0
Cosmic Scrat-tastrophe (2015)           5.0
Love and Pigeons (1985)                 5.0
Red Sorghum (Hong gao liang) (1987)     5.0

Despite this being a real-world dataset, you might be surprised that the highest-ranked movies are not movies that most people have heard of. This is because very infrequently-viewed movies are skewing the results. We will address this issue next.

Combining popularity and reviews: Until now, you have used the two most common non-personalized recommendation methods to find movies to suggest. As you may have noticed, they both have their weaknesses.

Finding the most frequently watched movies will show you what has been watched, but not how people explicitly feel about it. However, finding the average of reviews has the opposite problem where we have customers’ explicit feedback, but individual preferences are skewing the data.

# Create a list of only movies appearing > 50 times in the dataset
movie_popularity = user_ratings_df["title"].value_counts()
popular_movies = movie_popularity[movie_popularity > 50].index

# Use this popular_movies list to filter the original DataFrame
popular_movies_rankings =  user_ratings_df[user_ratings_df["title"].isin(popular_movies)]

# Find the average rating given to these frequently watched films
popular_movies_average_rankings = popular_movies_rankings[["title", "rating"]].groupby('title').mean()
print(popular_movies_average_rankings.sort_values(by="rating", ascending=False).head())
                                                      rating
title                                                       
Shawshank Redemption, The (1994)                    4.429022
Godfather, The (1972)                               4.289062
Fight Club (1999)                                   4.272936
Cool Hand Luke (1967)                               4.271930
Dr. Strangelove or: How I Learned to Stop Worry...  4.268041

You are now able to make intelligent non-personalized recommendations that combine both the ratings of an item and how frequently it has been interacted with.

Non-personalized suggestions

While suggesting the highest-ranked items will generally return items that most people do not object to, it lacks any understanding of user tastes, or what items are liked by the same people.

Now, we will work through a third and final type of non-personalized recommendations, making suggestions by finding the most commonly seen together items.

For example, suppose we have a dataset with users and the books they read. We will record every time two books were read by the same person, and then count how often these pairings of books occur. We can then use this lookup table to suggest books that are often read by the same people, implying that if you like one, you are likely to enjoy the other.

Permutations of pairs: We will be looking for all permutations of pairs, or in other words, counting both item_a paired with item_b and item_b paired with item_a separately. This will allow us to independently lookup items commonly seen with item_a, or items commonly seen with item_b.

We will need to first create a function that finds all permutations of pairs of items in a list it is applied to and apply the function to the sets of books each user has read.

Finding all pairs of movies

Goal: We will work through how to find all pairs of movies or all permutations of pairs of movies that have been watched by the same person.

You will need to first create a function that finds all possible pairs of items in a list it is applied to. For ease of use, you will output the values of this as a DataFrame. Since you only want to find movies that have been seen by the same person and not all possible permutations, you will group by user_id when applying the function.

user_ratings_df.groupby('userId')['title'].size()
userId
1       232
2        29
3        39
4       216
5        44
       ...
606    1115
607     187
608     831
609      37
610    1302
Name: title, Length: 610, dtype: int64
from itertools import permutations

# Create the function to find all permutations
def find_movie_pairs(x):
    pairs = pd.DataFrame(list(permutations(x.values, 2)),
                       columns=['movie_a', 'movie_b'])
    return pairs

# Apply the function to the title column and reset the index
movie_combinations = user_ratings_df.groupby('userId')['title'].apply(find_movie_pairs).reset_index(drop=True)

movie_combinations
movie_a movie_b
0 Toy Story (1995) Grumpier Old Men (1995)
1 Toy Story (1995) Heat (1995)
2 Toy Story (1995) Seven (a.k.a. Se7en) (1995)
3 Toy Story (1995) Usual Suspects, The (1995)
4 Toy Story (1995) From Dusk Till Dawn (1996)
... ... ...
60793295 31 (2016) Gen-X Cops (1999)
60793296 31 (2016) Bloodmoon (1997)
60793297 31 (2016) Sympathy for the Underdog (1971)
60793298 31 (2016) Hazard (2005)
60793299 31 (2016) Blair Witch (2016)

60793300 rows × 2 columns

  • permutations(list, length_of_permutations) generates iterable object containing all permutations.
  • list() converts this object to a usable list.
  • pd.DataFrame() converts the list to a dataframe containing the columns movie_a and movie_b.

NOTE: When you use groupby, the DataFrame groupby objects have some built-in grouping functions, for example, .size(), which you have used previously. However, custom functions are applied using the apply method. Here you can see the groupby being called on the DataFrame with our custom function being applied using the apply method. This returns the correct data, but due to the groupby, it’s a little difficult to read the nested index. You can drop the multi-index.

You now have a clean table of all of the movies that were watched by the same user, which can be used to find the most commonly paired movies.

Counting up the pairs: Our next task is to find which movies are most commonly paired. To get this, we will generate a new DataFrame containing the counts of occurrences of each of the pairs.

# Calculate how often each item in movie_a occurs with the items in movie_b
combination_counts = movie_combinations.groupby(['movie_a', 'movie_b']).size()

# Convert the results to a DataFrame and reset the index
combination_counts_df = combination_counts.to_frame(name='size').reset_index()
print(combination_counts_df.head())
      movie_a                                     movie_b  size
0  '71 (2014)                 (500) Days of Summer (2009)     1
1  '71 (2014)                  10 Cloverfield Lane (2016)     1
2  '71 (2014)                            127 Hours (2010)     1
3  '71 (2014)  13 Assassins (Jûsan-nin no shikaku) (2010)     1
4  '71 (2014)                             13 Hours (2016)     1

Awesome, now you will use this aggregated DataFrame to generate recommendations for any movie in the dataset.

Generate recommendations for Gladiator

Making your first movie recommendations: Now that you have found the most commonly paired movies, you can make your first recommendations!

While you are not taking in any information about the person watching, and do not even know any details about the movie, valuable recommendations can still be made by examining what groups of movies are watched by the same people.

Finally, you will examine the movies often watched by the same people that watched Gladiator, and then use this data to give a recommendation to someone who just watched the movie.

import matplotlib.pyplot as plt

# Sort the counts from highest to lowest
combination_counts_df.sort_values('size', ascending=False, inplace=True)

# Find the movies most frequently watched by people who watched Gladiator
gladiator_df = combination_counts_df[combination_counts_df['movie_a'] == 'Gladiator (2000)']

# get top 10
gladiator_df_top_10 = gladiator_df.head(10)

# Plot the results
gladiator_df_top_10.plot.bar(x="movie_b")
plt.show()

png

You can see that Matrix was the most commonly watched movie by those who watched Gladiator. This means that it would be a good movie to recommend Gladiator watchers as it shows they have similar fans - that includes me!