Non-personalized recommendations is one of the most basic ways to make recommendations. It goes with the knowledge of the crowd and recommends what is already the most popular. They are called this as they are made to all users, without taking their preferences into account.
One example is recommending the items most frequently seen together like you can see here on Amazon. This might not select the ‘best’ items or items that are most suited to you, but there is a good chance you will not hate them as they are so common.
In this post, we will calculate how often each movie in the dataset has been watched and find the most frequently watched movies. The DataFrame
user_ratings_df, which is a subset of the Movie Lens dataset is being used here.
This table contains identifiers for each movie (
movieid) and the user who watched it (
userid), along with the
rating they gave it.
import pandas as pd user_ratings_df = pd.read_csv('data/user_ratings.csv') print(user_ratings_df.shape) user_ratings_df.head()
|0||1||1||4.0||964982703||Toy Story (1995)||Adventure|Animation|Children|Comedy|Fantasy|
|1||5||1||4.0||847434962||Toy Story (1995)||Adventure|Animation|Children|Comedy|Fantasy|
|2||7||1||4.5||1106635946||Toy Story (1995)||Adventure|Animation|Children|Comedy|Fantasy|
|3||15||1||2.5||1510577970||Toy Story (1995)||Adventure|Animation|Children|Comedy|Fantasy|
|4||17||1||4.5||1305696483||Toy Story (1995)||Adventure|Animation|Children|Comedy|Fantasy|
# Get the counts of occurrences of each movie title movie_popularity = user_ratings_df["title"].value_counts() # Inspect the most common values print(movie_popularity.head().index)
Index(['Forrest Gump (1994)', 'Shawshank Redemption, The (1994)', 'Pulp Fiction (1994)', 'Silence of the Lambs, The (1991)', 'Matrix, The (1999)'], dtype='object')
Improved non-personalized recommendations: Just because a movie has been watched by a lot of people doesn’t necessarily mean viewers enjoyed it. To understand how a viewer actually felt about a movie, more explicit data is useful. Thankfully, you also have ratings from each of the viewers in the Movie Lens dataset.
Next, you will find the average rating of each movie in the dataset, and then find the movie with the highest average rating.
# Find the mean of the ratings given to each title average_rating_df = user_ratings_df[["title", "rating"]].groupby('title').mean() # Order the entries by highest average rating to lowest sorted_average_ratings = average_rating_df.sort_values(by='rating', ascending=False) # Inspect the top movies print(sorted_average_ratings.head())
rating title Gena the Crocodile (1969) 5.0 True Stories (1986) 5.0 Cosmic Scrat-tastrophe (2015) 5.0 Love and Pigeons (1985) 5.0 Red Sorghum (Hong gao liang) (1987) 5.0
Despite this being a real-world dataset, you might be surprised that the highest-ranked movies are not movies that most people have heard of. This is because very infrequently-viewed movies are skewing the results. We will address this issue next.
Combining popularity and reviews: Until now, you have used the two most common non-personalized recommendation methods to find movies to suggest. As you may have noticed, they both have their weaknesses.
Finding the most frequently watched movies will show you what has been watched, but not how people explicitly feel about it. However, finding the average of reviews has the opposite problem where we have customers’ explicit feedback, but individual preferences are skewing the data.
# Create a list of only movies appearing > 50 times in the dataset movie_popularity = user_ratings_df["title"].value_counts() popular_movies = movie_popularity[movie_popularity > 50].index # Use this popular_movies list to filter the original DataFrame popular_movies_rankings = user_ratings_df[user_ratings_df["title"].isin(popular_movies)] # Find the average rating given to these frequently watched films popular_movies_average_rankings = popular_movies_rankings[["title", "rating"]].groupby('title').mean() print(popular_movies_average_rankings.sort_values(by="rating", ascending=False).head())
rating title Shawshank Redemption, The (1994) 4.429022 Godfather, The (1972) 4.289062 Fight Club (1999) 4.272936 Cool Hand Luke (1967) 4.271930 Dr. Strangelove or: How I Learned to Stop Worry... 4.268041
You are now able to make intelligent non-personalized recommendations that combine both the ratings of an item and how frequently it has been interacted with.
While suggesting the highest-ranked items will generally return items that most people do not object to, it lacks any understanding of user tastes, or what items are liked by the same people.
Now, we will work through a third and final type of non-personalized recommendations, making suggestions by finding the most commonly seen together items.
For example, suppose we have a dataset with users and the books they read. We will record every time two books were read by the same person, and then count how often these pairings of books occur. We can then use this lookup table to suggest books that are often read by the same people, implying that if you like one, you are likely to enjoy the other.
Permutations of pairs: We will be looking for all permutations of pairs, or in other words, counting both item_a paired with item_b and item_b paired with item_a separately. This will allow us to independently lookup items commonly seen with item_a, or items commonly seen with item_b.
We will need to first create a function that finds all permutations of pairs of items in a list it is applied to and apply the function to the sets of books each user has read.
Finding all pairs of movies
Goal: We will work through how to find all pairs of movies or all permutations of pairs of movies that have been watched by the same person.
You will need to first create a function that finds all possible pairs of items in a list it is applied to. For ease of use, you will output the values of this as a DataFrame. Since you only want to find movies that have been seen by the same person and not all possible permutations, you will group by
user_id when applying the function.
userId 1 232 2 29 3 39 4 216 5 44 ... 606 1115 607 187 608 831 609 37 610 1302 Name: title, Length: 610, dtype: int64
from itertools import permutations # Create the function to find all permutations def find_movie_pairs(x): pairs = pd.DataFrame(list(permutations(x.values, 2)), columns=['movie_a', 'movie_b']) return pairs # Apply the function to the title column and reset the index movie_combinations = user_ratings_df.groupby('userId')['title'].apply(find_movie_pairs).reset_index(drop=True) movie_combinations
|0||Toy Story (1995)||Grumpier Old Men (1995)|
|1||Toy Story (1995)||Heat (1995)|
|2||Toy Story (1995)||Seven (a.k.a. Se7en) (1995)|
|3||Toy Story (1995)||Usual Suspects, The (1995)|
|4||Toy Story (1995)||From Dusk Till Dawn (1996)|
|60793295||31 (2016)||Gen-X Cops (1999)|
|60793296||31 (2016)||Bloodmoon (1997)|
|60793297||31 (2016)||Sympathy for the Underdog (1971)|
|60793298||31 (2016)||Hazard (2005)|
|60793299||31 (2016)||Blair Witch (2016)|
60793300 rows × 2 columns
permutations(list, length_of_permutations)generates iterable object containing all permutations.
list()converts this object to a usable list.
pd.DataFrame()converts the list to a dataframe containing the columns
NOTE: When you use
groupby, the DataFrame groupby objects have some built-in grouping functions, for example,
.size(), which you have used previously. However, custom functions are applied using the apply method. Here you can see the groupby being called on the DataFrame with our custom function being applied using the apply method. This returns the correct data, but due to the groupby, it’s a little difficult to read the nested index. You can drop the multi-index.
You now have a clean table of all of the movies that were watched by the same user, which can be used to find the most commonly paired movies.
Counting up the pairs: Our next task is to find which movies are most commonly paired. To get this, we will generate a new DataFrame containing the counts of occurrences of each of the pairs.
# Calculate how often each item in movie_a occurs with the items in movie_b combination_counts = movie_combinations.groupby(['movie_a', 'movie_b']).size() # Convert the results to a DataFrame and reset the index combination_counts_df = combination_counts.to_frame(name='size').reset_index() print(combination_counts_df.head())
movie_a movie_b size 0 '71 (2014) (500) Days of Summer (2009) 1 1 '71 (2014) 10 Cloverfield Lane (2016) 1 2 '71 (2014) 127 Hours (2010) 1 3 '71 (2014) 13 Assassins (Jûsan-nin no shikaku) (2010) 1 4 '71 (2014) 13 Hours (2016) 1
Awesome, now you will use this aggregated DataFrame to generate recommendations for any movie in the dataset.
Generate recommendations for
Making your first movie recommendations: Now that you have found the most commonly paired movies, you can make your first recommendations!
While you are not taking in any information about the person watching, and do not even know any details about the movie, valuable recommendations can still be made by examining what groups of movies are watched by the same people.
Finally, you will examine the movies often watched by the same people that watched
Gladiator, and then use this data to give a recommendation to someone who just watched the movie.
import matplotlib.pyplot as plt # Sort the counts from highest to lowest combination_counts_df.sort_values('size', ascending=False, inplace=True) # Find the movies most frequently watched by people who watched Gladiator gladiator_df = combination_counts_df[combination_counts_df['movie_a'] == 'Gladiator (2000)'] # get top 10 gladiator_df_top_10 = gladiator_df.head(10) # Plot the results gladiator_df_top_10.plot.bar(x="movie_b") plt.show()
You can see that
Matrix was the most commonly watched movie by those who watched
Gladiator. This means that it would be a good movie to recommend
Gladiator watchers as it shows they have similar fans - that includes me!