Intro to Recommendation Systems

3 minute read

What are recommendation engines?

Whether you realize it or not, recommendations drive so many of our decisions on a daily basis. For example, Netflix promoting shows you are likely to enjoy, or Amazon proposing other purchases that go well with what you are buying. These are examples of what we call as data-driven recommendations. This is the focus of this post.

What kind of data do I need?

So, what kind of data do we need? : Recommendation engines use the feedback of users to find new relevant items for them or for others with the assumption that users who have similar preferences in the past are likely to have similar preferences in the future like the example here.

Recommendation engines benefit from having a many to many match between the users giving the feedback, and the items receiving the feedback.

In other words, a better recommendation can be made for an item that has been given a lot of feedback, and more personalized recommendations can be given for a user that has given a lot of feedback.

How does the structure of the data look like?: As we see in the table here, users have rated multiple items, and each item has been rated by multiple users. This allows us to find users with similar preferences. This is valuable as users who have similar tastes in the past are likely to have similar tastes in the future.

When should you use a recommendation engine?

Machine learning can be used for many different kinds of predictions, from whether a stock price will increase, to detecting criminals laundering money. Recommendation engines target a specific kind of machine learning problem, they are designed to suggest a product, service, or entity to a user based on other users, and their own feedback.

Let’s take some examples. Making a suggestion as to what movie a user would like based on what genres of movies they have ranked highly in the past would be suited to a recommendation engine. Predicting whether that movie will do well in the box office on the other hand would be better suited to a different kind of statistical model. Predicting whether a user would enjoy a restaurant based on where they have enjoyed in the past could be performed with a recommendation engine. Predicting how much a house in the same area as the restaurant would cost based on its size and historic house prices would not.

Implicit vs Explicit data

Recommendation engines rely on data that records the preferences of users. How these preferences are measured fall into two main groups, implicit and explicit.

  • Explicit Data: Explicit data contains direct feedback from a user as to how they feel about an item such as a numerical rating, or upvoting or downvoting. Take a dataset where users rate a restaurant out of 5 stars like Yelp, the feedback from the user is explicitly recorded.

  • Implicit Data: Implicit data relies not on a user’s direct rating but instead uses the user’s actions to summarise their preferences such as users choosing to watch certain programs, or having a specific purchase history. A user’s historic choice of music on Spotify is a good example of this, based on what songs someone has listened to, you can infer what kind of music they enjoy.

Here are some other examples of explicit and implicit feedback.

The dataset listening_history_df has been loaded for you. This dataset contains columns identifying the users, the songs they listen to, along with:

  • Skipped Track: A Boolean column recording whether the user skipped the song or listened to it to the end.
  • Rating: The score out of 10 the user gave the song.
       User            Song Title  Skipped Track  Rating
0  User_001  Like a Rolling Stone           True       6
1  User_001               Imagine          False       2
2  User_001       Whats Going On          False       9
3  User_002               Respect          False       6
4  User_003       Good Vibrations           True       0

Here, Skipped Track is considered to be implicit data, whereas, Rating is considered explicit data.

Pick the dataset that is more suitable for recommendations

In this case, restaurant_data_2 is more suitable for building recommendation engines.