Scrape IMDB movie reviews

17 minute read

Author: Shravan Kuchkula

What is this post about ?

Oftentimes it is required to construct a dataset by scraping a website and extracting relevant information. I will be using IMDB website to pull user reviews for the top 250 Thriller movies and construct a dataset that will later be used to perform NLP tasks like: shallow parsing, clustering and sentiment analysis. In this post, the focus is on how to create the dataset and how to do shallow parsing by breaking down each user review into Noun-chunks.

The goal of this post is show:

  • How to query IMDB api ?
  • How to use BeautifulSoup to scrape the api response and extract individual movie reviews ?
  • How to construct a dataset from a website ?
  • How to use spaCy and pattern to do shallow parsing of the user reviews ?
  • How to pickle dataframes so that you can share your dataset with others ?

Introduction:

IMDB search tool https://www.imdb.com/search/title/ allows us to filter movies based on a certain criteria. For instance, if we want to select titles which are:

  • feature films
  • rated atleast 4.0
  • having atleast 50,000 votes
  • in the Thriller genre
  • sorted by user rating
  • limit to 250 movies

This translates to: url = "https://www.imdb.com/search/title/?title_type=feature&user_rating=4.0,10.0&num_votes=50000,&genres=thriller&view=simple&sort=user_rating,desc&count=250"

We can then use this url to make a request and get back the response. However, the response is not in JSON/XML format, but it is a HTML page with 250 titles. We make use of BeautifulSoup to extract the title page for each movie.

The following strategy was employed to collect a list of static links to individual user movie reviews:

  • Step 1: Make an api request to get 250 movie titles.
  • Step 2: Scrape the result to extract the links to individual movies.
  • Step 3: For each movie, extract the movie user reviews link.
  • Step 4: For each of the movie reviews link, get a positive user review link and a negative movie review link.

Thus, we should have 500 movie user review links (one positive, one negative for each of the 250 movies).

The following sections of code show the python code used to do this. Several utility functions were developed to help with scraping and cleaning the data. All of these utility functions are grouped together in imdbUtils module. The code can be found in the Appendix section of this post.

Analysis

import requests
from bs4 import BeautifulSoup
import re
import pandas as pd
import numpy as np
import itertools
from imdbUtils import *

pd.options.display.max_colwidth=500

Step 1: Make an api request to get 250 movie titles

The getSoup(url) function takes a URL and returns a BeautifulSoup object.

# API call to select:
## feature films
## which are rated atleast 4.0
## having atleast 50,000 votes
## in the Thriller genre
## sorted by user rating
## limit to 250 movies
url = '''https://www.imdb.com/search/title/?title_type=feature&user_rating=4.0,10.0
&num_votes=50000,&genres=thriller&view=simple&sort=user_rating,desc&count=250'''

# get the soup object for main api url
movies_soup = getSoup(url)

Upon careful observation, it was found that all a tags which have class:None seem to correspond to the individual movie links. In the below section, these tags are extracted from the soup object.

The individual movie links also have a specific pattern. They all start with /title and end with /. Making use of this pattern, we can extract just the titles we are interested in and filter out the remaining tags.

Due to the way in which this site in constructed, there are duplicates of the movie tags. We can quickly fix this problem by removing the duplicates in the list.

# find all a-tags with class:None
movie_tags = movies_soup.find_all('a', attrs={'class': None})

# filter the a-tags to get just the titles
movie_tags = [tag.attrs['href'] for tag in movie_tags 
              if tag.attrs['href'].startswith('/title') & tag.attrs['href'].endswith('/')]

# remove duplicate links
movie_tags = list(dict.fromkeys(movie_tags))

print("There are a total of " + str(len(movie_tags)) + " movie titles")
print("Displaying 10 titles")
movie_tags[:10]
There are a total of 250 movie titles
Displaying 10 titles





['/title/tt0468569/',
 '/title/tt1375666/',
 '/title/tt0114814/',
 '/title/tt0114369/',
 '/title/tt0110413/',
 '/title/tt0102926/',
 '/title/tt0482571/',
 '/title/tt0407887/',
 '/title/tt0209144/',
 '/title/tt0054215/']

The movie user reviews link can easily be constructed from the title as shown below:

# movie links
base_url = "https://www.imdb.com"
movie_links = [base_url + tag + 'reviews' for tag in movie_tags]
print("There are a total of " + str(len(movie_links)) + " movie user reviews")
print("Displaying 10 user reviews links")
movie_links[:10]
There are a total of 250 movie user reviews
Displaying 10 user reviews links





['https://www.imdb.com/title/tt0468569/reviews',
 'https://www.imdb.com/title/tt1375666/reviews',
 'https://www.imdb.com/title/tt0114814/reviews',
 'https://www.imdb.com/title/tt0114369/reviews',
 'https://www.imdb.com/title/tt0110413/reviews',
 'https://www.imdb.com/title/tt0102926/reviews',
 'https://www.imdb.com/title/tt0482571/reviews',
 'https://www.imdb.com/title/tt0407887/reviews',
 'https://www.imdb.com/title/tt0209144/reviews',
 'https://www.imdb.com/title/tt0054215/reviews']

Now that we have obtained the user reviews link for each of the 250 movies, our next task is to get the links for one positive and one negative user review link.

The function getReviews() returns a tuple of positive and negative user review links for each movie.

# get a list of soup objects
movie_soups = [getSoup(link) for link in movie_links]

# get all 500 movie review links
movie_review_list = [getReviews(movie_soup) for movie_soup in movie_soups]

movie_review_list = list(itertools.chain(*movie_review_list))
print(len(movie_review_list))

print("There are a total of " + str(len(movie_review_list)) + " individual movie reviews")
print("Displaying 10 reviews")
movie_review_list[:10]
500
There are a total of 500 individual movie reviews
Displaying 10 reviews





['https://www.imdb.com/review/rw1921967/',
 'https://www.imdb.com/review/rw1908115/',
 'https://www.imdb.com/review/rw2286063/',
 'https://www.imdb.com/review/rw2276780/',
 'https://www.imdb.com/review/rw0374462/',
 'https://www.imdb.com/review/rw0374686/',
 'https://www.imdb.com/review/rw1746146/',
 'https://www.imdb.com/review/rw0370669/',
 'https://www.imdb.com/review/rw0342827/',
 'https://www.imdb.com/review/rw0342948/']

In summary, we have queried the IMDB’s api and scraped the response to obtain links to 500 individual movie reviews with an equal mix of positive and negative reviews.

For each of these individual user movie review links found inside movie_review_list, our task is to extract the main review text.

  • getReviewText() function extracts the user review text from the review link.
  • getMovieTitle() function extracts the movie title from the review link.
  • review_sentiment indicates whether the user review is positive or negative. The least rated review is labelled negative and the highest rated review is labelled positive.

Construct a dataframe

Finally, a dataframe is constructed using these results.

# get review text from the review link
review_texts = [getReviewText(url) for url in movie_review_list]

# get movie name from the review link
movie_titles = [getMovieTitle(url) for url in movie_review_list]

# label each review with negative or positive
review_sentiment = np.array(['negative', 'positive'] * (len(movie_review_list)//2))

# construct a dataframe
df = pd.DataFrame({'movie': movie_titles, 'user_review_permalink': movie_review_list,
             'user_review': review_texts, 'sentiment': review_sentiment})
df.head()
movie user_review_permalink user_review sentiment
0 The Dark Knight https://www.imdb.com/review/rw1921967/ Many commenters said they were "blown away," so it probably has succeeded in blowing away the box office. I waited until the second week, and had high expectations from the 9 and 10 ratings it was receiving. But, fellow movie/film viewers (and especially great film lovers) ... really!? There's no doubt about the action and action and action in this one, and thus, the special effects. That's the main reason I enjoy such fantasy flicks -- the comic book genre. So, this one does more of it and ... negative
1 The Dark Knight https://www.imdb.com/review/rw1908115/ We've been subjected to enormous amounts of hype and marketing for the Dark Knight. We've seen Joker scavenger hunts and one of the largest viral campaigns in advertising history and it culminates with the actual release of the movie.Everything that's been said is pretty much spot on. This is the first time I can remember where a summer blockbuster film far surpasses the hype.For as much action as there is in this movie, it's the acting that makes it a great piece of work. Between all the pu... positive
2 Inception https://www.imdb.com/review/rw2286063/ I have to say to make such an impressive trailer and such an uninteresting film, takes some doing.Here you have most of the elements that would make a very good film. You have great special effects, a sci-fi conundrum, beautiful visuals and good sound. Yet the most important part of the film is missing. There is no plot, character or soul to this film. It's like having a beautiful building on the outside with no paint or decoration on the inside.It's an empty shell of a film. There is no ten... negative
3 Inception https://www.imdb.com/review/rw2276780/ What is the most resilient parasite? An Idea! Yes, Nolan has created something with his unbelievably, incredibly and god- gifted mind which will blow the minds of the audience away. The world premiere of the movie, directed by Hollywood's most inventive dreamers, was shown in London and has already got top notch reviews worldwide and has scored maximum points! Now the question arises what the movie has that it deserve all this?Dom Cobb(Di Caprio) is an extractor who is paid to invade the dre... positive
4 The Usual Suspects https://www.imdb.com/review/rw0374462/ After a gun fight on the docks leaves only one survivor with the majority dead, NYC agent Dave Kujan flies in to ensure that ex-cop Dean Keaton is really dead. During the questioning the survivor, Verbal Kint, tells of how events came to happen. Five criminals are brought together in a line up and decide to use the events to plan a job. However another survivor tells an extra story – one involving master criminal Kyser Soze. Kint reveals how the gang were forced into the fateful job by S... negative

Pickle the dataframe

# save the dataframe to a csv file.
df.to_csv('userReviews.csv', index=False)

# pickle the dataframe
df.to_pickle('userReviews.pkl')

# to validate
#temp = pd.read_csv('userReviews.csv')
#temp = pd.read_pickle('userReviews.pkl')

Shallow Parsing using spaCy

Shallow parsing, also known as light parsing or chunking, is a technique of analyzing the structure of a sentence to break it down into its smallest constituents (which are tokens such as words) and group them together into higher-level phrases.

In shallow parsing, there is more focus on identifying these phrases or chunks rather than diving into further details of the internal syntax and relations inside each chunk, like we see in grammar- based parse trees obtained from deep parsing. The main objective of shallow parsing is to obtain semantically meaningful phrases and observe relations among them.

spaCy provides a convenient way to obtain Noun-chunks. spaCy inherently employs a default pipeline which first tokenizes the text and creates a Doc object. The Doc object is then passed through a series of components that are responsible for setting various Document level and Token level attributes. One such attribute is the Doc.noun_chunks attribute. Shown below is the default pipeline with POS-tagger, dependency parser and entity recognizer.

import spacy
nlp = spacy.load('en_core_web_sm')
print(nlp.pipe_names)
nlp.pipeline
['tagger', 'parser', 'ner']





[('tagger', <spacy.pipeline.pipes.Tagger at 0x147af7710>),
 ('parser', <spacy.pipeline.pipes.DependencyParser at 0x147a15e88>),
 ('ner', <spacy.pipeline.pipes.EntityRecognizer at 0x147a15ee8>)]

For each user review calculate the noun chunks as follows:

  • getNounChunks() function takes a user review text. It then tokenizes the text and creates a Doc object. This Doc object is then passed through the pipeline. The function returns the noun_chunks as a list of strings, that is then stored inside the dataframe as noun_chunks column.
# create the noun_chunks
df['noun_chunks'] = df['user_review'].apply(getNounChunks)
df.head()
movie user_review_permalink user_review sentiment noun_chunks
0 The Dark Knight https://www.imdb.com/review/rw1921967/ Many commenters said they were "blown away," so it probably has succeeded in blowing away the box office. I waited until the second week, and had high expectations from the 9 and 10 ratings it was receiving. But, fellow movie/film viewers (and especially great film lovers) ... really!? There's no doubt about the action and action and action in this one, and thus, the special effects. That's the main reason I enjoy such fantasy flicks -- the comic book genre. So, this one does more of it and ... negative [Many commenters, they, it, the box office, I, the second week, high expectations, the 9 and 10 ratings, it, fellow movie/film viewers, especially great film lovers, no doubt, the action, action, action, this one, thus, the special effects, the main reason, I, such fantasy flicks, the comic book genre, this one, it, the last Batman, the last Spiderman, the last super heroes, I, it, Iron Man, special effects, makeup, it, awards, I, all the action, special effects, movies, I, a movie, 45 minut...
1 The Dark Knight https://www.imdb.com/review/rw1908115/ We've been subjected to enormous amounts of hype and marketing for the Dark Knight. We've seen Joker scavenger hunts and one of the largest viral campaigns in advertising history and it culminates with the actual release of the movie.Everything that's been said is pretty much spot on. This is the first time I can remember where a summer blockbuster film far surpasses the hype.For as much action as there is in this movie, it's the acting that makes it a great piece of work. Between all the pu... positive [We, enormous amounts, hype, marketing, the Dark Knight, We, Joker, the largest viral campaigns, advertising history, it, the actual release, the movie, Everything, pretty much spot, the first time, I, a summer blockbuster film, the hype, as much action, this movie, it, the acting, it, work, all the punches, explosions, stunt-work, some great dialog work, All the actors, their moments, Bale's Batman, the definitive Batman, we, everything, this character, film, Martial arts skills, cunning, g...
2 Inception https://www.imdb.com/review/rw2286063/ I have to say to make such an impressive trailer and such an uninteresting film, takes some doing.Here you have most of the elements that would make a very good film. You have great special effects, a sci-fi conundrum, beautiful visuals and good sound. Yet the most important part of the film is missing. There is no plot, character or soul to this film. It's like having a beautiful building on the outside with no paint or decoration on the inside.It's an empty shell of a film. There is no ten... negative [I, such an impressive trailer, such an uninteresting film, some doing, you, the elements, a very good film, You, great special effects, a sci-fi conundrum, beautiful visuals, good sound, the most important part, the film, no plot, character, soul, this film, It, a beautiful building, the outside, no paint, decoration, the inside, It, an empty shell, a film, no tension, you, the characters, they, what, they, a corporation, another corporation, the human race, you, a dream environment, you, i...
3 Inception https://www.imdb.com/review/rw2276780/ What is the most resilient parasite? An Idea! Yes, Nolan has created something with his unbelievably, incredibly and god- gifted mind which will blow the minds of the audience away. The world premiere of the movie, directed by Hollywood's most inventive dreamers, was shown in London and has already got top notch reviews worldwide and has scored maximum points! Now the question arises what the movie has that it deserve all this?Dom Cobb(Di Caprio) is an extractor who is paid to invade the dre... positive [What, the most resilient parasite, An Idea, Nolan, something, unbelievably, incredibly and god- gifted mind, the minds, the audience, The world premiere, the movie, Hollywood's most inventive dreamers, London, top notch reviews, maximum points, the question, what, the movie, it, all this?Dom, Cobb(Di Caprio, an extractor, who, the dreams, various business tycoons, their top secret ideas, Cobb, the psyche, practiced skill, he, the memory, his late wife, Mal, (Marion Cotillard, who, a nasty h...
4 The Usual Suspects https://www.imdb.com/review/rw0374462/ After a gun fight on the docks leaves only one survivor with the majority dead, NYC agent Dave Kujan flies in to ensure that ex-cop Dean Keaton is really dead. During the questioning the survivor, Verbal Kint, tells of how events came to happen. Five criminals are brought together in a line up and decide to use the events to plan a job. However another survivor tells an extra story – one involving master criminal Kyser Soze. Kint reveals how the gang were forced into the fateful job by S... negative [a gun fight, the docks, only one survivor, the majority, NYC agent Dave Kujan, ex-cop Dean Keaton, the questioning, the survivor, Verbal Kint, events, Five criminals, a line, the events, a job, another survivor, an extra story, master criminal Kyser Soze, Kint, the gang, the fateful job, Soze, who, Soze, the men, what, a ship load, drugs, I, I, it, the cinema, I, it, The plot, a cliff hanger, the mystery, what, it, more questions, every answer, it, It, you, the issue, fact, you, the end, yo...
# Pickle this dataframe which contains noun_chunks
df.to_pickle('userReviewNounChunks.pkl')

Conclusion

500 user reviews were extracted from the IMDB website. These user reviews contain raw text which can be used perform common NLP tasks like: parsing, clustering and sentiment analysis. Rather than, hand-labelling each user review, we have programmatically pulled the lowest and highest rated user reviews for each movie and then constructed a dataset. Lastly, we have also pickled the dataframe so that anyone can start using this dataset instead of going through the entire process to pull the data from the website.

Appendix

Code

Several utility functions were developed and grouped together inside the imdbUtils.py module. The contents of this module are shown below:

##############################
#  Module: imdbUtils.py
#  Author: Shravan Kuchkula
#  Date: 07/13/2019
##############################

import requests
from bs4 import BeautifulSoup

def getSoup(url):
    """
    Utility function which takes a url and returns a Soup object.
    """
    response = requests.get(url)
    soup = BeautifulSoup(response.text, 'html.parser')
    
    return soup

def minMax(a):
    '''Returns the index of negative and positive review.'''
    
    # get the index of least rated user review
    minpos = a.index(min(a))
    
    # get the index of highest rated user review
    maxpos = a.index(max(a))
    
    return minpos, maxpos

def getReviews(soup):
    '''Function returns a negative and positive review for each movie.'''
    
    # get a list of user ratings
    user_review_ratings = [tag.previous_element for tag in 
                           soup.find_all('span', attrs={'class': 'point-scale'})]
    
    
    # find the index of negative and positive review
    n_index, p_index = minMax(list(map(int, user_review_ratings)))
    
    
    # get the review tags
    user_review_list = soup.find_all('a', attrs={'class':'title'})
    
    
    # get the negative and positive review tags
    n_review_tag = user_review_list[n_index]
    p_review_tag = user_review_list[p_index]
    
    # return the negative and positive review link
    n_review_link = "https://www.imdb.com" + n_review_tag['href']
    p_review_link = "https://www.imdb.com" + p_review_tag['href']
    
    return n_review_link, p_review_link

def getReviewText(review_url):
    '''Returns the user review text given the review url.'''
    
    # get the review_url's soup
    soup = getSoup(review_url)
    
    # find div tags with class text show-more__control
    tag = soup.find('div', attrs={'class': 'text show-more__control'})
    
    return tag.getText()

def getMovieTitle(review_url):
    '''Returns the movie title from the review url.'''
    
    # get the review_url's soup
    soup = getSoup(review_url)
    
    # find h1 tag
    tag = soup.find('h1')
    
    return list(tag.children)[1].getText()

def getNounChunks(user_review):
    
    # create the doc object
    doc = nlp(user_review)
    
    # get a list of noun_chunks
    noun_chunks = list(doc.noun_chunks)
    
    # convert noun_chunks from span objects to strings, otherwise it won't pickle
    noun_chunks_strlist = [chunk.text for chunk in noun_chunks]
    
    return noun_chunks_strlist

Alternative

Using pattern library to obtain noun-chunks

We can also extract the noun chunks by making using the pattern package. Here, we invoke the parsetree function to get a tree reprentation of our document. We can easily loop over the sentences and extract the noun chunks as shown here:

from pattern.en import parsetree
tree = parsetree(df.user_review[0])
for sentence_tree in tree:
    for chunk in sentence_tree.chunks:
        if chunk.type == 'NP':
            print(chunk)
Many commenters
they
it
the box office
I
the second week
high expectations
the 9 and 10 ratings it
fellow movie/film viewers
especially great film lovers
no doubt
the action
action
action
this one
the special effects
the main reason I
such fantasy flicks
the comic book genre
this one
it
the last Batman
the last Spiderman
the last super heroes
I
it
Iron Man
special effects
makeup
it awards.Maybe I
all the action
special effects
movies
I
a movie
45 minutes
I
it
It
way
Some reviewers
the acting
they
two-minute interludes
the fast and furious segments
action and destruction
I
the actors
their roles or parts
any parts
substance
Heath Ledger
a very good Joker
Nicholson
et al
those roles
Batman
Superman
plot
anyone
the genre
a story
Batman
the Joker
the Joker
the plot
the variety
mixture
change
unexpecteds
the action
I
the top
I
the list
this movie
plot
action
many cases
the likes
the Lord of the Rings
the Star Wars films
many great classics
real suspense
I
Casablanca
Vertigo
North
Northwest
me
I
the overall trends
the IMDb reviews
rate this film
the top.It
entertaining
It
I
the youth
today
I
a few grandchildren
I
them
this observation
I
this high acclaim
this film
very young audience
most members
a great deal
quality and varied films
their viewing history
those young people
I
a look
the top
250 list
other major review ratings
you rent
borrow
the other great films
I
you
you
an understanding
appreciation
some past customs
life styles
dress
behaviors
you
great entertainment

Tags:

Updated: