Web Scraping In Python

15 minute read

Web Scraping using python

The goal of this post is to do Web Scraping in python and introduce basic NLP tasks like getting the word frequency.

The urllib and requests packages are used to scrape the data from websites. Scraping means getting the html content as text from a particular website. urllib is an old way of scraping. requests is the new way and it is more high-level, wherein, you don’t have to worry about low-level details of making a web request.

Secondly, as an example, we will scrape the book “Moby Dick” from project gutenburg’s website to find the most frequent word used in this book. The following packages are used in this notebook:

  • urllib
  • requests
  • bs4 (BeautifulSoup)
  • nltk

Performing HTTP requests in Python using urllib

# Import packages
from urllib.request import urlopen, Request

# Specify the url
url = "http://www.datacamp.com/teach/documentation"

# This packages the request: request
request = Request(url)

# Sends the request and catches the response: response
response = urlopen(request)

# Print the datatype of response
print(type(response))

# Extract the response: html
html = response.read()

# Print the html
print(html[:300])

# Be polite and close the response!
response.close()

<class 'http.client.HTTPResponse'>
b'<!doctype html>\n<html lang="en" data-direction="ltr">\n  <head>\n    <link href="https://fonts.intercomcdn.com" rel="preconnect" crossorigin>\n      <script src="https://www.googletagmanager.com/gtag/js?id=UA-39297847-9" async="async" nonce="i4J7+2OrFDkkGYXjF0p+Mok+0NGAg9N4nLY3DoRLywg="></script>\n     '

You have just packaged and sent a GET request to “http://www.datacamp.com/teach/documentation” and then caught the response. You saw that such a response is a http.client.HTTPResponse object. The question remains: what can you do with this response?

Well, as it came from an HTML page, you could read it to extract the HTML and, in fact, such a http.client.HTTPResponse object has an associated read() method.

Performing HTTP requests in Python using requests

Now that you’ve got your head and hands around making HTTP requests using the urllib package, you’re going to figure out how to do the same using the higher-level requests library. You’ll once again be pinging DataCamp servers for their “http://www.datacamp.com/teach/documentation” page.

Note that unlike in the previous exercises using urllib, you don’t have to close the connection when using requests!

# Import package
import requests

# Specify the url: url
url = "http://www.datacamp.com/teach/documentation"

# Packages the request, send the request and catch the response: r
r = requests.get(url)

# Extract the response: text
text = r.text

# Print the html
print(text[:300])
<!doctype html>
<html lang="en" data-direction="ltr">
  <head>
    <link href="https://fonts.intercomcdn.com" rel="preconnect" crossorigin>
      <script src="https://www.googletagmanager.com/gtag/js?id=UA-39297847-9" async="async" nonce="06YwVXsOmgVhbLgAu2tHeQGoO5lZFwilEi5wxEmBC88="></script>

Scraping the web in python

We have just scraped HTML data from the web. You have done so using 2 different packages: urllib and requests. You also saw that requests provided a higher-level interface, i.e, you needed to write a fewer lines of code to retrieve the relevant HTML as a string.

HTML is a mix of unstructured and structed data.

In general, to turn the HTML that we got from the website to useful data you will need to parse it and extract structured data from it. You can perform this task using the python package BeautifulSoup.

The main object created and used when using this package is called BeautifulSoup. It has a very useful associated method called prettify. Let’s see how we can use BeautifulSoup. The first step is to scrape the HTML using requests package.

Remember: The goal of using BeautifulSoup is to extract data from HTML.

Parsing HTML with BeautifulSoup

Use the BeautifulSoup package to parse, prettify and extract information from HTML.

# Import packages
import requests
from bs4 import BeautifulSoup

# Specify url: url
url = 'https://www.python.org/~guido/'

# Package the request, send the request and catch the response: r
r = requests.get(url)

# Extracts the response as html: html_doc
html_doc = r.text

# Create a BeautifulSoup object from the HTML: soup
soup = BeautifulSoup(html_doc)

# Prettify the BeautifulSoup object: pretty_soup
pretty_soup = soup.prettify()

# Print the response
print(pretty_soup[:1000])
<html>
 <head>
  <title>
   Guido's Personal Home Page
  </title>
 </head>
 <body bgcolor="#FFFFFF" text="#000000">
  <h1>
   <a href="pics.html">
    <img border="0" src="images/IMG_2192.jpg"/>
   </a>
   Guido van Rossum - Personal Home Page
  </h1>
  <p>
   <a href="http://www.washingtonpost.com/wp-srv/business/longterm/microsoft/stories/1998/raymond120398.htm">
    <i>
     "Gawky and proud of it."
    </i>
   </a>
  </p>
  <h3>
   <a href="http://metalab.unc.edu/Dave/Dr-Fun/df200004/df20000406.jpg">
    Who
I Am
   </a>
  </h3>
  <p>
   Read
my
   <a href="http://neopythonic.blogspot.com/2016/04/kings-day-speech.html">
    "King's
Day Speech"
   </a>
   for some inspiration.
  </p>
  <p>
   I am the author of the
   <a href="http://www.python.org">
    Python
   </a>
   programming language.  See also my
   <a href="Resume.html">
    resume
   </a>
   and my
   <a href="Publications.html">
    publications list
   </a>
   , a
   <a href="bio.html">
    brief bio
   </a>
   , assor

Turning a webpage into data using BeautifulSoup: getting the text

Next, you’ll learn the basics of extracting information from HTML soup. In this exercise, you’ll figure out how to extract the text from the BDFL’s webpage, along with printing the webpage’s title.

# Import packages
import requests
from bs4 import BeautifulSoup

# Specify url: url
url = 'https://www.python.org/~guido/'

# Package the request, send the request and catch the response: r
r = requests.get(url)

# Extract the response as html: html_doc
html_doc = r.text

# Create a BeautifulSoup object from the HTML: soup
soup = BeautifulSoup(html_doc)

# Get the title of Guido's webpage: guido_title
guido_title = soup.title

# Print the title of Guido's webpage to the shell
print(guido_title)

# Get Guido's text: guido_text
guido_text = soup.get_text()

# Print Guido's text to the shell
print(guido_text)
<title>Guido's Personal Home Page</title>


Guido's Personal Home Page




Guido van Rossum - Personal Home Page
"Gawky and proud of it."
Who
I Am
Read
my "King's
Day Speech" for some inspiration.

I am the author of the Python
programming language.  See also my resume
and my publications list, a brief bio, assorted writings, presentations and interviews (all about Python), some
pictures of me,
my new blog, and
my old
blog on Artima.com.  I am
@gvanrossum on Twitter.  I
also have
a G+
profile.

In January 2013 I joined
Dropbox.  I work on various Dropbox
products and have 50% for my Python work, no strings attached.
Previously, I have worked for Google, Elemental Security, Zope
Corporation, BeOpen.com, CNRI, CWI, and SARA.  (See
my resume.)  I created Python while at CWI.

How to Reach Me
You can send email for me to guido (at) python.org.
I read everything sent there, but if you ask
me a question about using Python, it's likely that I won't have time
to answer it, and will instead refer you to
help (at) python.org,
comp.lang.python or
StackOverflow.  If you need to
talk to me on the phone or send me something by snail mail, send me an
email and I'll gladly email you instructions on how to reach me.

My Name
My name often poses difficulties for Americans.

Pronunciation: in Dutch, the "G" in Guido is a hard G,
pronounced roughly like the "ch" in Scottish "loch".  (Listen to the
sound clip.)  However, if you're
American, you may also pronounce it as the Italian "Guido".  I'm not
too worried about the associations with mob assassins that some people
have. :-)

Spelling: my last name is two words, and I'd like to keep it
that way, the spelling on some of my credit cards notwithstanding.
Dutch spelling rules dictate that when used in combination with my
first name, "van" is not capitalized: "Guido van Rossum".  But when my
last name is used alone to refer to me, it is capitalized, for
example: "As usual, Van Rossum was right."

Alphabetization: in America, I show up in the alphabet under
"V".  But in Europe, I show up under "R".  And some of my friends put
me under "G" in their address book...


More Hyperlinks

Here's a collection of essays relating to Python
that I've written, including the foreword I wrote for Mark Lutz' book
"Programming Python".
I own the official 
Python license.

The Audio File Formats FAQ
I was the original creator and maintainer of the Audio File Formats
FAQ.  It is now maintained by Chris Bagwell
at http://www.cnpbagwell.com/audio-faq.  And here is a link to
SOX, to which I contributed
some early code.




"On the Internet, nobody knows you're
a dog."

In this exercise, you’ll figure out how to extract the URLs of the hyperlinks from the BDFL’s webpage. In the process, you’ll become close friends with the soup method find_all().

# Import packages
import requests
from bs4 import BeautifulSoup

# Specify url
url = 'https://www.python.org/~guido/'

# Package the request, send the request and catch the response: r
r = requests.get(url)

# Extracts the response as html: html_doc
html_doc = r.text

# create a BeautifulSoup object from the HTML: soup
soup = BeautifulSoup(html_doc)

# Print the title of Guido's webpage
print(soup.title)

# Find all 'a' tags (which define hyperlinks): a_tags
a_tags = soup.find_all('a')

# Print the URLs to the shell
for link in a_tags:
    print(link.get('href'))
<title>Guido's Personal Home Page</title>
pics.html
http://www.washingtonpost.com/wp-srv/business/longterm/microsoft/stories/1998/raymond120398.htm
http://metalab.unc.edu/Dave/Dr-Fun/df200004/df20000406.jpg
http://neopythonic.blogspot.com/2016/04/kings-day-speech.html
http://www.python.org
Resume.html
Publications.html
bio.html
http://legacy.python.org/doc/essays/
http://legacy.python.org/doc/essays/ppt/
interviews.html
pics.html
http://neopythonic.blogspot.com
http://www.artima.com/weblogs/index.jsp?blogger=12088
https://twitter.com/gvanrossum
https://plus.google.com/u/0/115212051037621986145/posts
http://www.dropbox.com
Resume.html
http://groups.google.com/groups?q=comp.lang.python
http://stackoverflow.com
guido.au
http://legacy.python.org/doc/essays/
images/license.jpg
http://www.cnpbagwell.com/audio-faq
http://sox.sourceforge.net/
images/internetdog.gif

Example: Word frequency in Moby Dick

What are the most frequent words in Herman Melville’s novel, Moby Dick, and how often do they occur? In this notebook, we’ll scrape the novel Moby Dick from the website Project Gutenberg (which contains a large corpus of books) using the Python package requests. Then we’ll extract words from this web data using BeautifulSoup. Finally, we’ll dive into analyzing the distribution of words using the Natural Language ToolKit (nltk). The Data Science pipeline we’ll build in this notebook can be used to visualize the word frequency distributions of any novel that you can find on Project Gutenberg. The natural language processing tools used here apply to much of the data that data scientists encounter as a vast proportion of the world’s data is unstructured data and includes a great deal of text.

Let’s start by loading in the three main Python packages we are going to use.

# Importing requests, BeautifulSoup and nltk
import requests
from bs4 import BeautifulSoup
import nltk

To analyze Moby Dick, we need to get the contents of Moby Dick from somewhere. Luckily, the text is freely available online at Project Gutenberg as an HTML file: https://www.gutenberg.org/files/2701/2701-h/2701-h.htm .

To fetch the HTML file with Moby Dick we’re going to use the request package to make a GET request for the website, which means we’re getting data from it. This is what you’re doing through a browser when visiting a webpage, but now we’re getting the requested page directly into Python instead.

# Getting the Moby Dick HTML 
r = requests.get('https://s3.amazonaws.com/assets.datacamp.com/production/project_147/datasets/2701-h.htm')

# Setting the correct text encoding of the HTML page
r.encoding = 'utf-8'

# Extracting the HTML from the request object
html = r.text

# Printing the first 2000 characters in html
print(html[:2000])
<?xml version="1.0" encoding="utf-8"?>

<!DOCTYPE html
   PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
   "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd" >

<html xmlns="http://www.w3.org/1999/xhtml" lang="en">
  <head>
    <title>
      Moby Dick; Or the Whale, by Herman Melville
    </title>
    <style type="text/css" xml:space="preserve">

    body { background:#faebd0; color:black; margin-left:15%; margin-right:15%; text-align:justify }
    P { text-indent: 1em; margin-top: .25em; margin-bottom: .25em; }
    H1,H2,H3,H4,H5,H6 { text-align: center; margin-left: 15%; margin-right: 15%; }
    hr  { width: 50%; text-align: center;}
    .foot { margin-left: 20%; margin-right: 20%; text-align: justify; text-indent: -3em; font-size: 90%; }
    blockquote {font-size: 100%; margin-left: 0%; margin-right: 0%;}
    .mynote    {background-color: #DDE; color: #000; padding: .5em; margin-left: 10%; margin-right: 10%; font-family: sans-serif; font-size: 95%;}
    .toc       { margin-left: 10%; margin-bottom: .75em;}
    .toc2      { margin-left: 20%;}
    div.fig    { display:block; margin:0 auto; text-align:center; }
    div.middle { margin-left: 20%; margin-right: 20%; text-align: justify; }
    .figleft   {float: left; margin-left: 0%; margin-right: 1%;}
    .figright  {float: right; margin-right: 0%; margin-left: 1%;}
    .pagenum   {display:inline; font-size: 70%; font-style:normal;
               margin: 0; padding: 0; position: absolute; right: 1%;
               text-align: right;}
    pre        { font-family: times new roman; font-size: 100%; margin-left: 10%;}

    table      {margin-left: 10%;}

a:link {color:blue;
		text-decoration:none}
link {color:blue;
		text-decoration:none}
a:visited {color:blue;
		text-decoration:none}
a:hover {color:red}

</style>
  </head>
  <body>
<pre xml:space="preserve">

The Project Gutenberg EBook of Moby Dick; or The Whale, by Herman Melville

This eBook is for the use of anyone anywh

Get text from HTML

This HTML is not quite what we want. However, it does contain what we want: the text of Moby Dick. What we need to do now is wrangle this HTML to extract the text of the novel. For this we’ll use the package BeautifulSoup.

Firstly, a word on the name of the package: Beautiful Soup? In web development, the term “tag soup” refers to structurally or syntactically incorrect HTML code written for a web page. What Beautiful Soup does best is to make tag soup beautiful again and to extract information from it with ease! In fact, the main object created and queried when using this package is called BeautifulSoup. After creating the soup, we can use its .get_text() method to extract the text.

# Creating a BeautifulSoup object from the HTML
soup = BeautifulSoup(html)

# Getting the text out of the soup
text = soup.get_text()

# Printing out text between characters 32000 and 34000
print(text[32000:34000])
which the beech tree
        extended its branches.” —Darwin’s Voyage of a Naturalist.
      

        “‘Stern all!’ exclaimed the mate, as upon turning his head, he saw the
        distended jaws of a large Sperm Whale close to the head of the boat,
        threatening it with instant destruction;—‘Stern all, for your
        lives!’” —Wharton the Whale Killer.
      

        “So be cheery, my lads, let your hearts never fail, While the bold
        harpooneer is striking the whale!” —Nantucket Song.
      

     “Oh, the rare old Whale, mid storm and gale
     In his ocean home will be
     A giant in might, where might is right,
     And King of the boundless sea.”
      —Whale Song.





 





      CHAPTER 1. Loomings.
    

      Call me Ishmael. Some years ago—never mind how long precisely—having
      little or no money in my purse, and nothing particular to interest me on
      shore, I thought I would sail about a little and see the watery part of
      the world. It is a way I have of driving off the spleen and regulating the
      circulation. Whenever I find myself growing grim about the mouth; whenever
      it is a damp, drizzly November in my soul; whenever I find myself
      involuntarily pausing before coffin warehouses, and bringing up the rear
      of every funeral I meet; and especially whenever my hypos get such an
      upper hand of me, that it requires a strong moral principle to prevent me
      from deliberately stepping into the street, and methodically knocking
      people’s hats off—then, I account it high time to get to sea as soon
      as I can. This is my substitute for pistol and ball. With a philosophical
      flourish Cato throws himself upon his sword; I quietly take to the ship.
      There is nothing surprising in this. If they but knew it, almost all men
      in their degree, some time or other, cherish very nearly the same feelings
      towards the ocean with me.
    

      There 

We now have the text of the novel! There is some unwanted stuff at the start and some unwanted stuff at the end. We could remove it, but this content is so much smaller in amount than the text of Moby Dick that, to a first approximation, it is okay to leave it in.

Now that we have the text of interest, it’s time to count how many times each word appears, and for this we’ll use nltk – the Natural Language Toolkit. We’ll start by tokenizing the text, that is, remove everything that isn’t a word (whitespace, punctuation, etc.) and then split the text into a list of words.

# Creating a tokenizer
tokenizer = nltk.tokenize.RegexpTokenizer('\w+')

# Tokenizing the text
tokens = tokenizer.tokenize(text)

# Printing out the first 8 words / tokens 
print(tokens[:8])
['Moby', 'Dick', 'Or', 'the', 'Whale', 'by', 'Herman', 'Melville']

OK! We’re nearly there. Note that in the above ‘Or’ has a capital ‘O’ and that in other places it may not, but both ‘Or’ and ‘or’ should be counted as the same word. For this reason, we should build a list of all words in Moby Dick in which all capital letters have been made lower case.

# A new list to hold the lowercased words
words = []

# Looping through the tokens and make them lower case
for word in tokens:
    words.append(word.lower())
    

# Printing out the first 8 words / tokens 
print(words[:8])
['moby', 'dick', 'or', 'the', 'whale', 'by', 'herman', 'melville']

Load in the stop words

It is common practice to remove words that appear a lot in the English language such as ‘the’, ‘of’ and ‘a’ because they’re not so interesting. Such words are known as stop words. The package nltk includes a good list of stop words in English that we can use.

nltk.download('stopwords')
[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/Shravan/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.





True
# Getting the English stop words from nltk
from nltk.corpus import stopwords
sw = stopwords.words('english')

# Printing out the first eight stop words
# ... YOUR CODE FOR TASK 6 ...
print(sw[:8])
print(sw[:10])
['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're"]

Remove stop words in Moby Dick

We now want to create a new list with all words in Moby Dick, except those that are stop words (that is, those words listed in sw). One way to get this list is to loop over all elements of words and add each word to a new list if they are not in sw.

# A new list to hold Moby Dick with No Stop words
words_ns = []

# Appending to words_ns all words that are in words but not in sw
for word in words:
    if word not in sw:
        words_ns.append(word)
    

# Printing the first 5 words_ns to check that stop words are gone
print(words_ns[:5])
['moby', 'dick', 'whale', 'herman', 'melville']

Our original question was:

What are the most frequent words in Herman Melville’s novel Moby Dick and how often do they occur?

We are now ready to answer that! Let’s create a word frequency distribution plot using nltk.

import matplotlib.pyplot as plt

Create a word frequency distribution plot using nltk.

  • Create a frequency distribution object using the function nltk.FreqDist() and assign it to freqdist.
  • Use the plot method of freqdist to plot the 25 most frequent words.

The plot method of a FreqDist() object takes the number of items to plot as the first argument. Make sure to set this argument, otherwise plot will try to plot all the words which in the case of Moby Dick will take too long time.

# This command display figures inline
%matplotlib inline

# Creating the word frequency distribution
freqdist = nltk.FreqDist(words_ns)

# Plotting the word frequency distribution
freqdist.plot(25)

png

Whale is the most frequent word in the book Moby Dick. No surprise there :)

Updated: