Web Scraping with Python and BeautifulSoup

19 December 2013

In this short and to the point tutorial, we will use the infamous Python module BeautifulSoup to code a basic web scraping script that will get some useful project information from Kickstarter.com projects

Motivation

Knowing how to scrape the web is an excellent skill to have. The web is huge; thousands of new websites are created every day so there is always new content to work with. By extension, this also means that there will always be tons of web scraping projects needing to be done (this translates into money for us).

And if you already know Python, you really have no excuse to not know how to utilize the language in this area.

Technical Requirements

Tutorial

Introduction

Simply put, Kickstarter is a crowdfunding site where people can get funding for their projects through the power of the internet.

I’ll be using this fairly interesting project as a reference, but the techniques we use here will apply to any Kickstarter project page so choose any one that interests you.

First, import the required modules.

# import required modules.
import urllib2
from bs4 import BeautifulSoup

Getting the HTML

We obviously need to get some HTML before we start using BeautifulSoup. How you go about getting the page’s HTML is something BeautifulSoup leaves up to you. In this tutorial we use urllib2 as it is a standard Python module.

We can retrieve HTML from a page by passing a URL into urllib2.urlopen() and calling the read() method on the HTTP response object it returns.

url  = 'http://www.kickstarter.com/projects/sparkdevices/spark-core-wi-fi-for-everything-arduino-compatible'
res  = urllib2.urlopen(url)  # response object
html = res.read()

The Soup

In order to start using BeautifulSoup, we need to initialize a BeautifulSoup object by passing our HTML string into the BeautifulSoup() constructor.

soup = BeautifulSoup(html)

Once we have this object initialized, we have access to various pieces of general page information right off the bat.

print soup.title.text  # page title
# Spark Core: Wi-Fi for Everything (Arduino Compatible) by Spark Devices — Kickstarter

Getting More Information

This is where your browser’s development tools come in very handy. I’m using Firefox (Nightly), but Chrome/Chromium and Safari also have a nice set of development tools.

In any case, by right clicking on a page element and selecting Inspect Element or similar, we can jump to where the element is in the page’s source HTML.

Let’s say we wanted our scraper to know how many backers the Kickstarter project has.

By inspecting one of the elements that contain this information (there are two on each project page), we find that it has an HTML attribute called itemprop with a value of "Project[backers_count]"

We can use this attribute/value mapping to tell BeautifulSoup how to find the element

We can see the number of backers, but now let’s find this element programmatically with our soup object by calling its find method. We’ll pass the type of tag (data here) as the first argument, followed by the attribute/value mappings we want to match. These mappings are passed in as keyword arguments.

num_backers_element = soup.find('data', itemprop='Project[backers_count]')
# <data class="Project373368980" data-format="number" data-value="4201" itemprop="Project[backers_count]" value="4201">4,201</data>

This returns one BeautifulSoup HTML element object. We can do tons of shit with it, but most relevant right now is the fact that we can access all of its attribute/value mappings by using it as we would any normal Python dictionary and passing its attributes as keys.

Let’s get the number of backers as we intended by getting the element’s value attribute.

num_backers = num_backers_element['value']
# '4201' <- yes, this is a string. cast it to an integer by calling int(num_backers)

So what if we wanted to find multiple elements that match our criteria? We use find_all. Remember how the number of backers appears twice on the project page? Let’s prove that here.

print soup.find_all('data', itemprop='Project[backers_count]')
# should return a list of all elements with an itemprop of 'Project[backers_count]'

Conclusion

If you enjoyed this tutorial, please note that I’ll be posting good stuff like this all the fucking time. Follow me on Twitter to know when there’s something new.