BeautifulSoup for Python is a great tool for scraping but that doesn't necessarily mean everything is easy. Scraping is all about the unknown. And one of the golden rules is that you can't trust the page you're trying to scrape. You can't trust that it's telling you the truth.
For example, the site I'm trying to scrape for my 2nd major project at Metis is metacritic.com. As I was trying to scrape the list of critics, I went to http://www.metacritic.com/browse/movies/critic/score?num_items=100 and attempted to scrape 100 artists per page. I scraped it through a loop for the 8 pages there, expecting between 700 and 800 artists. But for some reason I kept on getting around 600 with no errors anywhere to be found! I tried over and over again with little changes hoping that I could solve the mystery. But, after about an hour, I finally realized that maybe I was right. Crazy, right?? Well, I found out that my scraper was scraping between 73 and 88 critics for each page, not 100. So I manually counted one of the pages and found out that it wasn't actually 100. D'OH!
Anyways, that is only one of many examples of something stupid I have done. Now, aside from this, I had many problems that I've since learned from. I'm going to share some really simple solutions to some pretty elementary problems that one might encounter when first learning to scrape. If you're already an expert, you can probably skip all this!
Problem 1: Rescraping Multiple Times
The first problem was having to rescrape the same data each time I wanted to rerun a function. This was easily solved with Pickle, which I wrote about in this blog entry.
Problem 2: Waiting For a Scraper for Hours
Sometimes this is necessary. But what's not cool is when you scrape some data from a site for 3+ hours only to have an error at the end stop you from saving the data through Pickle. So what I have done since, to prevent this from happening is to Pickle multiple times.
count = 1 for critic in critic_list: review = metacritic.get_all_reviews_by_critic(critic) with open('nytimes_review_list' + str(count) + '.pkl', 'w') as f: pickle.dump(review, f) count += 1
In my sample code here, instead of waiting to Pickle after all the scraping is done, I have my loop here create a new Pickle file for each critic. This way, even if it craps out at the end, I still have a lot of data that I can use.
Of course doing a good job with Python's "Try" and "Except" methods may also prevent this!
Problem 3: Losing Data Because of an Error
This can happen for a multitude of reasons. So one way to prevent a function from stopping because of an error is to use "Try" and "Except" in Python. The logic goes something like this: If there is an error in a "try" block, then it will move on to the "except" statement. It operates just like an if/else statement.
reviews =  try: pagination = soup.find(class_='last_page').a.text for num in range(0, (int(pagination))): reviews += get_reviews_by_critic(slug, num) except try: reviews += get_reviews_by_critic(slug, 0) except: pass
In this sample code, the program will search for the class 'last_page'. Now if this were in an 'if' statement, then an error would be called and the program would stop and you would lose everything. But with the try statement, if there is an error, now it will go onto the 'except' statement. In the excitement, I put another try statement that follows the same logic, and if that one fails too, then it will just pass. Now, the problem with this is that we don't know why it passed! So to get a little bit more info, we're going to set up logging.
I think the easiest way to set up logging is to write it into a separate log file. It really isn't very difficult. All you need is this:
import logging logging.basicConfig(filename='debug.log', level=logging.DEBUG)
This code will import in the logging module and then set the log file to 'debug.log' in the same directory as your program. Though of course you can change it to whatever path you want. Then we have to call it in the actual code.
reviews =  try: pagination = soup.find(class_='last_page').a.text for num in range(0, (int(pagination))): reviews += get_reviews_by_critic(slug, num) except try: reviews += get_reviews_by_critic(slug, 0) except Exception, e: logging.exception(e)
Problem 4: My Program Has Been Scraping for Hours, and I Need To Go Home...
Now, logging is not only for debugging, it can also be for giving you a little bit more information about the process. One of the things I found most annoying about scraping is that you often don't know how far along in the process you are. So I started logging multiple times during a program so I would know how far along it was and how much time was left.
import datetime reviews =  try: pagination = soup.find(class_='last_page').a.text logging.debug('[' + datetime.datetime.now().strftime("%a-%m-%d %I:%M:%S") + '] Found ' + pagination + ' pages to scrape') for num in range(0, (int(pagination))): reviews += get_reviews_by_critic(slug, num) except try: logging.debug('[' + datetime.datetime.now().strftime("%a-%m-%d %I:%M:%S") + '] Only found 1 page to scrape') reviews += get_reviews_by_critic(slug, 0) except Exception, e: logging.exception(e)
Adding in the timestamp with debugging now gives me not just more info, but also some idea of how long certain calls took!
What I Learned Today:
Python makes it incredibly easy to log errors.