Scrapers are an Advertiser's Best Friend?

Posted on November 15, 2010

I’m going to file this one under “stuff you shouldn’t have to know but which is pretty useful in this imperfect world in which we live”. The truth of the matter is that sometimes the quickest way to get the information you need is to scrape a client’s site. In this post I will give a quick overview of how to do this.

I recently bought a sleeping bag from www.outdoorkit.co.uk. I found their site through a PPC advert but they were not advertising on the product I bought (even though they had it in stock and everything). I spy a missed opportunity in the long tail for whoever is managing their PPC spend.

I think that when adding a long list of product specific keywords and ad groups to an account spreadsheet munging is a perfectly acceptable practice. I know that in an ideal world every product would have a carefully customized and handcrafted ad group with bespoke ad texts but this is rarely worth the effort for a keyword that will get <20 searches a month. Creating keyword combinations and ad texts in a spreadsheet is fairly simple as long as you have access to all the data you need. This is not normally very easy. The following conversation is typical:

Me:Hi, can you get me a list of all your products with prices and a brief description of the product category please?

Marketing Manager: Sure, no problem. I’ll get onto the IT team for you. They will be able to make you a spreadsheet

As soon as I hear that my question is going to go to the IT department I know that it will almost certainly be quicker for me to get the information I need by scraping.

Let’s think what information I want in order to include all products in their paid search account:

  1. Landing page URL
  2. Product name
  3. The type of product (sleeping bag, stove, hiking boot etc.)
  4. Maybe the price

The first thing I want is a list of landing page URLs. It would be possible to collect these by writing a program to spider the site or using something like Xenu Link Sleuth but in this case there is an easier way: Outdoor Kit have a sitemap.

Google Spreadsheet has a pretty cool function called “importXML” that we can use to retrieve the sitemap and extract the information we want. You can see my spreadsheet for the Outdoor Kit sitemap here. Excel also has the ability to import XML from the web, but I prefer Google Spreadsheet for this.

The importXML function takes two arguments; the first is the URL of the XML file (in this case “http://www.outdoorkit.co.uk/sitemap.xml”), and the second is an Xpath function to extract the important data. You don’t need to learn much Xpath for this; all you need to remember is that “//loc” will extract all the URLs from the sitemap.

The next two columns in the spreadsheet are filters on the list of URLs. I have one column that contains all the product pages, and one that contains all the category pages.

The next thing is to write a program that will take our list of webpages and extract the information we want from each one. I recommend using Python’s Beautiful Soup library for this but there are many other options. Unless all your client’s use the same template you will have to rewrite the parsing bit of your program for each client.

Parsing using Beautiful Soup is quite easy. Here is code that can be used to extract product landing pages, product name and product prices from a category page of Outdoor Kit:

import urllib2
from BeautifulSoup import BeautifulSoup

page = urllib2.urlopen('INSERT PAGE URL HERE;)
soup = BeautifulSoup(page)

products = soup.findAll("div", "txt")

for product in products:
  print str(product.h2.a.get('href'))+", "+str(product.h2.a.contents[0])+", "+str(product.find('div', 'price').contents[0]

This will print out the required information in CSV format. Of course, this output should be written to a file rather than sdout if you actually wanted to use the data.

Information about the product category is also available. Working this information into the program is left as an exercise for the reader.

So there you have it; this is why I think a little bit of knowledge about scraping and parsing can be useful for a Search Engine Marketer.