Easy Webpage Scraping with Python

To produce the tube station usage mashup I obtained the data from the TfL website. Unfortunately the data is not in an immediately usable format – rather than there being a CSV file to download, or a large HTML table, the data is presented as a separate webpage for each station and each year.

Luckily, Python makes it easy to get the data as a CSV file – although you do need to know a little Regex too, to extract the data you want. To construct the regular expressions needed, I used an excellent online tool, RegExr.

Once you have your regular expressions ready, you just use Python’s Urllib, RE and CSV libraries, and some loops, to download the webpages, get the data, and write it into a CSV file.

Here’s the script I used – note I’m using the back-slash character at the end of some lines below to indicate line continuation:

import urllib2, re, csv

stationnums = {2003:4, 2004:4, 2005:4, 2006:4, 
2007:4, 2008:6}

addressPre = "

indRE = '.*?salign=right>([0-9]{1,9}?)</td>.*?'
totalRE = '.*?smillions)s=s([0-9.]{1,9}?)</strong>.*?'
nameRE = '.*?selected>(.*?)</option>'

resFile = open('results.csv', 'w')
resWriter = csv.writer(resFile, quoting=csv.QUOTE_MINIMAL)

for i in range(2003, 2009):
	for j in range(1, stationnums[i]+1):
		address = addressPre + "?id=" + str(j) 
		+ "&agekey=" + str(i)
		html = urllib2.urlopen(address).read()
		indRes = re.findall(indRE, html)
		totalRes = re.findall(totalRE, html)
		nameRes = re.findall(nameRE, html)
		if len(nameRes) > 0:
			resWriter.writerow([i, j, 
			nameRes[0], totalRes[0]] 
			+ [e for e in indRes])

Change the stationnums values for each year to 304 (except 2008, to 306) to get all the data.

Leave a Reply

Your email address will not be published. Required fields are marked *

Solve this * Time limit is exhausted. Please reload CAPTCHA.