Easy Webpage Scraping with Python

To produce the tube station usage mashup I obtained the data from the TfL website. Unfortunately the data is not in an immediately usable format – rather than there being a CSV file to download, or a large HTML table, the data is presented as a separate webpage for each station and each year.

Luckily, Python makes it easy to get the data as a CSV file – although you do need to know a little Regex too, to extract the data you want. To construct the regular expressions needed, I used an excellent online tool, RegExr.

Once you have your regular expressions ready, you just use Python’s Urllib, RE and CSV libraries, and some loops, to download the webpages, get the data, and write it into a CSV file.

Here’s the script I used – note I’m using the back-slash character at the end of some lines below to indicate line continuation:

import urllib2, re, csv

stationnums = {2003:4, 2004:4, 2005:4, 2006:4, 
2007:4, 2008:6}

addressPre = "

indRE = '.*?salign=right>([0-9]{1,9}?)</td>.*?'
totalRE = '.*?smillions)s=s([0-9.]{1,9}?)</strong>.*?'
nameRE = '.*?selected>(.*?)</option>'

resFile = open('results.csv', 'w')
resWriter = csv.writer(resFile, quoting=csv.QUOTE_MINIMAL)

for i in range(2003, 2009):
	for j in range(1, stationnums[i]+1):
		address = addressPre + "?id=" + str(j) 
		+ "&agekey=" + str(i)
		html = urllib2.urlopen(address).read()
		indRes = re.findall(indRE, html)
		totalRes = re.findall(totalRE, html)
		nameRes = re.findall(nameRE, html)
		if len(nameRes) > 0:
			resWriter.writerow([i, j, 
			nameRes[0], totalRes[0]] 
			+ [e for e in indRes])

Change the stationnums values for each year to 304 (except 2008, to 306) to get all the data.

Data Graphics Technical

Forests of Great Britain

From the OS OpenData’s Meridian 2 dataset, which was released under a free licence today, here is the extent of forest cover across Great Britain – there is a dedicated polygon shapefile within the distribution showing just this:

As a general rule for orienteering, areas with good forest cover have the best orienteering maps. Scotland and Wales beat England hands down for cover, although Surrey’s doing not too badly at all. The rule doesn’t always hold though – there’s a big patch near Cambridge – more so, it appears than the Lake District, but the latter is considerably finer for orienteering in.

The map contains Ordnance Survey data © Crown copyright and database right 2010.

Mashups Technical

OS OpenData is here

It’s the first of April – but it’s not an April Fool – lots of Ordnance Survey medium-scale data has been released today, under a licence compatible with Creative Commons’ Attribution, i.e. you can do what you like with it as long as you attribute and don’t misrepresent the data source.

The best mirror for the data I’ve found is at MySociety – the OS’s own servers have been apparently overloaded since the release went live.

The first use I’ve made of the data is taking the “CodePoint Open” set of postcodes and locations, I’m now using this data as the postcode lookup for OpenOrienteeringMap. If you type in a UK postcode there, it should now take you to exactly the right place. Before, the lookup was using NPEmap data, which was pretty good in general, but someone did spot some glaring errors when they were using it, coincidently, yesterday.

The attribution statement, by the way, can be seen by mousing over the “Jump to Postcode” text.

OpenStreetMap Technical

Map Adornments with Cairo

For my OpenOrienteeringMap service, I have a added PDF creation facility, which produces a map, ready to print, with a title, north arrow, club logo and link back to the website.

The map itself is rendered in Mapnik, which uses Cairo, using Python and the pycairo bindings. To add the adornments, I’ve also made use of these bindings, and use them at the same time.

The adornments are shown above highlighted in yellow, and the purple control circles and numbers are also added directly using Cairo, rather than being rendered as geospatial objects in Mapnik. This blog post concentrates on the ones at the top of the sheet.

Here’s part of the Python script used to produce the PDF. S2P is a constant used to convert from metres (i.e. map units) to points (i.e. screen/paper units) and its value is 72/0.0254 (72 points per inch, 1/2.54 inches per cm).

I’ve generally use capitals for the names for the various positioning values, with a sort of naming convention – “W” is width, “WM” is west (i.e. left) margin. For example, ADORN_L_W is the width of the logo adornment, in metres. The “surface” sheet is made up of various “contexts” – in effect content boxes, which are positioned and scaled onto the surface, and filled with the content.

We need to import some modules:

import tempfile
import mapnik
import cairo
import urllib

Firstly, set up the PDF sheet, or “surface”, specifying its size:

file = tempfile.NamedTemporaryFile()
surface = cairo.PDFSurface(, PAPER_W*S2P, PAPER_H*S2P)

It’s important to specify the sizes accurately so that printers will print without trying to reduce the PDF, so rendering the scale inaccurate. For example, A4 landscape needs to have a PAPER_W of exactly 0.2970 and PAPER_H of 0.2100.

Then set up the map element – note Mapnik requires integers for the widths and heights:

map = mapnik.Map(int(MAP_W*S2P), int(MAP_H*S2P))
mapnik.load_map(map, styleFile)

Create a context on the surface to draw the map onto, and shift it to allow for margins on the page.

ctx = cairo.Context(surface)
ctx.translate(MAP_WM*S2P, MAP_NM*S2P)
mapnik.render(map, ctx)

Then, just add each adornment in the right place. First the title – the text has been supplied in the URL so is decoded first:

text = urllib.unquote(title)

Then, write the title onto the surface using show_text. Obviously, the server needs to have the fonts installed – I was using Deja Vu Sans initially, but Arial is used for the “regular” Street-O maps that I’m mimicing, so I installed the Microsoft Truetype Core Fonts for Linux:

ctx = cairo.Context(surface)
ctx.select_font_face("Arial Black", cairo.FONT_SLANT_NORMAL, cairo.FONT_WEIGHT_NORMAL)
ctx.translate(MAP_WM*S2P, (MAP_NM-ADORN_T_SM)*S2P)

The image above – and the ones below – are shown at 150% of their actual size on screen.

Now to add a scale caption:

ctx = cairo.Context(surface)
ctx.select_font_face("Arial", cairo.FONT_SLANT_NORMAL, cairo.FONT_WEIGHT_NORMAL)
text = "Scale 1:" + str(scale)
width = ctx.text_extents(text)[4]
ctx.translate((MAP_WM+MAP_W)*S2P-width-(ADORN_A_W+ADORN_L_W)*S2P, (MAP_NM-ADORN_S_SM)*S2P)

..and below it, a scalebar and indicator:

ctx = cairo.Context(surface)
ctx.select_font_face("Arial", cairo.FONT_SLANT_NORMAL, cairo.FONT_WEIGHT_NORMAL)
scaleBarMetres = 500
if scale < 10000:
    scaleBarMetres = 200
text = str(scaleBarMetres) + "m"
width = ctx.text_extents(text)[4]
barCaptionX = (MAP_WM+MAP_W-(ADORN_A_W+ADORN_L_W))*S2P-width
ctx.translate(barCaptionX, (MAP_NM-ADORN_S_SM)*S2P)

..the scalebar itself, making use of stroke:

scaleBarW = scaleBarMetres/float(scale)
ctx.move_to((-scaleBarW-ADORN_S_PADDING)*S2P, 0)
ctx.rel_line_to(0, -ADORN_S_LARGETICK*S2P)
ctx.rel_line_to(0, ADORN_S_LARGETICK*S2P)
ctx.rel_line_to(scaleBarW*S2P/2, 0)
ctx.rel_line_to(0, -ADORN_S_SMALLTICK*S2P)
ctx.rel_line_to(0, ADORN_S_SMALLTICK*S2P)
ctx.rel_line_to(scaleBarW*S2P/2, 0)
ctx.rel_line_to(0, -ADORN_S_LARGETICK*S2P)

The north-arrow is done in a similar way. close_path is used to create a proper triangle, with no “ends” that might result in ugly capping effects. fill makes it solid.

ctx = cairo.Context(surface)
ctx.translate((MAP_WM+MAP_W-ADORN_L_W)*S2P-width, CONTENT_NM*S2P)
ctx.move_to(0, 0)
ctx.line_to(0.001*S2P, 0.002*S2P)
ctx.line_to(-0.001*S2P, 0.002*S2P)

The “N” below the arrow is also drawn with lines, but using round line joins and line caps to produce a smooth letter.

ctx.move_to(0, 0.001*S2P)
ctx.line_to(0, 0.008*S2P)
ctx.move_to(-0.001*S2P, 0.005*S2P)
ctx.rel_line_to(0, -0.002*S2P)
ctx.rel_line_to(0.002*S2P, 0.002*S2P)
ctx.rel_line_to(0, -0.002*S2P)

Finally, a logo is added. This is a bit trickier – the logo is a PNG, but the surface is a PDF. The way I got around this is to create a temporary ImageSurface and then switch the surface that the context is on – the graphic also gets appropriately scaled:

logoSf = cairo.ImageSurface.create_from_png(home+"/logo.png")
ctx = cairo.Context(surface)
width = logoSf.get_width()*ADORN_L_SCALE
ctx.translate((MAP_WM+MAP_W)*S2P-width, CONTENT_NM*S2P)
ctx.set_source_surface(logoSf , 0, 0)

Finally, putting it all together:

return file


A Google Streetview History

Inspired by Alex – all the places I’ve ever lived:

Google’s UK streetview coverage isn’t quite ubiquitous – a couple of the images are snapped looking across fields from the nearest main road, and one of the images is pointing in the correct direction, but the house is completely hidden by all the trees in that general area.

The New York one should be pretty easy to spot, as well as my move to London and its terraced housing. The last picture is of my parent’s new house – the Google camera seems to have fallen onto its side when driving down this particular steep and twisty road.


Prettier Maps with Mapnik 0.7.0

[Updated] The Mapnik 0.7.0 release last month has added a number of features, including one that makes choropleth maps, and others with numerous adjacent polygons, more pretty. The PolygonSymbolizer has a new parameter, gamma, which, when set to a value between 0 and 1, causes the polygon edges to bleed into each other slightly, removing distinctive hairlines that allow the background colour to “shine through” the joins.

Here’s a before-and-after of a choropleth of England & Wales. We don’t care much for the boundaries, as they are Middle-Level Super Output Areas (MSOAs), which are administrative boundaries that people are not familiar with, unlike say county boundaries. So, using the gamma parameter, we can hide them.

Mapnik 0.6

Mapnik 0.7

The new parameter is also useful for my OpenOrienteeringMap project, particularly to make large estuaries, which are typically made up of multiple polygons, appear contiguous. However, applying too low a value appears to affect the anti-aliasing of line features, even ones away from the affected polygons. Here, the gamma is just applied to the “water” style. Note the thin white line crossing the water, near the black dotted line that also crosses it, disappears, but the paths and roads start to distort too:

Mapnik 0.6 – thin white line running across the river.

Mapnik 0.7+patch (Gamma 0.8) – the thin white line has gone. Other features are unchanged.

The map tile, by the way, is showing the Greenwich Foot Tunnel crossing the River Thames in east London.

[Update – The crossed-out section highlights a bug that has now been fixed by the Mapnik development team. I’ve patched my Mapnik build and the gamma changes now only appear where expected.]

OpenStreetMap Technical

Converting OS Eastings/Northings to Grid References in Python

[Updated] I needed to programatically convert a series of Ordnance Survey easting and northings (e.g. 325940, 673060) to “six-figure” grid references (e.g. NT259731) for a project I’m currently working on. It’s a pretty straightforward conversion – no reprojection, just a different way of expressing the same position on the British National Grid.

The specific use is that the 1km x 1km tiles from NPEMap are coding according to eastings & northings, but the calibration files required by TimSC‘s warp-gbos require grid references for the corners and calibration points.

Such a procedure is already well catered for in Perl, PHP and Javascript. Here is a straightforward conversion of some Javascript I found here into Python:

# Derived from # def getOSGridReference(e, n): import math # Note no I gridChars = "ABCDEFGHJKLMNOPQRSTUVWXYZ" # get the 100km-grid indices e100k = math.floor(e/100000) n100k = math.floor(n/100000) if e100k6 or n100k12: return '' # translate those into numeric equivalents # of the grid letters l1 = (19-n100k)-(19-n100k)%5+math.floor((e100k+10)/5) l2 = (19-n100k)*5%25 + e100k%5 letPair = gridChars[int(l1)] + gridChars[int(l2)] # strip 100km-grid indices from easting & northing, # round to 100m e100m = math.trunc(round(float(e)/100)) egr = str(e100m).rjust(4, "0")[1:] if n >= 1000000: n = n - 1000000 # Fix Shetland northings n100m = math.trunc(round(float(n)/100)) ngr = str(n100m).rjust(4, "0")[1:] return letPair + egr + ngr # test print getOSGridReference(96000,906000) # NA960060 print getOSGridReference(465149, 1214051) # HP651141  

[Update – fixed a bug that gave the wrong grid references for the Shetland Islands – their northings have seven figures, and similarly places in the far west or south that only have five figure easting or northings respectively :oops:]


Simple Choropleth Maps in Quantum GIS

It’s straightforward to create attractive choropleth maps in Quantum GIS, but there are a few things that can trip you up in the process.

The choropleth map I want to show is the distribution of deliberate fire-starting across London. A more advanced analysis would weight by ward area or population size, but I’m just showing the raw results, on the probably flawed assumption that wards in London have approximately equal populations and don’t vary in size hugely. Here’s how to do it, in Quantum GIS 1.4 “Enceladus” which was released a few days ago.

1. My boundary spatial data is a shapefile of the wards of London, which Quantum GIS can load without a problem, but my statistical data is in the form of a CSV file, showing the numbers of deliberate fires in each ward. It’s supplied by the new London Datastore, and CSV should not be a problem, but unfortunately the “Join attributes” functionality in Quantum GIS needs the data in DBF format.

Annoyingly, the most recent versions of Microsoft Office applications do not allow you to save data files as DBFs. However, OpenOffice will allow the conversion.

The big gotcha is that Quantum GIS is quite happy to load the CSV file as a layer, and will not complain if you select it as a vector layer to join in the “Join attributes” dialog box. It will simply go ahead and produce a result shapefile containing no data. This was the monitor-throwing bit. You have to instead select “Join dbf table”, and as the caption suggests, it needs to be a DBF.

There is also a bug in the “Join attributes” dialog box – available table columns are not removed when you select different layers in the drop downs, they just get appended to the list, so be careful to select the correct one.

I’ve used the “Continuous Color” option for the symbology setting as this allows me to quickly change the colours and remove the outlines – using “Graduated Symbol” would be a more authentic choropleth as the map would show discrete colours for each grouping.

Here is the resulting choropleth map, with the default adornments. Looks rather nice!

Data Graphics Technical

Spatial Interaction Modelling for Access to Higher Education

This is the first in a series detailing the projects I have worked on at UCL in the last academic year.

My main project through the last year has been to test a hypothesis, developed by Professor AG Wilson, that the flows of students moving from school to university can be approximately by spatial interaction modelling (SIM). Put simply, SIM is a variant of the 300-odd year old Newton’s Law of Universal Gravitation, i.e. the attraction between two masses is related by each of their masses and the distance between them. Replace the masses by the numbers of final-year pupils a school, and a university’s capacity, and make the distance decay exponential instead of inverse-square, and that’s the basics of the model. A similar theory has been applied to great effect by Joel Dearden of CASA, in his retail SIM, which has shown a “tipping point” explaining how supermarkets and out-of-town retail developments have become attractive to shoppers over the last forty years.

Of course, it’s a little more complicated than that, and even with the more complex model I’ve tested, a large number of simplifying assumptions have to be made.

The two main extra parameters that are added to the model are (1) that universities have an “attractiveness factor” above and beyond their size. I have used one of the common university league tables to provide values for this factor. And (2) the distance-decay is not uniform across all types of school students, but varies by their background. By splitting up the final-year school students by demographic, the variation in the distance-decay can be seen, and this is used to calibrate the model.


The seven OAC demographic supergroups are shown here – the horizontal scale is distance and is the same in each graph. (Only English-based school students going to English universities are considered in the study.) The vertical scale is the proportion of students, of that OAC supergroup, in each distance bucket. The actual number of students in each supergroup varies dramatically and this is not shown in the graphs.

The graphs show there is indeed considerable variation between supergroups in the “beta value” of the drop-off if approximated as exponential, and also in the “R-squared” fit to true exponential decay.

  1. Blue collar.
  2. City living – this group strongly favours London, Birmingham and Manchester, i.e. the same or other “big cities” in England, hence characteristic peaks appear at these distances – accentuated by the relatively small school-age population in this group.
  3. Countryside – this group rises before falling, as there is a minimum distance they need to travel to get to even their nearest university.
  4. Prospering suburbs – the lowest beta-value, in other words this group attaches the least importance to school-university distance.
  5. Constained by circumstance – similar to the first group.
  6. Typical traits – the “average” group which encouragingly also has an average looking graph.
  7. Multi-cultural – more distance-sensitive than the others – hence the very steep drop-off. This shows that people living in areas classified as multi-cultural will more strongly desire going to a university that is very local to their home.

Prof Wilson’s theory also factors in the subject that the student is studying (not all universities offer all subjects, and some are most are strong in certain subjects and weak in others), and their attainment at school (i.e. they might really want to study Maths at Oxford, and be at a school very near by, but if they get a D in Maths at A-Level, they aren’t going to be able to do that.)
Universities also come in two types – “recruiting”, where there are more places than students genuinely intending studying there, and “selective”, where there are more prospective students than places. One interesting effect of the recent economic downturn is the massive increase in people applying for university in 2009-10 – UCL saw a 12% increase for undergraduate courses, for example. This has had the effect of making more universities selective.

In order to consider two types in the same model, it was necessary to develop what is known as a “partially constrained” SIM. The details are for a future article, but, put simply, an iterative approach, assigning students to a university and then reassigning the weakest for over-capacity universities, is taken.

I built a GUI in Java – it’s the language I’m most comfortable with for “proper” programming – to quickly visualise the results and compare them with real-life flows. Here’s a bit of it:


This shows the perhaps not very surprising prediction that BIRM7s (multi-cultural school students living in Birmingham) are pretty likely to also go to university in Birmingham (AST = Aston, BCU = Birmingham City University, BIR = University of Birmingham), rather than elsewhere in the country.

When compared with the actual flows:
…the model under-predicts the flow to Birmingham City University, possibly because BCU’s desirability amongst this demographic group is mis-calibrated. Further-education students are also not present in the predicted model, but are included in the actual flows, so the two are not, as presented, normalised.

The model needs to be developed further before it can be presented formally. In particular, attainment is almost certainly a necessary component.


Quantum GIS 1.3

A new version of Quantum GIS, the free, open-source and user-friendly GIS, has been released today.

See the official blog for all the details, but the most exciting addition for me is the OpenStreetMap integration. Now, you can download data directly from the OSM servers, into the application. OSM-like stylings are applied to the data to make it look a bit more like a map, and you can easily can view all the tags and relations on each object. You can also edit the data directly in QGIS, as if it was normal GIS data, and then save it straight back to the server. This could potentially make it a good alternative to the Potlatch and JOSM editors that are currently used for the bulk of additions to OpenStreetMap. The integration isn’t perfect – I got a server-side bounding box error on my first attempt out downloading data which should have been caught locally – but it’s pretty impressive nonetheless.

With QGIS’s excellent python integration, it should be possible to write other plugins, to, for example, create well-shaped building outlines with perfect right angles. I think you can do this in JOSM too, but I’ve always found JOSM a rather unfriendly application to use.

Here’s some OpenStreetMap data of my local area, in Quantum GIS, with a road I added highlighted in red: