Categories
Technical

CityDashboard – the API

Here is the API documentation for CityDashboard. It’s really not a very advanced API, and it’s not delivered in a “proper” format (e.g. XML or JSON), instead it’s available as a number of CSV/TXT-formatted files. It ain’t pretty but it works!

I’ve put together this documentation as a number of people have asked. However, it should still be considered to be a “private” API, so could change or break at any time, for one of three likely reasons:

  • I make a change to the API structure. If it’s a big change, I will attempt to preserve the old structure, possibly by using an extra parameter (e.g. &v=2) to indicate the new one.
  • Our server goes down. Certainly not inconceivable!
  • One of the upstream data providers changes their feed format in such a way that it causes the CityDashboard data to freeze up. Again, quite likely, particularly as generally I don’t have a formal agreement with the upstream organisations.

1. Finding the cities available

The list of cities available can be found at:
http://citydashboard.org/cities.txt

Notes:

  • Comma-separated.
  • The city_id is the first field of each line.
  • Ignore lines starting with #.

2. Finding the modules available for a city

The list of modules available for london can be found at:
http://citydashboard.org/cities/[city_id].txt

Notes:

  • Comma-separated.
  • The module_id is the first field of each line.
  • Ignore lines starting with #.

3. Getting the data for each module

The data for each module can be found at:
http://citydashboard.org/modules/[module_id].php?city=[city_id]&format=[csv|html|blob]

Example:
http://citydashboard.org/modules/weather_cr.php?city=london&format=csv

Notes:

  • Comma-separated or HTML, depending on the format parameter you use.
  • Ignore lines starting with #.
  • The CSV format will be most useful, as the HTML and “blob” formats are specifically designed for the CityDashboard website. However, many of the modules don’t (yet) have a CSV format feed available – a blank page will instead be returned.
  • The first line in each CSV file contains a number in the second field. This is the number of seconds between each update. i.e if this is 5, then the file won’t update more than once every 5 seconds.
  • Modules which have a CSV feed for them, have an “m” included in the sixth field in the appropriate row in the london.txt file (typical values, d, db, dbm etc)

By the way, the module list will most likely be changing very soon to add a couple of important fields that I overlooked – first of all, the source URL will be in a field of its own, and secondly I will add in a proper attribution statement for each source.

Categories
Technical

The MySQL Groupwise Maximum Problem

There is a surprisingly difficult task to solve with MySQL queries, which I’ve been spending some time trying to do – the Groupwise Maximum problem.

This is the name for the type of query that I was trying, although in fact I am trying to find a set of minimum (rather than maximum) values.

The question: What is the time each day that we a see a minimum of available bikes for? (a research question – as finding this answer will tell us something about the commuting habits of the city.)

The source data table:

timestamp bikes_available
2012-05-29 17:12:00 4265
2012-05-29 17:14:00 4251
2012-05-29 17:16:00 4251
2012-05-29 17:18:00 4253
2012-05-29 17:20:00 4259
etc…

My initial thoughts were:

select date(timestamp), time(timestamp), min(bikes) from bike_agg_london group by date(timestamp)

date time bikes_available
2012-05-22 00:00:01 4662
2012-05-23 00:00:02 4600
2012-05-24 00:00:02 4594
2012-05-25 00:00:01 4805
2012-05-26 00:00:01 4144
2012-05-27 00:00:02 3710

This produces the minimum bikes number for each day, which is great, but the timestamp included is just the first one of each day (in fact it could be a randomly chosen timestamp from within the day, but MySQL’s internal logic happens to pick the first one out). This is because the time(timestamp) is not part of the “group by” (aggregate) clause, and all fields in a query must be included in the group by unless they are part of the aggregate. I don’t want to aggregate the time(timestamp) though – I want the value associated with the minimum bikes, rather than the maximum, minimum or average (etc) value.

Here’s 10 ways to solve the problem, although I tried a few and they didn’t work for me.

Here’s a technique that worked for me (the second solution)

Here’s the SQL that worked for me, quite quickly (~18 seconds for around 166000 rows representing 600 days):

select date(b1.timestamp) theday1, b1.timestamp, b1.bikes from bike_agg_london b1 inner join (select date(timestamp) as theday2, min(bikes) as min_bikes from bike_agg_london group by date(timestamp)) b2 on (date(b1.timestamp) = b2.theday2 and b1.bikes = b2.min_bikes)

date time bikes_available
2012-05-22 2012-05-22 18:22:01 4662
2012-05-23 2012-05-23 18:12:02 4600
2012-05-23 2012-05-23 18:16:01 4600
2012-05-24 2012-05-24 18:18:01 4594
2012-05-24 2012-05-24 18:20:02 4594
2012-05-25 2012-05-25 17:54:02 4805
2012-05-26 2012-05-26 15:56:01 4144
2012-05-27 2012-05-27 17:24:01 3710

It’s the second solution from the above link. There is one problem, where if there are multiple rows in a day that share the same min(bikes) value, they each appear. Using distinct won’t get rid of these, because the time(timestamp) does vary. The fix is to use an additional wrapper (tables co3) to eliminate these duplicate rows:

select theday1, time(min(timestamp)), bikes from
(select date(b1.timestamp) theday1, b1.timestamp, b1.bikes from bike_agg_london b1 inner join (select date(timestamp) as theday2, min(bikes) as min_bikes from bike_agg_london group by date(timestamp)) b2 on (date(b1.timestamp) = b2.theday2 and b1.bikes = b2.min_bikes)) b3 group by theday1, bikes

date time bikes_available
2012-05-22 18:22:01 4662
2012-05-23 18:12:02 4600
2012-05-24 18:18:01 4594
2012-05-25 17:54:02 4805
2012-05-26 15:56:01 4144
2012-05-27 17:24:01 3710
Categories
Mashups Technical

WMS and SVG Input with Mapnik and Cairo

A rather techy post from me, just before the Final Project Post for GEMMA (my current project at CASA) is submitted, summing the project and applications up. Why? I wanted to share with you two rather specialist bits of GEMMA that required quite a bit of head-scratching and trial-and-error, in the hope that, for some developer out there, they will be of use in their own work. I’m cross-posting this from the GEMMA project blog.

One of the core features of GEMMA that I have always desired is the ability to output the created map as a PDF. Not just any PDF, but a vector-based one – the idea that it will be razor-sharp when you print it out, rather than just looking like a screen grab. I had written a basic PDF creator, using Mapnik and Cairo, for OpenOrienteeringMap (OOM), an earlier side-project, and because the GEMMA project is about embracing and extending our existing technologies and knowledge at CASA, rather than reinventing the wheel, I was keen to utilise this code in GEMMA.

Most of the layers were quite straightforward – the two OpenStreetMap layers (background and feature) are very similar indeed to OOM, while the Markers and Data Collector layers were also reasonably easy to do – once I had imported (for the former) and hand-crafted (for the latter) suitable SVG images, so that they would stay looking nice when on the PDF. The trickiest layer was the MapTube layer. For the terms of this project, the MapTube imagery is not a vector layer, i.e. we are not using WFS. However I was still keen to include this layer in the PDF, so I turned to Richard Milton (the creator of MapTube) and discovered there is an WMS service that will stitch together the tiled images and serve them across the net. I could combine this requesting the WMS images on the GEMMA server (not the client!), converting them to temporary files, and then using a RasterSymbolizer in Mapnik 2, and an associated GDAL filetype.

The trickiest part was setting georeferencing information for the WMS image. Georeferencing is used to request the image, but it is also needed to position the image above or below the other Mapnik layers. Initially it looked like I would have to manually create a “worldfile”, but eventually I found a possibly undocumented Mapnik feature which allows manual specification of the bounding box.

I’ve not seen this done anywhere else before, although I presume people have just done it and not written it down on the web, so here’s my take, in Python.

First we get our WMS image. MAP_W and MAP_H are the size of the map area on the “sheet” in metres. We request it with a resolution of 5000 pixels per metre, which should produce a crisp looking image without stressing the server too much.

mb = map.envelope()
url = maptube_wms_path + "/?request=GetMap&service=WMS&version=1.3.0"
url = url + &format=image/png&crs=EPSG:900913"
url = url + "&width=" + str(int(MAP_W*5000)) 
url = url + "&height=" + str(int(MAP_H*5000))
url = url + "&bbox=" + str(mb.minx) + "," + str(mb.miny) + "," 
url = url + str(mb.maxx) + "," + str(mb.maxy)
url = url + "&layers=MAPID" + str(maptubeid)

furl = urllib2.urlopen(url, timeout=30)

Mapnik doesn’t work directly with images, but files, so we create a temporary file:

ftmp = tempfile.NamedTemporaryFile(suffix = '.png')
filename = ftmp.name
ftmp.write(furl.read())

Next we set up the layer and style. It’s nice that we can pass the opacity, set on the GEMMA website, straight into the layer in Mapnik.

			
style = mapnik.Style()
rule = mapnik.Rule()
rs = mapnik.RasterSymbolizer()
rs.opacity = opacity
rule.symbols.append(rs)
style.rules.append(rule)
map.append_style(filename,style)
lyr = mapnik.Layer(filename)

Here’s the key step, where we manually provide georeferencing information. epsg900913 is the usual Proj4 string for this coordinate reference system.

lyr.datasource = mapnik.Gdal(base='',file=filename, bbox=(mb.minx, mb.miny, mb.maxx, mb.maxy)) #Override GDAL
lyr.srs = epsg900913

Finally:

lyr.styles.append(filename)
map.layers.append(lyr)

I’m excited about one other piece of code in the PDF generation process, as again it involves jumping through some hoops, that are only lightly documented – adding an SVG “logo” – the SVG in this case being the GEMMA gerbil logo, that Steve (co-developer) created from Illustrator. Cairo does not allow native SVG import (only export) but you can use the RSVG Python package to pull this in. I’m being a bit lazy in hard-coding widths and scales here, because the logo never changes. There are more sophisticated calls, e.g. svg.props.width, that could be useful.

	
svg = rsvg.Handle(gemma_path + "/images/logo.svg")
ctx = cairo.Context(surface)
ctx.translate((MAP_WM+MAP_W)*S2P-ADORN_LOGO_SIZE, CONTENT_NM*S2P)
ctx.scale(0.062, 0.062)
svg.render_cairo(ctx)

Note that we are calling render_cairo, a function in RSVG, rather than a native Cairo function that we do for all the other layers in the PDF.

The screenshot above contains data from the OpenStreetMap project.

Categories
Technical

The Ease of Monitoring with Munin

I’m currently using Munin to keep an eye on the various services running on my main server at UCL CASA. Munin is a monitoring package. It is simple to install (on Ubuntu, sudo apt-get install munin on your primary server, sudo apt-get install munin-node on all servers you want to monitor), and brilliantly simple to set up and – most importantly – extend.

Out of the box, Munin will start monitoring and graphing various aspects of your server, such as CPU usage, memory usage, disk space and uptime. The key is that everything is graphed, so that trends can be spotted and action taken before it’s too late. Munin always collects its data every five minutes, and always presents the graphs in four timescales: the last 24 hours, the last 7 days, the last month and the last year.

Extending Munin to measure your own service or process is quite straightforward. All you need is a shell-executable script which returns key/value pairs representing the metric you want to measure. You also need to add a special response for when Munin wants to configure your plugin. This response sets graph titles, information on what you are measuring, and optionally thresholds for tripping alerts.

Here’s a typical response from a munin script “uptime”, this is used by Munin to construct the graph:

uptime.value 9.29

Here’s what you get when you call it with the config parameter:

graph_title Uptime
graph_args --base 1000 -l 0 
graph_scale no
graph_vlabel uptime in days
graph_category system
uptime.label uptime
uptime.draw AREA

Munin takes this and draws (or reconfigures) the trend graph, with the data and a historical record. Here’s the daily graph for that:

I have two custom processes being monitored with Munin. The first is my minute-by-minute synchronisation of my database (used by OpenOrienteeringMap and, soon hopefully*, GEMMA) with the UK portion of the master OpenStreetMap database. The metric being measured is the time lag. Normally this is around a minute or less, but if there are problems with the feed at either OSM’s end or (more likely) my end, the graph spikes up and the problem can be spotted and dealt with. Also, the subsequent graphing trend, after such an issue, is useful for predicting how quickly things will be back to normal. I’m using an OSM plugin (part of the Osmosis system, which is also doing the synchronisation) rather than writing my own.

The other process is for monitoring the various feeds I have to around 50 cities around the world, to get their current bikes/spaces information for my Bike Share Map. Munin graphs are useful for spotting feeds that temporarily fail, and then hopefully fix themselves, resulting in distinctive “shark fin” trendlines. If one feed doesn’t fix itself, its shark fin will get huge and I will then probably go and have a look. Above is what the daily graph looks like.

I wrote this plugin myself, in two stages. First, my scripts themselves “touch” a “heartbeat” file (one for each script) upon successful execution. When the feed’s file is touched, its modified timestamp updates, this can then be used as a basis for determining how long ago the last successful operation is.

Secondly, my Munin plugin, every five minutes, scans through the folder of heartbeat files (new ones may occasionally appear – they go automatically onto the graph which is nice) and extracts the name and modified timestamp for each file, and reports this back to Munin, which then updates the graphs.

Because Munin allows any shell-executable script, I was able to use my language du-jour, Python, to write the plugin.

Here it is – latest_dir is an absolute path to the parent of the heartbeats directory, scheme is the name of the city concerned:

#!/usr/bin/python

import os
import time
import sys

filedir = latest_dir + '/heartbeats/'

files = os.listdir(filedir)
files.sort()

if len(sys.argv) > 1: 
	if sys.argv[1] == 'config':
		print "graph_title Bike Data Monitoring"
		print "graph_vlabel seconds"
		for f in files:
			fp = f[:-6]
			print fp + ".label " + fp
		exit()

for f in files:
	tdiff = time.time() - os.path.getmtime(filedir + f)
	fp = f[:-6]
	print fp + ".value " + str(tdiff)

The middle section is for the config, the bit that is used every five minutes is just the gathering of the list of files at the top, and the simple measurement at the bottom.

That is really it – there is no other custom code that is producing the graphs like the one at the top of this post – all the colours, historical result storing and HTML production is handled internally in Munin.

My code that updates the heartbeat file is even simpler:


def heartbeat(scheme):
	open(latest_dir + '/heartbeats/' + scheme + '.touch', 'w').close()

Fingers crossed things won’t go wrong, but if they do, I now have a pleasant, graphical interface to spotting them.

* GEMMA will use OpenStreetMap data for the whole world, but currently it takes longer than a minute to process a minute’s worth of OpenStreetMap database updates, such is the level of activity in the project at the moment, so my minutely-updating database only covers the UK. So, for GEMMA, I am just using a “static” copy of the whole world. OpenOrienteeringMap has two modes, UK and Global – only the UK one uses the updating database.

Categories
Technical

Notes on a Migration with Images from Blogger to WordPress

I’ve recently reported a very large blog from Blogger to WordPress. The blog has been around for many years, with around a thousand posts – most of which contain at least one image embedded in.

The WordPress theme to be used was a gallery-style one (Arras) which displays the post image thumbnails properly, so there was a requirement to have these included in the import. However it is trickier than you might think to get these in from Blogger on a bulk-import basis. Individually associating all the images is not an option due the number of posts. I’ve not used a solution requiring programming, but writing a Python script to do this would be pretty straightforward. Instead I’ve used Excel to extract the URLs and build up SQL queries with the URLs in them, to insert into WordPress. These notes will hopefully prove useful to someone trying to do the same thing.

Assumptions (a lot of these can be worked around if you have the know-how): You have Excel, EditPad (a text editor), a WordPress.org 3.1.X install which allows you to install arbitrary plugins (i.e. if using WordPress Multisite, you are a super-admin), and that you can access the MySQL database backing your blog using phpMyAdmin.

1. Install the Blogger Import plugin in WordPress and use it to import your blog in.

The result should be a successfully imported blog. When viewing a post, the images even appear. However – these are just links to images on the Blogger server. The images themselves haven’t been pulled across.

2. Install the Hot Linked Image Cacher (HLIC) plugin and use it to pull in local copies of your images.

Now the images are locally stored, and linked to locally in the content of your posts – the plugin updates the URLs as well as copying the images across. However, the images are still not formally associated with the posts that link to them.

3. Use phpMyAdmin to export just the ID and post_content columns of the wp_posts table to a CSV file. Use the suppress CR/LF option and the pipe “|” delimiter.

4. Open the CSV file in EditPad (not Excel – as Excel will automatically assume that the commas in your posts are the field delimiters.) and change all commas in the document to spaces.

5. Copy the contents and paste them into a blank Excel document. Use the Text-to-Columns Import wizard to import in the data with the pipe delimiter.

6. Extract the (first) image URL from the post_content column. You can do this by adding another column and using a formula like this:

=MID($B823,FIND("HLIC/", $B823),41)

It’s OK to use 41 characters because the HLIC plugin always saves images with names this long.

7. Use a Filter (AutoFilter) to remove rows for which there is no image URL.

8. You need to create SQL statements, two for each image, structuring them using Excel columns. Post-associated images are considered by WordPress to be themselves posts, with a parent ID referencing the corresponding post ID – and the images also have entries in the post metadata table. Here’s two examples of the two statements you need for each image, I’ve used IDs from the 7xxxxx and 8xxxxx ranges on the assumption there are no existing posts with IDs this high.

insert into wp_posts values( 700002 ,1,NOW(),NOW(),"","","","inherit","open","open","","","","",NOW(),NOW(),"", 1654 ,"/wp-content/uploads/ HLIC/2af1d48d3251d953b106a0bbf8f2f810.jpg ", 0,"attachment", "image/jpeg", 0); insert into wp_postmeta values( 800002 , 700002 ,
"_wp_attached_file", "HLIC/2af1d48d3251d953b106a0bbf8f2f810.jpg ");

insert into wp_posts values( 700003
,1,NOW(),NOW(),"","","","inherit","open","open","","","","",NOW(),NOW(),"", 1683 ,"/wp-content/uploads/ HLIC/dd376f1a86ba3e833e866e7f03127712.jpg ", 0,"attachment", "image/jpeg", 0); insert into wp_postmeta values( 800003 , 700003 , "_wp_attached_file", "HLIC/dd376f1a86ba3e833e866e7f03127712.jpg ");

Elements in red are from the original database export. Elements in blue are sequential numbers.

I’ve assumed all the images are JPEGs – you’ll need to change the “image/jpeg” part of the SQL statement if this is not the case.

If using WordPress Multisite, you’ll need to use the appropriate wp_NN_ table prefix and also set the author number (1 above) appropriately.

9. Paste the statements into EditPad, remove all tabs, and then paste all the statements into phpMyAdmin and execute.

10. Finally you need to build in the attributes for each image, before they are seen. You can do this with another plugin called Regenerate Thumbnails. Note that this plugin depends on the URL that was specified in the post metadata table.

Manual tidying will still be needed – particularly for embedded YouTube videos and other content.

One more gotcha with the import was that the Blogger post tags were coming in as categories. As there were several hundred of these, that wasn’t very practical. So I ran the following bit of SQL to change them to be WordPress tags:

UPDATE wp_term_taxonomy SET taxonomy = replace(taxonomy,"category","post_tag")

Categories
Technical

Should the UK move to Double Summer Time?

Every autumn, as predictably as the clocks going back from British Summer Time (BST, i.e. GMT+1) to Greenwich Mean Time (GMT), there are newspaper articles with people suggesting that we scrap the change and just stay with BST all year around. Lighter evenings in the winter would make days feel longer for most people, and reduce the gloom of the season. Indeed this was done, for a few years from 1968, but politics intervened and we went back to the the annual GMT/BST cycle. Well, things might be changing. More specifically, we could be moving to the timezone that much of the rest of Europe adopts – GMT+1 in the winter, GMT+2 in the summer (so Double Summer Time is a bit of a mis-nomer.

What do you think? Here’s a 10-second survey:

Categories
London Technical

Hodder Geography Nest

During November, I am the guest blogger for the Hodder Geography Nest, along with James, a Ph.D at UCL Geography. We will be blogging about the research we are doing, focusing particularly on maps.

Categories
Technical

Rennes, a Model City for Transport Data

Having had some issues with obtaining the bike share data for some cities, it was refreshing to receive an email from some developers in Rennes, NW France, detailing the public API for transport data that the city has made available, under a Creative Commons-style licence for reuse. You have to sign up for an API key, through their data website, and then all the data you need is available, quickly and with documentation, through XML or other popular machine-readable formats. As well as the bike-share data, metro line information, including alerts, is also available.

Why can’t all cities be like this?

Picture by Eun Byeol Lee on Flickr

Categories
Data Graphics Mashups Technical

Fewer Cities, More Cities

Some bad news and good news about the Bike Share visualisation.

The bad news – the operator behind the schemes in Paris, Seville, Vienna, Dublin, Brussels, Valencia and Toyama asked me to stop getting the current bike share data from their websites. Although I was just loading their webpages, “in practice you are extracting data from [the operator’s] databases and re-utilising it” and “[the] databases are protected under the harmonised sui generis database right, as provided under Directive 96/9/EC: chapter III article 7 (1) and (2).”

For these seven cities, you can still see a historical snapshot from last Monday, when the feeds were switched off, but not the live status, historical animation or trend graphs.

This is despite a quick search on the web revealing a six-month collection of data for one of the schemes (at four minute intervals), the resulting trends being shown at a conference; a better-service campaign website, again for one of the schemes, with regularly updated performance tables; and an iPhone app pulling in the data from numerous schemes run by the operator, amongst others.

Digital Urban also mentioned this in the context of Bike-o-Meter, which uses the aggregated data from my Bike Share maps.

Now for the good news – I’ve added in five more cities – Rennes, Bordeaux, Zaragoza, Mexico City and Rio de Janerio. Yay! The inclusion of Mexico City and Rio should hopefully counter some claims of an European/English-speaking bias! Mexico City’s scheme appears to be concentrated in one very affluent district of the metropolis, while Rio’s is based on the seafront south of the city, rather than in the main urban area.

Rennes is a particularly interesting example, more about that shortly.

[Update – turns out I’m not the first.]

Categories
Technical

English Counties and UAs in One

Great Britain’s administrative geography is rather complicated, particularly for England – some English areas are “three tier”, made up of counties which are subdivided into districts, and others are “two tier” consisting of unitary authorities. Then there’s London’s boroughs which are in a special category of their own as part of an authority.

The Ordnance Survey Open Data release (easy download page here) includes BoundaryLine, which includes the geography data file for the counties, and a separate for the districts, UAs and boroughs. The latter is complete (and also includes the Scottish and Welsh regions), but the former looks rather strange on a map, with “islands” of counties separated by a “sea”.

I received a request by someone who was interested in having a unified file, at county level for the non-GLA counties, but including the UAs and London boroughs to “fill in” the map. I’ve made such a file by doing a dissolve in Quantum GIS (the districts having the county name as an attribute), and it can be downloaded here (15MB zipped shapefile.) The data is derived from and therefore covered by the OS Open Data licence which requires simply that the original source must be attributed when using it – that is, the data contains Ordnance Survey data © Crown copyright and database right 2010.

The image above is showing the merged data, with the unmerged district data (dotted lines) superimposed.