Categories
London Technical

Me, Geolocated on Twitter

tweets_london

I was prompted by the excellent Twitter Tongues map, where geolocated tweets in London (including mine, and those from hundreds of thousands of others) were mined by Ed Manley over the summer, and then mapped by James Cheshire, to see where I had left my own Twitter footprint.

Many people would probably be quite alarmed to learn that the data, on the exact locations they have tweeted at – if they’ve allowed geolocation – is freely accessible to anyone, not just themselves, through the Twitter API.

tweets_chancerylane

It’s a bit of a faff to get the data – Twitter is starting to rollout a “download my Tweets” option which may make the first few steps here easier – but here’s how I did it.

  1. I used the user_timeline call on the Twitter API, repeatedly, to pull in my last 3200 tweets (the maximum) in batches (“pages”) of 200. The current Twitter API (1.1) requires OAuth authentication – not of the person whose tweets you are mining, but simply yourself, so that rate limits can be correctly applied. Registering a dummy application on the Twitter gives access to OAuth credentials, and then using the OAuth tool generates a CURL string that can then be run – the result is put in a file ( > pageX.json), and I do this 16 times to get all 3200 tweets, using the count, page and include_rts parameters. For this particular case, I’m interested in the locations of my own account but – to stress again – you can do this for anyone else’s account, unless their account is protected and you are not a follower.
  2. The output is as various JSON files. Lacking a JSON parser, or indeed the skill, I had to do a bit of manual text processing. Those with a flexible JSON parser can therefore skip a few steps. I then merged together the files (cat *.json > combined.txt), and in a text editor, put a line break between each },{"crea and replaced ," with ,^" with the caret being an otherwise unused character.
  3. I opened up the file as a text file (not CSV!) in Excel and did a text-to-column on the caret. I then extracted three columns – the date/time, tweet text, and the first coordinates column that occurred. These were the 1st(A), 4th (D) and 28th (AB) columns. I did further find/replace and text-to-columns to remove the keys and quotes, and split the coordinates column into two columns – lat and long.
  4. I removed all the rows that didn’t have a lat/long location. Out of 3186 (14 less than 3200 due to deleted tweets) I had 268 such tweets. I also added a header row.
  5. I created a new Google Fusion Table on the Google Drive website, importing in the Excel file from the above step, and assigning the latter two columns to be a two-column location field.
  6. I marked the table as public (viewable with a link). This is necessary as Google doesn’t allow the creation of a map from a private file, except though a paid (business) account. The flip side of course is this gives Google themselves the right of access to the file contents, although I can’t imagine they are particularly interested in this one.
  7. Finally, I added a tab to the Google Fusion Table which was a map tab, and then zoomed in and around and took the screenshots below. The map is zoomable and the points clickable as normal. It should be possible to colour-code the dots by year, if the categories are set appropriately and the appropriate part of the datetime feed is reformatted appropriately in Step 3.

The whole process, including some trial-and-error, took a little over an hour – not so bad.

In the images above and below, you can see the results – 268 geolocated tweets over the course of two and a half years from my account – many of them precisely and accurately located.

tweets_nweurope

All screenshots from Google Maps.

Categories
Data Graphics Technical

CityDashboard Weather Forecasts – Yahoo! to the Rescue

After Google abruptly turned off their XML weather feed this week, I’ve switched to using Yahoo! Weather (an RSS feed) for the CityDashboard weather forecast module. Yahoo uses WOEIDs rather than city names, which takes a bit longer to configure but is unambiguous – Google just used the city name, so required careful specification to get Birmingham (UK) weather rather than Birmingham (Alabama, US) weather, for example. Google’s feed was undocumented (so, strictly, private) but was widely used on other websites.

I’m using the weather icons (which link to the codes supplied by Yahoo) from the WeatherIcon project.

It works well. Thank you Yahoo!

Maybe Yahoo! is about to become the new Google?

Categories
Technical

CityDashboard – the API

Here is the API documentation for CityDashboard. It’s really not a very advanced API, and it’s not delivered in a “proper” format (e.g. XML or JSON), instead it’s available as a number of CSV/TXT-formatted files. It ain’t pretty but it works!

I’ve put together this documentation as a number of people have asked. However, it should still be considered to be a “private” API, so could change or break at any time, for one of three likely reasons:

  • I make a change to the API structure. If it’s a big change, I will attempt to preserve the old structure, possibly by using an extra parameter (e.g. &v=2) to indicate the new one.
  • Our server goes down. Certainly not inconceivable!
  • One of the upstream data providers changes their feed format in such a way that it causes the CityDashboard data to freeze up. Again, quite likely, particularly as generally I don’t have a formal agreement with the upstream organisations.

1. Finding the cities available

The list of cities available can be found at:
http://citydashboard.org/cities.txt

Notes:

  • Comma-separated.
  • The city_id is the first field of each line.
  • Ignore lines starting with #.

2. Finding the modules available for a city

The list of modules available for london can be found at:
http://citydashboard.org/cities/[city_id].txt

Notes:

  • Comma-separated.
  • The module_id is the first field of each line.
  • Ignore lines starting with #.

3. Getting the data for each module

The data for each module can be found at:
http://citydashboard.org/modules/[module_id].php?city=[city_id]&format=[csv|html|blob]

Example:
http://citydashboard.org/modules/weather_cr.php?city=london&format=csv

Notes:

  • Comma-separated or HTML, depending on the format parameter you use.
  • Ignore lines starting with #.
  • The CSV format will be most useful, as the HTML and “blob” formats are specifically designed for the CityDashboard website. However, many of the modules don’t (yet) have a CSV format feed available – a blank page will instead be returned.
  • The first line in each CSV file contains a number in the second field. This is the number of seconds between each update. i.e if this is 5, then the file won’t update more than once every 5 seconds.
  • Modules which have a CSV feed for them, have an “m” included in the sixth field in the appropriate row in the london.txt file (typical values, d, db, dbm etc)

By the way, the module list will most likely be changing very soon to add a couple of important fields that I overlooked – first of all, the source URL will be in a field of its own, and secondly I will add in a proper attribution statement for each source.

Categories
Technical

The MySQL Groupwise Maximum Problem

There is a surprisingly difficult task to solve with MySQL queries, which I’ve been spending some time trying to do – the Groupwise Maximum problem.

This is the name for the type of query that I was trying, although in fact I am trying to find a set of minimum (rather than maximum) values.

The question: What is the time each day that we a see a minimum of available bikes for? (a research question – as finding this answer will tell us something about the commuting habits of the city.)

The source data table:

timestamp bikes_available
2012-05-29 17:12:00 4265
2012-05-29 17:14:00 4251
2012-05-29 17:16:00 4251
2012-05-29 17:18:00 4253
2012-05-29 17:20:00 4259
etc…

My initial thoughts were:

select date(timestamp), time(timestamp), min(bikes) from bike_agg_london group by date(timestamp)

date time bikes_available
2012-05-22 00:00:01 4662
2012-05-23 00:00:02 4600
2012-05-24 00:00:02 4594
2012-05-25 00:00:01 4805
2012-05-26 00:00:01 4144
2012-05-27 00:00:02 3710

This produces the minimum bikes number for each day, which is great, but the timestamp included is just the first one of each day (in fact it could be a randomly chosen timestamp from within the day, but MySQL’s internal logic happens to pick the first one out). This is because the time(timestamp) is not part of the “group by” (aggregate) clause, and all fields in a query must be included in the group by unless they are part of the aggregate. I don’t want to aggregate the time(timestamp) though – I want the value associated with the minimum bikes, rather than the maximum, minimum or average (etc) value.

Here’s 10 ways to solve the problem, although I tried a few and they didn’t work for me.

Here’s a technique that worked for me (the second solution)

Here’s the SQL that worked for me, quite quickly (~18 seconds for around 166000 rows representing 600 days):

select date(b1.timestamp) theday1, b1.timestamp, b1.bikes from bike_agg_london b1 inner join (select date(timestamp) as theday2, min(bikes) as min_bikes from bike_agg_london group by date(timestamp)) b2 on (date(b1.timestamp) = b2.theday2 and b1.bikes = b2.min_bikes)

date time bikes_available
2012-05-22 2012-05-22 18:22:01 4662
2012-05-23 2012-05-23 18:12:02 4600
2012-05-23 2012-05-23 18:16:01 4600
2012-05-24 2012-05-24 18:18:01 4594
2012-05-24 2012-05-24 18:20:02 4594
2012-05-25 2012-05-25 17:54:02 4805
2012-05-26 2012-05-26 15:56:01 4144
2012-05-27 2012-05-27 17:24:01 3710

It’s the second solution from the above link. There is one problem, where if there are multiple rows in a day that share the same min(bikes) value, they each appear. Using distinct won’t get rid of these, because the time(timestamp) does vary. The fix is to use an additional wrapper (tables co3) to eliminate these duplicate rows:

select theday1, time(min(timestamp)), bikes from
(select date(b1.timestamp) theday1, b1.timestamp, b1.bikes from bike_agg_london b1 inner join (select date(timestamp) as theday2, min(bikes) as min_bikes from bike_agg_london group by date(timestamp)) b2 on (date(b1.timestamp) = b2.theday2 and b1.bikes = b2.min_bikes)) b3 group by theday1, bikes

date time bikes_available
2012-05-22 18:22:01 4662
2012-05-23 18:12:02 4600
2012-05-24 18:18:01 4594
2012-05-25 17:54:02 4805
2012-05-26 15:56:01 4144
2012-05-27 17:24:01 3710
Categories
Mashups Technical

WMS and SVG Input with Mapnik and Cairo

A rather techy post from me, just before the Final Project Post for GEMMA (my current project at CASA) is submitted, summing the project and applications up. Why? I wanted to share with you two rather specialist bits of GEMMA that required quite a bit of head-scratching and trial-and-error, in the hope that, for some developer out there, they will be of use in their own work. I’m cross-posting this from the GEMMA project blog.

One of the core features of GEMMA that I have always desired is the ability to output the created map as a PDF. Not just any PDF, but a vector-based one – the idea that it will be razor-sharp when you print it out, rather than just looking like a screen grab. I had written a basic PDF creator, using Mapnik and Cairo, for OpenOrienteeringMap (OOM), an earlier side-project, and because the GEMMA project is about embracing and extending our existing technologies and knowledge at CASA, rather than reinventing the wheel, I was keen to utilise this code in GEMMA.

Most of the layers were quite straightforward – the two OpenStreetMap layers (background and feature) are very similar indeed to OOM, while the Markers and Data Collector layers were also reasonably easy to do – once I had imported (for the former) and hand-crafted (for the latter) suitable SVG images, so that they would stay looking nice when on the PDF. The trickiest layer was the MapTube layer. For the terms of this project, the MapTube imagery is not a vector layer, i.e. we are not using WFS. However I was still keen to include this layer in the PDF, so I turned to Richard Milton (the creator of MapTube) and discovered there is an WMS service that will stitch together the tiled images and serve them across the net. I could combine this requesting the WMS images on the GEMMA server (not the client!), converting them to temporary files, and then using a RasterSymbolizer in Mapnik 2, and an associated GDAL filetype.

The trickiest part was setting georeferencing information for the WMS image. Georeferencing is used to request the image, but it is also needed to position the image above or below the other Mapnik layers. Initially it looked like I would have to manually create a “worldfile”, but eventually I found a possibly undocumented Mapnik feature which allows manual specification of the bounding box.

I’ve not seen this done anywhere else before, although I presume people have just done it and not written it down on the web, so here’s my take, in Python.

First we get our WMS image. MAP_W and MAP_H are the size of the map area on the “sheet” in metres. We request it with a resolution of 5000 pixels per metre, which should produce a crisp looking image without stressing the server too much.

mb = map.envelope()
url = maptube_wms_path + "/?request=GetMap&service=WMS&version=1.3.0"
url = url + &format=image/png&crs=EPSG:900913"
url = url + "&width=" + str(int(MAP_W*5000)) 
url = url + "&height=" + str(int(MAP_H*5000))
url = url + "&bbox=" + str(mb.minx) + "," + str(mb.miny) + "," 
url = url + str(mb.maxx) + "," + str(mb.maxy)
url = url + "&layers=MAPID" + str(maptubeid)

furl = urllib2.urlopen(url, timeout=30)

Mapnik doesn’t work directly with images, but files, so we create a temporary file:

ftmp = tempfile.NamedTemporaryFile(suffix = '.png')
filename = ftmp.name
ftmp.write(furl.read())

Next we set up the layer and style. It’s nice that we can pass the opacity, set on the GEMMA website, straight into the layer in Mapnik.

			
style = mapnik.Style()
rule = mapnik.Rule()
rs = mapnik.RasterSymbolizer()
rs.opacity = opacity
rule.symbols.append(rs)
style.rules.append(rule)
map.append_style(filename,style)
lyr = mapnik.Layer(filename)

Here’s the key step, where we manually provide georeferencing information. epsg900913 is the usual Proj4 string for this coordinate reference system.

lyr.datasource = mapnik.Gdal(base='',file=filename, bbox=(mb.minx, mb.miny, mb.maxx, mb.maxy)) #Override GDAL
lyr.srs = epsg900913

Finally:

lyr.styles.append(filename)
map.layers.append(lyr)

I’m excited about one other piece of code in the PDF generation process, as again it involves jumping through some hoops, that are only lightly documented – adding an SVG “logo” – the SVG in this case being the GEMMA gerbil logo, that Steve (co-developer) created from Illustrator. Cairo does not allow native SVG import (only export) but you can use the RSVG Python package to pull this in. I’m being a bit lazy in hard-coding widths and scales here, because the logo never changes. There are more sophisticated calls, e.g. svg.props.width, that could be useful.

	
svg = rsvg.Handle(gemma_path + "/images/logo.svg")
ctx = cairo.Context(surface)
ctx.translate((MAP_WM+MAP_W)*S2P-ADORN_LOGO_SIZE, CONTENT_NM*S2P)
ctx.scale(0.062, 0.062)
svg.render_cairo(ctx)

Note that we are calling render_cairo, a function in RSVG, rather than a native Cairo function that we do for all the other layers in the PDF.

The screenshot above contains data from the OpenStreetMap project.

Categories
Technical

The Ease of Monitoring with Munin

I’m currently using Munin to keep an eye on the various services running on my main server at UCL CASA. Munin is a monitoring package. It is simple to install (on Ubuntu, sudo apt-get install munin on your primary server, sudo apt-get install munin-node on all servers you want to monitor), and brilliantly simple to set up and – most importantly – extend.

Out of the box, Munin will start monitoring and graphing various aspects of your server, such as CPU usage, memory usage, disk space and uptime. The key is that everything is graphed, so that trends can be spotted and action taken before it’s too late. Munin always collects its data every five minutes, and always presents the graphs in four timescales: the last 24 hours, the last 7 days, the last month and the last year.

Extending Munin to measure your own service or process is quite straightforward. All you need is a shell-executable script which returns key/value pairs representing the metric you want to measure. You also need to add a special response for when Munin wants to configure your plugin. This response sets graph titles, information on what you are measuring, and optionally thresholds for tripping alerts.

Here’s a typical response from a munin script “uptime”, this is used by Munin to construct the graph:

uptime.value 9.29

Here’s what you get when you call it with the config parameter:

graph_title Uptime
graph_args --base 1000 -l 0 
graph_scale no
graph_vlabel uptime in days
graph_category system
uptime.label uptime
uptime.draw AREA

Munin takes this and draws (or reconfigures) the trend graph, with the data and a historical record. Here’s the daily graph for that:

I have two custom processes being monitored with Munin. The first is my minute-by-minute synchronisation of my database (used by OpenOrienteeringMap and, soon hopefully*, GEMMA) with the UK portion of the master OpenStreetMap database. The metric being measured is the time lag. Normally this is around a minute or less, but if there are problems with the feed at either OSM’s end or (more likely) my end, the graph spikes up and the problem can be spotted and dealt with. Also, the subsequent graphing trend, after such an issue, is useful for predicting how quickly things will be back to normal. I’m using an OSM plugin (part of the Osmosis system, which is also doing the synchronisation) rather than writing my own.

The other process is for monitoring the various feeds I have to around 50 cities around the world, to get their current bikes/spaces information for my Bike Share Map. Munin graphs are useful for spotting feeds that temporarily fail, and then hopefully fix themselves, resulting in distinctive “shark fin” trendlines. If one feed doesn’t fix itself, its shark fin will get huge and I will then probably go and have a look. Above is what the daily graph looks like.

I wrote this plugin myself, in two stages. First, my scripts themselves “touch” a “heartbeat” file (one for each script) upon successful execution. When the feed’s file is touched, its modified timestamp updates, this can then be used as a basis for determining how long ago the last successful operation is.

Secondly, my Munin plugin, every five minutes, scans through the folder of heartbeat files (new ones may occasionally appear – they go automatically onto the graph which is nice) and extracts the name and modified timestamp for each file, and reports this back to Munin, which then updates the graphs.

Because Munin allows any shell-executable script, I was able to use my language du-jour, Python, to write the plugin.

Here it is – latest_dir is an absolute path to the parent of the heartbeats directory, scheme is the name of the city concerned:

#!/usr/bin/python

import os
import time
import sys

filedir = latest_dir + '/heartbeats/'

files = os.listdir(filedir)
files.sort()

if len(sys.argv) > 1: 
	if sys.argv[1] == 'config':
		print "graph_title Bike Data Monitoring"
		print "graph_vlabel seconds"
		for f in files:
			fp = f[:-6]
			print fp + ".label " + fp
		exit()

for f in files:
	tdiff = time.time() - os.path.getmtime(filedir + f)
	fp = f[:-6]
	print fp + ".value " + str(tdiff)

The middle section is for the config, the bit that is used every five minutes is just the gathering of the list of files at the top, and the simple measurement at the bottom.

That is really it – there is no other custom code that is producing the graphs like the one at the top of this post – all the colours, historical result storing and HTML production is handled internally in Munin.

My code that updates the heartbeat file is even simpler:


def heartbeat(scheme):
	open(latest_dir + '/heartbeats/' + scheme + '.touch', 'w').close()

Fingers crossed things won’t go wrong, but if they do, I now have a pleasant, graphical interface to spotting them.

* GEMMA will use OpenStreetMap data for the whole world, but currently it takes longer than a minute to process a minute’s worth of OpenStreetMap database updates, such is the level of activity in the project at the moment, so my minutely-updating database only covers the UK. So, for GEMMA, I am just using a “static” copy of the whole world. OpenOrienteeringMap has two modes, UK and Global – only the UK one uses the updating database.

Categories
Technical

Notes on a Migration with Images from Blogger to WordPress

I’ve recently reported a very large blog from Blogger to WordPress. The blog has been around for many years, with around a thousand posts – most of which contain at least one image embedded in.

The WordPress theme to be used was a gallery-style one (Arras) which displays the post image thumbnails properly, so there was a requirement to have these included in the import. However it is trickier than you might think to get these in from Blogger on a bulk-import basis. Individually associating all the images is not an option due the number of posts. I’ve not used a solution requiring programming, but writing a Python script to do this would be pretty straightforward. Instead I’ve used Excel to extract the URLs and build up SQL queries with the URLs in them, to insert into WordPress. These notes will hopefully prove useful to someone trying to do the same thing.

Assumptions (a lot of these can be worked around if you have the know-how): You have Excel, EditPad (a text editor), a WordPress.org 3.1.X install which allows you to install arbitrary plugins (i.e. if using WordPress Multisite, you are a super-admin), and that you can access the MySQL database backing your blog using phpMyAdmin.

1. Install the Blogger Import plugin in WordPress and use it to import your blog in.

The result should be a successfully imported blog. When viewing a post, the images even appear. However – these are just links to images on the Blogger server. The images themselves haven’t been pulled across.

2. Install the Hot Linked Image Cacher (HLIC) plugin and use it to pull in local copies of your images.

Now the images are locally stored, and linked to locally in the content of your posts – the plugin updates the URLs as well as copying the images across. However, the images are still not formally associated with the posts that link to them.

3. Use phpMyAdmin to export just the ID and post_content columns of the wp_posts table to a CSV file. Use the suppress CR/LF option and the pipe “|” delimiter.

4. Open the CSV file in EditPad (not Excel – as Excel will automatically assume that the commas in your posts are the field delimiters.) and change all commas in the document to spaces.

5. Copy the contents and paste them into a blank Excel document. Use the Text-to-Columns Import wizard to import in the data with the pipe delimiter.

6. Extract the (first) image URL from the post_content column. You can do this by adding another column and using a formula like this:

=MID($B823,FIND("HLIC/", $B823),41)

It’s OK to use 41 characters because the HLIC plugin always saves images with names this long.

7. Use a Filter (AutoFilter) to remove rows for which there is no image URL.

8. You need to create SQL statements, two for each image, structuring them using Excel columns. Post-associated images are considered by WordPress to be themselves posts, with a parent ID referencing the corresponding post ID – and the images also have entries in the post metadata table. Here’s two examples of the two statements you need for each image, I’ve used IDs from the 7xxxxx and 8xxxxx ranges on the assumption there are no existing posts with IDs this high.

insert into wp_posts values( 700002 ,1,NOW(),NOW(),"","","","inherit","open","open","","","","",NOW(),NOW(),"", 1654 ,"/wp-content/uploads/ HLIC/2af1d48d3251d953b106a0bbf8f2f810.jpg ", 0,"attachment", "image/jpeg", 0); insert into wp_postmeta values( 800002 , 700002 ,
"_wp_attached_file", "HLIC/2af1d48d3251d953b106a0bbf8f2f810.jpg ");

insert into wp_posts values( 700003
,1,NOW(),NOW(),"","","","inherit","open","open","","","","",NOW(),NOW(),"", 1683 ,"/wp-content/uploads/ HLIC/dd376f1a86ba3e833e866e7f03127712.jpg ", 0,"attachment", "image/jpeg", 0); insert into wp_postmeta values( 800003 , 700003 , "_wp_attached_file", "HLIC/dd376f1a86ba3e833e866e7f03127712.jpg ");

Elements in red are from the original database export. Elements in blue are sequential numbers.

I’ve assumed all the images are JPEGs – you’ll need to change the “image/jpeg” part of the SQL statement if this is not the case.

If using WordPress Multisite, you’ll need to use the appropriate wp_NN_ table prefix and also set the author number (1 above) appropriately.

9. Paste the statements into EditPad, remove all tabs, and then paste all the statements into phpMyAdmin and execute.

10. Finally you need to build in the attributes for each image, before they are seen. You can do this with another plugin called Regenerate Thumbnails. Note that this plugin depends on the URL that was specified in the post metadata table.

Manual tidying will still be needed – particularly for embedded YouTube videos and other content.

One more gotcha with the import was that the Blogger post tags were coming in as categories. As there were several hundred of these, that wasn’t very practical. So I ran the following bit of SQL to change them to be WordPress tags:

UPDATE wp_term_taxonomy SET taxonomy = replace(taxonomy,"category","post_tag")

Categories
Technical

Should the UK move to Double Summer Time?

Every autumn, as predictably as the clocks going back from British Summer Time (BST, i.e. GMT+1) to Greenwich Mean Time (GMT), there are newspaper articles with people suggesting that we scrap the change and just stay with BST all year around. Lighter evenings in the winter would make days feel longer for most people, and reduce the gloom of the season. Indeed this was done, for a few years from 1968, but politics intervened and we went back to the the annual GMT/BST cycle. Well, things might be changing. More specifically, we could be moving to the timezone that much of the rest of Europe adopts – GMT+1 in the winter, GMT+2 in the summer (so Double Summer Time is a bit of a mis-nomer.

What do you think? Here’s a 10-second survey:

Categories
London Technical

Hodder Geography Nest

During November, I am the guest blogger for the Hodder Geography Nest, along with James, a Ph.D at UCL Geography. We will be blogging about the research we are doing, focusing particularly on maps.

Categories
Technical

Rennes, a Model City for Transport Data

Having had some issues with obtaining the bike share data for some cities, it was refreshing to receive an email from some developers in Rennes, NW France, detailing the public API for transport data that the city has made available, under a Creative Commons-style licence for reuse. You have to sign up for an API key, through their data website, and then all the data you need is available, quickly and with documentation, through XML or other popular machine-readable formats. As well as the bike-share data, metro line information, including alerts, is also available.

Why can’t all cities be like this?

Picture by Eun Byeol Lee on Flickr