Categories
Technical

The MySQL Groupwise Maximum Problem

There is a surprisingly difficult task to solve with MySQL queries, which I’ve been spending some time trying to do – the Groupwise Maximum problem.

This is the name for the type of query that I was trying, although in fact I am trying to find a set of minimum (rather than maximum) values.

The question: What is the time each day that we a see a minimum of available bikes for? (a research question – as finding this answer will tell us something about the commuting habits of the city.)

The source data table:

timestamp bikes_available
2012-05-29 17:12:00 4265
2012-05-29 17:14:00 4251
2012-05-29 17:16:00 4251
2012-05-29 17:18:00 4253
2012-05-29 17:20:00 4259
etc…

My initial thoughts were:

select date(timestamp), time(timestamp), min(bikes) from bike_agg_london group by date(timestamp)

date time bikes_available
2012-05-22 00:00:01 4662
2012-05-23 00:00:02 4600
2012-05-24 00:00:02 4594
2012-05-25 00:00:01 4805
2012-05-26 00:00:01 4144
2012-05-27 00:00:02 3710

This produces the minimum bikes number for each day, which is great, but the timestamp included is just the first one of each day (in fact it could be a randomly chosen timestamp from within the day, but MySQL’s internal logic happens to pick the first one out). This is because the time(timestamp) is not part of the “group by” (aggregate) clause, and all fields in a query must be included in the group by unless they are part of the aggregate. I don’t want to aggregate the time(timestamp) though – I want the value associated with the minimum bikes, rather than the maximum, minimum or average (etc) value.

Here’s 10 ways to solve the problem, although I tried a few and they didn’t work for me.

Here’s a technique that worked for me (the second solution)

Here’s the SQL that worked for me, quite quickly (~18 seconds for around 166000 rows representing 600 days):

select date(b1.timestamp) theday1, b1.timestamp, b1.bikes from bike_agg_london b1 inner join (select date(timestamp) as theday2, min(bikes) as min_bikes from bike_agg_london group by date(timestamp)) b2 on (date(b1.timestamp) = b2.theday2 and b1.bikes = b2.min_bikes)

date time bikes_available
2012-05-22 2012-05-22 18:22:01 4662
2012-05-23 2012-05-23 18:12:02 4600
2012-05-23 2012-05-23 18:16:01 4600
2012-05-24 2012-05-24 18:18:01 4594
2012-05-24 2012-05-24 18:20:02 4594
2012-05-25 2012-05-25 17:54:02 4805
2012-05-26 2012-05-26 15:56:01 4144
2012-05-27 2012-05-27 17:24:01 3710

It’s the second solution from the above link. There is one problem, where if there are multiple rows in a day that share the same min(bikes) value, they each appear. Using distinct won’t get rid of these, because the time(timestamp) does vary. The fix is to use an additional wrapper (tables co3) to eliminate these duplicate rows:

select theday1, time(min(timestamp)), bikes from
(select date(b1.timestamp) theday1, b1.timestamp, b1.bikes from bike_agg_london b1 inner join (select date(timestamp) as theday2, min(bikes) as min_bikes from bike_agg_london group by date(timestamp)) b2 on (date(b1.timestamp) = b2.theday2 and b1.bikes = b2.min_bikes)) b3 group by theday1, bikes

date time bikes_available
2012-05-22 18:22:01 4662
2012-05-23 18:12:02 4600
2012-05-24 18:18:01 4594
2012-05-25 17:54:02 4805
2012-05-26 15:56:01 4144
2012-05-27 17:24:01 3710

Leave a Reply

Your email address will not be published. Required fields are marked *