Weather hacking updates
Published: November 23, 2008Tags: weather data hacking getauweather
My move went smoothly enough. Total downtime for all the maurits.id.au services was a lot higher than I had anticipated, slightly over 24 hours. It turns out that I stupidly still had some IPNAT rules active from a very long time ago which weren't having any effect on my old home network (as the IP address ranges didn't match) but which did have an effect on the (pre-existing) network at my new home. It took me a while to figure this out and until I did no incoming connections got through as my router was forwarding to packets to a non-existent host. Oops.
Anyway, it's time for an update on my weather hacking project. For about the last two months I have been slowly tweaking the bugs and inefficiencies out of some new code which is run as part of the hourly cron job which updates the .csv, .xml and .yml files that are about the full extent of the project, currently. This new code dumps the data into a PostgreSQL database, so that it persists beyond the lifetime of the particular issue of the BoM page that it was scraped from. In this way I have managed to accumulate an impressive 892294 records thus far! The data comes from 678 stations, so there is an average of 1316 records per station, or one record per hour for about the last 55 days.
The ultimate long-term plan for this database (which I'll probably start making dumps of available on a weekly basis or something) is to create a JSON interface to it, which will enable people with much better Javascript-fu than I to build nifty web applications without having to download the entire huge database. I have a prototype for this using CherryPy in the works, but it will probably be quite a while before I have come up with a complete API that I am happy with and which runs quickly and reliably. Perhaps I should polish up the code for the database cron job and release it so someone else can beat me to it?
A much more immediate goal is to use the database to get get dynamic generation of RSS and Atom feeds for weather data working. This will, of course, be done using feedformatter, which is turning out to be a very useful module for my projects indeed. I have feeds being produced by the hourly cron job already, but have not published them just yet on account of a detail. In order for RSS or Atom feeds to be valid (and hence to be recognised by the less-lenient readers out there) each item needs a URL associated with it. It seems wrong to have the items link anywhere other than to the BoM (to a station-relevant page such as this one for Adelaide). This is easy enough to do but requires putting a URL into each row of the station table of the database. This is the kind of thing one definitely wants to automate rather than doing by hand, but this is not entirely straightforward. The stations are identified in the database, presently, by an abbreviated name as used in the scraped page, e.g. "HindmarshI" for "Hindmarsh Island". The only pages I can find that are convenient for scraping these station-relevant URLs identifies stations by their full names. Thus, automating the process of associating URLs with stations requires the ability to automatically map from full names to abbreviated names. I've written a heuristic function for doing this which makes expansions like "Is" to "Island" and "Pt" to "Port" and looks for matches in the list of full names, but so far it's only got about a 50% success rate. Logically there exists a point where it makes more sense to just finish it by hand that spend time trying to finish off the automation, but I am the kind of person who will keep banging my head against incomplete code well beyond that point. Hopefully it won't take too long.
Anyway, the project is progressing well. I still haven't contacted the Bureau regarding copyright licensing. I definitely will before publishing the RSS feeds, or perhaps immediately after so that whomever at the Bureau has to consider the matter can actually see them.
It's kind of a shame that it looks like the more interesting parts of this project will be ready just in time for the soul-withering monotony of the heart of an Australian summer...