This blog has, IMO, some great resources. Unfortunately, some of those resources are becoming less relevant. I'm still blogging, learning tech and helping others...please find me at my new home on http://www.jameschambers.com/.

Thursday, April 16, 2009

Data Prep, Part 1: GPS Co-ordinates

I wish that there was a package out there that took any kind of data and was able to interpret it, validate it and correct any mistakes, regardless of the different symbols, notation or ordering used in the data as it was originally entered.

Actually, I suppose that package already exists.  It’s called a human. 

Unfortunately, they’re terribly inefficient and tend to bore when presented with mundane tasks, so we have to re-invent (a very small portion of) the wheel whenever we come across data that would give most DBAs nightmares.

I have to get this out there…I hate free-form text.

I just churned through an exercise in regular expressions that had me standardizing a set of 11,000 records containing GPS information.  In the end, there are only 100 lines that do not match the required pattern, which is less than 1% of the data, that will need to be corrected by hand.

For the most part, data was entered in some variant of DD MM SS.ss for both latitude and longitude. Sometimes those co-ords were inversed, which added a few headaches.  The range of characters used to express the text was as varied as the locations contained in the database.  When you see 9947126 in the database, are you able to tell if that means 99.4 71 26, or 99 47.1 26, or – the correct answer – of -99.0 47.1 2.6?

Not all of it could be done with regex alone;  we had section-township-range information which can be converted to a rectangular approximation.  We also had notes on some of the records, and some customers had work orders with even more hints as to how to solve the puzzle. 

Thankfully, the large majority of data was using the above stated pattern and I just had to deal with the user’s choice of delimiter (- 99*47.1’2.6”, for instance).

The other hurdle today was working through our dying pair of productions servers that are trying to serve up data from a database that lacks indexes, foreign keys, stored procs or a db version that caches execution paths.  Everytime you drill into a customer record or work order it pegs the servers.

We had three crashes today, eating up approximately 1.5 hours * 15 employees, so around three man days of time.  That’s real money there, so it was quite the load on my shoulders in my second full week on the job.

No comments:

Post a Comment