This blog has, IMO, some great resources. Unfortunately, some of those resources are becoming less relevant. I'm still blogging, learning tech and helping others...please find me at my new home on

Friday, October 8, 2010

Unravelling the Data – Ill-Formatted Data


Read the background to this post.

When Bad Data Is Required

Fixing the data in the legacy system was not something that could be done in place.  What I would refer to as ‘bad’ data was in some cases the glue that held reports together and made things like billing work.

This was one of the first things I had to resolve.  My original approach was that I was going to try to “self-heal” the data through a combination of regular expressions, string replacements and templated hints and helpers.  With the sheer number of discrepancies, this approach was DOA, and manual intervention was required.

A Side Order of Data Fixin’

I took a snapshot of the database and added additional columns to the tables where combined data was present.  To understand ‘combined data’ a little background will help.

image At various points in the application lifecycle the management had decided that they weren’t going to use the fields for their original purpose and started using them for a new one.  In other scenarios, they decided to use the fields in one context for some customers and in a different context for other customers. 

Depending on the customer and how long it took employees to shake old habits, these fields were used in differing ways over extended periods of time.  Furthermore, even if there was a clear drawing point, none of the records in the database have a last modified date or any kind of audit log that reveals when a customer record is modified (in a meaningful way).

Thus, my side order approach faced another problem: there was no clear cut of the data and the existing applications needed to keep running.  A snapshot of data today wouldn’t help in the transition 6 months down the road.

The Birth of the Transition Platform

The solution was to create an ASP.NET MVC application, hosted only on the intranet, that used much of my original approaches to identifying bad data, but left the “healing” to an end user.

Where possible, I used jQuery to look up context-based fixes through controller actions and mashed-up some save functionality by POSTing to the legacy ASP pages of the original application.  Where it wasn’t possible (where functionality would be affected by changes to data) I created proxy tables to house the ‘corrected’ version of the data and wrote some monitors to periodically check to make sure that data was up-to-date.

I grouped functionality of the fixes into distinct controllers.  For instance, anything related to a billing address was in the BillingAddressController with actions to support the corrections required for errors related to that piece. The models focused on model-view versions of the “bad data” and I used repositories to not only connect to the legacy system, but also to maintain a worklog of outstanding and completed tasks.

This worked great, as I was also able to say, at any given point, where we were at percentage-wise for correcting any set of data.

This process continues on today, and time is devoted to cleaning data each week.  All three of the legacy systems continue to get (mis)used, though accuracy has been greatly improved.  As users became aware of expected formats they also became more conscience of how they were entering the data into the older software.

This first win made the next steps more plausible.

Next up: Data that Could be Auto-Corrected


  1. What happened to #2 of this series?

  2. This is number 2!

    Is the title confusing? That's ill as in 'bad' not III as in 'the third'. ;o)