A Side Effect of Optimization – A Spring Cleaning for Your Data

When executives and managers start to plan for new optimization projects within their operations, they usually consider the basics like hardware requirements, time to design and implement the model, and the need for user testing and acceptance. But data cleanup is an under-estimated, and often under-appreciated, necessity when implementing a new model, or even when just revamping a current one. Essential data may not be correct, may not be in a usable form, or may not even exist.

Bad or missing data can have a lethal effect on an optimization model. If data is missing  completely the model won’t be able to run. If data is incorrect, the model may give results that are valid based on the inputs but make no sense to the users. Model output problems caused by incorrect data can be particularly difficult to diagnose, especially if the initial assumption is that the data is complete, clean, and correct. This leads the OR team down a path of debugging issues in the model that don’t exist. Often the team will only check the data after exhausting all possibilities within the actual model.

This happened to me recently while working on a new MIP model to replace an older model currently in production. Since the previous model has been in use for years by the client, the team assumed the data was fine. After all, any data issues would have been found during the implementation and continued use of the previous model. However, the previous model had some logic flaws that caused it to sometimes return infeasible results. The users got accustomed to blaming the model for solution problems without investigating the actual cause of a particular issue.

When we started user testing on our new model, we took a structured approach to validating its logic. When the users reported an issue with the model’s output, we tracked the issue to the exact spot in the model’s logic so that we could fix it. Sometimes the issue wasn’t in the model’s logic but was instead in the data. One day we were tracking down an issue reported by the users when we realized that a key piece of input data for the model was not always accurate. In fact, when it was correct it was due to luck and not good data.

Since this was pretty fundamental data for the model, there was the obvious question of how the problem had not been identified in all the years that the previous model was in production. The answer is simply that the previous model’s logic issues masked the data issue. The users saw issues in the model’s output often enough that they became immune to them and just assumed any issues in a solution were the result of bad logic rather than bad data. Only when we began to rigorously identify the cause of reported issues did we start to see the data problems that had existed all along.

The moral of the story? There are two: 1) Never assume the data is complete, clean, and correct – always verify and 2) When starting a new OR project always leave time for data cleanup. Even data that has been used for years can have flaws.

This entry was posted in Uncategorized and tagged . Bookmark the permalink.

2 Responses to A Side Effect of Optimization – A Spring Cleaning for Your Data

  1. Paul Rubin says:

    Nice to see a post on an aspect of OR that often goes unnoticed (and comes as a shock to new practitioners the first time they trip over it). In my case, that was an implementation of Dykstra’s shortest path algorithm to route rail shipments for a client that shall remain nameless, minimizing either distance or time. On my code’s maiden voyage, it routed a shipment from the vicinity of Atlanta to a destination in Massachusetts via Cincinnati > Cleveland > Salt Lake City > Seattle > Milpitas (CA) > Van Nuys (CA) > Oklahoma City > outskirts of Boston. (I may have missed a stop there; my memory fades.) We used a database containing a record for every segment of every rail shipment they’d ever made. One field was distance. The (underpaid? unmotivated? loosely supervised?) employees who filled in the records sometimes skipped the distance field, which the computer then recorded as zero.

    I saw an article a few years ago in which a statistical consultant asserted that 60% or so of the time spent on a typical project went to data cleaning. Which leads us to Stamp’s Law.

  2. Thiago Serra says:

    That remembers me a story I’ve heard from an almost retired OR analyst about his early days. He was in a project in which data was not only redundant but also self-conflicting. They realized that it was not possible to go any further and advised the client to use create a data model instead of trying to optimize anything.

    I think that we are luckier today for having many companies using databases instead of unstructured data. However, some spreadsheets still provoke nightmares in OR projects.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s