When executives and managers start to plan for new optimization projects within their operations, they usually consider the basics like hardware requirements, time to design and implement the model, and the need for user testing and acceptance. But data cleanup is an under-estimated, and often under-appreciated, necessity when implementing a new model, or even when just revamping a current one. Essential data may not be correct, may not be in a usable form, or may not even exist.
Bad or missing data can have a lethal effect on an optimization model. If data is missing completely the model won’t be able to run. If data is incorrect, the model may give results that are valid based on the inputs but make no sense to the users. Model output problems caused by incorrect data can be particularly difficult to diagnose, especially if the initial assumption is that the data is complete, clean, and correct. This leads the OR team down a path of debugging issues in the model that don’t exist. Often the team will only check the data after exhausting all possibilities within the actual model.
This happened to me recently while working on a new MIP model to replace an older model currently in production. Since the previous model has been in use for years by the client, the team assumed the data was fine. After all, any data issues would have been found during the implementation and continued use of the previous model. However, the previous model had some logic flaws that caused it to sometimes return infeasible results. The users got accustomed to blaming the model for solution problems without investigating the actual cause of a particular issue.
When we started user testing on our new model, we took a structured approach to validating its logic. When the users reported an issue with the model’s output, we tracked the issue to the exact spot in the model’s logic so that we could fix it. Sometimes the issue wasn’t in the model’s logic but was instead in the data. One day we were tracking down an issue reported by the users when we realized that a key piece of input data for the model was not always accurate. In fact, when it was correct it was due to luck and not good data.
Since this was pretty fundamental data for the model, there was the obvious question of how the problem had not been identified in all the years that the previous model was in production. The answer is simply that the previous model’s logic issues masked the data issue. The users saw issues in the model’s output often enough that they became immune to them and just assumed any issues in a solution were the result of bad logic rather than bad data. Only when we began to rigorously identify the cause of reported issues did we start to see the data problems that had existed all along.
The moral of the story? There are two: 1) Never assume the data is complete, clean, and correct – always verify and 2) When starting a new OR project always leave time for data cleanup. Even data that has been used for years can have flaws.