The latest The Numbers Guy column of the Wall Street Journal investigates why early predictions of the magnitude of the H1N1 epidemic seem to have been significantly overestimated. Luckily, the swine flu did not turn into the massive epidemic people feared it would. While the column focuses on why the predictions were so far off the actual counts, I thought this quote by Jimmy Efird, a biostatistician at the University of North Carolina, Greensboro, was more interesting:
Sometimes you’re better off not trying to estimate something when it’s difficult to model.
But after reading the article, it seems to me that the problem was a lack of data rather than an incorrect model. The author notes that only small data sets were being used and the title of the column in fact references that the data was flawed, not necessarily the model. It is easy to blame the model when things don’t go the way we plan, but maybe sometimes the data (or lack of it) is to blame.
Not having enough data or the right data is a common problem when practicing OR in business. A few years ago, a client wanted to analyze the viability of schedules created days in advance. These schedules were optimal based on the data available when the model was run; however, in the time between when the schedule was created and when it was used, the schedule often had to be modified to accommodate new or updated data. The company wanted to know if the model could be improved to provide schedules that were less sensitive to the data changes that were occurring.
The problem was that the company was not tracking how the data changed, why changes had to be made to the schedule, or even what changes had been made. Once the schedule had been changed, the user was no longer able to even look at the previous schedule to compare it. How could the model be improved to make better solutions if we didn’t even know what was wrong with the original solution? There was a lack of data needed to make these adjustments to the model.
Rather than immediately question the model when the solution doesn’t come out like we expect, we should first consider whether we have enough data and the right data. Just like a computer can only do what the user tells it to do, a model can only use the data we have given it. After spending hours or days finding the right mathematical formulation, no one wants to hear that there isn’t enough data to run the model. But collecting sufficient and appropriate data is just as important a task as creating the right model.