November 30, 2011 § Leave a comment
(riffing on this course announcement)
There are oceans of data out there. The human ability to think in a million dimensions is limited, and that’s where models come in. All of statistics could, if you wished, be reframed in terms of models. An average is just the result of a model where every individual is assigned the same value. This is reductionist, of course, and we strive for a useful combination of simplicity and accuracy.
But all models are wrong (otherwise they’re not models), and we need to know how wrong they are. This is where statistics comes into its own — answering the question “how wrong”. When we’re modelling a particular data set, we can state exactly how wrong by calculating residuals. Usually, however, we also want to simply how wrong. So we have measures like the standard deviation and the root mean squared error. It can also be useful to examine how accurate the RMS error represents the residuals, but quantifying this can easily lead to a sinkhole.
Assessing the accuracy of predictions made by a model is a different matter. Consider the case where you have data from a nice stationary process. The most reliable way of dealing with this is splitting data into training and test sets, though there are shortcuts that may work well in the right circumstances.
In many cases, the data aren’t so nice. This is where you need to be very careful about how much you trust not just your models, but also the models underlying your error assessments. It’s not enough, for instance, to select between a null and an alternative model if neither is particularly close to the truth. Subject matter knowledge is crucial here. Statisticians should do a better job of helping them out.
Conclusion: The value of models seems self-evident. If I were teaching a course on models, I would be tempted for it to consist entirely of repetitions of “all models are wrong”. Would have to think hard about how to make it more constructive than that.