Models, inc.

November 30, 2011 § Leave a comment

(riffing on this course announcement)

There are oceans of data out there. The human ability to think in a million dimensions is limited, and that’s where models come in. All of statistics could, if you wished, be reframed in terms of models. An average is just the result of a model where every individual is assigned the same value. This is reductionist, of course, and we strive for a useful combination of simplicity and accuracy.

But all models are wrong (otherwise they’re not models), and we need to know how wrong they are. This is where statistics comes into its own — answering the question “how wrong”. When we’re modelling a particular data set, we can state exactly how wrong by calculating residuals. Usually, however, we also want to simply how wrong. So we have measures like the standard deviation and the root mean squared error. It can also be useful to examine how accurate the RMS error represents the residuals, but quantifying this can easily lead to a sinkhole.

Assessing the accuracy of predictions made by a model is a different matter. Consider the case where you have data from a nice stationary process. The most reliable way of dealing with this is splitting data into training and test sets, though there are shortcuts that may work well in the right circumstances.

In many cases, the data aren’t so nice. This is where you need to be very careful about how much you trust not just your models, but also the models underlying your error assessments. It’s not enough, for instance, to select between a null and an alternative model if neither is particularly close to the truth. Subject matter knowledge is crucial here. Statisticians should do a better job of helping them out.

Conclusion: The value of models seems self-evident. If I were teaching a course on models, I would be tempted for it to consist entirely of repetitions of “all models are wrong”. Would have to think hard about how to make it more constructive than that.

Advertisements

Guest post: BAYESBOT 3000 explains Bayesianism to me and Less Wrong

November 27, 2011 § Leave a comment

Greetings, humanoids! I am BAYESBOT 3000, the Bayesian robot. I am here to discuss some ideas that humanoids hold about Bayesians.

Specifically, here are what are claimed by humanoids of the website “Less Wrong” to be core tenets of Bayesianism:

Core tenet 1: Any given observation has many different possible causes.

Core tenet 2: How we interpret any event, and the new information we get from anything, depends on information we already had.

Core tenet 3: We can use the concept of probability to measure our subjective belief in something. Furthermore, we can apply the mathematical laws regarding probability to choosing between different beliefs. If we want our beliefs to be correct, we must do so.

Core tenet 1 is trivially true. In fact, it could be strengthened by deleting “possible” or changing “many” to “an infinite number of”, though the latter may be unwise, as humanoids have difficulty with the concept of infinity.

Core tenet 2 is either also trivially true, or meaningless. A humanoid’s evaluation of an event will depend on the knowledge of that humanoid. A BAYESBOT switched on for the first time will evaluate events based on its programming, which depends on the knowledge of humanoid programmers.

Core tenet 3 comprises three different tenets. Humanoids and BAYESBOTs can use probability to measure belief, just as they could use cubits to measure the length of a manatee. They could use probability to choose between beliefs: for instance, by rolling a die. Where BAYESBOT has a problem is with “If we want our beliefs to be correct, we must do so.” Firstly, BAYESBOT robo-LOLs at the idea of humanoids having correct beliefs. Secondly, if humanoids wish to be, as the website’s name says, less wrong, in the long-run BAYESBOTS and their friendly rivals FREQUENTOBOTS both achieve this. Humanoids are compost in the long-run, so they may be interested in the short-fun instead. There is no guarantee that a BAYESBOT beats a FREQUENTOBOT, or vice versa, on any time scale. The more important matter is BAYESBOTS and FREQUENTOBOTS use inputs efficiently. But humanoids experience a wider range of inputs than we bots. It is not clear to bots or humanoids how to mathematically combine observations of the Sun with Newtonian physics to arrive at a probability that the Sun will rise tomorrow. Using all inputs, it is impossible to arrive at an uncontroversial probability that anthropogenic global warming has occurred. BAYESBOT differs from STRICT FREQUENTOBOT in that BAYESBOT will calculate a probability for this hypothesis given a prior and a set of data. However, the prior will not be perfectly specified, so it is up to humanoids to decide how literally to take such probabilities. BAYESBOTS take such probabilities literally if and only if they are programmed to.

BAYESBOT’s empiricism is as good as its programming. How good that is best determined through the empiricism of those other than BAYESBOT.

Evaluating hypotheses: We are all Bayesians now (for appropriate definitions of “Bayesians”)

November 24, 2011 § Leave a comment

(lifted and edited from Phil Birnbaum’s comment section)

You should consider all relevant evidence when evaluating hypotheses. This seems an uncontroversial statement, even among journal editors. Is this necessarily Bayesian? Depends on one’s definition of Bayesianism, but to me the term implies something quantitatives: the use of Bayes’ theorem. If we consider any argument that goes outside the data Bayesian, the term seems too broad to be useful. In particular, if “Bayesianism” is used as an umbrella for any use of subjectivity, well, philosophers have been pointing that out for centuries that science can’t be entirely subjective. It’s necessary, however, to make clear what’s objective and what isn’t;  for scientists to use subjective priors (which, to be clear, few Bayesians endorse) obfuscates the difference. On the other hand, I’m totally on board with broadening the definition of “evidence”, though informal evidence should be used informally.

One thing that may or may not be relevant is it doesn’t matter what order you do the conditioning in. That is, in theory summarising all available evidence in a prior and then adjusting for the result of a new experiment gives the same posterior as starting with the experiment result then adjusting for all other evidence. Since there’s rarely an objective prior, you should post all the data and let anyone who wants to update their posterior do so. In practice, humans have all kinds of cognitive biases, not to mention they’re generally not great at integration. You should post the data, but you should help your readers out by providing informative and honest summaries of the data. Hypothesis tests can be nice, but graphs are often more useful.

Out-of-sample validation with 16 data points

November 21, 2011 § Leave a comment

It’s worth a try. Let me say that I think it’s a useful thing to do before I start nitpicking: the prediction interval widths are amusing. But:

  • If you only allow data prior to an election to be used when fitting a model, you ending using 1948 data to inform parameter estimates a lot, and 2008 data not at all. But if you’re trying to guess how well a model will do in 2012, isn’t this the opposite of what you want?
  • Preferring the model with the lowest prediction error isn’t necessarily the right thing to do: it rewards overfitting. All the models are designed after looking at past data. So even the test only allows past data in the estimation, it’s not entirely prospective because the variables have been chosen to give a good fit both in the past and in the future. You could fit a large set of high-order polynomials that give almost no prediction error according to this test, and its prediction for 2012 would be garbage. Parsimony still matters.

So we’re still talking about election forecasts

November 18, 2011 § Leave a comment

Nate Silver responded to the stupidest available critique of his election forecast, while brushing off the most detailed one. (If you think this makes him a “radical centrist” then I’ve got a Thomas Friedman swimsuit calendar to sell you.) Everybody reality-based agrees that in Presidential election, the economy matters, the candidates matter, and all sorts of other stuff matters.

One thing that everyone should know, but not everyone does, is that the fit of a model to past data is usually much better than its fit to future data. People who don’t realise this tend to make their prediction intervals (not confidence intervals, please) too narrow. The pro solution is to split data into training and test sets, but this isn’t feasible for presidential elections. You still want to do some kind of out-of-sample validation. But all the ways of doing this are terribly flawed!

Falling asleep, so to be continued.

Is there a point to election models?

November 16, 2011 § Leave a comment

(kind of in response to Sean Trende’s post)

Useful critiques of predictive election models:

  • It requires a leap of faith to believe that linear regression is anywhere close to right.
  • There are few data points. You can use what you know about Gubernatorial elections to inform Presidential election predictions, but the two levels don’t operate identically.
  • We rely on data that span decades, but politics has changed a lot in that span. Ask the ghost of George Wallace.
  • The model selection usually isn’t transparent. This applies especially when you get isolated variables: 2nd quarter growth but not 1st or 3rd.
  • Effect sizes are often hard to swallow. You really want to knock off 4.4 points if a party is seeking a third term?
  • Outliers can give wacky results. What if there’s 8 percent annualised growth in the relevant quarter? This ties into the linearity point.

Less useful critiques of predictive election models:

  • Lots of predictors are correlated. If you’re doing causal inference, this is a huge problem, but if you’re predicting it’s nearly immaterial.
  • There are lots of different models. It’s hard to choose which one is best, at least based on performance, but we don’t have to believe one because we don’t have to believe any of them.
  • Timing of variable measurements. I know I spent the last post complaining about this, but it shouldn’t be a dealbreaker, as long as all measurements are available before the election. If you use old measurements, you’re only handicapping yourself.
  • “But you left out…” Other variables may matter a lot, but the model is under no obligation to condition on everything.
Basically if you don’t want to interpret it causally it’s a prediction, and if you’re going to predict you should use polling. If you want to interpret it causally, you’re doing causal inference from a regression, with all the usual problems that implies.

Type F error

November 14, 2011 § Leave a comment

Incorrectly assuming something is a false positive because you heard about it from Freakonomics.

Where Am I?

You are currently viewing the archives for November, 2011 at "But it's under .05!".