Guest post: BAYESBOT 3000 explains Bayesianism to me and Less Wrong

November 27, 2011 § Leave a comment

Greetings, humanoids! I am BAYESBOT 3000, the Bayesian robot. I am here to discuss some ideas that humanoids hold about Bayesians.

Specifically, here are what are claimed by humanoids of the website “Less Wrong” to be core tenets of Bayesianism:

Core tenet 1: Any given observation has many different possible causes.

Core tenet 2: How we interpret any event, and the new information we get from anything, depends on information we already had.

Core tenet 3: We can use the concept of probability to measure our subjective belief in something. Furthermore, we can apply the mathematical laws regarding probability to choosing between different beliefs. If we want our beliefs to be correct, we must do so.

Core tenet 1 is trivially true. In fact, it could be strengthened by deleting “possible” or changing “many” to “an infinite number of”, though the latter may be unwise, as humanoids have difficulty with the concept of infinity.

Core tenet 2 is either also trivially true, or meaningless. A humanoid’s evaluation of an event will depend on the knowledge of that humanoid. A BAYESBOT switched on for the first time will evaluate events based on its programming, which depends on the knowledge of humanoid programmers.

Core tenet 3 comprises three different tenets. Humanoids and BAYESBOTs can use probability to measure belief, just as they could use cubits to measure the length of a manatee. They could use probability to choose between beliefs: for instance, by rolling a die. Where BAYESBOT has a problem is with “If we want our beliefs to be correct, we must do so.” Firstly, BAYESBOT robo-LOLs at the idea of humanoids having correct beliefs. Secondly, if humanoids wish to be, as the website’s name says, less wrong, in the long-run BAYESBOTS and their friendly rivals FREQUENTOBOTS both achieve this. Humanoids are compost in the long-run, so they may be interested in the short-fun instead. There is no guarantee that a BAYESBOT beats a FREQUENTOBOT, or vice versa, on any time scale. The more important matter is BAYESBOTS and FREQUENTOBOTS use inputs efficiently. But humanoids experience a wider range of inputs than we bots. It is not clear to bots or humanoids how to mathematically combine observations of the Sun with Newtonian physics to arrive at a probability that the Sun will rise tomorrow. Using all inputs, it is impossible to arrive at an uncontroversial probability that anthropogenic global warming has occurred. BAYESBOT differs from STRICT FREQUENTOBOT in that BAYESBOT will calculate a probability for this hypothesis given a prior and a set of data. However, the prior will not be perfectly specified, so it is up to humanoids to decide how literally to take such probabilities. BAYESBOTS take such probabilities literally if and only if they are programmed to.

BAYESBOT’s empiricism is as good as its programming. How good that is best determined through the empiricism of those other than BAYESBOT.

Evaluating hypotheses: We are all Bayesians now (for appropriate definitions of “Bayesians”)

November 24, 2011 § Leave a comment

(lifted and edited from Phil Birnbaum’s comment section)

You should consider all relevant evidence when evaluating hypotheses. This seems an uncontroversial statement, even among journal editors. Is this necessarily Bayesian? Depends on one’s definition of Bayesianism, but to me the term implies something quantitatives: the use of Bayes’ theorem. If we consider any argument that goes outside the data Bayesian, the term seems too broad to be useful. In particular, if “Bayesianism” is used as an umbrella for any use of subjectivity, well, philosophers have been pointing that out for centuries that science can’t be entirely subjective. It’s necessary, however, to make clear what’s objective and what isn’t;  for scientists to use subjective priors (which, to be clear, few Bayesians endorse) obfuscates the difference. On the other hand, I’m totally on board with broadening the definition of “evidence”, though informal evidence should be used informally.

One thing that may or may not be relevant is it doesn’t matter what order you do the conditioning in. That is, in theory summarising all available evidence in a prior and then adjusting for the result of a new experiment gives the same posterior as starting with the experiment result then adjusting for all other evidence. Since there’s rarely an objective prior, you should post all the data and let anyone who wants to update their posterior do so. In practice, humans have all kinds of cognitive biases, not to mention they’re generally not great at integration. You should post the data, but you should help your readers out by providing informative and honest summaries of the data. Hypothesis tests can be nice, but graphs are often more useful.

Out-of-sample validation with 16 data points

November 21, 2011 § Leave a comment

It’s worth a try. Let me say that I think it’s a useful thing to do before I start nitpicking: the prediction interval widths are amusing. But:

  • If you only allow data prior to an election to be used when fitting a model, you ending using 1948 data to inform parameter estimates a lot, and 2008 data not at all. But if you’re trying to guess how well a model will do in 2012, isn’t this the opposite of what you want?
  • Preferring the model with the lowest prediction error isn’t necessarily the right thing to do: it rewards overfitting. All the models are designed after looking at past data. So even the test only allows past data in the estimation, it’s not entirely prospective because the variables have been chosen to give a good fit both in the past and in the future. You could fit a large set of high-order polynomials that give almost no prediction error according to this test, and its prediction for 2012 would be garbage. Parsimony still matters.

So we’re still talking about election forecasts

November 18, 2011 § Leave a comment

Nate Silver responded to the stupidest available critique of his election forecast, while brushing off the most detailed one. (If you think this makes him a “radical centrist” then I’ve got a Thomas Friedman swimsuit calendar to sell you.) Everybody reality-based agrees that in Presidential election, the economy matters, the candidates matter, and all sorts of other stuff matters.

One thing that everyone should know, but not everyone does, is that the fit of a model to past data is usually much better than its fit to future data. People who don’t realise this tend to make their prediction intervals (not confidence intervals, please) too narrow. The pro solution is to split data into training and test sets, but this isn’t feasible for presidential elections. You still want to do some kind of out-of-sample validation. But all the ways of doing this are terribly flawed!

Falling asleep, so to be continued.

Is there a point to election models?

November 16, 2011 § Leave a comment

(kind of in response to Sean Trende’s post)

Useful critiques of predictive election models:

  • It requires a leap of faith to believe that linear regression is anywhere close to right.
  • There are few data points. You can use what you know about Gubernatorial elections to inform Presidential election predictions, but the two levels don’t operate identically.
  • We rely on data that span decades, but politics has changed a lot in that span. Ask the ghost of George Wallace.
  • The model selection usually isn’t transparent. This applies especially when you get isolated variables: 2nd quarter growth but not 1st or 3rd.
  • Effect sizes are often hard to swallow. You really want to knock off 4.4 points if a party is seeking a third term?
  • Outliers can give wacky results. What if there’s 8 percent annualised growth in the relevant quarter? This ties into the linearity point.

Less useful critiques of predictive election models:

  • Lots of predictors are correlated. If you’re doing causal inference, this is a huge problem, but if you’re predicting it’s nearly immaterial.
  • There are lots of different models. It’s hard to choose which one is best, at least based on performance, but we don’t have to believe one because we don’t have to believe any of them.
  • Timing of variable measurements. I know I spent the last post complaining about this, but it shouldn’t be a dealbreaker, as long as all measurements are available before the election. If you use old measurements, you’re only handicapping yourself.
  • “But you left out…” Other variables may matter a lot, but the model is under no obligation to condition on everything.
Basically if you don’t want to interpret it causally it’s a prediction, and if you’re going to predict you should use polling. If you want to interpret it causally, you’re doing causal inference from a regression, with all the usual problems that implies.

Type F error

November 14, 2011 § Leave a comment

Incorrectly assuming something is a false positive because you heard about it from Freakonomics.

Election predictions and lead time

November 12, 2011 § Leave a comment

In 2008, Nate Silver became Internet-famous for a model to predict that year’s presidential election. After Obama’s victory, the model was adjudged to have performed well, calling 49 out of 50 states correctly. The predictions that came under scrutiny were those made just before the election. Of course, anyone who put reasonable weight on the polling data as of November 3rd would have called at least 45 out of 50 states correctly, making Silver’s success impressive but hardly inexplicable. On the other hand, Silver was predicting an advantage for McCain as recently as three months before the election, and, again, anyone who looked at polls would have concluded that the race was at least close at that point.

The obvious point is that it’s much harder to make predictions with a lead time. It follows that if you’re making predictions, you should make the lead times of your input variables consistent and clear. Silver’s latest model has the clarity but not the consistency. It seems strange to me to model votes using growth in election-year and approval ratings from the preceding year. As predictions, they’re too late — election-year GDP isn’t determined until after the election. In terms of understanding causation: even if you approve of regression for this purpose, wouldn’t you rather use the most relevant data, like election-day approval ratings? Of course Silver has Times Magazine editors who want copy now, but it’s resulted in copy with some incoherence.