## Causal inference is hard: Baseball, salary, and marriage

(Meta-note: Now that classes are over for the semester I’ll try to post something every weekend. Annoy me if I don’t.)

Here’s a study by Cornaglia and Feldman of the relationship between baseball salaries and marital status. They overinterpret their results, but I’m not worried about that aspect here. What I’ll focus on is their regression to find the “direct effect of marriage on earnings by initial ability”. It’s a regression for causal inference, so you know I’ll find something to object to. What might be more constructive is a proposal for how I would study the issue.

This is what I’m thinking:

1. Is there a difference between the distribution of salaries of single players and the distribution of salaries of married players? What is the difference? Is it just a shift on a log scale, or does the distribution change shape? (Histograms would be informative.) Is the difference significant for some reasonable definition of significant?
2. By far the most obvious common cause is age: clearly this affects both marital status and salary, whereas experience, for example, will be much less directly related to marital status. Can any differences between single and married players be explained by differences in age? To answer this, compare the salaries for 23-year-old single players to 23-year-old married players. Then compare 24-year-old married players to 24-year-old married players, and so on. Comparisons of centres are important, but we also care about comparisons between each pair of distributions: if there are differences, are they for all players, or only for parts of the salary distribution?
3. If part 2 suggests there really are differences, let’s try to quantify them. Build a model for “deserved” salary based on age, experience, and output. The output part would be hard to build from scratch; fortunately, the good people at Fangraphs have done the work for us. Pair single and married players who are very close in predicted salary. Are there systematic differences between the pairs? Do they hold throughout the distribution of predicted salary, or only for one end or the other?

What would you do differently?

(h/t: The Book Blog)