When Correlation Is Not Causation, But Something Much More Screwy
Sampling error? Omitted variable bias? Bah, that's for first-year grad students.
Guest by Gabriel Rossman -- Sociologist at UCLA. His work applies economic sociology to media industries. He blogs at Code and Culture and is the author of Climbing the Charts
[This post incorporates parts of posts from posts on my own blog and lecture notes I circulate to my graduate students. I figured it was worth revising and posting here as a) basically none of you are my grad students or read my blog and b) I want to get Jim Manzi's opinion on it as long as I have him as a co-blogger.]
Sampling error? Omitted variable bias? Bah, that's for first-year grad students. What I find really interesting is there are some fairly basic principles for how analysis can get really screwy but which can't be fixed by adding more control variables, increasing your sample size, or fiddling with assumptions about the distribution of the dependent variable. I'm thinking about really scary sources of model specification problems. Or actually, not model specification in of itself, but data collection. Your typical social science graduate curriculum talks a lot about getting standard error right but on a day to day basis most of our work goes into getting the data into the proper form and this is also where most problems come from.
But before talking math, let's contemplate a recent overheard confession that, "Turns out those funny looking toe shoes are pretty comfortable." As someone who feels naked without footwear that involves both socks and laces I had never given much thought to this and to the extent that I had, I assumed wearing these things was a costly signal of geekiness. But on reflection it makes perfect sense. After all if something as ridiculous looking as toe shoes were not comfortable then nobody would wear them. Conversely, four inch heels are very uncomfortable (or so I am given to understand) but many women wear them because they're attractive. So we can imagine a negative association between how attractive shoes are and how good they feel. Indeed, this describes my own collection of incredibly comfortable but informal Chucks, fairly comfortable and decent-looking dress shoes, and a second pair of dress shoes that are uncomfortable but fancy. One interpretation of this (and bear with me as I briefly sound like a critical studies type person) would be something along the lines of a sadistic gaze wherein the perceived attractiveness of a shoe is directly derived from the discomfort we imagine it imposing on its wearer. I don't doubt that people have made this argument but I don't buy it as a general argument because I can imagine shoes that are both hideous and uncomfortable --- say Crocs made of gravel and epoxy. There is no ontological reason why we can't have shoes that are both hideous and uncomfortable but rather there is a practical reason in that nobody wears shoes that are terrible in every way and so such shoes don't make it unto the market. That is, there is a big difference between the covariance of traits for all conceivable shoes versus covariance of traits among those shoes that actually get bought and worn.
Now here's where we get to the math. The logician, computer scientist, and fellow UCLA faculty Judea Pearl uses a graph theoretic approach to logic that emphasizes using counter-factual understandings to get at the underlying structure of causation. (His magnum opus is Causality. For an introduction relevant to the social sciences see Morgan and Winship.) One of Pearl's most interesting deductions is the idea of conditioning on a collider. If a case being observed is a function of two variables then this will induce an artifactual negative correlation between the variables. This is true even if in the broader population there is no correlation (or even a mild positive correlation) between the variables.
For instance, suppose that in a population of aspiring Hollywood actors there is no correlation between acting ability and physical attractiveness. However assume that we generally pay a lot more attention to celebrities than to some kid who is waiting tables while going on auditions. That is, we can not readily observe people who aspire to be actors, but only those who actually are actors. This implies that we need to understand the selection process by which people get cast into films. In the computer simulation displayed below I generated a population of aspiring actors characterized by "body" and "mind," each of which follows a normal distribution and with these two traits being completely orthogonal to one another. Then imagine that casting directors jointly maximize talent and looks so only the aspiring actors with the highest sum for these two traits actually get work in Hollywood. I have drawn the working actors as triangles and the failed aspirants as hollow circles. Among those actors we can readily observe there then will be a negative correlation between looks and talent, even though there is no such correlation in the grand population. If we see only the working actors without understanding the censorship process we might think that there is some stupefaction of being ridiculously good-looking.
This also applies when one or both of the variables is categorical. Many prestigious colleges have policies of preferring legacy applicants. This implies that the SAT scores of legacies are lower in the freshmen class even though they are higher in the applicant pool.
In these examples the censorship bias implied by conditioning on a collider is fairly easy to see because we have started from the latent population (aspiring actors, college applicants) and worked our way to the observed population (working actors, college freshmen). However the insidious thing about conditioning on a collider is that we almost always only see the observed population. This makes it easy to confuse what is actually a causal process of truncation with a more direct structure of causation, such as an idea that being attractive or a legacy somehow causes someone to be untalented or unintelligent.
Conditioning on a collider can occur any time that there is an underlying selection regime that involves either variables in the dataset or correlates of variables in the dataset. This is almost inevitable if you have built a composite dataset out of multiple constituent datasets. That is, a case appears in the sample if it meets one or more sampling criteria. This is actually a fairly common sample design, usually premised on the idea of not wanting to "miss anything" and/or wanting to increase the sample size.
Once you start looking for it you see it in a lot of studies. For instance, suppose a researcher were interested in which firms had donated to a particular PAC. The researcher might start with a basic sample like the Fortune 500 but then notice only 5 firms had donated to the PAC. Because statistical power in analysis of a binary variable is a function of both the number of cases (higher is better) and the proportion (close to .5 is better), the analysis would have minimal statistical power. The researcher might then add to the data all firms that donated to the PAC, regardless of whether or not they were in the 500. If the researcher were then to do a logistic regression of donating to the PAC as a function of annual revenues the results would almost inevitably be a strong negative effect. The reason is that inclusion in the sample is defined by high revenues (which is the inclusion criteria for the Fortune 500) OR donating to the PAC. There are firms with low revenues that didn't donate to the PAC, lots of them in fact, but they don't appear in the dataset.
We can see this at work in survey data. I took the 2010 wave of the General Social Survey and pulled all 395 Republicans and GOP-leaning independents (PARTYID==4/6). For these people I compared their attitudes on marijuana (GRASS) and government redistribution of wealth (EQWLTH, which I cut to a binary with responses 1/4). Among Republicans who oppose wealth distribution, 37% favor legalizing marijuana, as opposed to 38% among those who favor wealth redistribution. This difference of one percentage point is not even remotely statistically significant (chi2 0.08, 1 df).
OK, now wait a minute you may be saying, he promised us negative relationships but this is no trend at all. True, but let's contrast it with the same analysis for the whole sample, regardless of party. In general, 42% of those who oppose redistribution favor legalized marijuana against 53% of those who favor redistribution. This relationship is strongly statistically significant (chi2 14.50, 1 df). So among the general population there is a positive association between marijuana legalization and wealth redistribution. Among Republicans this effect is perfectly counterbalanced by conditioning on a collider. People presumably join the GOP because they agree with it on at least some issues. Republicans who oppose both weed and redistribution we can call movement conservatives, those who oppose weed but favor redistribution we can call social conservative populists, those who favor weed but oppose redistribution we can call libertarians, and those who favor both we can call people who should probably change their party registration. This case illustrates how conditioning on a collider doesn't necessarily result in a net negative relationship but rather can partially or complete suppress an underlying general trend.
Conversely, if you understand how this process works you can exploit it both analytically and practically. Although he doesn't express it in the language of counterfactual causality using directed acyclic graphs (and I'm not really sure why not), several of Tyler Cowen's "Six Rules for Dining Out" in this magazine (and the related book) follow this logic. Start from the assumption that many restaurants go out of business, meaning that failed ones are censored from the remaining pool of available restaurants. Now assume that the two main things that let restaurants succeed are food quality and various other things that we can collectively call atmosphere. The logic of conditioning on a collider implies that among surviving restaurants there should be a negative correlation between atmosphere and food. This implies that if you are monomaniacally focused on good food you should follow the heuristic of avoiding fashionistas and seeking out unpopular ethnic groups as the only way such places could possibly stay in business is if they offer good food. Conversely if you don't have an especially refined palate and really like to be around pretty girls you should probably follow the heuristic of "if you're going to dinner with Tyler Cowen don't let him choose the restaurant."