What Is Causality?

Guest post by Jim Manzi, founder and Chairman of Applied Predictive Technologies, and the author of Uncontrolled: The Surprising Payoff of Trial-and-Error for Business, Politics and Society.

Gabriel, your very deep post that, in passing, requested my comment was fascinating.  My family thanks you for the weekend I just spent staring off into space.

You open with this:

Sampling error? Omitted variable bias? Bah, that's for first-year grad students. What I find really interesting is there are some fairly basic principles for how analysis can get really screwy but which can't be fixed by adding more control variables, increasing your sample size, or fiddling with assumptions about the distribution of the dependent variable.

I spend an enormous amount of time in my book arguing that that this problem is pervasive and significant, and that exactly this triptych of remedies will fail to enable us to build models that make useful, reliable and non-obvious predictions for the effects of our interventions in human social systems.  In it, I take apart some celebrated social science models for failing in this respect.  But in the spirit of what's sauce for the goose is sauce for the gander, I then take apart a model that I built to estimate the effect of changing the name of a convenience store, to show how all three together can't put Humpty Dumpty back together again.

Start at the most foundational level: What is causality?  I have an engineer's perspective on this.  What I care about is my ability to predict the effect of my interventions better than I can without the model.

Consider two questions:

1.     - Does A cause B?

2.    - If I take action A, will it cause outcome B?

I don't care about the first, or more precisely, I might care about it, but only as scaffolding that might ultimately help me to answer the second.

For example, in your shoes story, I don't care whether the characteristic of discomfort cause shoes to be considered attractive.  I care about whether, for example, if I take an existing type of shoes and narrow the toes, this will cause them to get more coverage in fashion magazines, sell more units or whatever.

In general, the best way to determine this is to take some comfortable shoes, narrow the toes, and then see what happens to sales.  That is, to run an experiment.

There are big problems with this approach.  One obvious one is that it is often impossible or impractical to run the experiment.  But even if we assume that I have done exactly this experiment, I still have the problem of measuring the causal effect of the intervention.  In a complicated system, like shoe stores, I have to answer the question of how many pairs I would have sold in the, say, three months after changing my design to narrow toes - I can't just assume that I would have sold the same number of wide-toed shoes that I did in the prior three months.  For reasons well-known to you, and that I go through at length in the book, the best way to measure this in a complicated system is a randomized field trial (RFT) in which I randomly assign some stores to get the new shoes and others to keep selling the old shoes.  In essence, random assignment allows me to roughly hold constant all of the "screwy" effects that you reference between the test and control group.

But what many cheerleaders for randomized experiments gloss over is that even if I have executed a competent experiment, it is not obvious how I turn this result in to a prediction rule for the future (the problem of generalization or external validity).  Here's how I put this in an article a couple of years ago:

In medicine, for example, what we really know from a given clinical trial is that this particular list of patients who received this exact treatment delivered in these specific clinics on these dates by these doctors had these outcomes, as compared with a specific control group. But when we want to use the trial's results to guide future action, we must generalize them into a reliable predictive rule for as-yet-unseen situations. Even if the experiment was correctly executed, how do we know that our generalization is correct?
A physicist generally answers that question by assuming that predictive rules like the law of gravity apply everywhere, even in regions of the universe that have not been subject to experiments, and that gravity will not suddenly stop operating one second from now. No matter how many experiments we run, we can never escape the need for such assumptions. Even in classical therapeutic experiments, the assumption of uniform biological response is often a tolerable approximation that permits researchers to assert, say, that the polio vaccine that worked for a test population will also work for human beings beyond the test population.

But as we climb a ladder of phenomenological complexity from physics to biology to sociology, this problem of generalization becomes more severe.  As I put it in Uncontrolled:

We can run a clinical trial in Norfolk, Virginia, and conclude with tolerable reliability that "Vaccine X prevents disease Y." We can't conclude that if literacy program X works in Norfolk, then it will work everywhere. The real predictive rule is usually closer to something like "Literacy program X is effective for children in urban areas, and who have the following range of incomes and prior test scores, when the following alternatives are not available in the school district, and the teachers have the following qualifications, and overall economic conditions in the district are within the following range." And by the way, even this predictive rule stops working ten years from now, when different background conditions obtain in the society.

We must have some model that generalizes.  What we really need to do is to build a distribution of results of "experiments + model" in predicting the results of future experiments.  An example of what I mean applied to criminology is the following from the article I referenced above:

One of the most widely publicized of these [criminology RFTs] tried to determine the best way for police officers to handle domestic violence. In 1981 and 1982, Lawrence Sherman, a respected criminology professor at the University of Cambridge, randomly assigned one of three responses to Minneapolis cops responding to misdemeanor domestic-violence incidents: they were required to arrest the assailant, to provide advice to both parties, or to send the assailant away for eight hours. The experiment showed a statistically significant lower rate of repeat calls for domestic violence for the mandatory-arrest group. The media and many politicians seized upon what seemed like a triumph for scientific knowledge, and mandatory arrest for domestic violence rapidly became a widespread practice in many large jurisdictions in the United States.
But sophisticated experimentalists understood that because of the issue's high causal density, there would be hidden conditionals to the simple rule that "mandatory-arrest policies will reduce domestic violence." The only way to unearth these conditionals was to conduct replications of the original experiment under a variety of conditions. Indeed, Sherman's own analysis of the Minnesota study called for such replications. So researchers replicated the RFT six times in cities across the country. In three of those studies, the test groups exposed to the mandatory-arrest policy again experienced a lower rate of rearrest than the control groups did. But in the other three, the test groups had a higher rearrest rate.
Why? In 1992, Sherman surveyed the replications and concluded that in stable communities with high rates of employment, arrest shamed the perpetrators, who then became less likely to reoffend; in less stable communities with low rates of employment, arrest tended to anger the perpetrators, who would therefore be likely to become more violent. The problem with this kind of conclusion, though, is that because it is not itself the outcome of an experiment, it is subject to the same uncertainty that Aristotle's observations were. How do we know if it is right? By running an experiment to test it--that is, by conducting still more RFTs in both kinds of communities and seeing if they bear it out. Only if they do can we stop this seemingly endless cycle of tests begetting more tests. Even then, the very high causal densities that characterize human society guarantee that no matter how refined our predictive rules become, there will always be conditionals lurking undiscovered. The relevant questions then become whether the rules as they now exist can improve practices and whether further refinements can be achieved at a cost less than the benefits that they would create.

We can then then compare the accuracy of such a theory this to analogous distributions of predictions made by non-experimental methods (that can vary from sophisticated regression models to newer machine learning techniques to prediction markets to the judgments of experts, and so on) for predicting the results of future experiments.  As I put this in the book:

The job of experimentation in business is to put rounds on target. Abstract discussion of causality is a means to the end of using prior experimental results to more accurately predict the shareholder value impacts of various alternative potential courses of action.

As I go into, there is no absolutely secure philosophical resting place.  That is, even if I have such a distribution of results for the predictions made by various methods, I can't ever be absolutely certain that this distribution won't suddenly change.  (I expend a lot of effort trying to unify the problem of induction and the reference class problem to show that this is always a risk, no matter what.)  But I think this is as close as you can get.

What this demands, of course, is a lot of experiments.  This is why lowering the cost per test is so critical.  Not just as an efficiency measure, but because in practice in enables me to get to much more reliable predictions of the effects of my proposed interventions.

To come back to where we started, I think this this is the way to evaluate whether some model, tool, guru or whatever has "really" discovered a causal relationship.  A statement about causality only has operational meaning as a predictor of future results of rigorous tests of the causal theory for the outcome of an intervention.