The Problem of Measurement

After a lot of years of fighting, we're finally seeing some fledgling efforts to measure teacher performance using tests scores and other metrics.  Predictably, this is causing some problems, as a New York Times story chronicles: bad data that needs to be fixed, and weak correlations between a teacher's performance, and the performance of their class in a given year:

Yet a promising correlation for groups of teachers on the average may be of little help to the individual teacher, who faces, at least for the near future, a notable chance of being misjudged by the ranking system, particularly when it is based on only a few years of scores. One national study published in July by Mathematica Policy Research, conducted for the Department of Education, found that with one year of data, a teacher was likely to be misclassified 35 percent of the time. With three years of data, the error rate was 25 percent. With 10 years of data, the error rate dropped to 12 percent. The city has four years of data.

The most extensive independent study of New York's teacher rankings found similar variability. In math, about a quarter of the lowest-ranking teachers in 2007 ended up among the highest-ranking teachers in 2008. In English, most low performers in 2007 did not remain low performers the next year, said Sean P. Corcoran, the author of the study for the Annenberg Institute for School Reform, who is an assistant professor of educational economics at New York University.

Michael O'Hare heaps scorn on the whole project:

The story, and the management it describes, is all about trying to get quality by incentives and personnel, in particular, sorting teachers into good, bad, and indifferent bins by observing student test score changes over a year. Presumably, those in the good bin get more pay, or something, and we fire the bad ones. The implicit model is that teaching is an irremediable trait among bad teachers, and purely a matter of incentives and motivation among all the others, and a positional arms race is just the ticket to make teachers really want to do a good job, right?

There's so much wrong with this it's hard to know where to start, even though test score increases is among the useful information that should be collected and analyzed, mostly to direct attention to teacher practice that seems to work and would reward analysis and discussion. But attributing outcomes to the teacher is nuts, as Deming demonstrated years ago. First, the instrument is very noisy; the article begins with some anecdotes of teachers omitted from the system, teachers scored in years when they didn't teach, and teachers given the wrong scores. Second, stuff happens; every teacher knows that a class takes on a personality early in the year on the basis of unobserved or random events, or just the luck of the draw in student selection, that seems to be refractory to what a teacher does. Some years have lots of snow days, some years there's a shooting in the school, and on and on. So the teacher effect is going to pick up correlations, spurious and real, with all the variables that are not observed.

Finally other stuff happens: Deming - brilliant, tough-minded, and humane - demonstrated that if you reward individual workers for performance, you are going to be rewarding random variation a lot of the time, with poisonous effects. Right away, when the top salesman among twenty gets a trip to Hawaii with his wife, the response of the other nineteen is not to emulate him (and how could they, if they don't see what he does, which is the case for teachers in spades), but to be pissed off and jealous, which is, like, really great for collaborative enterprise. Next year, regression toward the mean sets in and he is only number five, or ten, so he looks like a slacker, coasting on his laurels. Even his wife starts giving him the fisheye; don't be surprised if his lunch martini count starts to go up.

It is a universal, desperate, desire of lazy or badly trained managers to find a mechanistic device you can wind up like a clockwork, loose upon the organization, and go play golf. Like testing and firing to get people to do good work. Please, Lord, show me the way to manage without any actual heavy lifting!

In my experience, managers aren't quite the dull-witted slackards that O'Hare makes out.  They're quite well aware of the problems inherent in using performance metrics to judge people.  However, they're also quite well aware of the problems inherent in not using performance metrics to judge people.

As a manager, you basically have two options.  You can use subjective evaluations, which avoid the false positive issues raised in the New York Times article--but which then open you up to charges of favoritism and abuse.  These systems are heavily biased towards observed effort/effectiveness, rather than productivity, which makes them ripe for gaming by the office suck-up.  They are much more likely to be influenced by factors like race, religion, political beliefs, and the degree to which the boss sees a little bit of themselves in you--even if the boss is under the impression that they are trying to be scrupulously fair.  In a large organization with a lot of employees, this can cause a lot of headaches for higher-level managers who suddenly find themselves in the position of Appeals Court.

The alternative is to use a more objective standard, which runs into problems of its own; as Jim Manzi described to me a while back:

"When people's compensation is on the line," says Manzi, "they suddenly turn into Aristotle." Many a company has tried to design its sales-force commissions around arcane data analyses that try to control for complex factors, like whether one salesperson's territory has more customers than another's. Suddenly, they find the salespeople are all crack statisticians who can explain where the model has gone wrong in the case of their territory. Many of the companies that press forward find themselves, after six or eight months, with a sales force in full revolt. "I've seen it many times. Very few data-mining systems survive first contact with reality."

There's no Option C where you develop, through sheer managerial prowess, a performance evaluation system that avoids all the potential injustices and doesn't leave some portion of your workforce unhappy.  Any company with more than a few employees is going to let some of them fall through the cracks.

The answer to the inherent difficulty of judging performance is not to simply throw up one's hands and pay anyone who manages to show up for work more often than not.   Yet this is that the teacher's unions seem to want.  The unions, who seem to be the primary source for the New York Times piece, are (like many employees) very adept at picking out the problems in an evaluation system.  Unfortunately, their solutions their solution is what we might call Option D:  Make no attempt to evaluate performance at all.

If you propose a subjective system of evaluation by principals and peer evaluators, the spokesmen for the teacher's union complain that subjective evaluations are open to abuse of power.  If you propose a more objective set of statistical metric, they note the measurement problems. Until we can develop that never never system which inerringly identifies teacher performance, they want us to stick with the current system of rewarding seniority and advanced degrees in education, even though this is the one thing we know for sure doesn't work. It's now fairly clear that after 4-5 years, additional experience does not improve teacher performance, and neither do the useless education credentials that most teachers now get in order to bump their pay.

There is a lot to criticize about these beginning efforts to evaluate teacher performance, but at least they aren't rewarding things we know have nothing to do with educational quality.  I'm not sure why O'Hare finds it so ludicrous to think of sacking the worst performers, the ones who clearly aren't cut out for teaching, and offering bonus pay to the top-performing teachers who currently leave needy school districts for the suburbs, or exit teaching altogether.  Can it really be stupider than sacking only the teachers who actually commit felonies in the classroom, while rewarding people for doing things that don't make them better teachers?

Myself, I'd be happy with a more subjective, manager-intensive effort to evaluate teacher performance, weeding out the very bad teachers, and using pay to lure the very best ones into the places where they are most needed: schools that teach disadvantaged kids.  But it's worth noting that such a system is pretty much totally incompatible with a unionized workplace--certainly one with a unionized workforce as adversarial as the teacher's union is in many cities.  The whole union system is set up to deal with standardized processes.  The sort of hands-on, intensive managerial monitoring and coaching that O'Hare imagines is almost impossible to imagine in the context of a collective bargaining process where rewards and punishments--and the conditions for each--must be clearly spelled out in advance.

Given that our urban education workforce is unionized, and that this requires a set of standard metrics to be effective, the only recourse education reformers have is something like the statistical analytics that New York is trying to implement.  The system is far from perfect, and it will certainly need a lot of tweaking to make it work.  But at least there's some chance that it will be better than a system where the rewards are totally uncorrelated with teacher performance, which is what we now have.