The Problem of Measurement
After a lot of years of fighting, we're finally seeing some fledgling efforts to measure teacher performance using tests scores and other metrics. Predictably, this is causing some problems, as a New York Times story chronicles: bad data that needs to be fixed, and weak correlations between a teacher's performance, and the performance of their class in a given year:
Yet a promising correlation for groups of teachers on the average may be of little help to the individual teacher, who faces, at least for the near future, a notable chance of being misjudged by the ranking system, particularly when it is based on only a few years of scores. One national study published in July by Mathematica Policy Research, conducted for the Department of Education, found that with one year of data, a teacher was likely to be misclassified 35 percent of the time. With three years of data, the error rate was 25 percent. With 10 years of data, the error rate dropped to 12 percent. The city has four years of data.
The most extensive independent study of New York's teacher rankings found similar variability. In math, about a quarter of the lowest-ranking teachers in 2007 ended up among the highest-ranking teachers in 2008. In English, most low performers in 2007 did not remain low performers the next year, said Sean P. Corcoran, the author of the study for the Annenberg Institute for School Reform, who is an assistant professor of educational economics at New York University.
Michael O'Hare heaps scorn on the whole project:
The story, and the management it describes, is all about trying to get quality by incentives and personnel, in particular, sorting teachers into good, bad, and indifferent bins by observing student test score changes over a year. Presumably, those in the good bin get more pay, or something, and we fire the bad ones. The implicit model is that teaching is an irremediable trait among bad teachers, and purely a matter of incentives and motivation among all the others, and a positional arms race is just the ticket to make teachers really want to do a good job, right?There's so much wrong with this it's hard to know where to start, even though test score increases is among the useful information that should be collected and analyzed, mostly to direct attention to teacher practice that seems to work and would reward analysis and discussion. But attributing outcomes to the teacher is nuts, as Deming demonstrated years ago. First, the instrument is very noisy; the article begins with some anecdotes of teachers omitted from the system, teachers scored in years when they didn't teach, and teachers given the wrong scores. Second, stuff happens; every teacher knows that a class takes on a personality early in the year on the basis of unobserved or random events, or just the luck of the draw in student selection, that seems to be refractory to what a teacher does. Some years have lots of snow days, some years there's a shooting in the school, and on and on. So the teacher effect is going to pick up correlations, spurious and real, with all the variables that are not observed.
Finally other stuff happens: Deming - brilliant, tough-minded, and humane - demonstrated that if you reward individual workers for performance, you are going to be rewarding random variation a lot of the time, with poisonous effects. Right away, when the top salesman among twenty gets a trip to Hawaii with his wife, the response of the other nineteen is not to emulate him (and how could they, if they don't see what he does, which is the case for teachers in spades), but to be pissed off and jealous, which is, like, really great for collaborative enterprise. Next year, regression toward the mean sets in and he is only number five, or ten, so he looks like a slacker, coasting on his laurels. Even his wife starts giving him the fisheye; don't be surprised if his lunch martini count starts to go up.
It is a universal, desperate, desire of lazy or badly trained managers to find a mechanistic device you can wind up like a clockwork, loose upon the organization, and go play golf. Like testing and firing to get people to do good work. Please, Lord, show me the way to manage without any actual heavy lifting!
In my experience, managers aren't quite the dull-witted slackards that O'Hare makes out. They're quite well aware of the problems inherent in using performance metrics to judge people. However, they're also quite well aware of the problems inherent in not using performance metrics to judge people.
"When people's compensation is on the line," says Manzi, "they suddenly turn into Aristotle." Many a company has tried to design its sales-force commissions around arcane data analyses that try to control for complex factors, like whether one salesperson's territory has more customers than another's. Suddenly, they find the salespeople are all crack statisticians who can explain where the model has gone wrong in the case of their territory. Many of the companies that press forward find themselves, after six or eight months, with a sales force in full revolt. "I've seen it many times. Very few data-mining systems survive first contact with reality."
There's no Option C where you develop, through sheer managerial prowess, a performance evaluation system that avoids all the potential injustices and doesn't leave some portion of your workforce unhappy. Any company with more than a few employees is going to let some of them fall through the cracks.