Educational Reform: Low-Quality Testing Fails

by Mark Kleiman

Now lemmesee....

-Using high-stakes tests to reward and punish schools and their staffs encourages cheating.

-The relatively cheap (per test) tests that have to be used if we test all students - using census rather than sample testing, in violation of all the principles of statistical quality assurance - can measure only a subset of what we want the students to know and to be able to do. That is likely to distort curricular decisions by encouraging "teaching to the test" at every possible level: from deliberately teaching specific answers to known test questions at the lowest, to cutting back classroom time on non-tested sujects such such as science and art in favor of more math and reading drill at the highest.

-Even accepting what the tests test for as a valid reflection of educational performance, sheer measurement error makes it hard to distinguish signal from noise in year-to-year variations. (Doing sample rather than census testing may improve validity, but it increases sampling error.)

- The percentage of students falling below some arbitrary cutoff is a bad statistic to use for management purposes. (Again, this is Statistical QA 101 stuff.) It throws away information. Worse, it creates perverse incentives, encouraging schools to concentrate on the students performing at around the cut-off level and to neglect both those too far below the threshold to have a chance of catching up and those comfortably above it.

-Outcome studies of the actual results of high-stakes testing programs suggest that they boost scores on the tests used, but actually reduce performance on nearly every externally validated measure.

So what we have here is a policy that won't work in theory and fails in practice. Why, exactly, are we supposed to be for it? (The fact that the dominant newspaper in the nation's capital is owned by a test-prep company doesn't count.)

These results put the proponents of high-stakes testing in what ought to be an inescapable logical box: a dilemma in the proper sense of that term. If trying something out, measuring its results, and acting accordingly is the right thing to do, then having tried out high-stakes testing, measured its results, and found them to be bad, we ought to dump it, or at least fundamentally redesign it. If trying, measuring, and responding is not the right thing to do, then what's the argument for high-stakes testing in the first place?

I have a very strong prejudice in favor of managing by the numbers, especially in an area such as education where the non-quantitative theorizing is so wooly and our knowledge of the underlying processes so inadequate. (How to produce high-quality research in a field where the relevant university units engage mostly in training for a poorly paid, low-status profession is a different problem.) Some teachers are much, much effective than others, and we need to reward those who teach well and improve or replace those who don't. Some schools, and school districts, work much, much better than others, and we need to make the under-achievers act more like the high-achievers.

So I'd be inclined to strengthen the testing regime by broadening the base of knowledge and skill tested for and by making aggressive use of sampling, rather than just dumping the whole thing and letting the education establishment vapor on about how every child is different, every teacher is a skilled professional, and therefore nothing can be measured.

But there is now no case whatever for continuing to combine high stakes with low measurement quality. The job of measuring school (and teacher) performance is well worth doing, but that doesn't mean it's worth doing badly. High-stakes, low-quality testing? Been there, done that, got the T-shirt. B-o-o-o-o-ring! Next case, please.