F E B R U A R Y 1 9 8 0
by James Fallows
ON a bright weekday morning last October, with the school year just begun, several dozen fresh-faced boys sat crouched over little desks. The boys were fifth-formers -- high school juniors -- at St. Albans, a private Episcopal school in Washington, D.C.; the object of their attention was the Preliminary Scholastic Aptitude Test, the first stage in the examinations known to millions as "the College Boards."
Like other test-takers, they looked tense and worried; as the examination books were collected and the No. 2 pencils put away, they gathered in clusters to discuss their answers. "How many intersections are there between three parallel lines and three non-parallel lines? You said twelve? God, I could only find eleven." But their worry was within bounds; St. Albans boys always did well, on tests and in college admissions. The year before, half of the St. Albans senior class had scored in the mid-600s or above on the verbal portion of the Scholastic Aptitude Test. On a scale running from 200 to 800, this placed them above 97 percent of their contemporaries. Their scores on the mathematical portion were even better. The richness of their education gave them SAT-style skills -- close reading, puzzle-solving, facility with the nuance of language -- almost as an incidental benefit. In one room, seniors would discuss the mathematical formulae of Leibniz while, next door, juniors analyzed Voltaire's parody of Leibnizian optimism in Candide. By comparison, antonym questions and reading-comprehension passages were a piece of cake.
Return to Flashback: Educational Measures
Discuss this article in The Body Politic forum of Post & Riposte.
Shortly after the fifth-formers finished their examination, a calculus teacher
joked with one of the sixth-form (senior) boys. "Three touchdowns in the
football game, straight A in English class, 750 on the SAT -- Jones, you're
Harvard's dream." "Seven-eighty," the boy, smiling modestly, replied.|
A few weeks later, elsewhere in the District of Columbia, students from a large public high school took the SAT. The city's high schools have been overwhelmingly black and lower-class since the court decisions ordering desegregation, busing, and an end to the "tracking" system; many affluent families, black and white, have moved to the suburbs or send their children to private schools. Median SAT scores in the District -- when the school system chooses to reveal them -- are 200 to 300 points lower than those at St. Albans. Several years ago, there was a minor scandal when the valedictorian of a local high school, a straight-A student admired by all he met, was refused admission nearly everywhere he applied to college, so miserable were his SAT scores.
A number of students left this test center early, ignoring the test booklet's admonition to use extra time to recheck answers. Large portions of their answer sheets were blank. One black teenager pulled on his Washington Redskins jacket, lit a cigarette, and joined a group of friends. "Shit, man," he said, "what's the point?"
Each year, more than 2.5 million students sit hunched in testing centers, taking the PSAT, the SAT, "Achievement" and "Advanced Placement" tests to give advanced standing in college, admissions tests for law and medical schools, the Graduate Record Exam (GRE), and dozens of smaller examinations. Most of these are written, scored, and controlled by the Educational Testing Service of Princeton, New Jersey (which, contrary to popular understanding, is not synonymous with The College Board. The Board is an organization of more than 2500 colleges, schools, school systems, and education associations, whose members hire the ETS to write their tests). For its work, the nonprofit ETS takes in about $80 million a year.
The SAT has been part of American education for more than fifty years; the ETS, for more than thirty. But never have the testers been more under siege than in the past twelve months. Last November, a few representatives of the ETS and The College Board, standing incognito in the back of the crowd, heard one speaker after another lambaste them at the National Conference on Testing in Washington, D.C. "The dishonesty of the ETS reminds me of the arms manufacturers," said Terry Herndon, executive director of the National Education Association. "They say guns don't kill people, people kill people. The ETS says it's not their fault if the tests are abused -- someone else is to blame."
"We've always been told that the tests might have problems, but we had to live with them, because they're the best we have," said Anna Kahn, national secretary of the PTA. "Now we're beginning to question that."
"Standardized tests are used from the cradle to the grave, to select, reject, stratify, classify, and sort people," said Gerda Steele of the NAACP, "and they are used in ways that keep certain segments of the population from realizing their aspirations. Most of all they limit the access of blacks and other minorities to higher education."
"Ninety million lives have been affected by data collected by the ETS," concluded Herndon, building toward a crescendo that brought a roar from the crowd. "They're unaccountable to the political community, unaccountable to the educational community, unaccountable to the legal community. Something must be done!"
The something in most minds that day was a national version of the "truth in testing" law (a nickname the testing industry abhors) passed in New York State last year. Under that law, students who take the SAT and similar tests in New York this spring will, for the first time, receive corrected answer sheets and copies of the test booklet a few months after the test date. This conference, one of several organized by a group called Project DE-TEST (for DE-mystify The Established Standardized Tests), was designed to build support for the efforts of Democratic Representatives Shirley Chisholm, Theodore Weiss, and George Miller to extend similar disclosure provisions nationwide.
Even more, the speakers at that conference hoped for a basic challenge to the role of standardized tests in American education. They are saying, in effect, that the SAT -- like the IQ tests that precede it in elementary school, and the graduate school admissions tests that follow later on -- is fundamentally unfair; far from serving as agents of diversity and social mobility, such tests reinforce and legitimize every inequality that now exists. As James Loewen, a sociologist from Catholic University's Center for National Policy Review, concluded, "standardized tests are the greatest single barrier to equal opportunity, at least in the sphere of education." Such accusations, and the vitriol behind them, have wounded and mystified those at the ETS. Several days after the Washington conference, the ETS observers were back at headquarters, stunned by what they had seen. They knew that such critics existed, but to hear their work described that way was a heavy blow. In a side office, a secretary transcribed a tape of Ralph Nader's latest blast at the ETS. From the tone of his criticism, there could be no doubt: this was war.
To the ETS, it is a particularly galling war, because it is an attack on the company's point of greatest pride. Those who have shaped the testing industry do not think of themselves as guardians of privilege; quite the reverse. They are proud of the role they have played in opening up opportunities; secure in that pride, they can understand the current criticism only as one more manifestation of the mindless leveling impulses of the day. What else could explain such hostility to the tool that seeks out talent wherever it may be found? Why else does everyone hate us so?
From the other side, Alan Nairn defines the challenge differently. For the last five years, Nairn has been working for Ralph Nader on a study of the ETS due to be published this spring; at twenty-four, he resembles in appearance and intensity the Nader of twenty years past. "Social class is viewed as a sad fact of life, but not an issue," he says. "The controversy over testing makes class an issue."
ON a 400-acre tract of pleasant wooded land in Lawrenceville Township, a few miles from Princeton, the professionals of the ETS develop the tests, study their effects, and brood about challenges to their authority.
The people at the ETS are not naive about testing, nor blind to the dark spots in its past. Their profession is still a young one; the first IQ test was developed less than a century ago. Modern "psychometricians" (which is what those who test mental ability call themselves) know that, like other still-emerging sciences, theirs had rough moments in its clumsy, infant days. They know perfectly well that the first crude IQ tests were used mainly for racial and ethnic exclusion. In 1912, on the basis of tests run at Ellis Island, Henry Goddard scientifically proved that 83 percent of Jews were "feebleminded," along with 90 percent of Hungarians, 79 percent of Italians, and 81 percent of Russians (most of them Russian Jews). Modern ETS researchers recall with sad smiles the miraculous finding, some years after the Ellis Island tests, that Jews and Italians improved dramatically in intelligence after they had lived in this country for a while, and that their children, raised as English-speakers, seemed somehow to have been spared their parents' feebleminded genes. At the ETS, they have seen drawings such as Figure I, which is used in the current version of the Stanford Binet Intelligence Test for elementary school children and which has now become practically a logo for Project DE-TEST. They have heard the sample question from the Lorge-Thorndike IQ test that John Weiss of Project DE-TEST uses to illustrate racial stereotyping and bias in standardized tests:
When a dove begins to associate with crows, its feathers remain ______ but its heart grows black.For each outsider's story of a defective test, the insiders have a story of their own. Winton Manning, a bearded, reflective man in his late forties who is an ETS vice president and philosopher on matters of social equity, frequently tells journalists the tale of the breech block test. During World War II, the Navy evaluated breech block assemblers not by how quickly they could put a breech block together but by their score on a written test about how it should be done. We have seen it all, is the unspoken message of stories like this; we have worked hard on our tests.
They take most pride in the oldest and most famous test, the SAT, first administered in 1926. Before that time, all College Board exams were essays; the Board was established in 1900 to standardize the entrance exams that Harvard, Princeton, Yale, and a few other elite universities had been administering separately. The essay test and the multiple-choice SAT coexisted until World War II, when the essay was suspended for the duration and never resumed (except as a component in certain English proficiency exams). Hunter Breland, an ETS research psychologist, explained that to get statistically reliable results from an essay exam, students had to write five separate essays, with five readers each. "We found we could do as well with fifty multiple choice questions in a thirty-minute test," Breland said. "We got the same people in the same order." When those fifty questions are expanded to a full three-hour SAT, he said, you end up with a far better system than any feasible essay test. This principle has been extended even to demonstrations of writing skill. The old way of testing writing was to have the student write. For the last five years, as part of the SAT, students have spent thirty minutes on the Test of Standard Written English. It consists of multiple-choice questions like the following, in which the student is asked to choose the best phrasing for the underlined portion:
UNLESS THEY TAKE ACTION SOON, the world may face a power struggle for the resources of the sea.(The "right" choice is B, nicely illustrating the ETS concept of the "best" answer. None of the options is very good, and B is a lame passive; but bright students are expected to pick it out as the "best" of a bad lot.)
Another breakthrough toward test "efficiency" -- finding the people you want with fewer questions -- came with the "quantitative comparison" section that has replaced much of the mathematics portions of the SAT, the law examination, the GRE, and other tests. Instead of being asked to compute answers, the students are told to compare two items and see which is the larger. An example:
Number of minutes in 1 week. Number of seconds in 7 hours
The student is asked to mark A on the answer sheet if the quantity on the left is larger, B if the quantity on the right is larger, C if they are equal, and D if there is not enough information to tell. (The complexity of the answer system makes some at the ETS feel that this section is vulnerable to coaching; if you've seen this sort of question before, you are less likely to panic on the test.)
With "objective" tests seemingly necessary in practice and defensible in theory, the ETS develops new ones all the time. The first step is to establish the instruct validity" of the test -- that is, whether what it measures has, at face value, some connection to the purposes of the test. If it's a screening device for potential lawyers, there should be some test of logical reasoning; if it's screening future doctors, a test of the physical sciences. Unless it is an achievement test, the questions are not supposed to measure outside "knowledge," only how the students handle the information given them. The most obvious exceptions to this are the word analogy and antonym sections of the SAT and the GRE, which are essentially vocabulary tests. On the sample SAT passed out by the company, the verbal section requires knowing words such as "emanation," "aesthete," and "espouse"; the GRE includes "concatenation," "anodyne," and "preternatural."
Questions are written by the ETS's own staff or by a sizable free-lance pool. To get a better bargain for its money, the ETS has begun holding "item-writing workshops" to teach its contributors the rules of the game. In early November, I sat in on one of these workshops, conducted by Richard Evert, a young ETS staffer with a professorial air. There were half-a-dozen free-lancers in the room, all but one of them women. They were working on "Logical Reasoning" questions from the LSAT. "It is important to remember that those taking the test are a very bright population," Evert said as the session began, "whose ablest members, though younger and less wise" (this with a smile) "are just as good as any of us. It will tax your abilities to write difficult items -- partly because, as they work their way up through the review process, a lot of the difficulty will be ground away."
Among questions he discussed, some were thrown out for being too confusing in their answers, others for lurking cultural or sexist bias ("Sir! You just spent $3000 on your daughter's wedding and you are still drinking an ordinary vodka? Now's the time for Tovarich"). The free-lancers were counseled not to be upset if their questions were rejected, handed back to them with elaborate corrective suggestions, or eviscerated of their trickiest parts as they made their way through the in-house review. They were told that a special "minority" representative would screen the questions for bias; they saw examples of nit-picking comments from the endless reviewers and editors who look at each question individually and then look once more at the overall mix of questions on each new test. By the time a question makes its way to the final test form, the ETS says, it will have been inspected nearly thirty times; the great majority of questions will perish along the way. This is why, they say, it takes eighteen months to prepare a new test, and why a full GRE or LSAT may cost $100,000 or more to produce.
For items that pass through this screening process, the most demanding standard is still to come: their statistical performance on the "pretest." This step is the heart of the ETS's claim for consistency in its tests; it is also the root of one of the most basic complaints.
EVERY SAT is divided into six sections, five of which count for the student's score. The sixth -- which the student can't distinguish from the others -- is made up of new questions that have passed the in-house reviews and are ready for use. As soon as the score sheets get back to Princeton and are sent through the computerized scanners and scorers, statisticians work up response charts for each of the new questions. These charts are long horizontal strips, divided into six sections for the six possible answers to the question (A through E plus "omit"). The chart shows how many students chose each possible answer, and what the median score of those students was on the rest of the test. What the ETS hopes to find, of course, is that those who chose the right answer (known as the "key") had higher scores than those who did not; if so, the question "works." If the question does not work -- like the first one in Figure 2 -- it must be changed. Sometimes the whole item must be junked; sometimes it can be saved, by adjustments and rewording to make the key more obvious or the "distracters" less tempting. Then it must be pretested again, until the statistics come out right. The second question in Figure 2 shows an adjustment that made it "work."
The importance of this step can hardly be overstated. From the ETS's point of view, the statistical pretest is essential; how else can it guarantee, year after year, that the test will measure just the same things, in just the same way? But this consistency also means that the ETS must be very sure that its tests are measuring the right qualities, because their focus will never change. If talents are diverse, if different groups display their abilities in different ways, this process will never reveal it, because the standard set in the beginning is the standard it retains.
Even then the ETS's work is not done. There is the awkward but necessary business of "ethnic studies," which means a comparison of black and white scores, and correlations with different social and economic groups. There are the endless instructions to colleges and graduate schools about how the tests should be used. "A GRE test should be used only if its limitations are known," says the GRE handbook, in a typical warning. "A GRE test score should be used as only one of several criteria, and should not be given undue weight solely because it is convenient."
By and large, these warnings are more necessary for graduate schools than for colleges. One of the minor ironies of the testing controversy is that it occurs during a relative lull in pressure for college admissions. Because there are fewer college-age students than there were ten or fifteen years ago, and because a smaller percentage of them apply to college, today's students have an easier time of it. Lois D. Rice, a vice president of The College Board, is fond of pointing out that 90 percent of all college-bound students are accepted by the college of their choice; the argument about the SAT, she says, is really an argument about who gets into a few such as Stanford and Yale. This glosses over the fact that admission to the Stanfords and Yales still makes a difference; more generally, her figures may be only a sign that the tracking system has grown more refined.
These days, the SAT's significance for college admissions is mainly to confirm the judgments made by IQ and placement tests over the previous dozen years. There is a close connection between scores on elementary school IQ tests and on the SAT; indeed, recent research shows that tests in the fourth grade indicate which students will go to college almost as reliably as the SAT does. By the age of nine or ten, students are getting the picture about which of them will be the lawyers and which the plumbers.
GRADUATE schools still feel the pressure of numbers, especially those that control entry to lucrative professions such as law and medicine. Although applications have swelled in the last decade, admissions staffs remain small. As a result, many professional schools do end up using cutoff scores, and the ETS experts consider this an abuse of their work. Winton Manning has put out a policy paper explaining the difference between using test scores to determine admissibility, which is good, and using them for selection, which is bad. Yale Law School, for example, might decide that no one who scored below 500 on the LSAT could handle the work; that, in Manning's view, would be fine. But many big law and medical schools have more applicants than they can handle who score over 700, and they end up choosing among students on the basis of scores alone, accepting one at 740 and rejecting another at 710, even though the standard measurement error of the test means that the difference between those scores might be due entirely to chance.
Manning and the others know these things happen; but that does not finally undermine their faith in what they do, for their ideology and their life experience combine to convince them of the value of their work. One critic of the tests said he was first drawn to the subject because his grandparents, Eastern European immigrants, arrived at Ellis Island just when many of their countrymen were being pronounced feebleminded. William Angoff, an ETS vice president who is the spiritual father of the SAT, has taken the opposite lesson from life.
Angoff is an urbane, silver-haired figure; when we met, he was wearing a blue blazer, gray slacks, and a striped silk tie. He grew up in the tenements of Boston, with his whole family chipping in to pay his way through school. He went to Boston Latin, from there to Harvard, and on to eminence in the world of psychometrics. " I consider that the tests have been a friend of American society at all levels," Angoff says. "It certainly has been a boon to people like me. It picks out people because of their individual likelihood to succeed. The person who is going to do well on that test at Boston Latin is going to do well on it anywhere else."
With this assertion, the question is joined: Are the tests fair? William Angoff is serene in his faith: "I believe that standardized tests have benefited American education, and have benefited all the classes, as a very important part of the American meritocratic philosophy." Among his colleagues, that confidence is shared; people speak without irony of "our mission," and have no doubt that the mission is the smoother, more just working of the "meritocracy."
Yet there are reasons to think their faith misplaced, their efforts, however sincere, far more destructive than they imagine. The reasons begin with the premises of the "meritocracy" whose cause they uphold.
It is interesting to remember that when Michael Young, the British sociologist, invented the term "meritocracy" twenty years ago, he did so with satirical intent. His point was that a system of rewards based on "ability" and "merit" would not necessarily be any fairer or more pleasant than other systems of stratification the world has known. It would, he said, be a dull and dangerous society, run by single-minded technicians. So deep has been the American hunger for a "fair" system of classification, one based on ability rather than accident of birth, that Young's term has been appropriated without its irony.
The unspoken premises of our meritocracy are these:
-- That there is such a thing as "intelligence" or "ability," and that it can be measured.Whether or not they admit it, those who defend the tests are defending these premises. Whether or not they know it, those who complain about the tests are challenging the premises, beginning with the concept of measurable intelligence. Like most modern psychometrician ETS officials are careful to say, when speaking for the record, that native intelligence is not what their tests are designed to detect. Indeed, the main SAT fact sheet says that it is a "test of developed ability, not of innate intelligence; a test of abilities that are developed slowly over time both through in-school and out-of-school experience." They say it with feeling, and on one level I am sure they believe it. But they clearly believe something contrary, too, if they are sincere -- and they are -- in their claims that the tests serve as agents of the meritocracy. They believe that the tests measure something fundamental. One does not have to call it intelligence. Call it smarts, or the right stuff. By whatever name, it is a notion of intrinsic worth. How else could the tests be presumed to identify the promising lad from the Boston slums and propel him upward through Boston Latin and Harvard? How else could they provide a means of comparing a sixth-former at St. Paul's with a senior at Muncie High? How else could they be good for "all the classes" in America? The test-makers seemingly want it both ways: they want to speak with scientific decorum about the limits of their work, and they want to say, We have the best tool for judging people objectively. Alan Nairn puts the paradox this way: "With all the disclaimers, they are in effect saying that this is so important a piece of information, so terribly revealing about the student, that it must be handled with great care."
I have yet to meet a high school student who did not take the tests as a measure of how "smart" he was. Students were not allowed to see their own SAT scores until 1958. Frank Bowles, president of The College Board when that decision was made, was prescient about its effects. He said in 1960: "There was great fear that students would have their values warped by learning their own scores, but I have learned from hearing my own children's conversations that SAT scores have now become one of the peer group measuring devices. One unfortunate may be dismissed with the phrase, 'That jerk -- he only made 420!' The bright, steady student is appreciated with his high 600, and the unsuspected genius with his 700 is held in awe."
"I have spent time among people at the pinnacle of the meritocracy," says one test expert, "with people who can dissect very rationally all the shortcomings of the tests. But they'll say, 'That guy's really smart -- he got 800s on his SAT's.'" Of the sixty-odd people I spoke with while preparing this article, exactly one volunteered his SAT score -- a friend who has made a reputation as an analyst and writer. He scored in the mid-300s, felt crushed by the experience, and finally found his way into college on his swimming skills.
Nearly everyone else said something like, "Of course, I did very well on the tests myself," in the same tone a millionaire might use in saying, "Of course, my family had some money." To say just how much would be bragging; but to leave any doubt about the general picture would be worse.
And this is only the effect on those who succeed: the real damage is to those who are taught to expect to fail. Record amounts of mail poured in to the National Education Association, the teachers' organization, after two articles by Arlene Silberman appeared, in Reader's Digest and McCall's, about the way children were taught to think of themselves as "ordinary" or "slow average" because of elementary school IQ tests. This wounding effect is compounded by a statistical quirk of the test. Common sense would suggest that, if a scale runs from 200 to 800, 500 would be the "average" score. Indeed it was, forty years ago, when the statistical norms for the SAT were established and 500 was set as the median score. But as times have changed and the sample of students taking the test has grown less select, the median has dropped nearly 100 points; these days, only about one quarter of all students score above 500 on the test. The other three quarters think they are "below average." At any one time, 75 percent of all teenagers probably also think that they are below average in popularity, resistance to acne, or development of secondary sex traits. These things pass, or most of them do; the verdict on intelligence remains.
THE ETS's long denial that coaching could affect scores illustrates its conviction that the tests plumb innate capabilities. The test-coaching dispute has been the subject of numerous articles (the best of them, although hotly contested by the ETS, is Stephen Levy's "ETS and the Coaching Cover-up," in the March 1979 issue of New Jersey Monthly) and promises to occupy lawyers and researchers for years, but its central point is simple. Any fool can look at one of the tests and see that preparation has to make a difference. The verbal sections of the tests are not only loaded with vocabulary but also contain obscure kinds of items likely to confuse or panic those who have not seen them before. One example is the "data evaluation" section of the LSAT, which consists of a long passage describing a business decision followed by up to twenty separate elements, each of which the student must identify as "Major Objective," "Major Factor," "Minor Factor," "Major Assumption," or "Unimportant Issue." The mathematical portions penalize those who have not brushed up on high school algebra and geometry. During the early 1970s, several versions of the LSAT included questions based on a "triangle chart" like the ones in Figure 3. From the information given, the student would be asked questions such as, "What was the average number of men per city who preferred home improvements?" Once the chart has been explained, it is a snap. Without that explanation, most people are left to guesswork and blind luck.
The ETS more or less admits all this now. Its newly released handbook for the law school exam says that "vocabulary cards and exercises, in addition to extensive reading with the help of a dictionary, will be useful" for certain parts of the test. Its publications now instruct students in test-taking strategy -- for example, that it always makes sense to guess if you can narrow the answer down to two or three possibilities. But until very recently, the ETS could not bring itself to admit such a thing. It defined coaching as last-minute cramming, said such hasty efforts would do no good, and extended the judgment to a blanket assertion that special preparation for the tests "is likely to yield insignificant increases in scores."
Why such resistance to such a self-evident truth? Most likely it is because coaching, if effective, threatens to upset the whole applecart, by suggesting that what the tests measure can be fairly quickly changed. If the whole subject were less heated, this finding would not be surprising. One year of high school, with its courses in algebra and English, is expected to increase test scores, so why shouldn't a six-month coaching course have a similar effect? Stanley Kaplan, founder of one of the most popular chains of coaching schools, often says that parents pay him several hundred dollars to do what the public schools should be doing -- and that schools such as Exeter and Andover are just grander versions of the same idea.
Coaching also undermines the foundations of the tests from another direction. If courses can be designed for the specific purpose of increasing scores on the tests, does that not suggest that the tests reveal, rather than "aptitude" or "achievement," only mastery of an unusual and specialized system of thought? Since long before the current testing controversy, the ETS and its allies have fought a running battle against those who claim test scores are poor indicators of "merit" because of the limits inherent in both the format and the content of the tests.
Nearly twenty years ago, a mathematics professor at Queen's College named Banesh Hoffmann published a book entitled The Tyranny of Testing; his argument was that multiple-choice tests reveal nothing about the student's reasoning and penalize those with complex or creative styles of thought. One example, taken not from Hoffman but from an official testing manual, is shown in Figure 4. The manual explains why D is the answer to the first question and H is the answer to the second. But what about the student who sees, in the first question, that the first three figures are alike in being four-sided figures with no right angles, and chooses E as the only appropriate answer? Or who sees, in the second question, that the first three figures are all isosceles triangles, and that F is the only isosceles triangle among the answers? That student will get the question wrong, and have a lower IQ, unless he is enough of an old hand at multiple-choice tests to know the kind of obvious thinking the test-writers are usually looking for. "Sometimes you hear of very bright students who do poorly on these tests," says David Riesman of Harvard. "They don't exactly fail them, but their scores are not as spectacular as they should be. All you have to tell them is that the questions are designed for l'homme moyen sensuel, that they should take it at face value. Then they do fine."
While visiting the ETS, I spent evenings in my hotel room with a stopwatch on the table and a No. 2 pencil in hand, taking sample versions of the SAT, LSAT, and GRE. Six or eight times in the verbal portion of each test, I found questions with the same problem as the one in Figure 4: there were two or three plausible answers, depending on the logical course one chose. In most cases, it was easy enough to guess the "right" answer -- not by means of superior logic, but by knowing the way the ETS thinks. Of all the forms of test bias, this may be the most insidious and deepest rooted: the shared assumptions about which logic is "compelling" and which merely "superficial," assumptions that derive from the social experiences that shape judgment and taste. ETS officials winced at the suggestion that there was a system to their thought, but any veteran test-taker recognizes it. Recently a young law school graduate, his life story an unbroken series of successes on ETS exams, began boning up for the bar exam while finishing his clerkship at the Supreme Court. "My self-confidence went up 50 percent," he said, "when I discovered that the Multi-State Bar is an ETS exam."
The complaint against multiple-choice tests is based less on their neglect of individual genius than on their bias against larger groups. This is the heart of the case against testing. For all its other claims, the testing industry finally rests its defense on "equity." The tests were created to broaden the pool of talent open to the colleges, and that is what the testers say they still do. When they've finished with the disclaimers, the stipulations about "developed ability" and "limited predictive validity," when they lean back and talk about the meaning of it all, most of them speak as William Angoff does. The tests have been a friend to all classes. The system works. If their case were true, it would excuse most other defects in the test. But it is impossible to ignore the evidence that, in most instances, the tests simply ratify earlier advantage -- that, as engines of mobility, they have sputtered and died.
From the archives:
"Inner-city children do not necessarily have inferior mothers, language, or experience, but the language, family style, and ways of living of inner-city children are significantly different from the standard culture of the classroom, and this difference is not always properly understood by teachers and psychologists."
THROUGH all the controversies of the last decade about inherent racial
differences in intelligence, many black groups have concluded that standardized
tests are the latest attempt to deny them what they deserve. Black
psychologists' associations have called for a flat moratorium on standardized
tests; a journal called Measuring Cup, published in Savannah, reports on tests
from the black point of view. James Loewen of Catholic University, who is
white, has devised questions like the following (modified from Robert
Williams's Black Intelligence Test of Cultural Homogeneity, or BITCH), which
test "verbal aptitude and reasoning," but are based on black vocabulary.
Saturday Ajax got an LD:Once you know that an LD is a Cadillac Eldorado, the reasoning can begin; without that, it's hard to make a start. Loewen spent several years in the 1970s teaching at Tougaloo, a primarily black college in Mississippi, and concluded that the SAT and the Graduate Record Exam were major practical and psychological barriers to his students. Students who performed well in his classes scored in the mid-400s on the GRE; his very top student got a 565. "We learned to help some students develop reasons why they had not taken the GRE," Loewen told a congressional committee considering the "truth in testing" bill last fall, because "some schools found it possible to waive the GRE requirement for a believable excuse, while they would not have overlooked a 400 score."
Loewen went on to make a fundamental point: that the tests are not fair national standards of comparison. Without one nationwide test, the test designers say, how can you compare the boy from Boston with the girl from Grand Forks? "A GRE score of 500," the GRE booklet claims, "has the same meaning whether earned by a student from a small private liberal arts college or a student at a large public university." This is precisely what Loewen denies, calling it a "statement of arrogance." "If you have two kids who get 500," he says, "one from Harvard and one from Tougaloo, you know that one of them is pretty dumb, and it's the one from Harvard."
The bias is not racial so much as economic, and the overall point is so bald that it can hardly be ignored. College Board data showed this relationship between economic standing and test performance in 1974:
Student's Score Student's Mean Family IncomeLoewen has compared the median incomes of the fifty states and the District of Columbia with the National Merit Scholarship "cutoff score" for each state. (Scholarships are awarded on the basis of scores on the Preliminary Scholastic Aptitude Test, which is usually given in the junior year. Because National Merit does not want all its scholarships going to Connecticut and New Jersey, it sets a different cutoff score for each state.) "Connecticut has the highest cutoff, and Mississippi the lowest," Loewen says. "It is intriguing to note that Connecticut is the richest state in the nation, and Mississippi the poorest." All in all, he says, his list and The College Board's match perfectly; there is a .83 correlation between them -- about as close a relationship as statisticians ever expect to find.
These patterns strongly suggest that what the tests measure is exposure to upper middle class culture -- perhaps even the culture of the professional class of the east coast. (Is it entirely a coincidence, Loewen asks, that scores on the American College Testing Exam, written in Iowa City, are consistently better predictors of performance for students from all backgrounds than the ETS tests written on the outskirts of Princeton?) Whatever the reason, the findings simply blow apart the original precepts of the tests. One can of course argue that intelligence is hereditary, as in part it is, and that intelligence earns money, as to some extent it does. But these general tendencies do not explain the lockstep correlation between parental income and student scores. Unless one is willing to set aside the evidence of daily life and conclude that all smart people are rich, these results can mean only one thing: that standardized tests, created to offset one kind of privilege, have merely enshrined a different kind. The tests measure something, probably something of value -- but whatever it is, it's clearly a symptom of social advantage.
"Tests may measure aptitude of achievement within populations that share backgrounds," Loewen says, "but they do not measure accurately across backgrounds." To illustrate, he prepared the "Loewen Low Aptitude Test," which is "designed to show my urbane white students some of the forms of test bias and to give them the experience of 'flunking' an aptitude test." Here is one of its questions:
Spline is to mitre as _____is to ____.The question is biased toward the working class, toward students whose "exposure" would have taught them that a spline is a small piece of wood inserted to keep a mortise joint tight. From there, reasoning takes over -- spline adds strength to a mitre as straw adds strength to mud. (If you chose "love ... marriage," you have not grasped the concept of the "best" answer. "Sometimes one answer will be 65 percent right, and the 'distracter' will be 60 percent right," one test designer explained. "The first one's the 'best' answer -- but the distracter would be best if the other were removed." Here, any SAT veteran should recognize "straw ... mud" as the "best" because, like "spline ... mitre," it is tangible, not abstract.)
THERE is an annoying trick element to questions like this, but the argument behind them is sound. If the meritocracy's aim is to reveal talent, shouldn't we strive for ways to transcend boundaries of upbringing and taste? Is there not some way to avoid falling into the mushiness of saying that Black English is "just as good" as standard English, or that "all forms of learning are equally valid," yet still discover able people who grew up knowing more about splines than about concatenations? As a first step, shouldn't the ETS system of selecting questions be turned on its head? Instead of choosing new questions because they will yield exactly the same results as the old, the testers should deliberately look for questions that break the pattern -- not because they want everybody to get the same score, but because people from different backgrounds should have a chance to display their reasoning skills. It has been done before. The first widely used American version of the IQ test showed that, in native intelligence, women were significantly weaker than men. When the test was restandardized in the 1930s, one of the specifications was that men and women should get equally high scores. This involved, among other things, removing sports-knowledge questions from the test, and inserting others on which women did particularly well. Is it so outrageous to contemplate additional test items on which able poor or black children would excel?
Many professors look with horror on such a dilution of the tests. It is the levelers at work, they say: if these standards fall, nothing else will stand. "If people are underprepared for tests, they will generally be underprepared for courses," says David Riesman; he makes the logical point that a university cannot do its business without students qualified to meet its standards. "Good God," said one of my former professors when I mentioned the subject to him. "Students have little enough asked of them as it is. Few enough real standards to meet. It can only be worse if the everybody-is-equal view takes over."
True, many in the anti-testing movement abhor standards of all kinds -- any distinction, any measurement, any judgment that might make somebody feel bad. But one need not be among their number to feel uneasy about the tests and the meritocratic philosophy they represent. It is self-evident that not everyone is suited for freshman work at Harvard or at any other college. Not everyone should be a doctor. Some were born without the potential; others were ill prepared along the way. The universities do have standards of scholarship to protect.
In a way, the universities have been unfairly trapped by a myth -- the myth of themselves as the forces for social change. The ultimate meritocratic justification for selective education is that this privilege will go to the most deserving, who will be better trained for their future responsibilities, and who will provide surer, wiser leadership for all mankind. But most sociological studies show that education does not work this way at all. The amount of education you get makes a difference in success in later life, but performance in school doesn't (except by entitling you to spend more years in school). Instead of selecting and training leaders, education certifies them to hold positions of privilege. This confusion between the academic role of education and its role as a granter of credentials may be the biggest threat to academic standards of all. It leads to grade-grubbing, demands for "fair admissions," and a view of liberal education as nothing snore than a ticket to business school.
The testers are aware of these complaints. But what, they say, do you expect us to do? "It would be tempting to say that the schools should be looking about to find the people who will make the greatest contributions to society in their occupations and professions," says William Turnbull. But the schools have asked the ETS to help with a more limited task, selecting "those who are ready for the next step."
The proper challenge to testing is to accept a higher modesty, one that begins with the understanding that this "merit" system pretends to more fairness than it delivers. All systems of selection are unfair; all are tempted to claim too much justice for their results. We recognize this about previous systems but are reluctant to face it in our own. Once that fact is recognized, it may be possible to think again about the proper building blocks of a meritocracy -- measures that do not seal fate at an early age, that emphasize performance in specific areas, that expand the pool of talent in more than a hit-or-miss way, and whose limits are always visible to us, so that we are not again deluded into thinking we have found a scientific basis for the order of lords, vassals, and serfs.
Copyright © 1980 by James Fallows. All rights reserved.
The Atlantic Monthly; February, 1980; The Tests and the "Brightest"; Volume volume 245, No. 2; pages 37-48.