The Bias Question
In a surprising challenge to the SAT's reputation as an unbiased measure of student learning, one researcher has argued that blacks do better than matched-ability whites on the harder questions of the SAT—something he believes their scores should reflect
When the telephone rang in the kitchen of his townhouse in Lambertville, New Jersey, one morning last April, Roy O. Freedle had just taken his blood-pressure medicine. That was good, he thought later, because it turned out to be a stressful conversation. The caller was Drew Gitomer, the senior vice-president for research and development at the Educational Testing Service, where Freedle had worked for more than thirty years as a research psychologist, before retiring in 1998. The two men were of different generations—Gitomer was forty-six and Freedle was sixty-nine. They had not known each other well when Freedle was still at ETS, but the older man had a good idea why the younger man was calling.
Just a few days before, a long article by Freedle had appeared in the Harvard Educational Review arguing that the most important test in America, the SAT, was racially biased. Previous work on bias in the SAT, he wrote, had failed to point out that African-Americans were doing better on harder questions of the test than non-Hispanic whites with the same SAT scores. Minority students, along with culturally deprived whites with similarly hidden abilities, he argued, should have an assessment of their previously undiscovered talents shown to colleges so that they could get fairer decisions from admissions committees. Because ETS and its New York-based client the College Board had given birth to the SAT, still depended on it for much of their income, and spent considerable time and energy trying to keep biased questions out of the exam, Gitomer was none too pleased by Freedle's argument.
The SAT I, called the Reasoning Test by the College Board, is a three-hour, mostly multiple-choice test of verbal and mathematical knowledge and skills. In its three quarters of a century as a college-entrance examination it has become a giant; more than 2.2 million students took the test in the 2001-2002 school year, some more than once. Nearly as many people last year took the ACT, the SAT's Iowa-based rival, but the SAT gets more attention because it is the prevalent college-admissions test in the major government, financial, and media centers of the East and West Coasts. A revised version of the SAT, to be introduced in March of 2005, will add grammar questions and a written essay, replace quantitative comparisons with second-year algebra questions, and replace analogies with more reading questions.
Plenty of Americans, particularly those familiar with the subtlest forms of ethnic prejudice, think there is something wrong with the SAT, and with other standardized tests. For the high school class of 2002 the average score for a non-Hispanic white student on the 1600-point test was 1060. The average score for a black student was 857, or 203 points lower. (For Asians the average was 1070, and for Hispanics it was slightly over 900.) The gap between blacks and whites on the test is sixteen points greater today than it was in 1992.
If minority students are at a disadvantage in taking the SAT, their choice of colleges will be significantly limited, with important implications for their financial, professional, and social futures. In other words, the SAT is interfering with the pursuit of happiness—a problem that has long absorbed the efforts of education researchers and civil-rights lawyers, with not nearly as much progress as anyone would like.
Freedle's accusation of racial bias in the SAT is striking because it is one of the few ever to come from an experienced ETS professional. Perhaps more important, it has caught the attention of the University of California (a powerful malcontent in the College Board family), which has ordered its own detailed analysis of the issue, due to be completed in 2004. Even if Freedle is ultimately proved wrong, his success at raising doubts about the SAT shows how loose a grip the test has on the political and scientific handholds that keep it upright.
In his book The Big Test: The Secret History of the American Meritocracy (1999), Nicholas Lemann described how the existence of the racial test-score gap and the difficulty of closing it began to dawn on American policymakers in the mid twentieth century, just as the SAT was becoming the arbiter of which young people would get ahead, at least academically, and which would not. One of the first warnings had to do with socioeconomic bias rather than specifically racial bias. Lemann found an item in the diary of Henry Chauncey, the first president of ETS, showing that he had read an article in the April 1948 issue of The Scientific Monthly arguing that tests like the SAT could be biased against low-income students. But Chauncey dismissed it as a "radical point of view." The first prominent case involving the issue of racial bias in testing arose later, and dealt not with college-entrance tests but with employment exams. In 1963 the Illinois Fair Employment Practices Commission ordered a Motorola television factory in Chicago to hire a young African-American who had been denied a job because of his score on an IQ test. ETS and others had been promoting tests in hiring decisions, Lemann wrote, but it never occurred to them that the results would be used to systematically exclude members of minorities. The Motorola case had a profound effect, especially on the debate over the Civil Rights Act of 1964. Senator John Tower, of Texas, inserted an amendment into the act specifically permitting the use of ability tests in employment. "Thus did standardized testing become a part of a landmark law in American history," Lemann wrote.
At about the same time, civil-rights activists, legislators, judges, and educators began using the term "affirmative action" to justify offering opportunities to minority members who otherwise would not have seemed qualified. In the 1960s and 1970s colleges and universities that had more applicants than spaces began to give preference to some minority students who had lower test scores than whites but whose high school grades and personal qualities suggested that they would benefit from a demanding academic environment.
This form of affirmative action was buttressed by the 1978 Supreme Court case Regents of the University of California v. Bakke. The Court ruled 5-4 against the quota system used by the university's medical school, whereby sixteen places out of a hundred in each entering class were reserved for minority students. However, Justice Lewis F. Powell wrote in his opinion that race could be considered in admissions decisions.
Shortly thereafter the College Board set up a fairness-review process that subjected every potential SAT question to close examination for racial stereotypes, loaded words, inappropriate assumptions, or anything else that might put minority students at a disadvantage. Questions dealing with subjects beyond the experience of a typical inner-city student, such as yachting or debutante balls, were thrown out. The College Board, together with ETS, also produced studies showing that the SAT was doing its job and did not make minority academic skills look any worse than they were. The data demonstrated that the SAT predicted about as well as high school grades how a student would do in his or her first year of college. And since selective schools did not want to admit applicants who could not meet their standards, and since they had to have some defensible way to sort applicants whose relative academic merits were sometimes hard to quantify, the SAT—and the ACT—not only survived the occasional assaults on their methodology but continued to grow.
Inside ETS, in 1990, the senior scholar Winton Manning thought he had found a way of partly addressing the SAT's racial gap with his invention of the MAT, the Measure of Academic Talent. It was designed, in part, to give minority students a boost by identifying those whose SAT scores were higher than would be expected from their families' incomes and educational backgrounds. The idea died because of technical deficiencies and what Lemann described as fear among ETS executives about the commercial consequences of the corollary effect: adjusting downward the scores of affluent students who had not done as well as their backgrounds predicted. A group within ETS later proposed a version of the MAT called "Strivers," but this also received little support. In recent years the College Board has supported research, by the Yale psychologist Robert Sternberg and others, into alternative tests, of creative and practical skills. Sternberg said that his initial results produced smaller ethnic test-score gaps than the SAT, but he acknowledged that his test still needed years of work.
In the late 1990s SAT critics hailed the experiments of the Stanford psychologist Claude M. Steele, who had found a factor he called "stereotype threat" that seemed to explain lower scores by even successful black students. (See "Thin Ice: 'Stereotype Threat' and Black College Students," in the August 1999 Atlantic.) Steele tested black Stanford undergraduates, in some cases telling them that the results would be used to assess their verbal ability, and in other cases stressing that ability was not what was being tested. Scores were significantly lower when test takers thought their verbal ability—and therefore their intellectual ability—was being assessed. Steele argued that the students were reacting to the pressure of feeling they had to disprove negative stereotypes of African-Americans' intellectual ability, and that this at least partly explained the achievement gap.
The continuing debate over the worthiness of the SAT did have an effect on higher-education admissions policies. Some colleges, such as Muhlenberg, in Allentown, Pennsylvania, and Bowdoin, in Brunswick, Maine, began to market themselves as places where students did not need the SAT to apply. In the 1990s the National Center for Fair & Open Testing (FairTest), based in Cambridge, Massachusetts, began to compile a list—now nearing the 400 mark—of schools that do not require either the SAT or the ACT, or require them only from applicants who do not meet other criteria. Most selective colleges, however, remained committed to the tests.
Standardized testing was recently scrutinized in a suit against the University of Michigan, whose undergraduate admissions process used a point system that awarded significant points for being a member of a minority group. In 1997 two applicants sought to overturn Michigan's system after being rejected despite having higher ACT scores than some minority students who were accepted. FairTest and other groups argued in amicus curiae briefs that any decision relying heavily on the ACT or the SAT as a valid admissions factor would be wrong. Indeed, according to the FairTest brief, which focused on the SAT, the test was not even a good predictor of first-year college grades; high school records, FairTest said, were a significantly more accurate gauge of college performance. (The College Board and ETS have maintained that either SAT scores or grades are better predictors than any alternatives, and recommended that colleges continue using them in combination.) Last summer the Supreme Court ruled that although diversity was a legitimate goal, the point system was not an acceptable means of achieving it. Under this ruling schools may still use affirmative action, as long as the admissions process evaluates each applicant individually and not by a point formula. In practical terms this means that the college-admissions process, at least as it relates to affirmative action, can go on much as it has for the past forty years or so: selective colleges will continue to accept minority students with test scores below those of some rejected white applicants in order to maintain campus diversity, but most will also continue to use the SAT and the ACT, despite their imperfections, to help in evaluating applicants.
Freedle's attack on the SAT came at a time when ETS and the College Board thought the scholarly debate over the fairness of the test had been settled. The gap between blacks' and whites' performance on the SAT was clear. The blame, they thought, should be placed not on the test but on differences in family income and culture—and on K-12 school policies infected by lingering racism. Many middle-class black and Hispanic families were new to affluence and higher education, and on average, some researchers argued, were not quite as middle-class as their white neighbors. That meant that their children were still at a disadvantage in an academic setting, and only more time and better schooling would close the gap. Furthermore, high schools appeared to be putting minority students in less challenging classes out of a misplaced concern, fed by old stereotypes, that the students would not be able to handle the demands of honors and Advanced Placement or International Baccalaureate courses.
"It is not bias in the tests," Wayne Camara, the vice-president of research and development for the College Board, said in an interview with me this spring. "It is the differences in the opportunities the students have to get a quality education, the kinds of support they have in school and in the community and in the home." The College Board's president, Gaston Caperton, agrees. "If this is a bad test, I wouldn't have taken this job. Wayne wouldn't have come to work here. No college president would have used these tests, which they have for years and years, if it were a bad test. None of us would be part of it."
Freedle more or less accepted the view that the SAT was a useful measuring tool, but he believed the test had a flaw that could be corrected. He first became interested in the SAT for the same reason many people do. He saw himself as one of those whose lives were changed by higher education. His father was a tool-and-die maker. His mother was a waitress. He grew up in the Chicago area and majored in psychology and biology at Roosevelt University while working in the mailrooms of Marshall Field's and the Universal Atlas Cement Company. At Columbia, where he earned his doctorate in experimental psychology in 1964, he supported himself with typing jobs and work in the architecture-department office. He was the first person in his family to go to college, much less graduate school.
At Columbia he became interested in how the structure of language in passages heard or read influenced thought and perception. He worked briefly at a Washington research firm after getting his doctorate. Then, in 1967, the eminent cognitive psychologist John B. Carroll lured him to ETS. Freedle enjoyed years of pure research on short-term memory, but when grants for such work became hard to get, he was happy to try his hand at more-practical projects.
He began to analyze questions on the Test of English as a Foreign Language, an exam for students from abroad who want to qualify for places at American universities. He found that by analyzing various linguistic aspects of the questions—word order, word placement, interrogative style—he could predict which ones test takers in Seoul or Shanghai or Sarajevo would find easy and which would make them chew their pencils and look at the clock.
Freedle identified seven factors that seemed to affect the difficulty of a test question. One of the most common was simple word repetition. If an answer had a word in it that also appeared in the question, more test takers chose it; thus a question was easier if it contained a word that recurred in the correct answer, and harder if it contained a word that recurred in an incorrect answer. In reading-comprehension questions another factor was where in a reading passage the key part appeared. If it was at the beginning or the end of the passage, the question was easier than if it was in the middle.
Then he decided he would apply his analysis to ETS's biggest test. He realized that if he looked at how different ethnic groups reacted to the seven factors, he might help to improve the SAT and impress his supervisors, who indicated that rooting out bias in the test was a priority.
Freedle was excited by a new technique called differential item functioning, or DIF (pronounced diff). To see which questions on the verbal section of the SAT produced different results by race, test takers were first divided into groups by score: those who had scored 200, the lowest possible; those who had scored 210; those who had scored 220; and on up to 800, the top score. Each of those scoring groups were examined to see how people of different ethnicities had done on each item of the test.
Using DIF, Freedle began to notice some intriguing results, which turned out to have more to do with word choice than with the seven difficulty factors. At each level of ability, but particularly in the lower-scoring groups, white students on average did better than blacks on the easier items, whereas blacks on average did better than whites on the harder ones. (Whites, though, did better overall.)
That was not a result that Freedle's supervisors expected. In 1987 he handed in a draft of a report on the subject that he had done with Irene Kostin, an ETS colleague. The research-division chiefs asked for a revision. He handed in a second draft. They were still not satisfied. Some of the questions were legitimate, Freedle thought. His conclusions contradicted other research. The chiefs wanted him to look at the data from other angles. But each re-examination confirmed the initial results. By the time he was ordered to do an eleventh revision, Freedle had begun to wonder if ETS, in its scholarly way, was trying to discourage him from pursuing his rogue conclusion.
The report was eventually accepted, but his requests to do follow-up research were politely rejected. (A similar topic was assigned to other researchers, who did not get far with it.) He could have made a ruckus. Other ETS researchers had when they thought their best efforts were being buried. But Freedle just wanted to keep working, so he concentrated on his techniques for predicting the difficulty of items, which he knew interested the company. Over the next few years he wrote several reports on the subject. But by the late 1990s all his research proposals were being turned down. Others at ETS were also having a hard time getting projects approved, he says, but he thinks his supervisors had a particular problem with his work.
So in October of 1998 Freedle retired, taking with him much of his old data. He wanted to pursue on his own something that had popped out of his ETS work. On average, black students were performing only slightly above matched-ability whites on the hard questions—but averages did not submit applications to colleges; individual students did. When he broke the data down to specific cases, he found that many minority students got a boost of a hundred points or more on the SAT if the score was weighted toward the hard items.
Working from his townhouse, Freedle could no longer dip whenever he wanted to into the deep well of College Board SAT results. The provocative paper he had in mind could not be published in any peer-reviewed journal, because he did not have the statistical backup. But still he thought the idea was sound. He sent a thick sheaf of pages to the Harvard Educational Review. The editors whittled it down in time for the spring 2003 issue and titled it "Correcting the SAT's Ethnic and Social-Class Bias: A Method for Reestimating SAT Scores." (Freedle's analysis focused on comparing the test results of non-Hispanic whites with those of African-Americans, though he argued that it could be more widely applied, to the disadvantaged of all ethnic groups.)
In the article Freedle proposed a supplement to SAT scores, called the Revised-SAT, or R-SAT, which would be calculated based only on the hard items. By putting more emphasis on the results for the harder test questions, Freedle argued, the supplement would "greatly increase the number of high-scoring minority individuals." His late-1980s research with Kostin had revealed, using the DIF method, "evidence of an unintended but persistent cultural and statistical bias in the verbal section of the SAT that adversely affects African Americans," he wrote.
It should be noted from the outset that virtually all these DIF item effects are typically small. For example, White students may get 84 percent correct on some easy items, while African Americans get a slightly lower number, say 82 percent, correct for the same item. Conversely, for some particular hard items, White students may get 30 percent correct whereas African Americans might get a slightly higher score, say 31 percent correct. What is unusual about these effects is their highly patterned nature; that is, many easy items show a small but persistent effect of African Americans' underperformance, while many hard items show their overperformance ...
A culturally based interpretation helps explain why African American examinees (and other minorities) often do better on many hard verbal items but do worse than matched-ability Whites on many easy items. To begin with, easy analogy items tend to contain high-frequency vocabulary words while hard analogy items tend to contain low-frequency vocabulary words ... For example, words such as "horse," "snake," "canoe," and "golf" have appeared in several easy analogy items. These are words used frequently in everyday conversations. By contrast, words such as "vehemence," "anathema," "sycophant," and "intractable" are words that have appeared in hard analogy items, and do not appear in everyday conversation ... However, they are likely to occur in school-related contexts or in textbooks.
Common words, Freedle explained, "often have many more semantic (dictionary) senses than rare words," so there's more of a chance that people's cultural and socio-economic backgrounds will affect their interpretations of those words. (In a 1990 study Freedle and Kostin reported that "fifteen high-frequency analogy words ... had an average of 5.2 dictionary entries, whereas rare analogy words ... had an average of only 2.0 dictionary entries.") Thus words that are frequently used in the middle-class neighborhoods of the SAT makers may have a different meaning in underprivileged minority neighborhoods. This, Freedle continued, could help explain why African-American students do worse on questions containing those common words than on questions that depend on the harder (but less ambiguous) words they study at school. He found that this effect was most pronounced on those questions—sentence completions, analogies—that provided little or no context.
Although Freedle's analysis concentrated on the verbal section of the SAT, he argued that the difficulty-bias effect extended to the math section, and perhaps even to essay questions such as those found on AP exams. He cited research showing minority students doing better than non-Hispanic whites on harder math items, which he attributed to the fact that those items used more textbooklike language and "more abstract concepts learned strictly in the classroom." Minority students scored worse on the easier math items, just as they did on easier verbal items, because commoner words were used in those questions. Freedle said that an examination of essay test results showed a somewhat similar effect, whereby minority students scored better on harder topics than they did on easier ones.
He believed that a supplemental SAT score was the best solution. He emphasized that he was not out to destroy the most important product of ETS and the College Board. He was not proposing a completely new SAT. He liked the adjustments the College Board was making to the test, including the thirty-minute essay section that would be added in 2005. He wanted his R-SAT scores to be sent to colleges as a bonus, to help them identify students, mostly lower-income students of all races, whose SAT scores suffered because of the distance between the language of their families and neighborhoods and that of middle-class America.
He did not try to estimate how many students would benefit from this additional information, but he thought it would be enough to make the supplement worthwhile. Freedle found one African-American student (Freedle's data gave no names) whose verbal R-SAT score was 600 although his or her original verbal SAT score was only 290. "This student's gain score is 310 points—an astonishingly large reassessment of his/her scholastic skills," Freedle wrote. Potentially thousands of students would score 100 to 200 points higher on the R-SAT than on the SAT; that higher score could mean the difference between getting into a selective college and not. Such increases in their scores might also make them eligible for thousands of dollars in scholarships.
Freedle confessed that he did not have enough data to analyze the current form of the SAT, which differs in some ways from the SAT he had analyzed while still working at ETS. But he saw no reason why an R-SAT score wouldn't still benefit some minority students. Nor did he think that the SAT changes to take effect in 2005, including the removal of the analogies, would be enough to eliminate the need for an R-SAT, since the gap between easy words and hard words—or the gap between what people learn at home and what they learn at school—affected other parts of the test. He hoped that ETS would give his analysis the follow-up it deserved: a rigorous testing of its validity and predictive value. "The expense is truly minimal, the moral obligation maximal," he concluded.
When Drew Gitomer read the Review article, he felt not only that Freedle's conclusion was wrong but that his analysis was nonsense. It was based on snippets of old data and seemed to put great weight on correct answers that could be explained as random guesses. He asked some of his staff members to look at it, and then thought about what he should do. He helped the College Board to post a quick Web-site response, which criticized Freedle's paper as "flawed" and "misleading," and organized his staff to produce something longer and more complete, but still he worried about Freedle's use of SAT data, which he thought might be College Board property.
Gitomer, like Freedle, was a cognitive psychologist committed to the values of science. He had arrived at ETS in 1985 with a keen interest in finding better ways of training people for complex tasks and in creating alternatives to fill-in-the-box assessment tools like the SAT. Five years later, not yet thirty-five, he won the company's annual Scientist Award. But in 1999 Gitomer agreed to be the top administrator of the research division. So when Freedle's article appeared, the job of damage control fell to him. The ETS executive had promised himself that he would not get angry on the phone. He would not discuss the many large holes he saw in the article. He just wanted to be able to answer the questions he anticipated from the College Board: Had Freedle used some of its data that he was not authorized to have? Were more such articles on the way?
The pleasantries did not take long. Gitomer got to the point. "I have read your paper in the Harvard Educational Review," he said. "Where did you get the data?"
"That's the wrong question," Freedle replied. He wanted Gitomer to see the benefits of taking his work seriously. "You should view this, and the College Board should view this, as a positive development, and I really mean it. I have solved a significant problem for you."
Gitomer could see that he was not going to get an answer, but he tried again. "Well, let me ask you, then," he said. "How else did you get the data?"
"Again, that is the wrong question," Freedle said. "I think you must use this as a positive thing." He was not going to talk about people who might have shared information with him, and Gitomer was not interested in pushing him further. He asked if Freedle had any more articles in the works. Freedle said no. Relieved, Gitomer said good-bye and hung up.
Gitomer wrote a note to himself on the top of the first page of Freedle's article: "Call Atkinson." This was a reference to Richard C. Atkinson, the president of the University of California system. Atkinson had become the fulcrum of the SAT debate. Like Freedle and Gitomer, he was a cognitive psychologist who understood how the SAT was constructed. Parts of the test he did not like. It had too many psychometric tricks (in his view, the analogies were the worst) that forced college applicants to take expensive SAT-prep courses in order to decipher what Atkinson thought were irrelevancies—for example, is "entomology to insects" more like "agriculture to cows" or "pedagogy to education"? So, wielding his power, the U.C. president had persuaded the College Board to junk the analogies, add more-advanced math, and create the writing section for the 2005 version. He did this as a way of making the test fairer for all students. And Gitomer knew that he was likely to be intrigued by any suggestion that certain questions put minority students at a disadvantage.
Freedle realized the same thing, and decided to call Atkinson himself. His call was returned by Patrick S. Hayashi, the associate president of the U.C. system. After a friendly chat Hayashi suggested that Freedle talk with Saul Geiser, the director of research in Atkinson's office. Geiser told Freedle that U.C. planned to do just the kind of serious analysis of the latest data that Freedle had hoped ETS and the College Board would do.
As it happened, Gitomer did not call Atkinson, but a few days later he spoke to Mark Wilson, a Berkeley psychometrician who had been asked by U.C. to analyze Freedle's report. Gitomer told Wilson that he welcomed the U.C. study, because Freedle had raised serious issues—"no matter how scientifically bankrupt I believe they are." Wayne Camara, of the College Board, said that he, too, was happy to cooperate. However, the College Board's Web-site response to Freedle's article seemed to imply that any study of Freedle's results would be largely a waste of time.
Let us look briefly at the data for the so-called SAT-R Section that Freedle recommends. On the difficult items that are included in the SAT-R, African-American candidates receive an average score of 22 percent out of a perfect score of 100 percent. Since there are five answer options for each question, 22 percent is only slightly above what would be expected from random guessing, namely 20 percent. White candidates do somewhat better, achieving an average score of 31 percent. The results indicate that this test is too hard for either group and would be a frustrating experience for most students. There are simply too many questions that are geared to those with a much higher level of knowledge and skill than is required of college freshmen. Extending Freedle's argument, we could substantially reduce all group differences if the test were made significantly more difficult so that all examinees would have to guess the answers to nearly all of the questions. We could then predict that each subgroup would have to have an average of 20 percent of their answers correct, based on chance.
Freedle's response to this, in a draft memo that he never sent to the College Board, was "Shame on you." He said he had done a statistical analysis of the five choices in the questions studied and found that the students' picks did not seem to be random at all; taking his study further would not be adding error to error. Some independent experts dismissed the Freedle piece. "I was unimpressed," says Robert Linn, a University of Colorado professor and a former president of the American Educational Research Association. "I don't think there is much there." But some found it valuable. Robert Calfee, a retired Stanford cognitive psychologist who now serves as the dean of the education school at U.C. Riverside, said he found Freedle's article "very convincing and to me very understandable." His analysis of the effect of different language cultures on test results, Calfee said, seemed to mirror other research on the powerful differences between formal language learned at school and informal (often called natural) language learned at home. Michael T. Brown, a professor of education at U.C. Santa Barbara, called the article "a competently performed work, thought-provoking, and sensitive with respect to the issues of equity."
So while the University of California pursues its study, Freedle hopes that he is finally getting someplace after having his most provocative work stuffed into ETS file drawers. Nearly all those involved—ETS and College Board officials, University of California researchers, high school guidance counselors and admissions officers from those schools that would be affected by a change in the SAT—are, like Freedle, practical people with a seemingly distant but still compelling goal. They want to remove barriers that limit young people's choices in life. All of them, Freedle included, acknowledge that many other things, more difficult than devising a scoring supplement to a multiple-choice test, will have to be done to make that happen.