Freedle identified seven factors that seemed to affect the difficulty of a test question. One of the most common was simple word repetition. If an answer had a word in it that also appeared in the question, more test takers chose it; thus a question was easier if it contained a word that recurred in the correct answer, and harder if it contained a word that recurred in an incorrect answer. In reading-comprehension questions another factor was where in a reading passage the key part appeared. If it was at the beginning or the end of the passage, the question was easier than if it was in the middle.
Then he decided he would apply his analysis to ETS's biggest test. He realized that if he looked at how different ethnic groups reacted to the seven factors, he might help to improve the SAT and impress his supervisors, who indicated that rooting out bias in the test was a priority.
Freedle was excited by a new technique called differential item functioning, or DIF (pronounced diff). To see which questions on the verbal section of the SAT produced different results by race, test takers were first divided into groups by score: those who had scored 200, the lowest possible; those who had scored 210; those who had scored 220; and on up to 800, the top score. Each of those scoring groups were examined to see how people of different ethnicities had done on each item of the test.
Using DIF, Freedle began to notice some intriguing results, which turned out to have more to do with word choice than with the seven difficulty factors. At each level of ability, but particularly in the lower-scoring groups, white students on average did better than blacks on the easier items, whereas blacks on average did better than whites on the harder ones. (Whites, though, did better overall.)
That was not a result that Freedle's supervisors expected. In 1987 he handed in a draft of a report on the subject that he had done with Irene Kostin, an ETS colleague. The research-division chiefs asked for a revision. He handed in a second draft. They were still not satisfied. Some of the questions were legitimate, Freedle thought. His conclusions contradicted other research. The chiefs wanted him to look at the data from other angles. But each re-examination confirmed the initial results. By the time he was ordered to do an eleventh revision, Freedle had begun to wonder if ETS, in its scholarly way, was trying to discourage him from pursuing his rogue conclusion.
The report was eventually accepted, but his requests to do follow-up research were politely rejected. (A similar topic was assigned to other researchers, who did not get far with it.) He could have made a ruckus. Other ETS researchers had when they thought their best efforts were being buried. But Freedle just wanted to keep working, so he concentrated on his techniques for predicting the difficulty of items, which he knew interested the company. Over the next few years he wrote several reports on the subject. But by the late 1990s all his research proposals were being turned down. Others at ETS were also having a hard time getting projects approved, he says, but he thinks his supervisors had a particular problem with his work.
So in October of 1998 Freedle retired, taking with him much of his old data. He wanted to pursue on his own something that had popped out of his ETS work. On average, black students were performing only slightly above matched-ability whites on the hard questions—but averages did not submit applications to colleges; individual students did. When he broke the data down to specific cases, he found that many minority students got a boost of a hundred points or more on the SAT if the score was weighted toward the hard items.
Working from his townhouse, Freedle could no longer dip whenever he wanted to into the deep well of College Board SAT results. The provocative paper he had in mind could not be published in any peer-reviewed journal, because he did not have the statistical backup. But still he thought the idea was sound. He sent a thick sheaf of pages to the Harvard Educational Review. The editors whittled it down in time for the spring 2003 issue and titled it "Correcting the SAT's Ethnic and Social-Class Bias: A Method for Reestimating SAT Scores." (Freedle's analysis focused on comparing the test results of non-Hispanic whites with those of African-Americans, though he argued that it could be more widely applied, to the disadvantaged of all ethnic groups.)
In the article Freedle proposed a supplement to SAT scores, called the Revised-SAT, or R-SAT, which would be calculated based only on the hard items. By putting more emphasis on the results for the harder test questions, Freedle argued, the supplement would "greatly increase the number of high-scoring minority individuals." His late-1980s research with Kostin had revealed, using the DIF method, "evidence of an unintended but persistent cultural and statistical bias in the verbal section of the SAT that adversely affects African Americans," he wrote.
It should be noted from the outset that virtually all these DIF item effects are typically small. For example, White students may get 84 percent correct on some easy items, while African Americans get a slightly lower number, say 82 percent, correct for the same item. Conversely, for some particular hard items, White students may get 30 percent correct whereas African Americans might get a slightly higher score, say 31 percent correct. What is unusual about these effects is their highly patterned nature; that is, many easy items show a small but persistent effect of African Americans' underperformance, while many hard items show their overperformance ...
A culturally based interpretation helps explain why African American examinees (and other minorities) often do better on many hard verbal items but do worse than matched-ability Whites on many easy items. To begin with, easy analogy items tend to contain high-frequency vocabulary words while hard analogy items tend to contain low-frequency vocabulary words ... For example, words such as "horse," "snake," "canoe," and "golf" have appeared in several easy analogy items. These are words used frequently in everyday conversations. By contrast, words such as "vehemence," "anathema," "sycophant," and "intractable" are words that have appeared in hard analogy items, and do not appear in everyday conversation ... However, they are likely to occur in school-related contexts or in textbooks.
Common words, Freedle explained, "often have many more semantic (dictionary) senses than rare words," so there's more of a chance that people's cultural and socio-economic backgrounds will affect their interpretations of those words. (In a 1990 study Freedle and Kostin reported that "fifteen high-frequency analogy words ... had an average of 5.2 dictionary entries, whereas rare analogy words ... had an average of only 2.0 dictionary entries.") Thus words that are frequently used in the middle-class neighborhoods of the SAT makers may have a different meaning in underprivileged minority neighborhoods. This, Freedle continued, could help explain why African-American students do worse on questions containing those common words than on questions that depend on the harder (but less ambiguous) words they study at school. He found that this effect was most pronounced on those questions—sentence completions, analogies—that provided little or no context.