The report was eventually accepted, but his requests to do follow-up research were politely rejected. (A similar topic was assigned to other researchers, who did not get far with it.) He could have made a ruckus. Other ETS researchers had when they thought their best efforts were being buried. But Freedle just wanted to keep working, so he concentrated on his techniques for predicting the difficulty of items, which he knew interested the company. Over the next few years he wrote several reports on the subject. But by the late 1990s all his research proposals were being turned down. Others at ETS were also having a hard time getting projects approved, he says, but he thinks his supervisors had a particular problem with his work.
So in October of 1998 Freedle retired, taking with him much of his old data. He wanted to pursue on his own something that had popped out of his ETS work. On average, black students were performing only slightly above matched-ability whites on the hard questions—but averages did not submit applications to colleges; individual students did. When he broke the data down to specific cases, he found that many minority students got a boost of a hundred points or more on the SAT if the score was weighted toward the hard items.
Working from his townhouse, Freedle could no longer dip whenever he wanted to into the deep well of College Board SAT results. The provocative paper he had in mind could not be published in any peer-reviewed journal, because he did not have the statistical backup. But still he thought the idea was sound. He sent a thick sheaf of pages to the Harvard Educational Review. The editors whittled it down in time for the spring 2003 issue and titled it "Correcting the SAT's Ethnic and Social-Class Bias: A Method for Reestimating SAT Scores." (Freedle's analysis focused on comparing the test results of non-Hispanic whites with those of African-Americans, though he argued that it could be more widely applied, to the disadvantaged of all ethnic groups.)
In the article Freedle proposed a supplement to SAT scores, called the Revised-SAT, or R-SAT, which would be calculated based only on the hard items. By putting more emphasis on the results for the harder test questions, Freedle argued, the supplement would "greatly increase the number of high-scoring minority individuals." His late-1980s research with Kostin had revealed, using the DIF method, "evidence of an unintended but persistent cultural and statistical bias in the verbal section of the SAT that adversely affects African Americans," he wrote.
It should be noted from the outset that virtually all these DIF item effects are typically small. For example, White students may get 84 percent correct on some easy items, while African Americans get a slightly lower number, say 82 percent, correct for the same item. Conversely, for some particular hard items, White students may get 30 percent correct whereas African Americans might get a slightly higher score, say 31 percent correct. What is unusual about these effects is their highly patterned nature; that is, many easy items show a small but persistent effect of African Americans' underperformance, while many hard items show their overperformance ...
A culturally based interpretation helps explain why African American examinees (and other minorities) often do better on many hard verbal items but do worse than matched-ability Whites on many easy items. To begin with, easy analogy items tend to contain high-frequency vocabulary words while hard analogy items tend to contain low-frequency vocabulary words ... For example, words such as "horse," "snake," "canoe," and "golf" have appeared in several easy analogy items. These are words used frequently in everyday conversations. By contrast, words such as "vehemence," "anathema," "sycophant," and "intractable" are words that have appeared in hard analogy items, and do not appear in everyday conversation ... However, they are likely to occur in school-related contexts or in textbooks.
Common words, Freedle explained, "often have many more semantic (dictionary) senses than rare words," so there's more of a chance that people's cultural and socio-economic backgrounds will affect their interpretations of those words. (In a 1990 study Freedle and Kostin reported that "fifteen high-frequency analogy words ... had an average of 5.2 dictionary entries, whereas rare analogy words ... had an average of only 2.0 dictionary entries.") Thus words that are frequently used in the middle-class neighborhoods of the SAT makers may have a different meaning in underprivileged minority neighborhoods. This, Freedle continued, could help explain why African-American students do worse on questions containing those common words than on questions that depend on the harder (but less ambiguous) words they study at school. He found that this effect was most pronounced on those questions—sentence completions, analogies—that provided little or no context.
Although Freedle's analysis concentrated on the verbal section of the SAT, he argued that the difficulty-bias effect extended to the math section, and perhaps even to essay questions such as those found on AP exams. He cited research showing minority students doing better than non-Hispanic whites on harder math items, which he attributed to the fact that those items used more textbooklike language and "more abstract concepts learned strictly in the classroom." Minority students scored worse on the easier math items, just as they did on easier verbal items, because commoner words were used in those questions. Freedle said that an examination of essay test results showed a somewhat similar effect, whereby minority students scored better on harder topics than they did on easier ones.