Over the past couple of days I've had a chance to see the Oxford English Corpus in action, and I'm really impressed. Covetous. The thing contains 2 billion words of text (and counting), making it by far the largest linguistic corpus in existence. All of the sources are 21st-century, and every passage is meticulously tagged as to whether it's British, American, Canadian, Australian, etc., and whether it's from news, fiction, blogs, online chat rooms, medical journals ... Naturally, the tags make it possible to pick apart usage in the different realms. If you want to see sentences containing the word "balloon" in British fiction or in American medical literature (where it's not as scarce as you might suppose, owing to "balloon angioplasty"), no problem. Click, click, hit "Enter," and the passages line up neatly on the screen.
The developers of the corpus have tried to make the text as representative a sample of contemporary English as possible. Which of course gets me thinking, What does that mean? Certainly, the developers have given a lot more thought to this question than I have. They're obviously smart, experienced, and passionate about their work - I'm not at all skeptical of them. I would love to get my hands on the corpus. But I can't help being skeptical that anything anyone could come up with could be "representative" of contemporary English. Have I zeroed in on a fundamental design problem, a fundamental problem with the nonspecialist's relationship to technology, or a fundamental problem with my state of mind?

Barbara Wallraff, a contributing editor and columnist for The Atlantic, has worked for the magazine for 25 years. She is also a weekly syndicated newspaper columnist for King Features and the author of Word Fugitives (2006), Your Own Words (2004), and the national best-seller Word Court (2000). Her writing about language has appeared in The Washington Post, The Boston Globe, The Wilson Quarterly, The American Scholar, and The New York Times Magazine. 


Join the Discussion
After you comment, click Post. If you’re not already logged in you will be asked to log in or register. blog comments powered by Disqus