Over the past couple of days I've had a chance to see the Oxford English Corpus in action, and I'm really impressed. Covetous. The thing contains 2 billion words of text (and counting), making it by far the largest linguistic corpus in existence. All of the sources are 21st-century, and every passage is meticulously tagged as to whether it's British, American, Canadian, Australian, etc., and whether it's from news, fiction, blogs, online chat rooms, medical journals ... Naturally, the tags make it possible to pick apart usage in the different realms. If you want to see sentences containing the word "balloon" in British fiction or in American medical literature (where it's not as scarce as you might suppose, owing to "balloon angioplasty"), no problem. Click, click, hit "Enter," and the passages line up neatly on the screen.
The developers of the corpus have tried to make the text as representative a sample of contemporary English as possible. Which of course gets me thinking, What does that mean? Certainly, the developers have given a lot more thought to this question than I have. They're obviously smart, experienced, and passionate about their work - I'm not at all skeptical of them. I would love to get my hands on the corpus. But I can't help being skeptical that anything anyone could come up with could be "representative" of contemporary English. Have I zeroed in on a fundamental design problem, a fundamental problem with the nonspecialist's relationship to technology, or a fundamental problem with my state of mind?