Women say “um” more often than do men, who favor “uh,” and when people from the middle of the country begin a tweet with one of those words, it’s usually uh, while much of the rest of America opts for um, at least according to linguists who analyzed 14,000 phone conversations and 600 million tweets.

In the era of Big Data, language researchers can produce insights like these in a flash. Ten years ago, when I was writing a book about the cultural and linguistic significance of uh, um, and other pause fillers, I’d have been thrilled to have such statistics at my disposal.

Spoken uhs and ums have long intrigued psychologists and linguists, because people tend to say them without deliberation. Sigmund Freud pioneered the examination of speech errors for insight into a person’s unconscious self; George Mahl, a psychiatrist at Yale, continued this tradition in the 1950s, tying the emotional states of patients to their uhs, ums, and other so-called disfluencies. Later, in the 1980s, psycholinguists began using disfluencies to study how the brain produces language.

Now come data showing that uses of uh and um seem to be patterned along social lines. In August, the University of Pennsylvania linguist Mark Liberman reported that, according to his analysis of transcribed phone conversations, women say “um” 22 percent more frequently than men do, while men say “uh” more than twice as often as women do. Various Web sites jumped on the news that, as one headline put it, “Dudes Say ‘Uh’; Ladies Say ‘Um.’ ” A few days later, Jack Grieve, a dialectologist at Aston University in the U.K., made more headlines. By mapping 600 million tweets, he found a preference for starting sentences with um in regions including the Southeast, New England, and the upper Midwest; in an area stretching west to Arizona, north to Iowa, and east to Ohio, uh was dominant.

Tweeting is not talking, of course, so Grieve’s results don’t necessarily establish anything about spoken uhs and ums. When uh and um are written, they’re deliberate—and, says Grieve, “they have all sorts of other functions, such as signaling confusion, surprise, sarcasm, or cuteness.” He speculates that um may be somewhat more polite than uh. If so, perhaps the northeastern preference for um is simply an expression of a regional formality. Grieve says that in his other research, he has found the Northeast to be the most linguistically formal area of the U.S.—for example, northeasterners use contractions less frequently than do other Americans. Grieve notes that other regions that favor um—such as the Bay Area and the Southeast—also “tend to be a bit more formal.” But he admits this is a hunch.

Herb Clark, a Stanford University cognitive psychologist who has studied uh, um, and other phenomena that he calls “departures from the ideal,” cautions that the stories Big Data tells about language are provisional. Data can show us what people are saying and writing, and quickly (using a data set he’d already prepared, Grieve produced his uh/um map in about a minute), but not why. As data proliferate and computing becomes ever faster, we can expect more and more headlines about smaller and smaller grains of lexical sand. But only controlled experiments will explain what we’re hearing (and seeing), and keep Big Data from ultimately becoming big noise.