When one person asks another a question, it takes an average of 200 milliseconds for them to respond. This is so fast that we can’t even hear the pause. In fact, it’s faster than our brains actually work. It takes the brain about half a second to retrieve the words to say something, which means that in conversation, one person is gearing up to speak before the other is even finished. By listening to the tone, grammar, and content of another’s speech, we can predict when they’ll be done.
This precise clockwork dance that happens when people speak to each other is what N.J. Enfield, a professor of linguistics at the University of Sydney, calls the “conversation machine.” In his book How We Talk, he examines how conversational minutiae—filler words like “um” and “mm-hmm,” and pauses that are longer than 200 milliseconds—grease the wheels of this machine. In fact, he argues, these little “traffic signals” to some degree define human communication. What all human languages have in common, and what sets our communication apart from animals, is our ability to use language to coordinate how we use language.
I hopped into the conversation machine with Enfield for a very meta chat about the big impacts of tiny words and pauses on human interaction. An edited and condensed transcript of our interview is below.
Julie Beck: Can you explain what the “conversation machine” is and why it’s unique among animals?
N.J. Enfield: When we’re having a conversation, because of the entirely cooperative nature of language, we form a single unit. Certain social cognitions that humans have—the capacity to read other people’s intentions and the capacity to enter into true joint action—allow us to connect up to each other in interaction and ride along in this machine.
Obviously, animals communicate in a range of interesting and complex ways. But where I draw the line is the moral accountability that humans have in interaction. If one person doesn’t do the appropriate thing, for example not answering a question when it’s being asked, we can be held to account for that. We don’t see that in animals. [In humans], one individual can say: “Why did you say that?” Or “Please repeat that.” You don’t see animals calling others out for their failures, asking why did they say that, or could they repeat that. [What’s unique in humans] is the capacity for language to refer back to itself.
Beck: It seems like conversation is always operating on two levels. One is we’re talking about whatever it is we’re talking about, and at the same time, on this more meta level, we’re monitoring the conversation itself and steering it in the direction we want it to go.
Enfield: Exactly. In the book, I mention a psychologist by the name of Herb Clark, at Stanford. He’s made the point for years that language is a tool for coordinating joint action. Let’s say you and I are moving house. All day we’re going to be using language to coordinate our activity. When we lift up a table, we’ll say, like, “One, two, three, lift.” We’re using language to coordinate our physical activity. Herb Clark points this out, and then he says language is used in exactly that way to coordinate the very activity of using language. We might be talking about a subject like what are we going to do on the weekend or whatever, but at the same time we’re using all these traffic signals to coordinate the activity of talking itself. We’re sending little signals like, “Wait, I’m not ready to finish my turn yet,” “What was that? I didn’t catch what you said,” “Yes, I’m still paying attention to you.” Language regulates itself.
Beck: One of the ways we do that is how quickly people respond to each other, right? You write that people usually respond to each other in a conversation within 200 milliseconds. What if you take longer to respond? What signal does that send?
Enfield: It could mean a few things. The fact that this is average, 200 milliseconds, suggests people are aiming for that. So if you are late, it suggests you were not able to hit that target because of some trouble in finding the words you wanted. Or maybe you didn’t hear what was said, or maybe you were distracted in some way. That delay is caused directly by some kind of processing problem. And if you ask people difficult questions, their answers will tend to be delayed
One of the big traffic signals that manages that is these hesitation markers like “um” and “uh,” because they can be used as early as you like. Of course, they don’t have any content, they don’t tell you anything about what I’m about to say, but they do say, “Wait please, because I know time’s ticking and I don’t want to leave silence but I’m not ready to produce what I want to say.”
There’s another important reason for delay, and that is because you are trying to buffer what we call a “dis-preferred response.” A clear example would be: I say “How about we go and grab coffee later?” and you’re not free. If you’re free and you say, “Yeah, sure, sounds good,” that response will tend to come out very fast. But if you say “Ah, actually no, I’m not really free this afternoon, sorry,” that kind of response is definitely going to come out later. It may have nothing to do with a processing problem as such, but it’s putting a buffer there because you’re aware saying “No” is not the thing the questioner was going for. We tend to deliver those dis-preferred responses a bit later. If you say “no” very quickly, that often comes across as blunt or abrupt or rude.
The way we play with those little delays, others are very sensitive to what that means. A full second is about the limit of our tolerance for silence. Then we will either assume the other person’s not going to respond at all, and we just keep speaking, or we might pursue a response.
Beck: Maybe I shouldn’t tell you this, but one of the things that they tell you to do if you’re doing an interview is to just wait. If they’re not responding, just sit there quietly, because people get uncomfortable and then they just keep talking.
Enfield: Exactly. The interesting thing about it is you as an interviewer have to suppress quite a strong tendency to jump into that space. It’s a skill you’ve got to learn to do. I think people naturally don’t feel comfortable with that silence. Once you’ve got that one second going by, somebody’s got to do something. Unless it’s a situation where you’re with your loved ones in your house or you’re on a long car drive or something like that. Obviously, we can lapse into silence and that’s not a problem, but if we’re in the middle of a to-and-fro conversation, we’re generally not going to let that happen.
Beck: So I’m going to transcribe this Q&A later, and I’m going to edit all of those filler words like “um” and “uh” and “well” out of this interview, as I always do. But you write that these words are actually extremely important to conversation. What am I going to lose by cutting all of that out of this transcript?
Enfield: I think it’s the right thing to do, to edit it out when you write things down. You’re not going to lose anything too significant, and the reason is you’ve changed the context completely in which people are going to consume those words. At the moment, the words I’m producing are being interpreted by you in real time. Things never come out perfectly, and we have to edit on the fly. That’s what these words do. What they’re doing is telling you, “No, that word is not what I meant, I’ve doubled back and I’m now going to replace that word with this word.” Or, “Wait a second, I’m about to get the word I’m looking for.” But as soon as you transcribe those, people are not consuming the words at the same time and place as I’ve created them. Those “ums” and “uhs” just become superfluous.
Beck: So you don’t need the words that you use to edit yourself anymore because I’m literally editing you?
Enfield: Exactly. The thing about my book is that as a reader, you don’t know how many times I’ve rephrased a sentence. But you can’t hide that from someone in interaction because you’ve got the time pressure of turn taking. What we’re doing, it’s messy, there’s no getting around that. And that is completely hidden from view when you write something down and publish it because no one’s going to get access to all the drafts. But conversation is all draft.
Beck: Another thing you mentioned that I thought was super interesting was the way people use “um” as a way to claim more conversational space for themselves. Can you talk about that?
Enfield: In any form of interaction, we don’t have access to each other’s minds. It’s the classic problem of human life in a way. Things like “ums” and “uhs” signal there’s some delay in processing. But as a speaker, what I can do is exploit those kinds of signals. I can use them dishonestly. I can use something like “um” to give the overt signal that I’m having some sort of trouble with processing, but in reality, all I’m doing is trying to claim more ground and get you to keep waiting for me to finish.
All words can be used to lie. Whether they’re nouns and verbs, or whether they’re traffic signals, we can exploit them in dishonest ways. If you want to game the system, and all you want to do is hold the floor, then words like “um” can be exploited in that way. Obviously, there are limits to it. People are sensitive to these things, and after a while if you’re trying to dominate the floor, people will either wise up and grab it back or they will just get sick of you.
Beck: Another thing I do in interviews all the time, that I’m doing right now and I’m also going to cut out of the transcript, is I say “mm-hmm” a lot while the person is talking. It makes sense; it’s me just signaling that I’m still listening. But how important is that to our experience of conversation? If I wasn’t “mm-hmm”-ing, would that make a difference to how our conversation goes?
Enfield: Yeah, it would make a big difference. When you’re saying “Mm-hmm, uh-huh,” you’re really playing an important role in the smooth operation of this conversation machine. In the book, I talk about a study done by Janet Bavelas in Canada, with her colleagues. They brought people into the lab, they asked them to get into pairs, and they’d just randomly nominate one of them and say, “Think of a near-miss scenario you had and tell that to the other person.” The listener will look at them, they’ll nod, say “Uh-huh, mm-hmm,” and when the person gets to the punchline, they’ll say things like “Wow.”
Then they had a special condition where they tried to distract the listener. They said, “You have to press this button underneath the desk every time the person who’s telling their near-miss scenario uses a word that begins with the letter T.” It completely distracted the listeners from actually following along the content of the story. They produced many fewer of those “uh-huhs” and “mm-hmms.” It also meant the timing of them was kind of out of whack, and they didn’t really recognize when the speaker had reached the climax of the story—the moment when they’re supposed to say “Oh, wow.” They showed that when you distract the listener, then the storyteller tends to circle back and repeat themselves. They essentially become a less proficient and less fluent storyteller. It was a powerful demonstration of precisely the importance of those types of feedback markers for the performance of the person who is telling the story itself.
Beck: You talk a lot in the book about the “moral architecture” of conversation. Explain what that means in the context of these little traffic signals. What does using words like “um” have to do with morality?
Enfield: Morality’s a strong word. When you use that word, people think, “Oh you’re talking about is it okay to have sex with animals” or whatever. Thinking about grand moral questions. I’m talking about a much simpler code. In general, morals tell us how we should live. In the moral architecture of language, they tell us how we should talk. What the moral code does is it licenses us to hold other people accountable to that code. Like, “Hey I asked you a question,” that would be an example. I might not be saying it explicitly but I’m implying, “That’s bad. You shouldn’t be silent when I’m asking you a question, you should respond.”
When it comes to little words, I produce “um” and “uh” as a signal to you that I know I should be speaking right now. The right thing to do is to be speaking fluently, moving the conversation forward. The whole motivation for my producing those little traffic signals is to make clear that, despite current appearances, I am aware of and I’m following the basic stipulations of what it takes to produce an appropriate conversation. It’s that whole moral architecture that human beings have, it’s the root of so much of our cultural life and our social life: the defining of what’s appropriate, what’s inappropriate, and policing those things and judging others on the basis of those things. And in these extremely subtle ways it’s right there in every conversation that we have.
We want to hear what you think about this article. Submit a letter to the editor or write to email@example.com.