Part of what I needed to figure out was how to exploit the Loebner Prize’s unusual “live typing” medium. The protocol being used was unlike e-mails, text messages, and standard instant-messaging systems in a very crucial way: it transmitted our typing keystroke by keystroke. The judge and I were watching each other type, typos and backspacing and all. I remember some Internet chat programs back in the ’90s trying out this character-at-a-time approach, but people for the most part rejected it. It was too invasive, was the feeling: what people like about writing is the time and space to compose and edit a message before sharing it with the other person. The advantage of the character-at-a-time transmission, though, is that it approaches much more closely the condition of speech, with its fluidity of turn-taking and its choppy grammar: what’s lost in eloquence is made up for in agility.
It also, then, lets us see typing’s “negative space”: hesitation. In a chat conversation where text is transmitted with every carriage return, only egregiously long pauses are taken to be part of the interaction. With more fluid and immediate feedback, silence acquires meaning. Failing to quickly answer a question in a face-to-face conversation, for instance, is tantamount in many cases to answering it. I recall asking a friend of mine how things were going with the woman he’d started seeing; the um and the tiny pause in which he searched for the right words made it clear they were having problems.
So what did the Loebner Prize’s unusual (and recently implemented) protocols enable and disable, compared with the standard, walkie-talkie, turn-taking style? Long missives weren’t going to work, as they had in previous years, when programs were able to steamroll the judges by eating up the clock and delivering ridiculously prolix answers. If two parties are taking strict turns under time pressure, they are putting themselves at the mercy of the length of the other’s turns. Here, for instance, is one program’s conversation transcript from 1996:
Computer: How about telling me a joke?
Computer: Here’s my favorite joke … A dog limps into a bar with a bloody bandage on his foot. He’s wearing a 10-gallon hat, six-shooters in his holsters and chaps. He growled, “I’m a-lookin’ for the man that shot my paw.”
“The joke’s not funny …” the judge writes, giving the program an opening to tell another one—which it does (“A knotty, worn-out old string walks into a bar …”). Meanwhile, three full minutes have elapsed. If a computer (or confederate) started rambling on too long under the new, live-typing protocols, the judge could and would just cut it off.
And so another piece of my confederate strategy fell into place. I would treat the Turing Test’s strange and unfamiliar textual medium more like spoken English, and less like the written language. I would attempt to disrupt the turn-taking “wait and parse” pattern that computers understand, and create a single, flowing duet of verbal behavior, emphasizing timing. If computers understand little about verbal “harmony,” they understand even less about rhythm.
If nothing was happening on my screen, whether or not it was my turn, I’d elaborate a little on my answer, or add a parenthetical, or throw a question back at the judge—just as we offer and/or fill audible silence when we talk out loud. If the judge took too long considering the next question, I’d keep talking. I would be the one (unlike the bots) with something to prove. If I knew what the judge was about to write, I’d spare him the keystrokes and jump in.
There’s a trade-off, of course, between the number of opportunities for serve and volley, and the sophistication of the responses themselves. The former thrives with brevity, the latter with length. It seemed to me, though, that so much of the nuance (or difficulty) in conversation comes from understanding (or misunderstanding) a question and offering an appropriate (or inappropriate) response—thus, it made sense to maximize the number of interchanges.
Some judges, I discovered, would be startled or confused at this jumping of the gun, and I saw them pause, hesitate, yield, even start backspacing what they had half-written. Other judges cottoned on immediately, and leapt right in after me.
In the first round of the 2009 contest, judge Shalom Lappin—a computational linguist at King’s College London—spoke with a computer program called Cleverbot, and then with me. My strategy of verbosity was clearly in evidence: I made 1,089 keystrokes in five minutes (3.6 keystrokes a second) to Cleverbot’s 356 (1.2/sec), and Lappin made 548 keystrokes (1.8/sec) in my conversation, compared with 397 (1.3/sec) with Cleverbot. Not only did I say three times as much as my silicon adversary, but I engaged the judge more, to the tune of 38 percent more typing from Lappin.
Reviewing the logs later, though, I looked for a way to quantify the fluidity of the human-to-human interactions against the rigidity of the human-to-machine ones. It occurred to me that you could count the changes—let’s call them “swaps”—of the party who typed the latest keystroke. I did manage to type three times as much as Cleverbot, but the real story, as it turns out, is in the swaps. Lappin’s conversation with Cleverbot had 33; his conversation with me had 492, almost 15 times as many.
You don’t converse with Google, or with most computer systems; you depose them. There’s a crucial difference. Interestingly, many Loebner Prize judges approach the Turing Test as a kind of cross-examination; strangely, a number of confederates also seem to approach it with that dynamic in mind. One of the human-to-human conversations in 2008 never manages to get out of that stiff question-and-response mode:
Judge: Do you have a long drive?
Confederate: fairly long
Judge: so do I :( ah well, do you think you could have used public transport?
Confederate: i could have
Judge: and why not?
Confederate: i chose not to
Judge: that’s fair. Do you think we have too many cars on the road or not enough today?
Confederate: it’s not for me to say
Yawn! But the computer in this pair is playful with the judge from the get-go:
Computer: Amen to that.
Judge: quite the evangelist
Computer: Our Father, who art in cyberspace, give us today our daily bandwidth.
Judge: evangelist / nerd lol. So how are things with you today?
And with that, the program has practically sealed up the judge’s confidence in its humanity with its second sentence. Note that the confederate’s stiff answers prompt more grilling and forced conversation—what’s your opinion on such-and-such political topic? But with the computer, the judge, misled by the opening wisecracks into assuming it’s the real person, remains utterly casual: how are things? This makes the contest easier for the computer and harder for the confederate.
It surprised me to see some confederates being coy with their judges. Asked what kind of engineer he is, Dave, to my left, answered, “A good one. :)” And Doug, to my right, responded to a question about what brought him to Brighton with “if I tell you, you’ll know immediately that I’m human ;-)” For my money, wit is very successful, but coyness is a double-edged sword. You show a sense of humor, but you jam the cogs of the conversation. Probably the most dangerous thing a confederate can do in a Turing Test is stall. It’s suspect—as the guilty party would tend to be the one running out the clock—and it squanders your most precious resource: time.
The humans in a Turing Test are strangers, limited to a medium that is slow and has no vocal tonality, and without much time. A five-second Turing Test would be an easy win for the machines: the judges, barely able to even say “hello,” simply wouldn’t be able to get enough data from their respondents to make any kind of judgment. A five-hour test would be an easy win for the humans. The Loebner Prize organizers have tried different time limits since the contest’s inception, but in recent years they’ve mostly adhered to Turing’s original prescription of five minutes: around the point when conversation starts to get interesting.
A big part of what I needed to do as a confederate was simply to make as much engagement happen in those minutes as I physically and mentally could. Rather than adopt the terseness of a deponent, I offered the prolixity of a writer. In other words, I talked a lot. I stopped typing only when to keep going would have seemed blatantly impolite or blatantly suspicious. The rest of the time, my fingers were moving. I went out of my way to embody that maxim of “A bore is a man who, being asked ‘How are you?’ starts telling you how he is.”
Judge: Hi, how’s things?
Confederate: hey there
Confederate: things are good
Confederate: a lot of waiting, but …
Confederate: good to be back now and going along
Confederate: how are you?
When we’d finished, and my judge was engaged in conversation with one of my computer counterparts, I strolled around the table, seeing what my comrades were up to. Looking over at my fellow confederate Dave’s screen, I noticed his conversation began like he was on the receiving end of an interrogation, and he was answering in a kind of minimal staccato:
Judge: Are you from Brighton?
Confederate: No, from the US
Judge: What are you doing in Brighton?
Confederate: On business
Judge: How did you get involved with the competition?
Confederate: I answered an e-mail.
Like a good deponent, he let the questioner do all the work. When I saw how stiff Dave was being, I confess I felt a certain confidence—I, in my role as the world’s worst deponent, was perhaps in fairly good shape as far as the Most Human Human award was concerned.
This confidence lasted approximately 60 seconds, or enough time for me to continue around the table and see what another fellow confederate, Doug, and his judge had been saying.
Judge: Hey Bro, I’m from TO.
Confederate: leafs suck
Judge: I am just back from a sabbatical in the CS Dept. at U of T.
Judge: I remember when they were a great team.
Judge: That carbon date me, eh?
Confederate: well, the habs were a great team once, too …
Judge: YEH, THEY SUCK TOO.
Confederate: (I’m from Montreal, if you didn’t guess)
Doug and his judge had just discovered that they were both Canadian. They let rip with abbreviations and nicknames and slang and local references. And then they started to talk about hockey.
I was in trouble.
Six months after the 2009 contest, a video appeared on YouTube of a man having a shockingly cogent conversation with a bot about Shakespeare’s Hamlet. Some suspected it might herald a new age for chatbots, and for AI. Others, including myself, were unimpressed. Sophisticated behavior doesn’t necessarily indicate a mind. It might indicate just a memory. As Dalí so famously put it, “The first man to compare the cheeks of a young woman to a rose was obviously a poet; the first to repeat it was possibly an idiot.”
For instance, Richard Wallace, the three-time Most Human Computer winner, recounts an “AI urban legend” in which
a famous natural language researcher was embarrassed … when it became apparent to his audience of Texas bankers that the robot was consistently responding to the next question he was about to ask … [His] demonstration of natural language understanding … was in reality nothing but a simple script.
The moral of the story: no demonstration is ever sufficient. Only interaction will do. In the 1997 contest, one judge gets taken for a ride by Catherine, waxing political and really engaging in the topical conversation “she” has been programmed to lead about the Clintons and Whitewater. In fact, everything is going swimmingly until the very end, when the judge signs off:
Judge: it looks like i’ve got to go, Catherine. its been nice talking with you
Judge: do you need water or something before the 5 minute rounds?
At which point the program erupts into gibberish.
Computer: OK, yes on balance … Time to get off this one I think and onto something more down to earth!
Judge: like what?
Computer: Almost certainly! let’s move on
Computer: I suppose it depends on where you’re coming from, but as the song goes in My Fair Lady, “Why can’t a woman be more like a man?”
Defeat from the jaws of victory.
We so often think of intelligence, of AI, in terms of sophistication, or complexity of behavior. But in so many cases, it’s impossible to say much with certainty about the program itself, because any number of different pieces of software—of wildly varying levels of “intelligence”—could have produced that behavior.
No, I think sophistication, complexity of behavior, is not it at all. For instance, you can’t judge the intelligence of an orator by the eloquence of his prepared remarks; you must wait until the Q&A and see how he fields questions. The computation theorist Hava Siegelmann once described intelligence as “a kind of sensitivity to things.” These Turing Test programs that hold forth may produce interesting output, but they’re rigid and inflexible. They are, in other words, insensitive—occasionally fascinating talkers that cannot listen.
As computing technology in the 21st century moves increasingly toward mobile devices, we’ve seen the 1990s’ explosive growth in processor speed taper off, and product development become less about raw computing horsepower than about the overall design of the product and its fluidity, reactivity, and ease of use. This fascinating shift in computing emphasis may be the cause, effect, or correlative of a healthier view of human intelligence—an understanding, not so much that it is complex and powerful, per se, as that it is reactive, responsive, sensitive, nimble. Our computers, flawed mirrors that they are, have helped us see that about ourselves.