One night in 1984, a man broke into 22-year-old Jennifer Thompson’s apartment, threatened her at knifepoint, and raped her. While it was happening she tried to memorize everything about him—his  face, hair, clothes, body type. Later that day, she recounted those details to a police sketch artist.

Two days later, a detective showed Thompson a photo lineup of six men. She ruled out four of them right away, and stared at the other two pictures for four or five minutes. Finally she chose one. “Yeah. This is the one,” she said, as recounted in the book Picking Cotton. “I think this is the guy.”

“You ‘think’ that’s the guy?” one of the detectives asked her.

“It’s him,” she said.

“You’re sure?” asked another detective.


She wrote her initials and date on the back of the photo, then asked them, “Did I do OK?”

“You did great, Ms. Thompson.”

The man she identified, Ronald Cotton, was convicted and sentenced to a life in prison. More than 10 years later, a DNA test revealed that Thompson had pointed to the wrong guy. Cotton was innocent.

Eyewitness testimony is hugely influential in criminal cases. And yet, brain research has shown again and again that human memory is unreliable: Every time a memory is recalled it becomes vulnerable to change. Confirming feedback—such as a detective telling a witness she “did great”—seems to distort memories, making them feel more accurate with each recollection. Since the start of the Innocence Project 318 cases have been overturned thanks to DNA testing. Eyewitness mistakes played a part in nearly three-quarters of them.

For three decades psychology researchers have been searching for ways to make eyewitness identifications more reliable. Many studies have shown, for example, the value of “double-blind” lineups, meaning that neither the cop administering the lineup nor the witness knows which of the photos, if any, is the suspect.

But injecting science into the justice system is tricky. For one thing, most criminal investigations happen at a local level. The U.S. has roughly 16,000 law enforcement agencies and few nationally mandated standards. The other big problem is the nature of science itself: Evidence for a given idea builds gradually, as scientists try to replicate others’ work. It can take years or even decades for a clear picture to emerge, and in the meantime scientists may vigorously disagree. While they argue, cases are opened and closed, and people, sometimes the wrong people, go to prison.

Some helpful guidance came today from the National Academy of Sciences. Last year the Academy asked a panel of top scientists to review technical reports and expert testimony about eyewitness identifications and make some solid recommendations. The resulting 160 page report offers many concrete suggestions for carrying out eyewitness identifications. For example, the Academy recommends using double-blind lineups and standardized witness instructions, and training law enforcement officials on the fallibility of eyewitness memory.

On one question, though, the Academy offers no clear answer: What’s the best way to present a photo lineup to a witness? This fuzziness reflects a hot debate bubbling in the scientific literature.

* * *

In 1984, psychology researcher Gary Wells published a study of eyewitness memory by setting up a mock crime in his laboratory. College-student volunteers were told they were about  to participate in an experiment about video games. While they were waiting for the experiment to begin, they witnessed someone steal the game. Then Wells’s team asked them to pick out the thief from a photo lineup. Some of the lineups contained the actual thief but others did not.

The study found that when volunteers made mistaken identifications, it was usually in lineups that did not contain a photo of the real culprit. Wells has a hunch as to why that might be. “Witnesses have a natural propensity to identify the person in the lineup who looks most like the perpetrator relative to the others,” he says. “The problem with that is that if the real perp’s not there, there’s still somebody who looks more like the perpetrator than others.”

Based on those findings, Wells thought that some false IDs could be avoided by using what’s called a ‘sequential lineup’, in which witnesses see photos one at a time and make a decision, yes or no, after each. Over the next few years he carried out more mock-crime experiments and reported exactly that—when images were shown one by one people were less likely to falsely accuse someone than if the images were shown all together.

In fact, Wells’s studies found that sequential lineups slashed the rate of false positives considerably. His first study showed a drop in incorrect accusations  43 to 17 percent. Sequential lineups also slightly increased the number of missed identifications, in which the perpetrator is in the lineup but not fingered by the witness. But because the ratio of true positives to false positives—the so-called ‘diagnosticity ratio’—was much higher for sequential procedures, Wells argued they were superior.

At the same time, Wells’s group was also publishing on other ways to improve eyewitness identifications, such as instructing the witness in an unbiased way, using double-blind administrators, and picking appropriate “filler” photos— photos that all share whatever physical characteristics were noted by the witness, such as race and hair color. These findings gradually seeped into official policies. In 2001, the New Jersey attorney general mandated each of these lineup reforms, and a dozen other states have implemented similar changes. Many local agencies adapted their policies voluntarily. A 2011 survey found that roughly one-third of police departments were using the sequential lineup method.

While these reforms swept the country, though, other research groups were finding that sequential lineups weren’t necessarily the best option. These studies showed, just as Wells’s had, that sequential lineups lower the number of false identifications. But they also suggested that sequential presentations decreased correct identifications far more than Wells had previously reported.

“People are really focused on the people who are innocent and wrongly convicted. It is an awful error and that does horrible things to those people,” says Laura Mickes, a psychology researcher at Royal Holloway, University of London. “But there’s the other error, and that’s missing the person in of the lineup. It happened to Ted Bundy.”

In 1974, police detectives in King County, Washington, suspected that Bundy was behind a string of murders in the Pacific Northwest. A witness to one of the abductions looked at a photo lineup that included Bundy’s face but failed to pick him out. “And he went on to murder more women,” Mickes says.

So the choice between sequential and simultaneous lineups reflects a trade-off: Is it better to choose the procedure that minimizes false positives (sequential), or the one that maximizes true positives (simultaneous)?

That decision is, ultimately, an ethical one that policy makers will have to grapple with. Luckily, statistics can help.

* * *

The problem with all of the aforementioned lineup studies, some say, is that they don’t account for the fact that the diagnosticity ratio is not just a matter of the witness’s memory skills. It also depends on witness confidence: How sure are they that their identification is correct?  You could imagine two people with near-perfect memories, but one is sure she’s right and the other is racked by doubt.

That means that a given lineup procedure does not have just one rate of false identification and one rate of correct identification. Rather, each procedure has a collection of possible outcomes, depending on the witness’s memory and confidence. A more confident witness with a worse memory has one diagnosticity ratio. A timid witness with a photographic memory has another. “There is a whole family of rates,” says John Wixted, a memory researcher at the University of California, San Diego.

That family of numbers can be plotted on something known in the statistics world as a ‘receiver operating characteristic’ (ROC) curve. Each point on the curve represents a different diagnosticity ratio, with the rate of false positives along the X axis and the rate of true positives on the Y axis.

Wixted and Mickes have shown in laboratory experiments that for a given line-up procedure, witnesses with low levels of confidence tend to have low diagnosticity ratios (they’re less likely to finger anybody, but the person they do is more likely to be guilty). In contrast, witnesses with high levels of confidence have higher ratios (they’re more likely to point the blame at the wrong guy). This family of ratios creates a curve for each lineup method.

When researcher compared these two curves to what the data says a chance performance might look like they found that the simultaneous lineup was more accurate than the sequential one. “It was a real shock,” says Mickes, because researchers had been pushing for sequential procedures for so long. Since Wixted and Mickes first published that data, in 2012, three other research groups have confirmed their findings using ROC analysis of lineup experiments.

ROC analysis is not a newfangled method. It has been a staple of medical diagnostics, weather forecasting, and epidemiology for many decades. “Literally every medical test you’ve ever taken has been vetted against some alternative test using ROC analysis,” Wixted says.

Still, some scientists aren’t ready to accept that ROC analysis is relevant for lineup data. Only a few laboratory studies have been done so far, after all, and all in the last couple of years. “This idea has been pushed by a few people very quickly,” Wells says. “The closer we look the less convinced we become that what they’re observing is what they think they observe. But it’s going to take awhile to shake that out. And that’s how science works.”

So here we have an ongoing debate, with one camp arguing for sequential lineups and the other for simultaneous ones. Today’s NAS report didn’t resolve the debate, but it did move the needle a bit.

The report acknowledges, for example, that a single diagnosticity ratio is not enough to judge the effectiveness of a given lineup procedure. It also notes that ROC analysis “is a positive and promising step, with numerous advantages.” But the report also describes some disadvantages of ROC analysis. For example, this approach relies on how confident the witness thinks they are, which may vary from one witness to the next.

Overall, the Academy decided that it’s just too early to pick sides in the simultaneous vs. sequential debate. “There is, as yet, not enough evidence for the advantage of one procedure over another,” the report reads. “The committee thus recommends that caution and care be used when considering changes to any existing lineup procedure, until such time as there is clear evidence for the advantages of doing so.”

That ambivalent finding might frustrate the average police detective, but for researchers it’s the healthy and appropriate response to such a thorny scientific issue.

“With so much going on in the science right now, this is not a good time to advocate for one procedure over the other,” says Steven Clark, director of the Presley Center for Crime and Justice Studies at the University of California, Riverside who testified at one of the Academy hearings. “More data have been collected on that particular comparison than on any other—and curiously that's the one that seems to be the most difficult to resolve.”

But it’s not ambivalence all the way down. One thing all of these scientists agree on—and was underscored repeatedly in the NAS report—is the importance of recording the witness’s level of confidence immediately after making a photo identification. As many studies have shown, witnesses’ confidence in their memories tends to inflate over time, which is obviously problematic if they’re testifying in court long after the event took place. As Wixted points out, most of the people who were wrongly convicted and then exonerated with DNA were initially identified with low-confidence witness ratings. Making sure to record confidence immediately “is a fantastic recommendation that will do far more to protect innocent suspects” than switching from one lineup type to another, Wixted says.

That could have saved Ronald Cotton from Jennifer Thompson’s false ID. “She’s always used as the classic example of how terrible eyewitness memory is,” Wixted says. “I use her as the perfect example of how good eyewitness memory is.”

After all, Thompson had initially expressed uncertainty at the lineup. And she seemed to get more confident after getting reassurances from the investigators. “It’s not the eyewitnesses that are making a mistake,” Wixted says, “it’s the legal system that is making a mistake.”