Most People of European Ancestry Can Be Identified From a Relative’s DNA

A person uses a computer to look at family history on a genealogy website.
George Frey / Reuters

In April, the world learned that police had tracked down the alleged Golden State Killer by using a genealogy site to match DNA from crime scenes to that of his distant relatives. The next arrest that resulted from the same technique—for a double murder in Washington State—came less than a month later. And then another and another and another.

As the wave of reports went on, Yaniv Erlich, a computational biologist, was working to understand the reach of such police searches. Were they lucky breaks? Or could nearly every American be found through a third cousin’s DNA? With every identification that made the news, Erlich had to update the paper he was working on. “It was like, every time, it’s a new case,” he says. By his count, the number of murderers, rapists, or unidentified persons found through genetic genealogy is up to 19—the latest announced just on Monday.

These cases are not exceptional, according to his analysis, now published in Science. Golden State Killer investigators found their suspect through third- and fourth-cousin matches in a database called GEDmatch, which includes information from about 1 million people. In a database of that size, Erlich and his co-authors show, nearly 60 percent of people have a relative who is a third cousin or closer.

Recommended Reading

[Read: How a genealogy website led to the alleged Golden State Killer]

With the growing popularity of DNA tests, such databases are only getting bigger and bigger. It’s not hard to imagine being able to identify nearly every American through a relative’s DNA.

This is a boon for people taking DNA tests precisely to look for family. Years before police realized that genetic genealogy—the combination of family trees and DNA—could be used to ID criminals and unidentified victims, people were using DNA databases to track down birth parents, sperm donors, and long-lost family. “It wasn’t a surprise to us at all. None of this has been a surprise to us. We have been using it for years and years,” says the genealogist Debbie Kennett. The Golden State Killer suspect’s arrest just woke everyone else up to the power of genetic genealogy.

GEDmatch, the database investigators used in the Golden State Killer case and subsequent others, does not offer DNA tests itself. But it allows users to upload raw data files from genetic-testing companies such as AncestryDNA, 23andMe, and MyHeritage. (Erlich is the chief scientific officer at MyHeritage.) To upload crime-scene DNA, investigators had to make their own DNA data file mimicking one that might come from a genetic-testing company. GEDmatch then offers an array of tools to sort DNA matches. Genealogists can connect those matches to family trees using census records, newspaper obituaries, and other public records. The closer the match, the more quickly they can zero in on the right branch of the family tree.

[Read: How GEDmatch became the police’s go-to genealogy database]

“A second-cousin match is the sweet spot where it’s easy,” says Kennett, whereas a fourth-cousin match might take “thousands and thousands of hours’ work.” Identifying someone through a single third-cousin match is somewhere in the middle: It’s not trivial, but it’s very much possible.

To find out exactly how easy it is for genealogists and law enforcement to find genetic matches, Erlich and his team first analyzed MyHeritage’s 1.28 million–person DNA database. Nearly 60 percent of the people in it match enough DNA with at least one other person to be third cousins or closer. Then the researchers built a model that predicted that a database needs to include only 2 percent of a population for 90 percent of the people to have a third-cousin match or closer in it. In other words, a database of just a few million people could be sufficient to track down nearly everyone in the United States. That stat underscores the consequences of consumer genetic testing: Whenever a DNA-test taker reveals his results, he is giving up not just his own privacy, but potentially that of hundreds of relatives.

Another pattern jumped out in the MyHeritage data: People of primarily northern European ancestry were more likely to have matches than people of primarily sub-Saharan ancestry. This reflects the predominantly white customer base for MyHeritage and most other direct-to-consumer DNA tests. It also means that genetic-genealogy searches by law enforcement are, for now, more likely to succeed with people of European descent.

Back in May, Graham Coop and Doc Edge, geneticists at UC Davis, wrote a blog post asking, “How lucky was the genetic investigation in the Golden State Killer case?” Their back-of-the-envelope calculations suggested that the investigators’ luck was just about average. Erlich and his team used real data and came to a similar conclusion. “It was interesting to see all of the ideas being demonstrated in a very strong empirical way,” Coop says of the new study.

This summer, genetic-testing companies including 23andMe, AncestryDNA, and MyHeritage banded together with the Future of Privacy Forum, a think tank and advocacy group, to publish a “best practices” guide for the industry. That report said companies can hand over data when legally forced to. So far, they haven’t had to, because investigators in the recently publicized cases, including that of the alleged Golden State Killer, could simply use GEDmatch. The site has since updated its terms of service to note that law enforcement is searching through it.

AncestryDNA and 23andMe, the two leading genetic-testing companies, both say they have never handed over a customer’s genetic information to law enforcement. But it’s worth noting that both have databases bigger than GEDmatch: 10 million people for AncestryDNA and 5 million for 23andMe. That’s probably big enough to identify most Americans through a relative’s DNA already.

At the end of his paper, Erlich, who worked as a white-hat hacker before turning to genetics, also sketched out how companies like MyHeritage could use cryptographic signatures to prevent the misuse of data on third-party sites like GEDmatch. John Verdi, the vice president of policy at the Future of Privacy Forum, told me technical strategies like cryptography could play a role, but policy was the important lever. States, for example, could pass laws limiting the use of sites like GEDmatch for less serious crimes, though it doesn’t appear that they’re currently eager to: Verdi hadn’t heard of any states introducing such legislation yet.

I asked Verdi why he thinks the focus should be on the privacy of genetic data. DNA profiles alone would not have solved these cases; they also required looking up public records and often social media profiles. Why not think about privacy for that data as well? “This is a question of cultural norms,” Verdi said. “I think it’s probably a heavy lift to think about modifying those norms.” But the norms around DNA are all still very new. And we have an opportunity to shape them.