[Read: How a genealogy website led to the alleged Golden State Killer]
With the growing popularity of DNA tests, such databases are only getting bigger and bigger. It’s not hard to imagine being able to identify nearly every American through a relative’s DNA.
This is a boon for people taking DNA tests precisely to look for family. Years before police realized that genetic genealogy—the combination of family trees and DNA—could be used to ID criminals and unidentified victims, people were using DNA databases to track down birth parents, sperm donors, and long-lost family. “It wasn’t a surprise to us at all. None of this has been a surprise to us. We have been using it for years and years,” says the genealogist Debbie Kennett. The Golden State Killer suspect’s arrest just woke everyone else up to the power of genetic genealogy.
GEDmatch, the database investigators used in the Golden State Killer case and subsequent others, does not offer DNA tests itself. But it allows users to upload raw data files from genetic-testing companies such as AncestryDNA, 23andMe, and MyHeritage. (Erlich is the chief scientific officer at MyHeritage.) To upload crime-scene DNA, investigators had to make their own DNA data file mimicking one that might come from a genetic-testing company. GEDmatch then offers an array of tools to sort DNA matches. Genealogists can connect those matches to family trees using census records, newspaper obituaries, and other public records. The closer the match, the more quickly they can zero in on the right branch of the family tree.
[Read: How GEDmatch became the police’s go-to genealogy database]
“A second-cousin match is the sweet spot where it’s easy,” says Kennett, whereas a fourth-cousin match might take “thousands and thousands of hours’ work.” Identifying someone through a single third-cousin match is somewhere in the middle: It’s not trivial, but it’s very much possible.
To find out exactly how easy it is for genealogists and law enforcement to find genetic matches, Erlich and his team first analyzed MyHeritage’s 1.28 million–person DNA database. Nearly 60 percent of the people in it match enough DNA with at least one other person to be third cousins or closer. Then the researchers built a model that predicted that a database needs to include only 2 percent of a population for 90 percent of the people to have a third-cousin match or closer in it. In other words, a database of just a few million people could be sufficient to track down nearly everyone in the United States. That stat underscores the consequences of consumer genetic testing: Whenever a DNA-test taker reveals his results, he is giving up not just his own privacy, but potentially that of hundreds of relatives.
Another pattern jumped out in the MyHeritage data: People of primarily northern European ancestry were more likely to have matches than people of primarily sub-Saharan ancestry. This reflects the predominantly white customer base for MyHeritage and most other direct-to-consumer DNA tests. It also means that genetic-genealogy searches by law enforcement are, for now, more likely to succeed with people of European descent.