300 Million Letters of DNA Are Missing From the Human Genome

The world’s most famous genetic tool has a major diversity problem.

Three children whisper into one another's ears.
Stephanie Rabemiafara / Art in All of Us / Getty

In March 1997, a man answered an ad in The Buffalo News. He agreed to give 50 milliliters of blood. He did not leave a name, as the researchers who took his blood wanted to keep things anonymous. They called him RP11.

The ad RP11 answered turned out to be for the Human Genome Project. And it happened that, as the race to complete the first human genome heated up, it was RP11’s sample that was ready to be sequenced first. Ultimately, 70 percent of the Human Genome Project’s DNA came from RP11. The rest came from about 50 other volunteers.

Since then, that first human genome has been continually refined. It’s become a “reference genome,” the standard against which practically every human whose DNA has been sequenced is compared. But most of it, still, comes from RP11. Every person’s genetic code is unique, so using just one reference genome—most of it from one person—to stand in for all of humanity has introduced subtle biases into genetics research.

A new study of DNA from people of African descent shows just how much the reference genome is missing: Scientists found across the 910 people in their study 300 million letters of DNA that are not in the reference genome. Some of these newfound segments of DNA could represent new genes that were previously overlooked.

Africans collectively are far more genetically diverse than people in other parts of the world, so their DNA is especially likely to differ from the reference genome. Currently, DNA sequencers work by chopping up a genome into segments that are individually “read” and then assembled like a jigsaw puzzle. Algorithms use the reference genome to tell them where to put each segment. If a new segment pops up—one that doesn’t match anything in the reference genome—the algorithms don’t know what to do. Usually, scientists ignore it.

“We wanted to look at what’s in the pieces we’re ignoring,” says Rachel Sherman, a computational biologist at Johns Hopkins University who is the lead author of the study. Sherman’s adviser, Steven Salzberg, collaborates with researchers studying the genetics of asthma in people of African descent. That study includes people living in West Africa, North America, South America, and the Caribbean. “It occurred to me we had this very genetically diverse group,” says Salzberg. In other words, their research presented an opportunity to study the diversity missing from the reference genome.

Smaller studies have found enough novel DNA sequences that they estimated there are as many as 40 million letters that can appear in the human genome, but are not in the current reference genome. Sherman and Salzberg found nearly 300 million missing letters in 125,715 separate DNA segments—much more than they expected. “I thought I must have done something wrong,” says Sherman. But when she went back and combed through all of her code, she couldn’t find any errors.

Now the question is what those 300 million previously overlooked letters of DNA contain. Perhaps they code for novel, interesting mutations related to disease that studies using the reference genome overlooked. If you’re only comparing against the reference genome, “you’re never going to find that missing piece,” says Tina Graves-Lindsay, a geneticist at Washington University of St. Louis. This is especially true in people of African descent, as their high genetic diversity means their DNA is more likely to not match up to the reference genome. RP11, scientists later surmised, was probably African American himself, but the problem of using one reference genome to represent the whole human population still holds true.

Graves-Lindsay says she wasn’t surprised by the number of new DNA sequences Sherman found, as she and her collaborators are also working to address the gaps in the reference genome. They’re assembling reference genomes from a diverse group of people so that scientists can compare against more than just the single current version. The team has sequenced 15 samples so far, including five from Africa. “It was a very good paper for us to read about,” Graves-Lindsay says of Sherman’s study. “It proves what we’re doing is really needed.” But her work is still constrained by what exists in DNA repositories. For example, she doesn’t have a sample to sequence from people indigenous to Australia.

Since the Human Genome Project began, scientists have slowly realized they underestimated human genetic diversity. At the time, they focused on single-letter mutations. But it’s becoming clear now that big structural variations—thousands of letters being inserted or deleted or flipped around—are common, too, says Deanna Church, a founding member of the Genome Reference Consortium and a scientist at the company 10x Genomics. It’s like comparing two copies of a book to look for typos and realizing whole pages are missing from one.

The big chunks of DNA that Sherman and Salzberg found missing in the reference genome are likely the product of thousands of insertions and deletions. RP11 has his own unique pattern of them, as do all the other original volunteers for the Human Genome Project. And so does everyone else in the world.