Africans collectively are far more genetically diverse than people in other parts of the world, so their DNA is especially likely to differ from the reference genome. Currently, DNA sequencers work by chopping up a genome into segments that are individually “read” and then assembled like a jigsaw puzzle. Algorithms use the reference genome to tell them where to put each segment. If a new segment pops up—one that doesn’t match anything in the reference genome—the algorithms don’t know what to do. Usually, scientists ignore it.
Read: 23andMe wants its DNA data to be less white.
“We wanted to look at what’s in the pieces we’re ignoring,” says Rachel Sherman, a computational biologist at Johns Hopkins University who is the lead author of the study. Sherman’s adviser, Steven Salzberg, collaborates with researchers studying the genetics of asthma in people of African descent. That study includes people living in West Africa, North America, South America, and the Caribbean. “It occurred to me we had this very genetically diverse group,” says Salzberg. In other words, their research presented an opportunity to study the diversity missing from the reference genome.
Smaller studies have found enough novel DNA sequences that they estimated there are as many as 40 million letters that can appear in the human genome, but are not in the current reference genome. Sherman and Salzberg found nearly 300 million missing letters in 125,715 separate DNA segments—much more than they expected. “I thought I must have done something wrong,” says Sherman. But when she went back and combed through all of her code, she couldn’t find any errors.
Now the question is what those 300 million previously overlooked letters of DNA contain. Perhaps they code for novel, interesting mutations related to disease that studies using the reference genome overlooked. If you’re only comparing against the reference genome, “you’re never going to find that missing piece,” says Tina Graves-Lindsay, a geneticist at Washington University of St. Louis. This is especially true in people of African descent, as their high genetic diversity means their DNA is more likely to not match up to the reference genome. RP11, scientists later surmised, was probably African American himself, but the problem of using one reference genome to represent the whole human population still holds true.
Graves-Lindsay says she wasn’t surprised by the number of new DNA sequences Sherman found, as she and her collaborators are also working to address the gaps in the reference genome. They’re assembling reference genomes from a diverse group of people so that scientists can compare against more than just the single current version. The team has sequenced 15 samples so far, including five from Africa. “It was a very good paper for us to read about,” Graves-Lindsay says of Sherman’s study. “It proves what we’re doing is really needed.” But her work is still constrained by what exists in DNA repositories. For example, she doesn’t have a sample to sequence from people indigenous to Australia.