Erlich has a track record of similar projects: Two years ago, he assembled what is probably the world’s largest ever family tree by pooling 43 million profiles from a publicly available genealogy site.
Getting people to actively participate in DNA.LAND will be harder, but Daniel Macarthur from Massachusetts General Hospital is optimistic. “Efforts like this are all about building momentum, and getting them going is hard. But at least in this case they have the advantage of working with an existing community that's actively excited about sharing their own genetic data.”
Erlich was encouraged that two sites, openSNP and GedMatch, have successfully crowdsourced genetic data from hundreds and thousands of people. But neither of these sites includes much in the way of privacy protection. DNA.LAND, by contrast, promises to not release any information without explicit permission. And they designed a consent form that could be read in 5 minutes, that contains minimal scientific or legal jargon, and that links to Erlich’s and Pickrell’s own uploaded genomes. “We have a skin-in-the-game philosophy,” says Erlich.
To encourage users to participate, the site offers several free services, including an ancestry report and a relative finder (which Erlich himself used to find a fourth cousin). Other companies provide these services, but within their own corrals. “What if you get tested by 23andme, and you have relatives in Ancestry.com?” asks Erlich. “We allow you to find relatives outside the silo of the company where you’ve been tested.”
These companies also analyze just hundreds of thousands of markers across a person’s genome, leaving huge tracts uncharted; DNA.LAND fills some of these gaps through a process called imputation. Erlich explains this by pulling up the sentence: Ba_ _ _ _ O_ _ ma i_ t_ _ Pr_ _ _ _ _ _ t. You can probably quickly read that as “Barack Obama is the President,” because you’re familiar with English and there are only so many options for the letters in the spaces. The genome is similar. DNA is inherited in chunks so based on what’s there, you can make educated guesses about what’s not. “If they test 700,000 markers, we can get to million,” says Erlich.
Next, he wants to infuse DNA.LAND with data from other sources. A person’s tweets might reveal their sleeping patterns, whether they’re sick and what symptoms they have, and the ebb and flow of their moods. Fitbit data could say even more about their health and fitness. If such sources could be connected to DNA.LAND, it would be an effortless way to connect people’s genotype (their genes) with their phenotype (the physical traits that those genes affect).
When I spoke to Erlich last Thursday, he was nervous. The site was due to launch on Friday morning and 24 hours later, he would present the site to his colleagues at the American Society for Human Genetics 2015 conference. “I thought maybe we’d have 30 genomes, and I’d have to wave my hands and talk about how awesome it is,” he said.
In fact, users uploaded 1250 genomes within DNA.LAND’s first day. “We’re getting a genome a minute right now,” he told the crowd. “My programmer was awake all night.” That impressive pace has naturally slowed, but at the time of writing, there are 5,485 genomes on the site. The team are hoping for thousands more.