When Mark DePristo and Ryan Poplin began their work, Google’s artificial intelligence did not know anything about genetics. In fact, it was a neural network created for image recognition—as in the neural network that identifies cats and dogs in photos uploaded to Google. It had a lot to learn.
But just eight months later, the neural network received top marks at an FDA contest for accurately identifying mutations in DNA sequences. And in just a year, the AI was outperforming a standard human-coded algorithm called GATK. DePristo and Poplin would know; they were on the team that originally created GATK.
It had taken that team of 10 scientists five years to create GATK. It took Google’s AI just one to best it.
“It wasn’t even clear it was possible to do better,” says DePristo. They had thrown every possible idea at GATK. “We built tons of different models. Nothing really moved the needle at all,” he says. Then artificial intelligence came along.
This week, Google is releasing the latest version of the technology as DeepVariant. Outside researchers can use DeepVariant and even tinker with its code, which the company has published as open-source software.
DeepVariant, like GATK before it, solves a technical but important problem called “variant calling.” When modern sequencers analyze DNA, they don’t return one long strand. Rather, they return short snippets maybe 100 letters long that overlap with each other. These snippets are aligned and compared against a reference genome whose sequence is already known. Where the snippets differ with the reference genome, you probably have a real mutation. Where the snippets differ with the reference genome and with each other, you have a problem.