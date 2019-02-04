In 2015, scientists discovered a pig in China that would set off a frantic, worldwide search. The pig carried bacteria resistant to colistin, a drug used to cure infections when almost all other drugs have failed. Colistin is an old antibiotic with sometimes severe side effects in humans. Chinese doctors didn’t even prescribe it for human patients; instead, farmers were relying on literal tons of it, used in low doses, as a growth promoter in pigs.
Bacteria are constantly crossing continents in people, animals, and food, though. In England, where colistin is reserved for patients in rare and dire circumstances, public health officials worried. Could colistin-resistant bacteria also be lurking in England?
The answer was hidden somewhere in Public Health England’s archives. The agency routinely collects and sequences bacteria on food and humans, and they just needed to search those sequences for the DNA segment that confers bacterial resistance to colistin. In theory, this shouldn’t have been much harder than a Google search. To a computer, a DNA sequence looks like a very long word, which just happens to be made up of only four different letters, A, T, C, and G.
Yet, the search took 256 computers working together for an entire weekend, says Zamin Iqbal, a computational genomicist at the European Bioinformatics Institute, who collaborates with Public Health England. The researchers there did find colistin resistance among their 24,000 samples, and eventually, countries all over the world found it, too.
Why did this process take so long? The computers at Public Health England had to open up and search the sequencing files of 24,000 genomes one by one. If Google had to search every page on the internet for the word “pie” everytime you search for “pie,” that search would also take forever. Instead, Google is constantly indexing pages. If a blog post is written about “pie,” Google files that post under the “pie” entry in its index. So when someone comes along looking for pie recipes, it just has to serve up the pages under the “pie” entry. That’s part of the reason why a Google search takes less than a second.
So Iqbal decided to build a Google of sorts for bacteria and viral genomes. He and his colleagues downloaded all available genomes—nearly 500,000 at the time—from a public database called the European Nucleotide Archive. The 170,000 gigabyte dataset took six whole weeks to download. Then, the team indexed the data. The resulting tool is called BIGSI, for BItsliced Genomic Signature Index.
Searching for colistin resistance through nearly 500,000 sequences now takes just a few seconds.
Suppose a patient has an unusual brain infection, says Jennifer Gardy, a genomic epidemiologist who until recently was at the University of British Columbia and who was not involved with the project. Suppose it’s a pathogen the doctor doesn’t recognize. Before, the pathogen’s particular sequence might have been hiding in one of those 500,000 genomes. But a mountain of data is only as good as your ability to search it. “We can now go back and look through all of the DNA, through all of the other experiments that had done sequencing. Loads and loads of DNA,” Gardy says. For the first time, it’s possible to easily answer a question as simple as: “Have we seen this thing before?”
Since Iqbal and his co-authors started sharing their project—making a demo version of BIGSI available online, posting a non-peer-reviewed paper on the website bioRxiv, giving talks—they’ve been hearing from researchers who’ve started to use it. After Andrew Page, a bioinformatics researcher now at the Quadram Institute, learned about the tool, he walked back to his office and fired it up. Page was interested in a particular plasmid, a round loop of DNA, that helps make typhoid fever bacteria drug resistant. This plasmid seemed to have popped up out of the blue in Pakistan.
“Within in two seconds I got a list of twenty other samples where they were seen,” says Page. The plasmid wasn’t just in other typhoid bacteria. It was in soil bacteria, animal bacteria, E. coli—painting a much more complex picture of how resistance plasmids emerge and get swapped between different bacterial species.
Iqbal’s paper is just getting published today in Nature Biotechnology, after making its way through the sometimes slow process of peer review. But published papers have already cited the bioRxiv preprint, and another scientist wrote a program to more easily search mutations of a gene in BIGSI. Tara Smith, an epidemiologist at Kent State University, says BIGSI is a fantastic idea, although the tool as only as good as the data that goes in. “The genomes we choose to sequence are very biased,” she says—often toward serious clinical infections, from patients in research-intensive hospitals, in big urban centers.
The team is updating BIGSI with new data that have been made public since Iqbal made the first version, and the total number of sequences available at one quick click is now up to 1.2 million. As sequencing is becoming more common, the number of publicly available bacterial and viral genomes has doubled, and at the rate this work is going, within a few years there will be multiple millions of searchable pathogen genomes—a library of DNA and disease, spread the world over.
