Yet, the search took 256 computers working together for an entire weekend, says Zamin Iqbal, a computational genomicist at the European Bioinformatics Institute who collaborates with Public Health England. The researchers there did find colistin resistance among their 24,000 samples, and eventually, countries all over the world found it, too.
Why did this process take so long? The computers at Public Health England had to open up and search the sequencing files of 24,000 genomes one by one. If Google had to search every page on the internet for the word pie every time you search for pie, that search would also take forever. Instead, Google is constantly indexing pages. If a blog post is written about pie, Google files that post under the pie entry in its index. So when someone comes along looking for pie recipes, it just has to serve up the pages under the pie entry. That’s part of the reason a Google search takes less than a second.
So Iqbal decided to build a Google of sorts for bacterial and viral genomes. He and his colleagues downloaded all available genomes—nearly 500,000 at the time—from a public database called the European Nucleotide Archive. The 170,000-gigabyte data set took six whole weeks to download. Then, the team indexed the data. The resulting tool is called BIGSI, for BItsliced Genomic Signature Index.
Searching for colistin resistance through nearly 500,000 sequences now takes just a few seconds.
Read: 300 million letters of DNA are missing from the human genome
Suppose a patient has an unusual brain infection, says Jennifer Gardy, a genomic epidemiologist who until recently was at the University of British Columbia and who was not involved with the project. Suppose it’s a pathogen that the doctor doesn’t recognize. Before BIGSI, the pathogen’s particular sequence might have been hiding in one of those 500,000 genomes. But a mountain of data is only as good as your ability to search it. “We can now go back and look through all of the DNA, through all of the other experiments that had done sequencing. Loads and loads of DNA,” Gardy says. For the first time, it’s possible to easily answer a question as simple as: “Have we seen this thing before?”
Since Iqbal and his colleagues started sharing their project—making a demo version of BIGSI available online, posting a non-peer-reviewed paper on the website bioRxiv, giving talks—they’ve been hearing from researchers who’ve started to use it. After Andrew Page, a bioinformatics researcher now at the Quadram Institute, learned about the tool, he walked back to his office and fired it up. Page was interested in a particular plasmid, a round loop of DNA, that helps make typhoid-fever bacteria drug resistant. These plasmids seemed to have popped up out of the blue in Pakistan.
“Within two seconds, I got a list of 20 other samples where they were seen,” Page says. The plasmid wasn’t just in other typhoid bacteria. It was in soil bacteria, animal bacteria, E. coli—painting a much more complex picture of how resistance plasmids emerge and get swapped between different bacterial species.