Illustration: Dale Edwin Murray

If someone were to ask you what industry or field of study was likely to generate the most raw data in the next ten years, what would you say? It might be tempting to look up at the stars, marvel at the vastness of the universe, and put your chips down on the cosmos. But if you guessed astronomy, you would be wrong, at least according to Michael Schatz, a computational biologist at Cold Spring Harbor Laboratory in New York.

This summer, in a paper published by the journal PLoS Biology, Schatz showed that the volume of genomic data being uploaded onto computers around the world will blow every other big data producer out of the water. That includes not only astronomical data but also Twitter, YouTube, and every other content aggregator. In terms of storage demands, all the cat videos in the world may be nothing compared to human and microbial genetic sequences, which are piling up faster every day.

The enormous increase in data collection has been made possible by a steady improvement in sequencing technology. Over the last decade, the time it takes to read the nitrogenous base pairs that make up our genetic code, and therefore the costs involved, have decreased dramatically.

“The first human genome cost around a billion dollars to sequence,” says Schatz. “Now we routinely sequence human genomes for about a thousand dollars or a couple thousand dollars.”

As a result, sequencers are popping up everywhere. Food inspectors have them. Big agricultural firms like Monsanto use them. There’s even a sequencer on the International Space Station. And with these tools, public and private institutions are now sequencing new genomes at a furious rate.

Schatz estimates that new genomic data is being created at a rate of 35 petabytes a year. To make sense of that number, says Schatz, imagine you were to take just one petabyte of data and record it onto DVDs. The resulting stack of discs would climb as high as the Empire State Building. Multiply that times 35 and you have this year’s output of genomic data, but the rate at which it’s being collected is accelerating fast: Every few months, it doubles.

“That’s when the numbers start to become really huge, when we look ten years in the future,” says Schatz. “If these trends continue, we’re going to move beyond petabytes into what’s called exabytes and beyond that perhaps even zettabytes. And sooner or later, this stack of DVDs, rather than being as tall as the Empire State building, is going to grow into space.”

In this tower of genomic data lie the answers to all sorts of questions. For example: How do individual genes determine physiological and behavioral traits? How do they interact with other genes? What variation do we find across the genomes of our own species? And when do these variations signal disease and therefore become candidates for drug intervention?

But with so much data, it is no longer possible to find meaningful patterns using the human eye alone. In order to connect the dots, we will have to apply our most powerful computer brains as well, including cognitive computing platforms like IBM’s Watson, machines capable of ingesting and analyzing massive amounts of data at lightning speed—and, beyond that, “learning” from each task how to modify itself to do a better job the next time.

In the field of genomic analysis, these tools are already being deployed.

Recently, IBM partnered with more than a dozen cancer institutes to train Watson to provide potential personalized treatment options for patients who have cycled through all of the traditional treatments. This is part of a promising new approach to treatment called personalized medicine, which can be especially effective when dealing with a complex medical challenge like cancer.

We now know, for instance, that a type of cancer that exhibits the same set of symptoms in many patients can actually be caused by tumors with different genetic mutations.

“Cancer, for the most part, has been categorized by where it occurs in the body,” says Stephen Harvey, vice president of IBM’s Watson health program. “And what we’re finding today through whole genome sequencing is that we’re starting to be able to identify the specific mutations that are actually driving the cancer.”

To identify those mutations is to identify targeted treatments, says Harvey: “If you can identify the specific mutation and find drugs to treat it then you can usher in a new era of personalized medicine.”

Still, deciding which drugs to try is tricky. The data linking a specific tumor type to a drug on the shelf may exist, but often a physician would have to find it in studies buried under a mountain of medical literature.

“The task of a cancer researcher and a practicing oncologist these days is to take all available medical knowledge, synthesize all that in a meaningful manner and come up with a treatment option that’s viable to this patient at this point in time,” says Ajay K. Royyuru, director of computational biology research at IBM. “Often times, that’s a task beyond the capability of a practicing oncologist to do in a comprehensive and timely manner.”

A study published last year by the Dana-Farber Cancer Institute found that nearly a quarter of physicians feel that they lack the genomic knowledge necessary for the task.

That’s another problem cognitive computing could help tackle. IBM’s Watson has the ability to ingest all of the cancer-related biomedical literature that’s ever been published. And it is now working with the foremost oncologists in the U.S. to learn how to make specific treatment recommendations in the most difficult cases.

Watson will “take patients who are in dire straights, go to a pharmaceutical shelf for already approved drugs and to big pharma themselves and biotech companies, and find out from them what sort of hidden things they have on the shelf,” says Robert Darnell, scientific director of the New York Genome Center. It will then “take all of that information that we can gain access to and apply it to the analysis of this patient’s tumor in a clinically relevant time frame.”

Personalized medicine is one of the most promising applications on the horizon. But there are many more out there. There are secrets yet to be learned from the many genomes we are collecting from plants, animals, and microbes around the globe, much of which might be useless without cognitive computing to make sense of it.

With IBM’s algorithms, for example, scientists have better identified which genes determine the variety of colors found in the cacao seedpod. They have investigated how clusters of genes work together to produce complex traits in humans. And they are tracing the paths of ancient human migrations by analyzing the genetic variation of today’s populations.

Any research that requires scientists to make sense of large data sets spread across a diverse range of sources can benefit from cognitive computing. As our stack of genomic sequences climbs further into the atmosphere, we will rely more and more on computers to tame it with meaning.