First, the data have to be available. Relevant information is currently stored in what are essentially thousands of different silos. Sometimes, the data are considered off-limits due to privacy regulations; in other instances, they are considered proprietary, hence unsharable; in still other cases, the data are considered open and liberated, and there's a lot of effort among research groups and patient organizations to move more data into this category.
The privacy concerns make intuitive sense: while some may be eager to share their personal data with the world, most of us would demand and expect robust guarantees of privacy - though it's not clear to what extent this is even possible, as the deanonymization of Netflix challenge data suggests.
The reluctance to share proprietary data also makes sense, although there's a bit of a prisoner's dilemma twist: data from any one domain is of only limited value by itself; the real opportunity comes when you are able to combine data from multiple domains. Most likely, strategic coupling (data-sharing) between organizations representing different domains will precede the development of the grand, unified data set.
This highlights the second key challenge: even if you're able to gain access to data, how do you organize and assemble it once you have it? As Schadt says, "getting to a data model that represents the great diversity of data, across a great many different domains, and interrelates it so that it can all be efficiently queried and mined has never really been achieved because it is a really hard problem."
Schadt adds, "I think those who have tried to make one master data model, make it fully self-consistent and representing all data across a diversity of knowledge domains will fail. The grand data model of everything in biology has been tried over and over again and ultimately it comes crashing down because once you lock in a model, you constrain the types of questions you can ask of the data. The information half life is short, and we know and understand so little about what the data can actually tell us that we don't really understand the questions that ultimately will be the most useful."
So where, exactly, does this leave us?
It seems clear that if there ever is a single, useful dataset, we're going to arrive there in an incredibly messy way, likely through the combination of a number of disparate datasets built to serve a range of different needs.
Schadt anticipates that in the end, these data will not be organized using a highly structured formal data model (although some components will likely benefit from such an approach), but in a less structured way that enables broader engagement of the data via simple, natural query interfaces, much like Google today manages the digital universe of information and makes it broadly accessible and useable via a very simply natural language interface.
It's also not clear whether the smartest move is to try to build broadly - focusing on collecting information across as many domains as possible -- or to build deeply, by relating a small number of comprehensive data sets (and if so, which datasets do you choose?) For reasons of convenience, more companies and academics seem to be pursuing the second strategy, though it's still unclear which will produce the greatest payoff.
What I suspect healthcare big data needs above all is a dramatic win - an immediately actionable, compelling, non-intuitive recommendation that by virtue of its successful implementation will announce the arrival of the discipline.
Until then, physicians, patients, and all the other stakeholders in our healthcare system must muddle forward as best we can. It's clear from the scope of the problem just how much muddling is involved -- and how unreasonable it is to expect any one individual to integrate and evaluate all the available information.
I don't think this means physicians are destined to become obsolete (as some Silicon Valley technologists seem to believe), nor do I envision a future where we rely solely upon machines and algorithims for our healthcare. I've seen that medicine at its best is as much about connection and understanding as it is about arriving at a diagnosis and formulating a treatment recommendation.
At the same time, I suspect that a kind word plus an individualized, optimized therapeutic plan based on the comprehensive integration of all existing health data would go a lot further than a kind word alone.