A Database of All Medical Knowledge: Why Not?
Physicians won't become obsolete any time soon, but the comprehensive integration of everything we know about well-being could revolutionize medical care.
Physicians won't become obsolete any time soon, but the comprehensive integration of everything we know about well-being could revolutionize medical care.

The progress of modern applied science has been defined by a series of outrageously ambitious projects, from the effort to build the first atomic bomb to the race to sequence the human genome.
For scientists and engineers today, perhaps the greatest challenge is the structure and assembly of a unified health database, a "big data" project that would collect in one searchable repository all of the parameters that measure or could conceivably reflect human well-being. This database would be "coherent," meaning that the association between individuals and their data is preserved and maintained. A recent Institute of Medicine (IOM) report described the goal as a "Knowledge Network of Disease," a "unifying framework within which basic biology, clinical research, and patient care could co-evolve."
The information contained in this database - expected to get denser and richer over time -- would encompass every conceivable domain, covering patients (DNA, microbiome, demographics, clinical history, treatments including therapies prescribed and estimated adherence, lab tests including molecular pathology and biomarkers, info from mobile devices, even app use), providers (prescribing patterns, treatment recommendations, referral patterns, influence maps, resource utilization), medical product companies (clinical trial data), payors (claims data), diagnostics companies, electronic medical record companies, academic researchers, citizen scientists, quantified selfers, patient communities - and this just starts to scratch the surface.
The underlying assumption here is that this information, appropriately analyzed, should improve both our potential and attained health, pointing us towards future medical insights while enabling us to immediately improve care by optimizing the use of existing resources and technologies.
As the IOM report concluded, "realizing the full promise of precision medicine, whose goal is to provide the best available care for each individual, requires that researchers and health-care providers have access to vary large sets of health and disease-related data linked to individual patients.
As daunting as this task obviously is, companies and academic researchers are bravely taking up the challenge, generally by focusing on some subset of the problem, typically at an intersection of two or more domains (clinical information plus biomarkers, say, or provider data plus claims). The selection may reflect what they bring to the table (clinical data in the case of medical centers, claims data in the case of payors) or where they think the greatest value can be found.
In addition, both established information companies (i.e. Google) and emerging companies (such as Palantir) are also key players; their fundamental business is based around their ability to approach problems of this dimension and complexity; Google famously asserts that their mission is "to organize the world's information, and make it universally accessible and useful."
One industry that seems underrepresented at the table is big pharma; evidently, many large drug companies have decided that big data informatics are not a core competency, and have elected to outsource this as a service. Perhaps this represents a savvy assessment of the current state of the art. However, it's also may be a miscalculation on the order of IBM's failure to appreciate the value of the operating system Bill Gates originally developed, or Xerox's failure to appreciate the value of the PC their PARC engineers had created, and which Steve Jobs immediately recognized and leveraged. Arguably, if you really want to build a company that is going to deliver the health solutions of the future, your first and most important investment might well be in recruiting a Palantir-level analytics group.
At the same time, you can understand big pharma's hesitation. Despite all the promise of big data in health, the results to date have been surprisingly skimpy; putting existing data in a vat and stirring has yielded a slew of academic publications, and a number of pretty pictures, but few truly impactful changes in health, at least so far.
Critics contend, "it's faddish, way overhyped, and not ready for primetime." Consider a recent big data project, that concluded that an easily observed clinical indicator, jugular venous distention (enlarged neck veins) is a bad prognostic sign for heart failure patients. That's something a third-year med student could just as easily have told you.
Advocates, meanwhile, point to early successes and plead for patience and resources; they point out that the scale of data required to build good predictive models has only recently become available, and has already led to promising advances.
For instance, system biologists such as Mount Sinai School of Medicine's Eric Schadt (a friend and previous collaborator) and Stanford's Atul Butte have already used big data analytics to identify and prioritize drug targets - though it remains to be seen whether these yield clinically useful products, and whether the associated approaches are truly generalizable.
Meanwhile, the latest issue of Cell reports the first computation model of a whole cell, a model of a bacterium that "includes all its molecular components and their interactions" and, according to the authors, "provides insights into previously unobserved cellular behaviors" while leading to new predictions that were subsequently experimentally validated.
There are at least two profound challenges that big data advocates will need to overcome en route to analytical nirvana.
First, the data have to be available. Relevant information is currently stored in what are essentially thousands of different silos. Sometimes, the data are considered off-limits due to privacy regulations; in other instances, they are considered proprietary, hence unsharable; in still other cases, the data are considered open and liberated, and there's a lot of effort among research groups and patient organizations to move more data into this category.
The privacy concerns make intuitive sense: while some may be eager to share their personal data with the world, most of us would demand and expect robust guarantees of privacy - though it's not clear to what extent this is even possible, as the deanonymization of Netflix challenge data suggests.
The reluctance to share proprietary data also makes sense, although there's a bit of a prisoner's dilemma twist: data from any one domain is of only limited value by itself; the real opportunity comes when you are able to combine data from multiple domains. Most likely, strategic coupling (data-sharing) between organizations representing different domains will precede the development of the grand, unified data set.
This highlights the second key challenge: even if you're able to gain access to data, how do you organize and assemble it once you have it? As Schadt says, "getting to a data model that represents the great diversity of data, across a great many different domains, and interrelates it so that it can all be efficiently queried and mined has never really been achieved because it is a really hard problem."
Schadt adds, "I think those who have tried to make one master data model, make it fully self-consistent and representing all data across a diversity of knowledge domains will fail. The grand data model of everything in biology has been tried over and over again and ultimately it comes crashing down because once you lock in a model, you constrain the types of questions you can ask of the data. The information half life is short, and we know and understand so little about what the data can actually tell us that we don't really understand the questions that ultimately will be the most useful."
So where, exactly, does this leave us?
It seems clear that if there ever is a single, useful dataset, we're going to arrive there in an incredibly messy way, likely through the combination of a number of disparate datasets built to serve a range of different needs.
Schadt anticipates that in the end, these data will not be organized using a highly structured formal data model (although some components will likely benefit from such an approach), but in a less structured way that enables broader engagement of the data via simple, natural query interfaces, much like Google today manages the digital universe of information and makes it broadly accessible and useable via a very simply natural language interface.
It's also not clear whether the smartest move is to try to build broadly - focusing on collecting information across as many domains as possible -- or to build deeply, by relating a small number of comprehensive data sets (and if so, which datasets do you choose?) For reasons of convenience, more companies and academics seem to be pursuing the second strategy, though it's still unclear which will produce the greatest payoff.
What I suspect healthcare big data needs above all is a dramatic win - an immediately actionable, compelling, non-intuitive recommendation that by virtue of its successful implementation will announce the arrival of the discipline.
Until then, physicians, patients, and all the other stakeholders in our healthcare system must muddle forward as best we can. It's clear from the scope of the problem just how much muddling is involved -- and how unreasonable it is to expect any one individual to integrate and evaluate all the available information.
I don't think this means physicians are destined to become obsolete (as some Silicon Valley technologists seem to believe), nor do I envision a future where we rely solely upon machines and algorithims for our healthcare. I've seen that medicine at its best is as much about connection and understanding as it is about arriving at a diagnosis and formulating a treatment recommendation.
At the same time, I suspect that a kind word plus an individualized, optimized therapeutic plan based on the comprehensive integration of all existing health data would go a lot further than a kind word alone.