Maximizing the potential of our increasingly vast base of scientific knowledge


We aren't yet at the stage where we can loose computers upon the stores of human knowledge only to return a week later with discoveries that would supplant those of Einstein or Newton in our scientific pantheon. But computational methods are helpful. Working in concert with people -- we are still needed to sort the wheat from the chaff -- computational programs and automated techniques can connect scientific areas that ought to be speaking to each other yet haven't, stitching together different fields until the interconnectivity between the different areas becomes clear.

In the fall of 2010, a team of scientists in the Netherlands published the first results of a project called CoPub Discovery. Their previous work had involved the creation of a massive database based on the co-occurrence of words in articles. If two papers both have the terms "p53" and "oncogenesis," for example, they would be linked more strongly than words with no two key terms in common. CoPub Discovery involved creating a new program that mines their database for unknown relationships between genes and diseases.

Essentially, CoPub Discovery automates the detections of relationships between thousands of genes and thousands of diseases, gene pathways, and even the effectiveness of different drugs. Doing this automatically allows many possible discoveries to be detected. In addition, CoPub Discovery also has a careful system of checks designed to sift out false positives -- instances where the program might say there is an association when there really isn't.

And it works! The program was able to find a number of exciting new associations between genes and the diseases that they may cause, ones that had never before been written about in the literature.

For example, there is a condition known as Graves' disease that normally causes hyperthyroidism, a condition in which the thyroid produces too much hormone. Symptoms include heat intolerance and eyes that stick out more prominently, yielding a somewhat bug-eyed appearance for sufferers. CoPub Discovery, when automatically plowing through the large database, found a number of genes that had never before been implicated in Graves' that might be involved in causing the disease. Specifically, it found a large cluster of genes related to something known as programmed cell death.

Programmed cell death is not nearly as scary as it sounds. Our bodies often require the death of individual cells in order to perform correctly, and there is a set of genes in our cells tailored for this purpose. For example, during embryonic development, our hands initially have webbing between the fingers. But prior to birth the cells in the webbing are given the signal to die, causing us to not have webbed hands. Webbed hands and feet only occur when the signal is given incorrectly, or when these genes don't work properly.

What CoPub Discovery computationally hypothesized is that when these programmed cell death genes don't work properly in other ways, a cascade of effects might follow, eventually leading to the condition known as Graves' disease. CoPub Discovery has also found relationships between drugs and diseases and determined other previously unknown effects of currently used drugs. For example, while a medicine might be used to help treatment for a specific condition, not all of its effects might be known. Using the CoPub Discovery engine and the concept of undiscovered public knowledge, it becomes possible to actually see what the other effects of such a drug might be.

The researchers behind CoPub Discovery did something even more impressive. Rather than simply put forth a tool and a number of computationally generated hypotheses -- although this is impressive by itself -- they actually tested some of the discoveries in the laboratory. They wanted to see if these pieces of newly revealed knowledge are actually true. Specifically, CoPub Discovery predicted that two drugs, dephostatin and damnacanthal, could be used to slow the reproduction and proliferation of a group of cells. They found that the drugs actually worked: the larger the dose of these drugs, the more the cells' growth was inhibited. This concept is known as drug repurposing, where hidden knowledge is used to determine that medicines are useful in treating conditions or diseases entirely different from their original purposes. 

One of the most celebrated examples of drug repurposing is Viagra, which was originally designed to treat hypertension. While Viagra proved effective for that condition, many of the participants in the clinical trials reported a certain intriguing side effect, also making Viagra one of the only cases where the pills left over at the end of the study were not returned by the participants.

There are many other examples of computational discovery that combine multiple pieces of knowledge to reach novel conclusions. From software designed to find undiscovered patterns in the patent literature to the numerous computerized systems devoted to drug repurposing, this approach is growing rapidly. In fact, within mathematics, there is even a whole field of automated theorem proving. Armed with nothing but various axioms and theorems well known to the mathematics community, as well as a set of rules for how to logically infer one thing from another, a computer simply goes about combining axioms and other theorems in order to prove new ones.

Given enough computational power, these systems can yield quite novel results. Of course, most of the output is rather simple and pedestrian, but they can generate new and interesting provably true mathematical statements as well. One of the earliest examples of these is Automated Mathematician, created by Doug Lenat in the 1970s. This program constructs regularities and equalities, with Lenat even claiming that the Automated Mathematician rediscovered a fundamental unsolved problem (though, sadly, did not solve it) in abstract number theory known as Goldbach's Conjecture. Goldbach's Conjecture is the elegant hypothesis that every even number greater than two can be expressed as the sum of two prime numbers. For example, 8 is 5 + 3 and 18 is 7 + 11. This type of program has provided a foundation for other automated proof systems, such as TheoryMine which names a novel computationally created and proved theorem after the reader, for a small price.

Ultimately, our computer systems allow for the uncovering of what I term hidden knowledge: knowledge that lies within the literature, or the total sum of what we know, but due to the complexity of our knowledge, is far beyond the reach of a single individual.

As computers aid humans in these endeavors, algorithms to uncover hidden knowledge in everything from medicine to mathematics is now possible.

This post is adapted from Samuel Arbesman's The Half-Life of Facts: Why Everything We Know Has an Expiration Date.

We want to hear what you think about this article. Submit a letter to the editor or write to