In an edited excerpt from his new book, Too Big to Know, David Weinberger explains how the massive amounts of data necessary to deal with complex phenomena exceed any single brain's ability to grasp, yet networked science rolls on.
Thomas Jefferson and George Washington recorded daily weather observations, but they didn't record them hourly or by the minute. Not only did they have other things to do, such data didn't seem useful. Even after the invention of the telegraph enabled the centralization of weather data, the 150 volunteers who received weather instruments from the Smithsonian Institution in 1849 still reported only once a day. Now there is a literally immeasurable, continuous stream of climate data from satellites circling the earth, buoys bobbing in the ocean, and Wi-Fi-enabled sensors in the rain forest. We are measuring temperatures, rainfall, wind speeds, C02 levels, and pressure pulses of solar wind. All this data and much, much more became worth recording once we could record it, once we could process it with computers, and once we could connect the data streams and the data processors with a network.
How will we ever make sense of scientific topics that are too big to know? The short answer: by transforming what it means to know something scientifically.
This would not be the first time. For example, when Sir Francis Bacon said that knowledge of the world should be grounded in carefully verified facts about the world, he wasn't just giving us a new method to achieve old-fashioned knowledge. He was redefining knowledge as theories that are grounded in facts. The Age of the Net is bringing about a redefinition at the same scale. Scientific knowledge is taking on properties of its new medium, becoming like the network in which it lives.
In this excerpt from my new book, Too Big To Know, we'll look at a key property of the networking of knowledge: hugeness.
In 1963, Bernard K. Forscher of the Mayo Clinic complained in a now famous letter printed in the prestigious journal Science that scientists were generating too many facts. Titled Chaos in the Brickyard, the letter warned that the new generation of scientists was too busy churning out bricks -- facts -- without regard to how they go together. Brickmaking, Forscher feared, had become an end in itself. "And so it happened that the land became flooded with bricks. ... It became difficult to find the proper bricks for a task because one had to hunt among so many. ... It became difficult to complete a useful edifice because, as soon as the foundations were discernible, they were buried under an avalanche of random bricks."
If science looked like a chaotic brickyard in 1963, Dr. Forscher would have sat down and wailed if he were shown the Global Biodiversity Information Facility at GBIF.org. Over the past few years, GBIF has collected thousands of collections of fact-bricks about the distribution of life over our planet, from the bacteria collection of the Polish National Institute of Public Health to the Weddell Seal Census of the Vestfold Hills of Antarctica. GBIF.org is designed to be just the sort of brickyard Dr. Forscher deplored -- information presented without hypothesis, theory, or edifice -- except far larger because the good doctor could not have foreseen the networking of brickyards.
Scientific knowledge is taking on properties of its new medium, becoming like the network in which it lives.
Indeed, networked fact-based brickyards are a growth industry. For example, at ProteomeCommons.org you'll find information about the proteins specific to various organisms. An independent project by a grad student, Proteome Commons makes available almost 13 million data files, for a total of 12.6 terabytes of information. The data come from scientists from around the world, and are made available to everyone, for free. The Sloan Digital Sky Survey -- under the modest tag line Mapping the Universe -- has been gathering and releasing maps of the skies gathered from 25 institutions around the world. Its initial survey, completed in 2008 after eight years of work, published information about 230 million celestial objects, including 930,000 galaxies; each galaxy contains millions of stars, so this brickyard may grow to a size where we have trouble naming the number. The best known of the new data brickyards, the Human Genome Project, in 2001 completed mapping the entire genetic blueprint of the human species; it has been surpassed in terms of quantity by the International Nucleotide Sequence Database Collaboration, which as of May 2009 had gathered 250 billion pieces of genetic data.
There are three basic reasons scientific data has increased to the point that the brickyard metaphor now looks 19th century. First, the economics of deletion have changed. We used to throw out most of the photos we took with our pathetic old film cameras because, even though they were far more expensive to create than today's digital images, photo albums were expensive, took up space, and required us to invest considerable time in deciding which photos would make the cut. Now, it's often less expensive to store them all on our hard drive (or at some website) than it is to weed through them.
Second, the economics of sharing have changed. The Library of Congress has tens of millions of items in storage because physics makes it hard to display and preserve, much less to share, physical objects. The Internet makes it far easier to share what's in our digital basements. When the datasets are so large that they become unwieldy even for the Internet, innovators are spurred to invent new forms of sharing. For example, Tranche, the system behind ProteomeCommons, created its own technical protocol for sharing terabytes of data over the Net, so that a single source isn't responsible for pumping out all the information; the process of sharing is itself shared across the network. And the new Linked Data format makes it easier than ever to package data into small chunks that can be found and reused. The ability to access and share over the Net further enhances the new economics of deletion; data that otherwise would not have been worth storing have new potential value because people can find and share them.
Third, computers have become exponentially smarter. John Wilbanks, vice president for Science at Creative Commons (formerly called Science Commons), notes that "[i]t used to take a year to map a gene. Now you can do thirty thousand on your desktop computer in a day. A $2,000 machine -- a microarray -- now lets you look at the human genome reacting over time." Within days of the first human being diagnosed with the H1N1 swine flu virus, the H1 sequence of 1,699 bases had been analyzed and submitted to a global repository. The processing power available even on desktops adds yet more potential value to the data being stored and shared.
The brickyard has grown to galactic size, but the news gets even worse for Dr. Forscher. It's not simply that there are too many brickfacts and not enough edifice-theories. Rather, the creation of data galaxies has led us to science that sometimes is too rich and complex for reduction into theories. As science has gotten too big to know, we've adopted different ideas about what it means to know at all.