Public Health's New Tool Is Wikipedia

All that online symptom-searching and self-diagnosing has paid off: Reseach has shown that traffic to the site can be used to predict disease outbreaks well in advance.

As there is with anything in this life, there are some upsides and some downsides to the Internet-as-doctor approach.

The downsides—as anyone who’s ever Googled “sore throat” and emerged several hours later from the virtual rabbit hole convinced of their imminent death can attest—are fairly clear; a simple search can quickly become an experience that haunts the dreams of many a hypochondriac.

Even so, plenty of people regularly turn to the World Wide Web as a first-line medical resource. As Julie Beck has previously noted on this site, 72 percent of Americans have looked online for health information sometime in the past year, while one in three Americans has self-diagnosed a health problem with the help of the Internet. And doctors are following in their patients’ footsteps, too: A full 50 percent of physicians turn to Wikipedia for health information, and some are active editors of the site as well.

And—here’s where the upside comes into play—the public-health world has gotten in on the action, too, using online activity to monitor and map diseases as they unfold. The app Sickweather, for example, scans social-networking sites for illness-related words and provides localized heat maps based on a user’s location, while the website, created by a team at Boston Children’s Hospital in 2006, pulls from a variety of sources to create a real-time global map of infectious diseases. Google Flu Trends analyzes search terms to estimate flu rates around the world (with a reporting lag of approximately one day, according to a letter Google published in the journal Nature in 2009).

Even more useful, though, would be the ability to predict disease, rather than simply observe and broadcast its path. In a study published yesterday in the journal PLOS Computational Biology, researchers from Los Alamos National Laboratory—a government facility in New Mexico that focuses on science with national-security implications—found that trends in Wikipedia pageviews can be used to predict flu outbreaks up to four weeks in advance.

To create their computer model, the researchers made a list of all the pages a person could click to directly from the Wikipedia entry for “influenza” (the entry for “flu” also redirects there) and then compared the traffic for each of those pages to flu reports provided by the Centers for Disease Control and Prevention.

“Those that correlated really well, we kept, and those that didn’t correlate, we dropped,” explained Los Alamos researcher Sara del Valle, one of the study authors. Ultimately, they were left with a list of 10 flu-related Wikipedia pages, including the entries for “antivirals,” “H1N1,” and “fever,” whose traffic they used to build their predictive algorithms. “So basically, just by looking at how many people are looking at the Wikipedia flu article, we can see how many cases are going to be showing up.”

Using online activity to predict outbreaks isn’t exactly new—but, del Valle said, the study offers “proof of concept” that Wikipedia could provide scientists a means of circumventing several existing obstacles in disease forecasting. While recent research has shown Twitter to be an effective resource for predicting outbreaks, for example, the cost of the raw data can be prohibitive to many. Gnip, a subsidiary of Twitter and one of only a few data-delivery companies with access to its “firehose” (the feed of every single Tweet ever sent), charges users a monthly fee of $2,000 for the data, plus an additional 10 cents for every 1,000 Tweets delivered. (Earlier this year, Twitter announced that it would provide a select number of “data grants” enabling researchers to access the data for free.) And other companies keep their data closed off entirely; Google doesn’t publicize the search terms it uses to build Flu Trends out of concern that the program might be manipulated by hackers trying to create the appearance of an outbreak.

Wikipedia, by contrast, offers public access to hourly traffic data for all of each of its pages. “We don’t do ‘data grants’ for selected individuals or research institutions, since our mandate is to make data openly available to anyone,” Dario Taraborelli, head of research and data at the Wikimedia Foundation, explained in an email, noting that his team fields “several requests a week” from researchers looking for data.

Another part of the appeal, del Valle said, is that Wikipedia’s open access could allow researchers to bypass the bureaucracy that currently comes with large-scale disease tracking.  FluView, the CDC’s weekly influenza-surveillance report that compiles data from hospitals, healthcare providers, state health departments, and government public-health labs, has a lag time of about two weeks.

In addition to the flu, the researchers applied their model to dengue (which Google also tracks), tuberculosis, and HIV, with varying levels of success: The model was able to predict outbreaks up to four weeks in advance for dengue and one week for tuberculosis, but was unable to forecast anything for HIV, a fact the researchers attributed to the virus’s lack of seasonality (both dengue and tuberculosis follow seasonal patterns) and its ability to live inside a host without causing symptoms for extended periods of time.

“For diseases that are very long and don’t really have a seasonal effect, Wikipedia may not necessarily work very well,” del Valle said. Down the road, she and her colleagues hope to apply the model to other diseases that fit within the established criteria, including chikungunya, a mosquito-borne disease currently sweeping its way through Latin America and the Caribbean.

But as they expand into other types of outbreaks, del Valle and her colleagues have a few tweaks to make to the existing system, including the ability to account for spikes in traffic that may be attributable to other factors. “You also need to look at how many news organizations are running information about something,” she said, “because otherwise your model will say that there’s Ebola everywhere, even though it’s because people are searching for it.” As of now, the model doesn’t yet have the capacity to determine how many page hits come from things like pure public curiosity.

Nor, for now, can it sort out views by geographic origin. “We only know when the flu article was accessed in English or Chinese or German,” del Valle explained, “but because we don’t really have the geographic resolution, we just assume that everyone who accessed the article in Chinese is in China, and everyone who accessed the article in German is in Germany.”

Even so, she said, the model’s success within its existing limitations offers the public—and the institutions that guard its health—a new, more easily accessible resource: “Hopefully, this [forecasting] information will be used by public-health departments so they can prepare resources or prepare to treat people.”

That’s not to say, of course, that the public won’t also take matters into its own hands. Full speed ahead, sore-throated, sniffly-nosed, feverish surfers of the web. One man’s hypochondria-induced Internet search, after all, is another’s insight into the outbreaks of tomorrow.