How Much of the Web Is Archived? Truth Is, We Don't Really Know

More

Somewhere between 35 and 90 percent of the web has at least one archived copy. That's a pretty big range.

5270376192_d903b07243_z.jpg

Yosemite James/Flickr

Here's the challenge: new Internet is being made all the time. Oftentimes, these new pages are added to existing networks on Tumblr or Facebook or Twitter or Livejournal. But other times, someone fires up a web server that's off the standard map, and it the web's crawlers, try as they might, may not find that page for a while, if ever.

That means some percentage of the web is not being archived by anyone (or anything, really), not even the Internet Archive's invaluable Wayback machine.

And certainly, few sites are being archived with any kind of regularity, even those (like TheAtlantic.com) that are changing constantly. So, how much of the web is humanity missing?

Researchers took a step towards answering that question in a paper submitted to the arXiv repository late last month. They found two things for sure:1)  the Internet has a memory problem and 2) we don't know how big it is.

"The results from our sample sets indicate that range from 35%-90% of the Web has at least one archived copy," they write. Think about how different those two numbers are. Either we're capturing almost all the web or we're capturing barely more than a third of it.

I can tell you one thing: the archiving of the public web can and should be better. And there's basically one way that's going to happen: the Internet Archive gets more money.

You missed their big fundraising push at the end of the last year, but that's no reason not to donate now. If you have any doubts about the people or their commitment to public service, just check out this profile of Brewster Kahle. This is a serious civilizational endeavor, and I hope it gets funded that way.

Jump to comments
Presented by

Alexis C. Madrigal

Alexis Madrigal is the deputy editor of TheAtlantic.com. He's the author of Powering the Dream: The History and Promise of Green Technology. More

The New York Observer has called Madrigal "for all intents and purposes, the perfect modern reporter." He co-founded Longshot magazine, a high-speed media experiment that garnered attention from The New York Times, The Wall Street Journal, and the BBC. While at Wired.com, he built Wired Science into one of the most popular blogs in the world. The site was nominated for best magazine blog by the MPA and best science website in the 2009 Webby Awards. He also co-founded Haiti ReWired, a groundbreaking community dedicated to the discussion of technology, infrastructure, and the future of Haiti.

He's spoken at Stanford, CalTech, Berkeley, SXSW, E3, and the National Renewable Energy Laboratory, and his writing was anthologized in Best Technology Writing 2010 (Yale University Press).

Madrigal is a visiting scholar at the University of California at Berkeley's Office for the History of Science and Technology. Born in Mexico City, he grew up in the exurbs north of Portland, Oregon, and now lives in Oakland.

Get Today's Top Stories in Your Inbox (preview)

Why Are Americans So Bad at Saving Money?

The US is particularly miserable at putting aside money for the future. Should we blame our paychecks or our psychology?


Elsewhere on the web

Join the Discussion

After you comment, click Post. If you’re not already logged in you will be asked to log in or register. blog comments powered by Disqus

Video

The Death of Film

You'll never hear the whirring sound of a projector again.

Video

How to Hunt With Poison Darts

A Borneo hunter explains one of his tribe's oldest customs: the art of the blowpipe

Video

A Delightful, Pixar-Inspired Cartoon

An action figure and his reluctant sidekick trek across a kitchen in search of treasure.

Video

I Am an Undocumented Immigrant

"I look like a typical young American."

Video

Why Did I Study Physics?

Using hand-drawn cartoons to explain an academic passion

Writers

Up
Down

More in Technology

Just In