How Much of the Web Is Archived? Truth Is, We Don't Really Know
Somewhere between 35 and 90 percent of the web has at least one archived copy. That's a pretty big range.
Here's the challenge: new Internet is being made all the time. Oftentimes, these new pages are added to existing networks on Tumblr or Facebook or Twitter or Livejournal. But other times, someone fires up a web server that's off the standard map, and it the web's crawlers, try as they might, may not find that page for a while, if ever.
That means some percentage of the web is not being archived by anyone (or anything, really), not even the Internet Archive's invaluable Wayback machine.
And certainly, few sites are being archived with any kind of regularity, even those (like TheAtlantic.com) that are changing constantly. So, how much of the web is humanity missing?
Researchers took a step towards answering that question in a paper submitted to the arXiv repository late last month. They found two things for sure:1) the Internet has a memory problem and 2) we don't know how big it is.
"The results from our sample sets indicate that range from 35%-90% of the Web has at least one archived copy," they write. Think about how different those two numbers are. Either we're capturing almost all the web or we're capturing barely more than a third of it.
I can tell you one thing: the archiving of the public web can and should be better. And there's basically one way that's going to happen: the Internet Archive gets more money.
You missed their big fundraising push at the end of the last year, but that's no reason not to donate now. If you have any doubts about the people or their commitment to public service, just check out this profile of Brewster Kahle. This is a serious civilizational endeavor, and I hope it gets funded that way.