Future Historians Probably Won't Understand Our Internet, and That's Okay

Archivists are working to document our chaotic, opaque, algorithmically complex world—and in many cases, they simply can’t.

A "Compose New Tweet" pop-up on the Twitter interface

What’s happening?

This has always been an easier question to pose—as Twitter does to all its users—than to answer. And how well we answer the question of what is happening in our present moment has implications for how this current period will be remembered. Historians, economists, and regular old people at the corner store all have their methods and heuristics for figuring out how the world around them came to be. The best theories require humility; nearly everything that has happened to anyone produced no documentation, no artifacts, nothing to study.

The rise of social media in the ’00s seemed to offer a new avenue for exploring what was happening with unprecedented breadth. After all, people were committing ever larger amounts of information about themselves, their friends, and the world to the servers of social-networking companies. Optimism about this development peaked in 2010, when Twitter gave its archive and ongoing access to public tweets to the Library of Congress. Tweets in the record of America! “It boggles my mind to think what we might be able to learn about ourselves and the world around us from this wealth of data,” a library spokesperson exclaimed in a blog post. “And I’m certain we’ll learn things that none of us now can even possibly conceive.”

Unfortunately, one of the things the library learned was that the Twitter data overwhelmed the technical resources and capacities of the institution. By 2013, the library had to admit that a single search of just the Twitter data from 2006 to 2010 could take 24 hours. Four years later, the archive still is not available to researchers.

Across the board, the reality began to sink in that these proprietary services hold volumes of data that no public institution can process. And that’s just the data itself.

What about the actual functioning of the application: What tweets are displayed to whom in what order? Every major social-networking service uses opaque algorithms to shape what data people see. Why does Facebook show you this story and not that one? No one knows, possibly not even the company’s engineers. Outsiders know basically nothing about the specific choices these algorithms make. Journalists and scholars have built up some inferences about the general features of these systems, but our understanding is severely limited. So, even if the LOC has the database of tweets, they still wouldn’t have Twitter.

In a new paper, “Stewardship in the ‘Age of Algorithms,’” Clifford Lynch, the director of the Coalition for Networked Information, argues that the paradigm for preserving digital artifacts is not up to the challenge of preserving what happens on social networks.

Over the last 40 years, archivists have begun to gather more digital objects—web pages, PDFs, databases, kinds of software. There is more data about more people than ever before, however, the cultural institutions dedicated to preserving the memory of what it was to be alive in our time, including our hours on the internet, may actually be capturing less usable information than in previous eras.

“We always used to think for historians working 100 years from now: We need to preserve the bits (the files) and emulate the computing environment to show what people saw a hundred years ago,” said Dan Cohen, a professor at Northeastern University and the former head of the Digital Public Library of America. “Save the HTML and save what a browser was and what Windows 98 was and what an Intel chip was. That was the model for preservation for a decade or more.”

Which makes sense: If you want to understand how WordPerfect, an old word processor, functioned, then you just need that software and some way of running it.

But if you want to document the experience of using Facebook five years ago or even two weeks ago ... how do you do it?

The truth is, right now, you can’t. No one (outside Facebook, at least) has preserved the functioning of the application. And worse, there is no thing that can be squirreled away for future historians to figure out. “The existing models and conceptual frameworks of preserving some kind of ‘canonical’ digital artifacts are increasingly inapplicable in a world of pervasive, unique, personalized, non-repeatable performances,” Lynch writes.

Nick Seaver of Tufts University, a researcher in the emerging field of “algorithm studies,” wrote a broader summary of the issues with trying to figure out what is happening on the internet. He ticks off the problems of trying to pin down—or in our case, archive—how these web services work. One, they’re always testing out new versions. So there isn’t one Google or one Bing, but “10 million different permutations of Bing.” Two, as a result of that testing and their own internal decision-making, “You can’t log into the same Facebook twice.” It’s constantly changing in big and small ways. Three, the number of inputs and complex interactions between them simply makes these large-scale systems very difficult to understand, even if we have access to outputs and some knowledge of inputs.

“What we recognize or ‘discover’ when critically approaching algorithms from the outside is often partial, temporary, and contingent,” Seaver concludes.

The world as we experience it seems to be growing more opaque. More of life now takes place on digital platforms that are different for everyone, closed to inspection, and massively technically complex. What we don't know now about our current experience will resound through time in historians of the future knowing less, too. Maybe this era will be a new dark age, as resistant to analysis then as it has become now.

If we do want our era to be legible to future generations, our “memory organizations” as Lynch calls them, must take radical steps to probe and document social networks like Facebook. Lynch suggests creating persistent, socially embedded bots that exist to capture a realistic and demographically broad set of experiences on these platforms. Or, alternatively, archivists could go out and recruit actual humans to opt in to having their experiences recorded, as ProPublica has done with political advertising on Facebook.

Lynch’s suggestion is radical for the archival community. Archivists generally allow other people to document the world, and then they preserve, index, and make these records available. Lynch contends that when it comes to the current social media, that just doesn’t work. If they want to accurately capture what it was like to live online today, archivists, and other memory organizations, will have to actively build technical tools and cultural infrastructure to understand the “performances” of these algorithmic systems. But, at least right now, this is not going to happen.

“I loved this paper. It laid out a need that is real, but as part of the paper, it also said, ‘Oh, by the way, this is impossible and intractable,’” said Leslie Johnston, director of digital preservation at the U.S. National Archives. “It was realistic in understanding that this is a very hard thing to accomplish with our current professional and technical constructs.”

Archivists are encountering the same difficulties that journalists and scholars have run up against studying these technologies. In an influential paper from last year, Jenna Burrell of the University of California’s School of Information highlighted the opacity that frustrates outsiders looking at corporate algorithms. Obviously, companies want to protect their own proprietary software. And the code and systems built around the code are complex. But more fundamentally, there is a mismatch between how the machines function and how humans think. “When a computer learns and consequently builds its own representation of a classification decision, it does so without regard for human comprehension,” Burrell writes. “Machine optimizations based on training data do not naturally accord with human semantic explanations.”

This is the most novel part of what makes archiving our internet difficult. There are pieces of the internet that simply don’t function on human or human-generated or human-parse-able principles.

While Seaver of Tufts University considered Lynch’s proposals to create an archival bot or human army to record the experience of being on an internet service plausible, he cautioned that “it’s really hard to go from a user experience to what is going on under the hood.”

Still, Seaver sees these technical systems not as totally divorced from humans, but as complex arrangements of people doing different things.

“Algorithms aren’t artifacts, they are collections of human practices that are in interaction with each other,” he told me. And that’s something that people in the social sciences have been trying to deal with since the birth of their fields. They have learned at least one thing: It’s really difficult. “One thing you can do is replace the word ‘algorithm’ with the word ‘society,’” Seaver said. “It has always been hard to document the present [functioning of a society] for the future.”

The archivist, Johnston, expressed a similar sentiment about the (lack of) novelty of the current challenge. She noted that people working in “collection-development theory”—the people who choose what to archive—have always had to make do with limited coverage of an era, doing their best to try to capture the salient features of a society. “Social media is not unlike a personal diary,” she said. “It’s more expansive. It is a public diary that has a graph of relationships built into it. But there is a continuity of archival practice.”

So, maybe our times are not so different from previous eras. Lynch himself points out that “the rise of the telephone meant that there were a vast number of person-to-person calls that were never part of the record and that nobody expected to be.” Perhaps Facebook communications should fall into a similar bucket. For a while it seemed exciting and smart to archive everything that happened online because it seemed possible. But now that it might not actually be possible, maybe that’s okay.

“Is it terrible that not everything that happens right now will be remembered forever?” Seaver said. “Yeah, that’s crappy, but it’s historically quite the norm.”