Every Trump tweet in a big, searchable database.
The library has been handed a Gordian knot, an engineering, cyber, and policy challenge that grows bigger and more complicated every day—about 500 million tweets a day more complicated. Will the library finally untie it—or give in and cut the thing off?
“This is a warning as we start dealing with big data—we have to be careful what we sign up for,” said Michael Zimmer, a professor at the University of Wisconsin-Milwaukee who has written on the library’s efforts. “When libraries didn’t have the resources to digitize books, only a company the size of Google was able to put the money and the bodies into it. And that might be where the Library of Congress is stuck.”
Things looked easier in 2010, when the library launched the Twitter partnership with a jaunty press release, “How Tweet It Is”:
Have you ever sent out a “tweet” on the popular Twitter social media service? Congratulations: Your 140 characters or less will now be housed in the Library of Congress.
Back then, Twitter users posted around 55 million tweets a day. That’s a lot, but it’s peanuts compared with the traffic Twitter sees today. And tweets were less complicated back then. They didn’t have embedded media, like photos or videos, and sharing Tweets was mostly still a copy-and-paste affair—though some early adopters were giving this new “retweet” button a try.
That April, Twitter and the Library of Congress signed a short agreement—just two pages. In it, Twitter promised to hand over all the tweets posted since the company’s launch in 2006, as well as a regular feed of new submissions. In return, the library agreed to embargo the data for six months and ensure that private and deleted tweets were not exposed.
As the library explained later that year:
Private account information and deleted tweets will not be part of the archive. Linked information such as pictures and websites is not part of the archive, and the Library has no plans to collect the linked sites. There will be at least a six-month window between the original date of a tweet and its date of availability for research use.
This turned out to be a tougher challenge than anyone expected. For one, the flood of tweets flew faster and faster, jumping from 55 million a day in 2010 to 140 million in 2011, before climbing to nearly 500 million in 2012. And the tweets got bigger, too. Individual tweets could be connected by a conversation thread. Users embedded photos, then video, and then live video. All this new metadata weighed down the Library of Congress’s daily downloads and forced staff to consider building an archival system that would change as often as Twitter did.
In 2013, with academics clamoring for access to the archive, the library admitted things weren’t going so well:
It is clear that technology to allow for scholarship access to large data sets is lagging behind technology for creating and distributing such data. Even the private sector has not yet implemented cost-effective commercial solutions because of the complexity and resource requirements of such a task…
The Library has not yet provided researchers access to the archive. Currently, executing a single search of just the fixed 2006-2010 archive on the Library’s systems could take 24 hours. This is an inadequate situation in which to begin offering access to researchers, as it so severely limits the number of possible searches.
At the same time, with the library sidelined, Twitter itself ramped up its own efforts to expose—and sell—its massive archive. In 2010, it partnered with data firm Gnip to offer feeds of raw tweets—for hundreds of thousands of dollars. Twitter eventually cut out the middleman and bought Gnip in 2014, consolidating distribution of its valuable data.