As you probably know, "the cloud" in Internet parlance isn't an actual cloud. The Internet's cloud refers to remote storage of information and the network that connects to it. What tech companies pitch as a nebulous intangibility is really just stacks and stacks of servers with direct connections to the rest of the world. Things that take up physical space, in other words.
For the National Security Agency to do its spying, they need servers. They need buildings, perhaps ones clad in black, patrolled by guards, in remote places across the country. Indeed, the NSA is building a massive facility in Utah. But they need big buildings to hold the data infrastructure. But just how big, physically, is the NSA's privacy invasion? We decided to try and figure that out.
But to answer that question, we needed to answer three other questions. What information is being collected in the surveillance operations? How much of that information is the NSA housing? And, how much space would saving that much information actually take up? What we learned from talking to a variety of experts is that the calculus is not simple, and any answers are largely estimates. But we got answers.
What information is being stored?
Early last month, even while he was finalizing his discussions with Edward Snowden, The Guardian's Glenn Greenwald reported on a conversation between Tim Clemente, a former FBI agent, and CNN host Carol Costello. In the interview about the Boston Marathon investigation, as seen at right, Clemente makes the claim that "all digital communications are — there's a way to look at digital communications in the past." Costello refers to a previous appearance in which Clemente claimed the government could access phone calls, even "exactly what was said in that conversation."
This is an important claim for two reasons. The first is that Clemente, who also served on the FBI's Joint Terrorism Task Force, suggests a massive breadth of information collection. The second is that he doesn't say who is actually collecting the data, which we'll come back to.
Clemente indicates that entire phone calls are being recorded and stored, which is a far stronger claim than that Verizon is sharing metadata with the government. Both from a privacy standpoint and for our calculations. Metadata on a call — the number from which it originated, who it was placed to, duration, location information — is tiny. Perhaps a few hundred bytes could contain all of it. But a call is much larger — and as the call goes on, the amount of storage space it takes up increases dramatically, running into multiple megabytes. Same thing with email: a text email message is small; embed a photo, and it gets much bigger; embed a video, and it gets much, much, much bigger.
So if Clemente is right, and the government has access to "all digital communications" — videos, calls, audio recordings, emails, photos — that's taking up a lot of physical space somewhere. Which brings us to the second reason Clemente's claim is important, and to our second question.
How much of that information is the NSA housing?
According to Cisco, North Americans moved 13.1 exabytes around the Internet each month. You're familiar with kilobytes, megabytes, gigabytes. You're maybe familiar with terabytes, the next largest unit of electronic storage, each about 1,000 times larger than the last. Next comes petabytes, 1,000 times a terabyte. Then we get to exabytes. Put another way, 13.1 exabytes is the equivalent of 4,367,000,000,000 song files. That's what we move monthly. It's not the same figure as what is stored, of course, but it gives some sense of scale. A healthy percentage of that 13.1 exabytes must necessarily exist on servers around the world.
At the time of Greenwald's report on Clemente, it seemed incredible — impossible, even — that the government would be storing anything close to a repository of that size. Since The Guardian's Snowden reports came out over the last week, the world realized that it might be entirely possible. The government's PRISM system, we have learned, apparently gives the NSA a way to see data on private company servers. In other words, the NSA may not need to store the petabytes of content that people create. It just needs, in Clemente's words, "a way to look" at it. Which makes sense. For the NSA to keep copies of everything Google and Facebook and Apple and Microsoft store for their customers in the cloud (on their servers), NSA would need to at least match that storage capacity. For every gigabyte Facebook stores, the NSA would need to store it, too. Having Facebook store it for them makes much more sense — if Facebook keeps things around long enough.
Let's assume, then, that the amount of information the NSA stores is limited to the phone call metadata, emails, and documents related to ongoing investigations, and so on. We actually have some sense of what that scale might be: the leaked Boundless Informant system suggested that the NSA collected 97 billion pieces of intelligence around the world in March of 2013. Since that system only manages metadata, we can assume they're fairly small, perhaps snippets of text of around 100 kilobytes. If so, Boundless Informant stored 9.7 petabytes that month alone. That data has to go somewhere.
On its website, the NSA presents its vision of how it wants to manage information: the Global Information Grid. More of a concept than an actual technical plan, the GIG is a "net-centric system operating in a global context to provide processing, storage, management, and transport of information." In other words, it's a broadly distributed system of servers throughout the world. In other words, it's the NSA cloud. When such a system exists, Facebook, Apple, and the other companies involved in PRISM can become nodes on that network. If the system exists, they probably already are; The Week's Mark Ambinder explains how it likely works. But the tech companies continue to deny it, obliquely.
How much space would saving that much information take up?
We reached out to two of those companies, given that these are also organizations that run massive datacenters. One didn't respond; one didn't want to talk on the record. Few companies share specifics about the datacenters they operate. But we can get a general sense of scale from what is publicly available.
Google, for example, operates 13 datacenters around the world. In January 2012, James Hearn, CTO for The Local, estimated that Google's server capacity would be around 2.3 million early this year. He made that assumption based on several factors: the size of Google's facilities, internal images of the server space, etc.
Hearn acknowledges that the estimate is almost certainly incorrect. Nearly everyone we spoke with, including that tech company that wouldn't go on the record, assured us that the variability in technology, operating system, and physical structure made any attempt to estimate the relationship between storage capacity and facility size very difficult.
Yevgeniy Sverdlik, North America editor of Datacenter Dynamics Focus, a trade magazine, made that point repeatedly in a phone interview with The Atlantic Wire on Monday afternoon. "Based on the size of the building? No way," he said. "You can't tell even if you're inside the building." Companies like Facebook and Google "design their own servers, they design their own network switches." (Facebook, in fact, is part of a group, the Open Compute Project, aimed at providing an open source system to build datacenters that use less energy and increase speed.)
The NSA doesn't design its own software or hardware, apparently. In an FAQ released last June, the agency explained its push to use off-the-rack parts.
Although NSA’s strategy for protecting classified information continues to employ both commercially-based and traditional Government-Off-The-Shelf (GOTS) solutions, IAD will look first to commercial technology and commercial solutions in helping customers meet their needs for protecting classified information while continuing to support customers with existing GOTS solutions or needs that can only be met via GOTS.
So what's the highest-density system you can make with commercial products?
The HP ProLiant DL580 has eight hard drive slots. HP makes 2.4 terabyte solid-state drives for those slots. We'll round up to 2.6 given the Kryder's Law expectation that storage density increases extremely quickly. (Thanks to the Atlantic Media Company's CTO for his advice on this. He also warned about the impreciseness of linking storage with footprint.)
By that calculation, a 27.6-by-19-inch box, seven inches high, can hold about 21 terabytes. For 9.7 petabytes per month, our estimate of the Boundless Informant collection, that requires only 466 of these boxes. For a year of 9.7-petabyte data collection, we're at 5,600 boxes. If you stack those boxes 5 feet high (eight boxes), we're talking about a total physical footprint of 2,410 square feet. This is both a high and a low estimate. It is high because that's a lot of power packed into a very small space, and heat can be a serious problem. It is low because you must include space for power cords, cooling systems, pathways, and likely more.
It is also low because storage of 116 petabytes a year is a conservative estimate. In 2010, Google's Eric Schmidt indicated that humanity was creating five exabytes of data every two days. The NSA doesn't capture or store anywhere near all of that, but the amount it does collect will almost necessarily only increase.
We know that the NSA will soon have far, far more than 2,410 square feet of space. It is nearing completion on a facility in Utah that will offer 100,000 square feet of its 1,000,000-square-foot area for servers. Just last month, the agency began building a new data facility at its home in Fort Meade. That facility, when complete in 2016, will another 70,000 square feet of space. Using our calculation of server space above — 116 petabytes per 2,410 square feet — even if we assume an equivalent amount of space needed for support infrastructure, those 170,000 square feet could host about 4.09 exabytes. And that's only new capacity. The NSA also has existing storage facilities of unknown scale. The New York Times reported that those include facilities of some sort in Virginia, New Jersey, Georgia, and San Francisco.
The vagaries of this dataserver calculation, about which we were warned over and over, mean that it should be considered a rough sketch at best. But the core point stands: The NSA likely has access to a massive amount of information, much of it residing on remote servers hosted by acquiescent technology companies. What the NSA needs to house itself, it probably can today and certainly can when its Fort Meade and Utah facilities come online.
Are you being watched? Ask Agent Clemente. Could the NSA be tracking unimaginably large amounts of data? Yes.
Update: James Offer created some clever visualizations to suggest just how much storage space we're talking about.
Images, from top:
An empty server rack at a non-governmental facility in New York, via AP.
Boundless Informant screenshot, via The Guardian.
Global Information Grid image, via NSA.
Construction of the NSA's Utah facility, via AP.