In their approach, the researchers feed the original Census data, which is kept confidential, into a complex statistical model that generates a simulated population that has the same general features as the original data. If you have a confidential dataset of 100 individuals' ages and incomes, for example, a corresponding synthetic dataset composed of 100 imaginary individuals would have the same mean age and mean income as the original. One of the major challenges is to create synthetic data that is statistically identical, but not an exact replica, of the original data.
“Any query that can be asked of the confidential data can also be asked of the synthetic data,” Abowd said. Because the synthetic data represent imaginary individuals, there is low risk in making synthetic data public.
The synthetic data are often used to develop and test computer code for analyses. But ultimately, any analysis on synthetic data needs to be verified on the original dataset. So, the researchers developed a "verification server"—an intermediary computer—to perform the same analysis on the original confidential data. The verification step determines whether the results of the analysis on synthetic data are also true for the original data. “The validation is a way of making sure that the assumptions that were built into the synthetic data are not driving the results, as opposed to the thing that the person is trying to study,” said Levenstein.
The Synthetic Longitudinal Business Database (SynLBD), released in 2011, is the result of their work, and the first ever record-level database on business establishments released by the Census Bureau. The Census Bureau collects information about businesses—the value of their output, how many employees they have, how much they spend on research and development, and so on. For businesses, privacy is important mainly because of strategic concerns. They might not want their competitors, customers, or suppliers to know exactly what is going on with their business, Levenstein explained.
The identity of businesses is hard to disguise by simply adding noise to a dataset. “Businesses are very different from one another,” Levenstein said. “You cannot hide General Motors or Walmart in a dataset. It’s too hard to anonymize the data in a way that would still make them useful. If you did enough masking, you’d be masking what’s important about employment and economic output in America. So you can’t do that.” Instead, the research team created the SynLBD, a database of synthetic data about businesses, which allows researchers to develop a better understanding of entrepreneurship, and to study the dynamics of the American economy—and what is causing it to grow or not—without revealing confidential information about individual businesses.
The team also created a database called the Survey of Income and Program Participation (SIPP) Synthetic Beta Data Product, which allows researchers to do important analyses about food security, poverty, income inequality, and other issues, Levenstein explained. The (nonsynthetic) Survey of Income and Program Participation has been going on for about 40 years, she said. “If you have that kind of information over a long period of time for a person, it increases the probability that the person could be re-identified. So we have created a synthetic version of SIPP.” The synthetic database allows any researcher in the community to study important questions that can have implications for government programs such as food stamps. Without the synthetic data, much of this research would be logistically difficult or impossible. “The realistic alternative to publishing the SIPP synthetic data is suppression (no publication of any form of the linked administrative data) with individual researchers proposing projects on a one-by-one basis for access to the confidential data,” said Abowd. “Those projects would have to be approved by the Census, Internal Revenue Service and Social Security Administration.”