Every day, researchers add hundreds of new papers to ArXiv, the massive public database of scientific writing and research.
And with each new work, a special detection system goes hunting through ArXiv for bits of text it has seen before. How it works: An algorithm takes what ArXiv founder Paul Ginsparg calls a "textual ﬁngerprint" of each incoming document, then compares that fingerprint to all the other fingerprints in the database. "The algorithm is such that it can compare over 500 new articles per day to the roughly 1 million already in the database in a matter of seconds," Ginsparg told me in an email.
And matches pop up routinely. About 3 percent of monthly submissions—around 250 papers—are flagged for re-used text. That adds up to thousands of papers per year. Ginsparg wanted to learn more about what, exactly, was going on. So he and a colleague recently examined text re-use among hundreds of thousands of papers posted to ArXiv over a period of two decades. (Naturally, their findings are themselves available via the database.)
Ginsparg and co-author Daniel Citron started with some basic questions. Where in the world did researchers most often copy the work of others? And how often were people straight-up plagiarizing versus quoting heavily—but still citing—someone else's stuff? What they found surprised them. For one thing, many of the researchers who re-used a significant amount of text from others seemed to make up little sub-communities of authors who all frequently cited one another. It's difficult to know what to make of these networks, Ginsparg said.