Every day, researchers add hundreds of new papers to ArXiv, the massive public database of scientific writing and research.

And with each new work, a special detection system goes hunting through ArXiv for bits of text it has seen before. How it works: An algorithm takes what ArXiv founder Paul Ginsparg calls a "textual fingerprint" of each incoming document, then compares that fingerprint to all the other fingerprints in the database. "The algorithm is such that it can compare over 500 new articles per day to the roughly 1 million already in the database in a matter of seconds," Ginsparg told me in an email.

And matches pop up routinely. About 3 percent of monthly submissions—around 250 papers—are flagged for re-used text. That adds up to thousands of papers per year. Ginsparg wanted to learn more about what, exactly, was going on. So he and a colleague recently examined text re-use among hundreds of thousands of papers posted to ArXiv over a period of two decades. (Naturally, their findings are themselves available via the database.)

Ginsparg and co-author Daniel Citron started with some basic questions. Where in the world did researchers most often copy the work of others? And how often were people straight-up plagiarizing versus quoting heavily—but still citing—someone else's stuff? What they found surprised them. For one thing, many of the researchers who re-used a significant amount of text from others seemed to make up little sub-communities of authors who all frequently cited one another. It's difficult to know what to make of these networks, Ginsparg said.

"It's not high-quality research, usually 'under the radar,'" he told me. "Sometimes it's out-of-the-mainstream researchers (developing world, etc.) doing the best that they can. Other times, it's so extreme, the impression is they're being consciously duplicitous to pad publication records."

There's perhaps some reassurance in the finding that the papers that re-use text the most tend to be cited by others the least. In other words, the works that re-use the most text aren't the most influential, Ginsparg found. But this inverse relationship may also signal that some serial text re-users are getting away with intellectual theft.

Of course, not all text re-use is wrong. (Consider the form of text re-use in this article: I'm quoting from Ginsparg's paper.) But Ginsparg reminded me that the threshold ArXiv uses for flagging work is "incredibly lenient," allowing for as much as 20 percent self-copying from previous articles, or "multiple sentences verbatim before being flagged" for re-using the text from another person's work, he said. A paper that was flagged this week, for instance, included "multiple paragraphs verbatim each from at least 10 different sources by other authors," Ginsparg said. "It's all cited, but still sloppy to copy paragraphs verbatim from other sources."

Text re-use also seems to happen more in some countries than others, a finding that reflects differences in academic cultures and the likelihood that non-native English speakers may rely more heavily on quoting others when writing in English. But there are several overlapping factors at play. From Ginsparg's paper: "Many students from non-Western cultures had never before heard the word 'plagiarism,' and in some cultures it is considered disrespectful to rewrite another author’s words." (Work originating from the following countries had the highest percentages of flagged submissions: Bangladesh, Belarus, Bulgaria, Colombia, Cyprus, Egypt, Iran, Jordan, Kazakhstan, Kyrgyztan, Latvia, Luxembourg, Micronesia, Moldova, Pakistan, Saudi Arabia, and Uzbekistan.)

There are also cultural differences from one discipline to the next. "For example, in math it is perfectly acceptable to restate a few paragraphs of a theorem without putting it in direct quotes," Ginsparg told me, "whereas
there is no direct analog in physics."

In other cases, researchers justify re-using the work of others based on arbitrary framework. Again, Ginsparg: "I remember once emailing a researcher out of curiosity as to why his intro was taken verbatim from Wikipedia without attribution. His response is still interesting to parse: 'Well if it had been from an article, I certainly would have cited it, but that is not necessary for collectively produced material.'"

One of the biggest surprises of all was the kind of material people were copying. Text re-use wasn't limited to research. Borrowed phrases pop up all over acknowledgment sections. "How can one not be original enough to figure out how to thank people?" Ginsparg asked.

Consider some of the examples he found, like this acknowledgment:

I cannot describe how indebted I am to my wonderful girlfriend, Amanda, whose love and encouragement will always motivate me to achieve all that I can. I could not have written this thesis without her support; in particular, my peculiar working hours and erratic behaviour towards the end could not have been easy to deal with!

And this one:

I cannot describe how indebted I am to my wonderful wife, Renata, whose love and encouragement will always motivate me to achieve all that I can. I could not have written this thesis without her support; in particular, my peculiar working hours and erratic behaviour towards the end could not have been easy to deal with!

(Using someone else's "thank you" language could lead to problems beyond plagiarism, Ginsparg points out, if a researcher picks up someone else's text, but fails to swap in the name of his or her partner.)

In general, the most prominent writers and researchers are not the ones re-using text—their own or the work of others. "We suspect that such researchers have little interest in retreading the same intellectual territory," Ginsparg wrote, "much less reusing their own or others' material verbatim."