An 'Economic Burden' Google Can No Longer Bear?

This is insider-tech talk, but I think it is very interesting in its implications -- about language, "big data," Google's strategies, and the never-ending recalibration of goods vs bads, "signal to noise," on the internet.

[Brief summary of what follows: Google is dropping an automatic-translation tool, because overuse by spam-bloggers is flooding the internet with sloppily translated text, which in turn is making computerized translation even sloppier.]

There has been a rumble in the tech world about Google's announcement last month that it was "deprecating," and phasing out, its "Translate API." In simplest terms that means that website developers will no longer be able to use code that makes Google's translation algorithms automatically provide material for other sites. The standalone Google Translate site, which allows you to enter text or URLs for translation, will remain (along with some other features that apply Google translations to others' sites). But as an announcement on the Translate API site said:

For a very, very detailed explication of what this "economic burden" might mean for Google, check this analysis from the eMpTy Pages site on translation technology and related topics. Here is the part of the explanation that, for me, had the marvelous quality of being obvious -- once it's pointed out -- and interesting too:

The intriguing problem is the way that over-use of automatic translation can make it harder for automatic translation ever to improve, and may even be making it worse. As people in the business understand, computerized translation relies heavily on sheer statistical correlation. You take a huge chunk of text in one language; you compare it with a counterpart text in a different language; and you see which words and phrases match up. The computer doesn't have to "understand" either language for this to work. It just notices that the English words "good" or "goods" show up as bon in French in certain uses (ie, as in "opposite of bad"), but as a variety of other French words depending on the context in English -- "dry goods," "I've got the goods," "good grief," etc.

Crucially, this process depends on "big data" for its improvement. The more Rosetta stone-like side-by-side passages the system can compare, the more refined and reliable the correlations will become. Day by day and comparison by comparison, the translation will only get better. So that some day, in principle, we could understand anything written in any language, without knowing that language ourselves.

UNLESS ... the side-by-side texts used to "train" the system aren't any more accurate and nuanced than what the computer already knows. That is the problem with a rapidly increasing volume of machine-translated material. These computerized translations are better than nothing, but at best they are pretty rough. Try it for yourself: Go to the People's Daily Chinese-language home site; plug any story's URL (for instance, this one)  into the Google Translate site; and see how closely the result resembles real English. You will get the point of the story, barely. Moreover, since these side-by-side versions reflect the computerized-system's current level of skill, by definition they offer no opportunity for improvement.

That's the problem. The more of this auto-translated material floods onto the world's websites, the smaller the proportion of good translations the computers can learn from. In engineering terms, the signal-to-noise ratio is getting worse. It's getting worse faster in part because of the popularity of Google's Translate API, which allows spam-bloggers and SEO operations to slap up the auto-translated material in large quantities. This is the computer-world equivalent of sloppy overuse of antibiotics creating new strains of drug-resistant bacteria. (Or GIGO -- Garbage In, Garbage Out -- as reader Rick Jones mentioned.) As the eMpTy Pages analysis describes the problem, using another analogy (emphasis added):
>>Polluting Its Own Drinking Water
...An increasing amount of the website data that Google has been gathering has been translated from one language to another using Google's own Translate API. Often, this data has been published online with no human editing or quality checking, and is then represented as high-quality local language content....

It is not easy to determine if local language content has been translated by machine or by humans or perhaps whether it is in its original authored language. By crawling and processing local language web content that has been published without any human proof reading after being translated using the Google Translate API, Google is in reality "polluting its own drinking water."...

The increasing amount of "polluted drinking water" is becoming more statistically relevant. Over time, instead of improving each time more machine learning data is added, the opposite can occur. Errors in the original translation of web content can result in good statistical patterns becoming less relevant, and bad patterns becoming more statistically relevant. Poor translations are feeding back into the learning system, creating software that repeats previous mistakes and can even exaggerate them.<<
That's all I have about this story, which I offer because it reveals a problem I hadn't thought of -- and illustrates one more under-anticipated turn in the evolution of the info age. The very tools that were supposed to melt away language barriers may, because of the realities of human nature (ie, blog spam) and the intricacies of language, actually be re-erecting some of those barriers. For the foreseeable future, it's still worth learning other languages.
Presented by

James Fallows is a national correspondent for The Atlantic and has written for the magazine since the late 1970s. He has reported extensively from outside the United States and once worked as President Carter's chief speechwriter. His latest book is China Airborne. More

James Fallows is based in Washington as a national correspondent for The Atlantic. He has worked for the magazine for nearly 30 years and in that time has also lived in Seattle, Berkeley, Austin, Tokyo, Kuala Lumpur, Shanghai, and Beijing. He was raised in Redlands, California, received his undergraduate degree in American history and literature from Harvard, and received a graduate degree in economics from Oxford as a Rhodes scholar. In addition to working for The Atlantic, he has spent two years as chief White House speechwriter for Jimmy Carter, two years as the editor of US News & World Report, and six months as a program designer at Microsoft. He is an instrument-rated private pilot. He is also now the chair in U.S. media at the U.S. Studies Centre at the University of Sydney, in Australia.

Fallows has been a finalist for the National Magazine Award five times and has won once; he has also won the American Book Award for nonfiction and a N.Y. Emmy award for the documentary series Doing Business in China. He was the founding chairman of the New America Foundation. His recent books Blind Into Baghdad (2006) and Postcards From Tomorrow Square (2009) are based on his writings for The Atlantic. His latest book is China Airborne. He is married to Deborah Fallows, author of the recent book Dreaming in Chinese. They have two married sons.

Fallows welcomes and frequently quotes from reader mail sent via the "Email" button below. Unless you specify otherwise, we consider any incoming mail available for possible quotation -- but not with the sender's real name unless you explicitly state that it may be used. If you are wondering why Fallows does not use a "Comments" field below his posts, please see previous explanations here and here.


A Stop-Motion Tour of New York City

A filmmaker animated hundreds of still photographs to create this Big Apple flip book


The Absurd Psychology of Restaurant Menus

Would people eat healthier if celery was called "cool celery?"


This Japanese Inn Has Been Open for 1,300 Years

It's one of the oldest family businesses in the world.


What Happens Inside a Dying Mind?

Science cannot fully explain near-death experiences.

More in Technology

From This Author

Just In