An 'Economic Burden' Google Can No Longer Bear?

By James Fallows

This is insider-tech talk, but I think it is very interesting in its implications -- about language, "big data," Google's strategies, and the never-ending recalibration of goods vs bads, "signal to noise," on the internet.

[Brief summary of what follows: Google is dropping an automatic-translation tool, because overuse by spam-bloggers is flooding the internet with sloppily translated text, which in turn is making computerized translation even sloppier.]

There has been a rumble in the tech world about Google's announcement last month that it was "deprecating," and phasing out, its "Translate API." In simplest terms that means that website developers will no longer be able to use code that makes Google's translation algorithms automatically provide material for other sites. The standalone Google Translate site, which allows you to enter text or URLs for translation, will remain (along with some other features that apply Google translations to others' sites). But as an announcement on the Translate API site said:
 
GoogleTrans2.png

For a very, very detailed explication of what this "economic burden" might mean for Google, check this analysis from the eMpTy Pages site on translation technology and related topics. Here is the part of the explanation that, for me, had the marvelous quality of being obvious -- once it's pointed out -- and interesting too:

The intriguing problem is the way that over-use of automatic translation can make it harder for automatic translation ever to improve, and may even be making it worse. As people in the business understand, computerized translation relies heavily on sheer statistical correlation. You take a huge chunk of text in one language; you compare it with a counterpart text in a different language; and you see which words and phrases match up. The computer doesn't have to "understand" either language for this to work. It just notices that the English words "good" or "goods" show up as bon in French in certain uses (ie, as in "opposite of bad"), but as a variety of other French words depending on the context in English -- "dry goods," "I've got the goods," "good grief," etc.

Crucially, this process depends on "big data" for its improvement. The more Rosetta stone-like side-by-side passages the system can compare, the more refined and reliable the correlations will become. Day by day and comparison by comparison, the translation will only get better. So that some day, in principle, we could understand anything written in any language, without knowing that language ourselves.


UNLESS ... the side-by-side texts used to "train" the system aren't any more accurate and nuanced than what the computer already knows. That is the problem with a rapidly increasing volume of machine-translated material. These computerized translations are better than nothing, but at best they are pretty rough. Try it for yourself: Go to the People's Daily Chinese-language home site; plug any story's URL (for instance, this one)  into the Google Translate site; and see how closely the result resembles real English. You will get the point of the story, barely. Moreover, since these side-by-side versions reflect the computerized-system's current level of skill, by definition they offer no opportunity for improvement.

That's the problem. The more of this auto-translated material floods onto the world's websites, the smaller the proportion of good translations the computers can learn from. In engineering terms, the signal-to-noise ratio is getting worse. It's getting worse faster in part because of the popularity of Google's Translate API, which allows spam-bloggers and SEO operations to slap up the auto-translated material in large quantities. This is the computer-world equivalent of sloppy overuse of antibiotics creating new strains of drug-resistant bacteria. (Or GIGO -- Garbage In, Garbage Out -- as reader Rick Jones mentioned.) As the eMpTy Pages analysis describes the problem, using another analogy (emphasis added):
>>Polluting Its Own Drinking Water
...An increasing amount of the website data that Google has been gathering has been translated from one language to another using Google's own Translate API. Often, this data has been published online with no human editing or quality checking, and is then represented as high-quality local language content....

It is not easy to determine if local language content has been translated by machine or by humans or perhaps whether it is in its original authored language. By crawling and processing local language web content that has been published without any human proof reading after being translated using the Google Translate API, Google is in reality "polluting its own drinking water."...

The increasing amount of "polluted drinking water" is becoming more statistically relevant. Over time, instead of improving each time more machine learning data is added, the opposite can occur. Errors in the original translation of web content can result in good statistical patterns becoming less relevant, and bad patterns becoming more statistically relevant. Poor translations are feeding back into the learning system, creating software that repeats previous mistakes and can even exaggerate them.<<
That's all I have about this story, which I offer because it reveals a problem I hadn't thought of -- and illustrates one more under-anticipated turn in the evolution of the info age. The very tools that were supposed to melt away language barriers may, because of the realities of human nature (ie, blog spam) and the intricacies of language, actually be re-erecting some of those barriers. For the foreseeable future, it's still worth learning other languages.

This article available online at:

http://www.theatlantic.com/technology/archive/2011/06/an-economic-burden-google-can-no-longer-bear/240283/