The intriguing problem is the way that over-use of automatic translation can make it harder for automatic translation ever to improve, and may even be making it worse. As people in the business understand, computerized translation relies heavily on sheer statistical correlation. You take a huge chunk of text in one language; you compare it with a counterpart text in a different language; and you see which words and phrases match up. The computer doesn't have to "understand" either language for this to work. It just notices that the English words "good" or "goods" show up as bon in French in certain uses (ie, as in "opposite of bad"), but as a variety of other French words depending on the context in English -- "dry goods," "I've got the goods," "good grief," etc.
Crucially, this process depends on "big data" for its improvement. The more Rosetta stone-like side-by-side passages the system can compare, the more refined and reliable the correlations will become. Day by day and comparison by comparison, the translation will only get better. So that some day, in principle, we could understand anything written in any language, without knowing that language ourselves.
... the side-by-side texts used to "train" the system aren't any more accurate and nuanced than what the computer already knows. That is the problem with a rapidly increasing volume of machine-translated material. These computerized translations are better than nothing, but at best they are pretty rough. Try it for yourself: Go to the People's Daily Chinese-language home site
; plug any story's URL (for instance, this one
) into the Google Translate site
; and see how closely the result resembles real English. You will get the point of the story, barely. Moreover, since these side-by-side versions reflect the computerized-system's current level of skill, by definition they offer no opportunity for improvement.
That's the problem. The more of this auto-translated material floods onto the world's websites, the smaller the proportion of good translations the computers can learn from. In engineering terms, the signal-to-noise ratio is getting worse. It's getting worse faster in part because of the popularity of Google's Translate API, which allows spam-bloggers and SEO operations to slap up the auto-translated material in large quantities. This is the computer-world equivalent of sloppy overuse of antibiotics creating new strains of drug-resistant bacteria. (Or GIGO -- Garbage In, Garbage Out -- as reader Rick Jones mentioned.) As the eMpTy Pages analysis describes the problem, using another analogy (emphasis added):
>>Polluting Its Own Drinking Water
...An increasing amount of the website
data that Google has been gathering has been translated from one
language to another using Google's own Translate API. Often, this data
has been published online with no human editing or quality checking, and
is then represented as high-quality local language content....
is not easy to determine if local language content has been translated
by machine or by humans or perhaps whether it is in its original authored
language. By crawling and processing local language web content that
has been published without any human proof reading after being
translated using the Google Translate API, Google is in reality
"polluting its own drinking water."...
The increasing amount of
"polluted drinking water" is becoming more statistically relevant. Over
time, instead of improving each time more machine learning data is
added, the opposite can occur. Errors in the original translation of web
content can result in good statistical patterns becoming less relevant,
and bad patterns becoming more statistically relevant. Poor
translations are feeding back into the learning system, creating
software that repeats previous mistakes and can even exaggerate them.<<
That's all I have about this story, which I offer because it reveals a problem I hadn't thought of -- and illustrates one more under-anticipated turn in the evolution of the info age. The very tools that were supposed to melt away language barriers may, because of the realities of human nature (ie, blog spam) and the intricacies of language, actually be re-erecting some of those barriers. For the foreseeable future, it's still worth learning other languages