{[ promptMessage ]}

Bookmark it

{[ promptMessage ]}

Sciam.Mar06.Translation

Sciam.Mar06.Translation - atrium Nepal Asia legend The lion...

Info iconThis preview shows pages 1–4. Sign up to view the full content.

View Full Document Right Arrow Icon
Background image of page 1

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full Document Right Arrow Icon
Background image of page 2
Background image of page 3

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full Document Right Arrow Icon
Background image of page 4
This is the end of the preview. Sign up to access the rest of the document.

Unformatted text preview: atrium Nepal Asia legend: The lion, the sorceress, the evil spirit wardrobe “already lack” the evil spirit abstains the trilogy “rich in poetic and artistic flavor, also has not let” the Har— ley baud “the series novel have the infinite pleasure the under- ' current to be turbulent. The preceding gibberish was brought to you by a Chinese-to-English transla- tion carried out by Altavista’s Babelfish, the popular Internet-based translator. In coherent English, from a bilingual page on the Web site of Taiwan’s China Post, it reads: “The Chronicles of Narnia” doesn’t come near the poetic vision of “The Lord of the Rings” trilogy, and it doesn’t have the dark under- currents that makes the “Harry Potter” series endlessly fascinating. 92 SCIENTIFIC AMERICAN The Elusive Goalof MACHINE TRANSLATIIN Statistical methods hold the promise of moving By Gary Stix This passage illustrates that machine translation, or MT, as it is known, re- mains one of the more challenged sub- disciplines of the blighted field of artifi- cial intelligence. A proper name or a few well-crafted phrases suffice to throw the software off track. In the past few years, though, a new research approach has fueled a revival for machine transla- tion: brute-force computing methods— which gauge the probability that a word or phrase in one language matches that in another—are at last bringing MT closer to human performance, in the es- timation of developers of this software. Tougher Than Chess THE EVER INCREASING POWER of hardware and software algorithms to- day has propelled the computer past the chess grandmaster. (Recall that IBM’s Deep Blue supercomputer triumphed over Garry Kasparov in 1997.) But on the whole, machine translation has ex- computerized translation out of the doldrums perienced only halting progress in achiev- ing humanlike capabilities in its more than 5 0~year historymand some critics would classify even that characteriza- tion as overly generous. In 1954 IBM and Georgetown Uni— versity demonstrated the translation of more than 60 sentences from Russian into English. The IBM press release, dated January 8, 1954, glowed: “Rus- sian was translated into English by an electronic ‘brain’ today for the first time.” The military defense community and computer scientists expected rou- tine machine translation within five years, but it never materialized. In 1966 the US. government—spon- sored Automatic Language Processing Advisory Committee reported that hu- mans could perform faster, more accu- rate translation at half the cost. “There is no immediate or predictable prospect of useful machine translation,” its study concluded. I MARCH 2005 SLIM FILMS Funding dried, up, and only modest advances came in subsequent decades. In the late 19605 the U.S. Air Force sup- plied support to a small company that created the machine translator called Systran—the Internet version of which provided the first paragraph of this ar- ticle—to cope initially with voluminous demands to translate Russian docu- ments into English. Systran is based on rules about the source and target languages, as was IBM’s original “brain” system, which relied on six rudimentary rules that gov- ern syntax, semantics and the like. For example, the word “0” in Russian could be translated by an IBM 701 computer as either “about” or “of.” If “0” fol- braved the word “nauka” (science), it looked for the appropriate rule that told it to translate “o” as “of”—in other words, the “science of,” not the “science about.” i The Paris-based Systran company ranks as the biggest machine translation company in the world. Even with cus- tomers that include Google, Yahoo and Time Warner’s AOL, its annual revenues were just $13 million for 2004—in an overall market for translations of all va- rieties that is estimated worldwide to total nearly $10 billion. “We’re so small, and we’re the largest," says Dimitris Sa— batakakis, Systran’s chairman and chief executive officer. www.sciarn.cum N 0 Mo re Rules FOR RULE-BASED SYSTEMS, lan- guage experts and linguists in specific languages have to painstakingly craft large lexicons and rules related to gram— mar, syntax and semantics to generate text in a target language. Commercial systems centain tens of thousands of grammar rules for a corpus that is made up of hundreds of thousands of words. Beginning in the late 19805, IBM cre- ated a system for translating French into English called Candide that required knowledge of neither grammar nor syn- tax. It eschewed rules in favor of taking substantial bodies of already translated text, matching words between the two languages (more recent systems use whole phrases) and finally deriving prob- abilities—based on Bayes’s theorem—to estimate whether an English word was a correct translation from the French. Another analysis that relied solely on large English texts assessed whether the word translated into English fit in gram- matically with surrounding words. The word or phrase in the target language accorded the highest probability could then be used to “decode” future texts— and multiple words could be linked to build entire documents. If the statistics showed that the word “pouderie” usu- ally equated to “blowing snow,” that, in principle, was all that was needed. IBM eventually dropped its effort. At the end of the 19905 it could take an entire day for a machine translation of a single page. But then things began to stir. The Internet produced a rapid growth in the number of large, bilingual bodies of text. The Web also created demand for translation that could never be met by humans. In 1999 the National Science Foun- dation held a workshop at Johns Hop- kins University to construct a software tool kit that could be readily dissemi- nated to the scientific community, an action that drew attention and spurred new activity. In 2002 one of the work- shop organizers, Kevin Knight of the University of Southern California, and Daniel Marcu, also at U.S.C., founded Language Weaver, the only statistical machine—translation company. It now claims to be capable of translating at least 5,000 words a minute back and forth between English and Arabic, Far- si, French, Chinese and Spanish. Google Is a Winner ANOTHER ALUMNUS of both the workshop and U.S.C., Franz Och, was hired by Google. Last summer the still experimental Google system engineered by Och bested competitors such as IBM to win every category in a competition organized by the National Institute of Standards and Technology to translate 100 newswire documents from Arabic SCIENTIFIC AMERICAN 93 -mgwm aqzvmnrnmmmr . .. nun-w .-”4thmw.-nn¥:13mi$l$‘zfli¥-fimrr STATISTICAL MACHINE TRANSLATION INPUTTiNG ALREADY-TRANSLATED TEXTS Existing translated texts from various sources form the foundation ofthe automatedtranslations. PREPRUCESSI NE Thetexts are scanned, aligned and formatted. Que hambre tengo yo. Preprocessor que hambte tengo yo PHRASE MATEHING IN TRANSLATED TEXTS Atranslation model picks out two-orthree-wordphrasesfromthesource language [in this case.Spanish]that matchthetargetlanguage [English]. Source language: Spanish Target language: English This traditional Stew is refined with scallops. lobster and turbot. Este guiso tradicional se Ennohlece con el bogevante. la viera y el Ddahallo. TRANSLATIDNMODEL Clue hambretengo go 91.] using statistics to measure how often and where words occur in a given phrase in both languages, the model derives a template forword reordering. It also takes advantage ofother techniques, such as reducing multiple Spanish wordsto a single translated word [notshown]. hungry ' I am so pleased to meet you, Statistical methods have proved to be more effectivethanothertgpes ofautomated machine translations based on rules crafted bg human translators. The new methods take advantage ofthe brute-force calculating power of machines to crunch through existingtranslatedtexts to determine the probabilitg that a word or a phrase in one language matches that in another. LANGUAGE MODEL Working from its own statistical analgses of English-onlg texts, a language model attempts to predictthe most likely word and phrase ordering forthe already-translated text. Greaterfrequencg ofa phrase's occurrence increases the probabilitg that it is correct. I am so baffled by Modern and Postmodern art. The boy is so thirsty and the mother so sadt v am so hungry to see everything. and to know everything." she said to herself. He tried. "I am so hungry! Will you give me I have so many _ people to thank. What strength have I that I should endure? lam so > Have lthat lam so > I have so So thirsty > Thirstg Am so hungryt > What hunger have D E CUBER When a new sentence gets inputted—onethat can differ slightig or substantivelg from the text already proceSsed [only sed substitutes forhombre here]~the decoder develops several hgpothetical translations and picks the one with the highest probabilitg. INPUT TEXT lam so thirsty What thirst have | Due sed tengo go. - Have lwhat thirst Thirsty lam so DECUDER ON. I am so thirsty ' LUCY READING-EKKANDA; SOURCES: KEVIN KNIGHT Language WeaverANfl PHILIPP KUEHN University of Edinburgh 5.....u, .2. ..4. 94 SCIENTIFIEAMERICAN MAREHEUDB or Chinese into English. Och has men- tioned that feeding the machine-trans— lation software with text that equated to one million books was key to perfor- mance improvements. He contrasted Google’s current Chinese-to-English MT system (Systran) with the experi- mental statistical one crafted by him and his co—workers: Can a machine translator provide more than “gisting,” a rough idea of the contents of a foreign-language text? GooglefSystran: “Doctor indi- , cates, the bright kernel prearranges recuperates the about one month.” Google Research: “Doctors said Akihito is scheduled to rest for about a month.” The buzz about statistical machine translation has put Systran on the de- fensive. “You need rules when learning a foreign language,” Sabatakakis com— ments. “You don’t learn a language with statistical methods.” Systran uses statistical techniques when creating sys- tems in very narrow domains, such as translating patent documents. But the current embrace of statistical methods is somewhat of a marketing technique, he says. The company still employs 50 people in research and development, among them linguists. “The major dif- ference between Systran and Google is that Google claims that it doesn’t need native Chinese people to develop Chi- nese [applications} because of the magic and beauty of this stuff,” Sabatakakis says, adding, “If we don’t have some Chinese guys, our system may contain enormous mistakes.” The distinction between the two camps has begun to blur a little as sta- _ tistical MT researchers have started to incorporate techniques that account for the syntactical structure of a sentence. These methods forgo the intervention of a human linguist: a syntactic model might estimate the chance that an Eng- lish adjective-noun phrase gets reor— dered after translation into French. Knight of Language Weaver says that WWW.sciam.corn relying on phrases instead of single words allows the statistics to deal with semantics as well, avoiding, for in- stance,-having his surname translated as “Caballero.” Microsoft Research has a substantial natural-language group, which for the past six years has also worked on MT. The group first focused on rule-based systems. But it is increasingly incorpo- rating statistical techniques. Recently Microsoft used primarily statistical ap proaches when translating its online cus- tomer-support Web sites into 12 new languages, including Russian, Arabic and Chinese. The text does not get ed— ited afterward. “Some of it is admittedly pretty rough; other parts of it are quite good,” notes Steve Richardson, a senior researcher in the natural-language pro- cessing unit. “The quality of the more statistical approaches is comparable to or beginning to exceed that of the rule— based systems that we used before.” Getting the Gist ALL THESE TECHNIQUES, however, raise the question of whether the ma- chine-translation equivalent of a Deep Blue, the IBM chess computer, will ever beat humans at their own game. Can a machine provide more than mere “gist- ing,” a rough idea of the contents of a foreign-language text? Kevin Hendzel, a spokesman for the American Transla- tors Association, says that the current optimism only promulgates decades’ worth cf overhyped claims—FAHQT, the idea of “fully automatic high-quality translation,” for instance. Gisting can help sort through massive amounts of foreign-language texts as long as it is unu derstood to be inherently unreliable, he notes. Even a rough translation has its perils. He cites one Arabic—to—English translation that mentioned two sides “going at” each other, a fragment that caught the attention of security officials. The reference turned out to be for a soc— cer game, not a terrorist attack or im~ minent battle. Keith Devlin, executive director of Stanford University’s Center for the Study of Language and Information, re- marks that machine—based systems will never equal the human linguist. “The use of statistical techniques, coupled with fast processors and large, fast mem- ory, will certainly mean we will see bet- ter and better translation systems that work tolerably well in many situations,” Devlin says, “but fluent translation, as a human expert can do, is, in my view, not achievable.” Knight, the pioneer in statistical translation, disagrees and points to the progress achieved during this decade. He foresees no limit to the technology, which will ultimately achieve human-level translations for everything except pos— sibly poetry. He has shown blind exam“ ples of human translations alongside those from a machine, and audiences have confused the two. “Let’s not kid ourselves—there are lots of mistakes in human-level translations. The bar is not as high as you would imagine,” he says. To prove that this round of translation ' tools is more than the perennial sales pitch, the statistics jocks who now lead the field must demonstrate that this time FAHQT is real. Only then will the tech- nology go beyond, as MicrOsoft’s Rich- ardson puts it, mere “MT promises.” MOR£ T0 EXPLORE The History of Machine Translation in a Nutshell. Unline at John Hutchins's Web site: http://ourworld.cnmpuserve.com/homepages/WJHutchins/nutshell.htm A Statistical MT Tutorial Workbook. Kevin Knight, Bnline atwww.isi.edu/natural-languagea’mt/wkbkrtf The Candide System for Machine Translation. Adam L. Berger et al. Dnline at http://acl.ldc.upenn.edu/H/H94/H94-1023.pdf SCIENTIFIC AMERICAN 95 ...
View Full Document

{[ snackBarMessage ]}