Categories

“Pidgins and Creoles” or “Why Machine-only Translation Will Always Fall Short”

Several weeks ago I was speaking at a localization conference when and audience member asked me about machine translation, and if it would ever completely take the place of human linguists in the industry. I answered “No,” although I did concede that machine translation is consistently making strides and does have a place in the localization community. He then mentioned that a scientific group in Europe recently had success with a robot performing a live human appendectomy. He believed that if something that delicate could be automated, what made something a “simple” as language beyond the scope of machines and artificial intelligence?  I thought about his question and then simply said, “Because there are no pidgins or creoles for appendectomies.”

First, let me state that I believe automated translation tools have a place in the localization industry. For the right organization, you will see significant improvement in cost, quality, consistency and time-to-market by using some degree of machine translation.  However, language is a constantly changing creature, and it is this constant transformation that makes it the sole possession of humans.

Some may be asking, “What are pidgins and creoles?” In short, they are languages. A pidgin occurs when two separate languages are pressed together for the sake of singular communication. Typically, the pidgin is very rudimentary, since the two sides will still communicate in their “mother tongues” when separated. A creole is when the pidgin becomes a new language, growing in complexity at an exponential rate until it becomes its own “mother tongue.”

For example, Brazilian Portuguese is a combination of many different languages. While the source language for context is Iberian Portuguese, there are distinct differences brought on by Spanish and indigenous languages. Brazilian Portuguese is a creole, formed by the need to for multiple cultures to communicate. As the European cultures arrived in South America a pidgin was created to make simple verbal communication possible amongst all groups. As the years passed, this pidgin became the common language and more terminology and complexity was added. Children were taught this language from their parents, instead of the “mother tongues.”

Spanish is another example of pidgins and creoles, especially in Latin and South America, where each country has a distinctive feel to their Spanish and there are differences. While the differences are not as vast as the ones between Iberian and Brazilian Portuguese, dialects in Peru have distinctions from Argentina, based on the indigenous influences each area had on the original Spanish settlers, i.e the Incan culture in Peru.

In his book, “Bastard Tongues: A Trailblazing Linguist Finds Clues to Our Common Humanity in the World’s Lowliest Languages” Derek Bickerton ascertains that original pidgins were created throughout a 5-7 year period of cross-communication between two separate languages, with the transition to creole occurring at the start of the next generation. However, as communication tools increase in scope and availability, the rate of change for both has been cut by almost 75%. This means that completely new languages are being created every 5-6 years!

For this, you need to look no further than American English. Each year words are added to Miriam-Webster’s American Dictionary. Words that did not exist five years ago are now commonly used in media. While my mother has no clue what “fo-shizzle” means, it is a commonly used term amongst my younger sister’s generation (Incidently, it means “for sure”). This term is engrained in her culture, and will be taught to her children as though it always existed. While this is just one word, understand the scale is exponentially greater. Machines, no matter how advanced, cannot keep up with the evolution of local languages. Even if cultures stopped communicating with each other tomorrow, the intra-cultural exchanges would still bring about changes in language. It might be at a slower rate, but it would still occur.

This is why machines can never fully replace humans in language and translation. The complexities of verbal communication between people are too complicated, random and “human” for machines to completely grasp. As an aside, this is also why in-country translators are necessary for localization to be effective. No matter how fluent you are in a language, even a native speaker loses part of the pidgin when they are no longer local.

I would appreciate any comments or thoughts on machine translation, pidgins or creoles from anyone reading this blog. Thanks, Jonathan

5 comments to “Pidgins and Creoles” or “Why Machine-only Translation Will Always Fall Short”

  • Beware of saying “never” in forecasting technological progress. Many such forecasts have looked unimaginative a few decades later.

    You might be blurring the issues of verbal and written communication here. I’d guess that most though not all translation is for written text (while interpretation is about verbal communication). “fo-shizzle” may not be the best example, because it seems a verbal code to reinforce in-group membership. An example from written text might be better (”tweet”, “blog”). That said, it’s certainly true that written communication is complicated, random, and “human”, just as verbal communication is.

    I think there are three issues with machine translation adapting to pidgins, creoles, and language change: language fluency, motivation, and data access. I’m cautious about saying that machines will “never” have the fluency to master any given pidgin or creole. In particular, statistical approaches to translation — what Google Translate is really good at — might eventually prove uncanny in their ability. However, motivation (or its proxy, money) is a major constraint. Machines may not get good at some creoles simply because those with the machines won’t find the effort lucrative enough. There will always be humans fluent in a thriving pidgin or creole, of course; they are a human activity.

    Data access may prove to be the biggest rate limiter to machines gaining fluency in particular languages. A corpus of parallel texts is a prerequisite for statistical translation. Computers have an easy time extracting text from web documents; a harder time from printed documents; and a very hard time from verbal communication. I’d expect to see slower progress in machine translation for pidgins and creoles which are verbal, off the net, and with thin wallets. But I expect that would be primarily a matter of motivation and data access, not the ability of the machine to grasp a pidgin per se.

    (I was intrigued enough to post a slightly elaborated reply, Will Machine-only Translation Always Fall Short?, on my blog.

  • 1) Machine Translation and related technologies have already been developed for creole languages. I spent several years on Creole linguistics (http://linguistlist.org/pubs/diss/browse-diss-action.cfm?DissID=12256) and then worked on language processing technologies for creoles and other languages:

    See:

    2003. Computing in Creole Languages. In Multilingual Computing and Technology magazine. Number 53, Vol. 14, Issue 1. January/February 2003. Pp. 24-32.
    http://www.multilingual.com/articleDetail.php?id=625

    2001. Closing the Digital Divide: Issues in expanding localization efforts to minority languages. In LISA Newsletter (Globalization Insider), Volume X, No. 2, April 2001 (pp. 23-24, 32).
    http://www.lisa.org/globalizationinsider/2001/04/closing_the_dig.html

    2002. Intra-textual Inconsistency: Risks of Implementing Orthographies for Less-Prevalent Languages. In Localization Industry Standards Association (LISA) Newsletter: Globalization Insider, Volume XI, No. 1.3, February 15, 2002, pp 1-5.
    http://www.lisa.org/globalizationinsider/2002/02/intratextual_in.html

    2001. Standardized Spelling as a Localization Issue. In Multilingual Computing and Technology magazine. Number 41, Vol. 12, Issue 5. July/August 2001. Pp. 37-40.
    http://www.multilingual.com/articleDetail.php?id=589

    1998. Lexical variation in Haitian Creole and orthographic issues for Machine Translation (MT) and Optical Character Recognition (OCR) applications. First Workshop on Embedded Machine Translation systems of the Association for Machine Translation in the Americas (AMTA) conference, Philadelphia, 28 October 1998.
    http://www.geocities.com/jeffallenpubs/amta98-allen-final.htm

    1998. Rapid-Deployment Text-to-Speech in the DIPLOMAT System (including Haitian Creole). Paper presented at the 6th International Conference on Spoken Language Processing (ICSLP98). 30 November – 4 December 1998, Sydney, Australia.
    http://www.shlrc.mq.edu.au/proceedings/icslp98/PDF/AUTHOR/SL980868.PDF

    2) As for written vs non-written phenomena which can be captured in machine translation and speech technologies, see the following article:

    2004. Issues in Multilingual Speech Technologies: The differences between written and spoken language must be taken into account when developing a speech system. In Multilingual Computing and Technology magazine, Number 63, Vol 15, Issue 3, April/May 2004.
    http://www.multilingual.com/articleDetail.php?id=677

    3) On the question of regional and local varieties of languages versus a universal version, and how to handle these in language processing, this was discussed in:

    2001. Is there a Universal Creole for localization efforts? In LISA Newsletter (Globalization Insider), Volume X, No. 3, August 2001 (pp. 39-42).
    http://www.lisa.org/globalizationinsider/2001/07/is_there_a_univ.html

    4) Handling new words in languages. In our work on creating huge amounts of human translated content for several less-prevalent languages, we measured the rate at which it was possible to create translated terminology in each less-prevalent language. See:

    1998. Expanding Lexical Coverage of Parallel Corpora for the Example Based MT Approach. In Proceedings of the First International Conference on Language Resources and Evaluation (LREC ‘98), Granada, Spain, May 1998, pp. 747-754.
    http://www.cs.cmu.edu/afs/cs.cmu.edu/user/chogan/Web/lre98.zip

    5) Can MT and human translators work together? Yes, many years of doing it successfully has provided published case studies with real results.
    Go to http://www.geocities.com/mtpostediting/

    Hope that helps.

    Jeff

  • admin

    Moderator comment.
    That is certainly a lot of useful information on MT research project – I will absolutely be using all these links for my ongoing research. It was news to me that Creole is a viable language for MT research as early as 1998 when MT was commercially unreliable for more mainstream languages. Overall though, I am not sure the original post intended to imply MT wasn’t useful – simply that it is unlikely to replace human translation across the board any time soon (or ever – as the author put it). I am sure Jonathan will clarify if appropriate.
    ~mod

  • jslaughter

    Jeff -

    This information is very good reading, thank you for it.

    As the moderator mentioned, I agree that automated translation is an extremely useful tool in the right situations. That said I firmly believe that human translators and/or post-editors will always be an essential part for the localization process. Regardless of advanced MT gets, it cannot replace the cultural perspective of someone living in that local market.

    Even the most advanced MT requires someone with an understanding of the language and the context it is used in to ensure the translations are correct.

    Nevertheless, I thoroughly enjoyed reading the information. It definitely gave me a new perspective on the MT capabilities that are growing exponentially.

    Thank you,

    Jonathan

  • Hi,
    MT was commercially reliable in 1998 for the more mainstream languages, but in very well-defined and well-conducted projects. And systems at that time were very expensive, it was hardly available as desktop software. Babelfish made its entry in Dec 1997 as the first online MT portal, and made visible the base-level MT systems, but did not demonstrate the capability of customizable enterprise-level systems. A lot of improvement has been made to desktop MT software in the past 10 years to bring the features of the enterprise level systems to the desktop and laptop workstations, and even pocket PCs and mobile telephones (yes, there is MT software for such devices, but usually very basic level software).
    All such systems/software have improved much over the past 10 years, and all of the presentations/articles/papers I have written and made available on MT projects have indicated how human intervention for dictionary creation and postediting are important and necessary tasks for providing publishable translated documents.

    Jeff

Leave a Reply

 

 

 

You can use these HTML tags

<a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>