Thursday, October 14, 2010

Universal Translator

In the Star Trek universe, a Universal Translator is a device that completely obliterates language barriers by instantaneously translating all speech to the user's native language.  In essence, it makes it so that everybody appears to be speaking one language.  In the Star Trek: Enterprise series, which takes place in the early years of humans' exploration of inhabited space, the translator is an imperfect device that merely produces a text translation of speech and often mistranslates, especially in the case of unfamiliar languages. The choice of the writers of Star Trek: Enterprise to portray the ancestor of the universal translator as a speech-to-text device is logical given that languages can be written down and text is easier for computers to work with than audio, right?

Earlier today I came across a TechCrunch article that got me thinking about speech-to-speech translation in the modern day.  One commenter mentioned an Android app that does speech-to-speech translation, so I took his tip and tried out Talk To Me Cloud.  The app's algorithm essentially includes three steps: 1) Listen to the Language A speech and translate it to Language A text; 2) Send the Language A text to a server in the cloud to translate it to Language B text; 3) Translate the Language B text to Language B speech. The interface took a minute to figure out, but true to its intention, it listened to my elementary Spanish and said it back to me in English.  Awesome!  But not quite a Universal Translator, or really all that useful.  Yet.

The app is plagued by several drawbacks, some within the developer's control and some without:

  1. The visual interface is clunky.  It looks like the developer took the technology behind the app, translated its steps into UI components, and just tossed those onto the screen in linear fashion.  "Let's see... The user has to select the language that needs to be translated, so let's put a drop down box in there.  The user has to see the text of what was said so they know if the speech-to-text worked right, so let's toss a big text box in there.  The user has to tell the app when to start listening, so let's put a microphone button in there, but let's make it REALLY BIG so it's easy to find.  Etc."  The net effect of this is the app feels like a techie tool, not something that opens doors to otherwise impossible conversations.  Make it fun, developer!  I can use this app to ask someone to point me to the nearest restroom in Italian!
  2. The hardware doesn't match up.  The app relies on the phone's mic for the speakerphone to hear the speech it should translate.  The problem is, the speakerphones on current smartphones (at least those that I've dealt with) are pretty poor about picking up meaningful audio outside of a few feet away, and that only in a quiet setting.  Now imagine using this app for one of its intended purposes: Suppose you are traveling in Germany and you are trying to describe your hotel (you can't recall the name) to a taxi driver.  You have this great app that will let you talk to this driver, so you bring out your phone and, naturally, set it between you two.  The app happily translates about 1 in 3 words, plus whatever is on the radio and the road noise outside.  The only solution available to you is to bring the phone close you, speak, quickly hand it to the driver who speaks and quickly hands it back to you, and so on.  That's a bit of an awkward interaction.
  3. You have to tell the app which language is being translated and which language to which it should be translated.  That's fine if only you are speaking, but in most cases you're having a conversation in which more than one person speaks, and more than one language needs to be translated.  Luckily, the developer put in a "Swap" button that allows you to swap the input and output languages.  But consider the taxi driver interaction:  Now, not only do you have to pass the phone back and forth, but each time you do, you have to press the Swap button.  And since you have to keep swapping languages, you have to break up the audio recordings, and thus each swap you make also requires a press of the microphone button.  The awkwardness of the conversation just went up a few points.
  4. Speech-to-text technology isn't very accurate.  As you can see from my screenshot, the STT translator got a word wrong ("Tus" became "Chris").  If the first step in the translation process fails, what hope is there that anything intelligible will come out the other end?  There is a reason STT technology hasn't been enormously successful commercially: It's a really tough problem to solve.
  5. Text does not include emphasis or accents.  The usage of text as an intermediary makes sense given current technology (text-to-text translation is getting pretty good), but does away with the benefits of verbal communication, namely emphasis and accent.  The former can alter the meaning of the speech entirely, and the latter can provide insight into the speaker's origins as well as (more practically) change the meaning of the speech due to differences in dialects.  For sure, creating speech recognition technology that could recognize dialects would be very difficult, but it is simply an impossible task for text.
#1 and #2 will likely be addressed within a few years as better phones and better UI interactions are developed, but #3, #4, and #5 are interesting to think about.  I believe that there is a simple solution to #3: Rather than the user telling the app which language to use on each round of speech, the app ought to be able to figure out the language itself after one round.  Put another way, the app ought to be able to figure out which person is speaking after hearing each one once.  Assuming that speakers do not change languages, the identity of the speaker can tell the app which language to translate.  One way to accomplish this would be to use the assumption that the speakers in most conversations have differing pitches of voice.  Another way would be to use the position of the speakers relative to the phone (though this would require multiple microphones on the phone, of course.  Many phones indeed have multiple mics, such as my Droid X).

#5 is a basic problem with text, so some other medium must be used to address it.  Video, scent, and touch would be pretty interesting, but audio seems to be the most practical.  In this case, audio also brings the benefit of reducing the number of translations from three (speech-text-text-speech) to one (speech-speech).  Assuming that any translation is imperfect, the multiplier effect would say that one translation has a much greater potential for high quality than three translations.  Further, direct speech-speech translation would bring the benefits of emphasis and accents, which are vital to verbal communication.  If we truly want a Universal Translator, it needs to be a direct speech-speech translation device.  That would take care of #4 as well.

Finally, it seems to me that speech-text, text-text, and speech-speech are all different problems.  They may have some underlying similarities such as matching algorithms, but progress in the text-text space may be unhelpful to the progress of the speech-speech space.  Thus I feel that the writers of Star Trek: Enterprise were misguided: The speech-text translation device is not a stepping-stone to the Universal Translator.  If computer scientists and developers focus their energies on direct speech-speech translation, we just might have the UT before Star Trek's predicted late-22nd-century arrival.

No comments:

Post a Comment