We were recently contacted by a software publisher asking us to consider Machine Translation (MT) use for translating their knowledge base. Given the volumes involved, they were looking at a way to lower their costs.
Their hypothesis for using MT was based on the following:
- Knowledge bases, unlike software GUI, documentation or help, do not need to have a high level of quality
- Since they contain massive amounts of information, it is impossible for humans to translate them fast enough to meet their rapid expansions
- Some entries are never or rarely used by users
- The entries themselves are authored by many support people and not by professional tech pub writers, so the grammatical quality of the source content is already at an inferior level.
We've had long ago experimented with MT and concluded that its benefits do not save our professional translators time. Reworking the output of MT is more time consuming that translating from scratch. But given recent hype about new methods and technologies, we decided to put their hypothesis to the test.
We randomly selected sentences from their knowledge base and gave off-the-shelf MT solutions a try. We found many problems, mainly in inaccurate translations and terminology use, and particularly that the source was not in a perfect shape (see last bullet above). We will limit the discussion here to a simple example to illustrate our point. The source English sentence that we will use is: The operation of saving the assembly as a multi-body part was a point in time event.
With much press on the new Statistical Machine Translation (SMT) technology from the University of Southern California, and its proclaimed higher fidelity than rule-based translation output, we decided to give it a try. SMT depends on vast (multimillions of words) existing translation databases, so we opted to go to the fore-front leader in serving content, Google. After all, they are the best at indexing the world-wide web and if anyone can make benefit of the vast existing translations on the internet, it will be them.
Google's translation of our text sentence into French was the following: Le fonctionnement de l'économie d'un assemblage de plusieurs partie du corps a été un moment manifestation.
Despite other problems, the key term that I want you to focus on is saving. It was translated by Google as economizing, giving it a financial tone.
With Systran, the engine used by many free online translation engines like Altavista's Bable Fish, the translation into French was: L'opération de sauver l'assemblée comme pièce de multi-corps était un point dans l'événement de temps.
Microsoft's Beta Translate site was very similar to Systran's translation: L'opération de sauver l'assemblée comme pièce de multi-corps était un événement de moment.
But both Systran and Microsoft interpreted saving as rescuing!
It took a human being to realize that the text is intended for a software application and to correctly infer that saving is intended for registering the file (enregistrer) of the assembly and not for rescuing or economizing it!
This was not a surprise to us. When you deal with translations every day, hour and minute, you know that there is no real substitute today to human translations.
Some say that despite all this, the gist of the meaning is still maintained and the international user can benefit from MT. It is better than not having any translation at all. Perhaps. But when you are a successful and reputable professional company and your brand and image are on the line, are you willing to risk it all without looking at better options?
Your goal should be to seek quality and accuracy in everything that you publish, no matter if it is product, website, support, PR, training, legal, financial or knowledge base related. So how can you balance brand, image and cost trade offs when it comes to translating bulk content? Simple: Divide, Prioritize and Conquer! Stay tuned for the next blog.
3 comments:
Good point, but let's agree that the English sentence is not very understandable in the original either. "The operation of saving the assembly" is the same as saying "Saving the assembly". Just by editing that, the translation became "Enregistrement de l'assemblée comme un organe pluridisciplinaire partie a été un point dans le temps." The "multi-body part" is also a tricky construction, and since I don't know the context, I didn't try to say it with other words. In short, if you improve the original - a sound investment if you are going to machine translate the text into 15 languages - you can save a lot of time. If you integrate tools like acrotext into the process, you can improve the output of MT by improving the quality of the source text. This is the famous GIGO issue.
Appreciate your clear examples regarding AT. It is important for LSP's & translators to continually stay aware so they don't miss the tipping point where it becomes a benefit, yet we aren't there yet. Thanks for you helpful analysis.
There are few points to be added as a testimony to the fact that machine translation is not a very good option even for projects that don’t require high quality of translation:
I have recently managed a project for translating a fairly big KB into two languages in a short timeline. Many teams were assigned and a lot of synchronization took place to make sure the terms are consistent.
One side of this project was managing the numerous questions from the different teams. If we ignore the questions that aimed to having the best translation for a certain term (since the quality is not the most important factor in these projects), we are still left with a big amount of questions seeking clarifications of the contextual meaning of a certain term, the technical definition of some other terms and very often to clarify some acronyms. Since we have to either have an acronym in the target language or keep the English one but have an explanation between brackets in the target language, I can't see how this issue would be handled by the automatic translation engines (keeping in mind that the number of occurrences of such terms is high).
Another project's constraint was the fact that the fields for questions and answers had a limited length (character count), for example 255 chars for the question. As we know for some target languages, the word and character counts grow exponentially. During the lifecycle of the project, and just before delivery, we had a QA phase that identified all the segments that were larger than the limit size and the teams started working on rephrasing the sentences to fit the size. There were so many comments from different teams that a lot of effort was needed to keep the short translation meaningful. I am just wondering what an automatic translation would offer for this case. Even if we put a size limit constraint and enforce it on the engine, my guess is that it would crop the text when it reaches the size limit with no attention to how meaningful the resulting text would be to the end user.
Post a Comment