Future Tense

Death by Machine Translation?

Scattered letters rain down, or out of, a smartphone.
Photo illustration by Slate. Photos by Getty Images Plus.

Imagine you are in a foreign country where you don’t speak the language and your small child unexpectedly starts to have a fever seizure. You take them to the hospital, and the doctors use an online translator to let you know that your kid is going to be OK. But “your child is having a seizure” accidentally comes up in your mother tongue is “your child is dead.”

This specific example is a very real possibility, according to a 2014 study published in the British Medical Journal about the limited usefulness of AI-powered machine translation in communications between patients and doctors. (Because it’s a British publication, the actual hypothetical quote was “your child is fitting.” Sometimes we need American-British translation, too.)

Machine translation tools like Google Translate can be super handy, and Big Tech often promotes them as accurate and accessible tools that’ll break down many intra-linguistic barriers in the modern world. But the truth is that things can go awfully wrong. Misplaced trust in these MT tools’ ability is already leading to their misuse by authorities in high-stake situations, according to experts—ordering a coffee in a foreign country or translating lyrics can only do so much harm, but think about emergency situations involving firefighters, police, border patrol, or immigration. And without proper regulation and clear guidelines, it could get worse.

Machine translation systems such as Google Translate, Microsoft Translator, and those embedded in platforms like Skype and Twitter are some of the most challenging tasks in data processing. Training a big model can produce as much CO2 as a trans-Atlantic flight. For the  training, an algorithm or a combination of algorithms is fed a specific dataset of translations. The algorithms save words and their relative positions as probabilities that they may occur together, creating a statistical estimate as to what other translations of similar sentences might be. The algorithmic system, therefore, doesn’t interpret the meaning, context, and intention of words, like a human translator would. It takes an educated guess—one that isn’t necessarily accurate.

In South Korea, a young man used a Chinese-to-Korean translation app to tell his female co-worker’s Korean husband they should all hang out together again soon. A mistranslation resulted in him erroneously referring to the woman as a nightlife establishment worker, resulting in a violent fistfight between the two in which the husband was killed, the Korea Herald reported in May. In Israel, a young man captioned a photo of himself leaning on a bulldozer with the Arabic caption “يصبحهم,” or “good morning,” but the social media’s AI translation rendered it as “hurt them” in English or “attack them” in Hebrew. This led the man, a construction worker, to being arrested and questioned by police, according to the Guardian in October 2017. Something similar happened in Denmark, where, the Copenhagen Post Online reported in September 2012, police erroneously confronted a Kurdish man for financing terrorism because of a mistranslated text message. In 2017, a cop in Kansas used Google Translate to ask a Spanish-speaker if they could search their car for drugs. But the translation was inaccurate and the driver did not fully understand what he had agreed to given the lack of accuracy in the translation. The case was thrown out of court, according to state legal documents.

These examples are no surprise. Accuracy of translation can vary widely within a single language—according to language complexity factors such as syntax, sentence length, or the technical domain—as well as between languages and language pairs, depending on how well the models have been developed and trained. A 2019 study showed that, in medical settings, hospital discharge instructions translated with Google Translate into Spanish and Chinese are getting better over the years, with between 81 percent and 92 percent overall accuracy. But the study also found that up to 8 percent of mistranslations actually have potential for significant harm. A pragmatic assessment of Google Translate for emergency department instructions from 2021 showed that the overall meaning was retained for 82.5 percent of 400 translations using Spanish, Armenian, Chinese, Tagalog, Korean, and Farsi. But while translations in Spanish and Tagalog are accurate more than 90 percent of the time, there’s a 45 percent chance that they’ll be wrong when it comes to languages like Armenian. Not all errors in machine translation are of the same severity, but quality evaluations always find some critical accuracy errors, according to this June paper.

The good news is that Big Tech companies are fully aware of this, and their algorithms are constantly improving. Year after year, their BLEU scores—which measure how similar machine-translated text is to a bunch of high quality human translations—get consistently better. Just recently, Microsoft replaced some of its translation systems with a more efficient class of AI model. Software programs are also updated to include more languages, even those often described as “low-resource languages” because they are less common or harder to work with; that includes most non-European languages, even widely used ones like Chinese, Japanese, and Arabic, to small community languages, like Sardinian and Pitkern. For example, Google has been building a practical machine translation system for more than 1,000 languages. Meta has just released the No Language Left Behind project, which attempts to deploy high-quality translations directly between 200 languages, including languages like Asturian, Luganda, and Urdu, accompanied by data about how improved the translations were overall.

However, the errors that lead to consequential mistakes—like the construction worker experienced—tend to be random, subjective, and different for each platform and each language. So cataloging them is only superfluously helpful in figuring out how to improve MT, says Félix Do Carmo, a senior lecturer at the Centre for Translation Studies at the University of Surrey. What we need to talk about instead, he says, is “How are these tools integrated into society?” Most critically, we have to be realistic about what MT can and cannot do for people right now. This involves understanding the role machine translation can have in everyday life, when and where it can be used, and how it is perceived by the people using it. “We have seen discussions about errors in every generation of machine translation. There is always this expectation that it will get better,” says Do Carmo. “We have to find human-scale solutions for human problems.”

And that means understanding the role human translators still need to play. Even as medications have gotten massively better over the decades, there still is a need for a doctor to prescribe them. Similarly, in many translation use cases, there is no need to totally cut out the human mediator, says Sabine Braun, director of the Centre for Translation Studies at the University of Surrey. One way to take advantage of increasingly sophisticated technology while guarding against errors is something called machine translation followed by post-editing, or MT+PE, in which a human reviews and refines the translation.

One of the oldest examples of a company using MT+PE successfully is detailed in this 2012 study about Autodesk, a software company that provides imaging services for architects and engineers, which used post-editing for machine translation to translate the user interface into 12 languages. Other similar solutions have been reported by a branch of the consulting company EY, for example, and the Swiss bank MigrosBank, which found that post-editing boosted translation productivity by up to 60 percent, according to Slator. Already, some machine translation companies have stopped selling their technologies for direct use of clients and now always require some sort of post-editing translation, Do Carmo says. For example, Unbabel and Kantan are platform plugins that businesses add into their customer support and marketing workflows to reach clients all over the world. When they detect poor quality in the translated texts, the texts are automatically routed to professional editors. Although these systems aren’t perfect, learning from these could be a start.

Ultimately, Braun and Do Carmo think that it’s necessary to develop holistic frameworks that go far beyond the metrics used at the moment to assess or evaluate quality of translation, like BLEU. They  would like to see the field working on an evaluation system which encompasses the “why” behind the use of translation, too. One approach might be an independent, international regulatory body to oversee the use and development of MT into the real world—with plenty of social scientists on board. Already, there are many standards in the translation industry as well as technological standardization bodies, like the W3 organization—so experts believe it can be done, as long as there is some more organization in the industry.

Governments and private companies alike also need clear policies about exactly when officials should and should not use machine translation tools, either free consumer ones or others. Neil Coulson is the founder of CommSOFT, a communication and language software technology company trying to make machine translation safer. “Police forces, border-control agencies, and many other official organizations aren’t being told that machine translation isn’t real translation and so they give these consumer gadgets a go,” he says. In March 2020, his organization sent out a Freedom of Information request to 68 different large U.K. public-sector organizations asking for their policies on the use of consumer gadget translation technologies. The result: None of these organizations had any policy for their use of MT, and they do not monitor any of their organizational or staff’s ad-hoc use of MT. This can lead to an unregulated, free-for-all landscape in which  anyone can publish a translation app and claim that it works, says Coulson. “It’s a ‘let a thousand flowers bloom’ approach … but eventually someone eats a flower that turns out to be poisonous and dies,” says Coulson.

Education about the pros and cons of MT, of course, is paramount—among researchers, companies, and organizations who want to actually start using the tool, but most importantly, among everyday users. That’s why Lynne Bowker, a professor of translation and interpretation at the University of Ottawa, started the Machine Translation Literacy project. Their goal is to spread awareness of how MT systems process information and teach researchers and scholars how to actually use MT more effectively. Including information about machine translation as part of the broader digital literacy and information literacy training given to school kids would also be welcome. “Being machine translation literate means understanding the essentials of how this technology works in order to be able to evaluate its strengths and weaknesses for a particular task or use,” says Bowker. Language, in a social context, is communication. “One of the real challenges we are facing is how to reach the wider public with this message,” says Bowker.

Being able to differentiate between low-stakes tasks and high-stakes tasks remains one of the key points, Bowker says. Thankfully, in the meantime, most mistranslations still just lead up to a laugh: According to a 2016 study in International Journal of Communication, there’s a Chinese restaurant called Translate Server Error. The MT system mistranslated the original language, but the restaurant owners didn’t know English well enough to realize something was off.

Future Tense is a partnership of Slate, New America, and Arizona State University that examines emerging technologies, public policy, and society.