Google says it has made progress in improving the quality of language translation. In a forthcoming blog post, the company details new innovations that enhance the user experience in 108 languages supported by Google Translate, especially the data-poor Yoruba and Malayalam languages, which translate an average of 150 billion words a day.
In the 13 years since Google Translate first came to light, technologies such as neural machine translation, rewrite-based paradigms and local processing have made the platform’s translation accuracy a quantifiable leap forward. But until recently, the latest algorithms in translation lagged behind humans. Efforts outside Google also illustrate the difficulty of the problem, and the Masakhane project aims to enable the automatic translation of thousands of languages on the Continent, but it has not yet gone beyond the data collection and transcription phase. Common Voice (Public Voice, Common Voice, a crowdsourcing project launched by Mozilla to create a free database of speech recognition software) Has reviewed only 40 voices since its launch in June 2017, Mozilla’s efforts to build an open source collection of transcribed voice data.
Google says its breakthrough in translation quality is not driven by a single technology, but rather a combination of technologies for less resource-based languages, high-quality source languages, overall quality, latency and overall reasoning speed. Between May 2019 and May 2020, by human evaluation and BLEU, a metric based on similarity between system translation and manual reference translation, Google Translate improved by an average of 5 points or more in all languages and by an average of 7 or more points in the 50 lowest levels of translation. 此外,谷歌表示,”翻译”对机器翻译联想的功能变得更加强大,一种现象是,当给泰卢固语字符”షషషషషష”输入,”Shenzhen Shenzhen Shaw International Airport (SSH)”)时,AI模型会产生奇怪的翻译”Sh sh sh sh sh sh sh sh sh sh sh sh sh sh sh sh sh”。
Hybrid models and data diggers
The first of these technologies is the translation model architecture, a hybrid architecture contained in the Transformer encoder and recursive neural network (RNN) decoders implemented in Lingvo (The TensorFlow framework for sequence modeling).
In machine translation, the encoder typically encodes words and phrases as internal representations, and the decoder uses them to generate text in the desired language. The Transformer-based model, first proposed by Google researchers in 2017, is more effective than RNN in this regard, but Google says its work shows that most of the quality improvement comes from one component of Transformer: the encoder. That’s probably because while both RNN and Transformer are designed to handle ordered data sequences, Transformers don’t need to process sequences sequentially. In other words, if the data in question is a natural language, Transformer does not need to process the beginning of a sentence before processing the end.
Nevertheless, when reasoning, the RNN decoder is “much faster” than the decoder in Transformer. Aware of this, the Google Translate team optimized the RNN decoder before combining it with the Transformer encoder to create a low-latency, higher-quality hybrid model that was more stable than the RNN-based neural machine translation model four years ago.
Figure: The BLEU score for the Google Translate model since its inception in 2006. (Photo: Google)
In addition to its novel hybrid model architecture, Google compiles decades-old crawlers from millions of sample translations (for articles, books, documents, and Web search results). The new translator is based on 14 major languages embedded, rather than dictionary-based – meaning that it uses real vectors to represent words and phrases – and pays more attention to precision (the proportion of relevant data in the data retrieved) (part of the total amount of the data actually retrieved). Google says this increases the number of sentences extracted by translators by an average of 29 per cent during use.
Noisy data and transfer learning
Another translation performance improvement comes from a modeling approach that better handles noise in training data. Observing that noisy data (data containing a large amount of information that cannot be properly understood or interpreted) can impair language translation, so the Google Translation team deployed a system that uses trained models to assign scores to examples to tune noisy data and to tune “clean” data. In fact, these models begin to train all the data, and then gradually train smaller and cleaner subsets, a method called course learning in the AI research community.
In resource-poor languages, Google has implemented a reverse translation scheme in translation to enhance parallel training data, with each sentence paired with its translation. In this scenario, training data is automatically aligned with synthetic parallel data, so that the target text is a natural language, but generates a source through neurotranslation models in this scenario. As a result, Google Translate uses richer single-language data to train models, which Google says is particularly useful for improving fluency.
Pictured: Google Maps with translation.
Google Translate now uses M4 modeling, one of the larger models, M4, which translates between multiple languages and English. (M4 was first presented in a paper last year that demonstrated that it improved the translation quality of more than 30 low-resource languages after training 25 billion pairs of sentencepairs in more than 100 languages.) M4 modeling made it possible to migrate learning in Google Translate, collecting high-resource languages including French, German, and Spanish (with billions of parallel examples) for training to improve performance, which can be applied to translation of low-resource languages such as Yoruba, Sindhi, and Hawaiian (with only tens of thousands of examples).
Looking to the future
According to Google, translation has increased by at least one BLEU point a year since 2010, but automated machine translation will not solve the problem. Google acknowledges that even its enhanced models are error-prone, including confusing different dialects of a language, producing too many literal translations, and poor performance on specific subject series and informal or verbal language.
Microsoft is trying to solve this problem in a variety of ways, including recruiting volunteers through its Google Translate Community Program, and helping to improve the quality of translations in low-resource languages by translating words and phrases or checking whether they are correct. In February alone, the program, combined with emerging machine learning techniques, added translation, with 75 million people speaking five languages: Kinyarwanda, Odia (Oria), Tatar, Turkmen and Uyghur (Uyghur).
It’s not just Google that’s looking for a true universal translation. In August 2018, Facebook unveiled an AI model that combines word-by-word translation, language model and reverse translation to perform better in language pairing. Recently, researchers at MIT’s Computer Science and Artificial Intelligence Laboratory proposed an unsupervised model that could be learned in test data that was never explicitly marked or classified, and that could be translated between texts in two languages without having to translate directly between two languages.
In a statement, Google said diplomatically that it was grateful for the machine translation research “in academia and industry”, some of which informed it of its work. “We’ve done this by consolidating and expanding the latest developments (recent improvements from Google Translate),” the company said. With this update, we pride ourselves on providing relatively consistent, automated translations, even with the fewest resources in the 108 languages supported. “