Automatic Multilingual Recognition of Named Entities in Various Domains, for the Purposes of Machine Translation Anonymization
DOI:
https://doi.org/10.26334/2183-9077/rapln9ano2022a12Keywords:
Machine-Translation, Named Entities, Annotation, Gold Standards, AlignersAbstract
The following article describes the research developed at Unbabel, a Portuguese Machine-Translation start-up, that combines Machine Translation (MT) with human post-edition with a focus on customer service content. With the work carried out within a real multilingual AI powered, human-refined, MT industry, we aim to contribute to furthering MT quality and good-practices, by exposing the importance of having continuously in development, robust Named Entity Recognition systems for General Data Protection Regulation (GDPR) compliance. We will report three different experiments, resulting from a shared work with Unbabel´s linguists and Unbabel´s Artificial Intelligence (AI) engineering team, matured over a year. The first experiment focused on developing a methodology for the identification and annotation of domain-specific Named Entities (NEs) for the Food-Industry. The devised methodology allows the construction of gold standards for building domain specific NER systems and can be applied for a myriad of different domains. With the implementation of the designed method, we were able to identify the following domain-specific NEs set: Restaurant Names; Restaurant Chains; Dishes; Beverage, Ingredients. The second and third experiments explored the possibilities of constructing, in a semi-automatically way, multilingual NER gold standards for different domains and language pairs, using aligners that project Named Entities across a parallel corpus. Both experiments made it possible to benchmark four different open-source aligners (SimAlign; Fastalign; AwesomeAlign; Eflomal), allowing to identify the one with better performance and, simultaneously, validate the aforementioned approach. This work should be taken as a statement of multidisciplinary, proving and validating the much-needed articulation between different scientific fields that compose and characterize the area of Natural Language Processing (NLP).
Downloads
Downloads
Published
How to Cite
Issue
Section
License
Copyright (c) 2022 Miguel Menezes, Vera Cabarrão, Helena Moniz, Pedro Mota

This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.
Authors retain copyright and concede to the journal the right of first publication. The articles are simultaneously licensed under the Creative Commons Attribution License, which allows sharing of the work with an acknowledgement of authorship and initial publication in this journal.
The authors have permission to make the version of the text published in RAPL available in institutional repositories or other platforms for the distribution of academic papers (e.g., ResearchGate).


