Reconhecimento Automático Multilingue de Entidades Mencionadas em Diversos Domínios, para Efeitos de Anonimização de Tradução Automática
DOI:
https://doi.org/10.26334/2183-9077/rapln9ano2022a12Palavras-chave:
Machine-Translation, Named Entities, Annotation, Gold Standards, Aligners, Tradução Automática, Entidades Mencionadas, Anotação, Sistemas de AlinhamentoResumo
The following article describes the research developed at Unbabel, a Portuguese Machine-Translation start-up, that combines Machine Translation (MT) with human post-edition with a focus on customer service content. With the work carried out within a real multilingual AI powered, human-refined, MT industry, we aim to contribute to furthering MT quality and good-practices, by exposing the importance of having continuously in development, robust Named Entity Recognition systems for General Data Protection Regulation (GDPR) compliance. We will report three different experiments, resulting from a shared work with Unbabel´s linguists and Unbabel´s Artificial Intelligence (AI) engineering team, matured over a year. The first experiment focused on developing a methodology for the identification and annotation of domain-specific Named Entities (NEs) for the Food-Industry. The devised methodology allows the construction of gold standards for building domain specific NER systems and can be applied for a myriad of different domains. With the implementation of the designed method, we were able to identify the following domain-specific NEs set: Restaurant Names; Restaurant Chains; Dishes; Beverage, Ingredients. The second and third experiments explored the possibilities of constructing, in a semi-automatically way, multilingual NER gold standards for different domains and language pairs, using aligners that project Named Entities across a parallel corpus. Both experiments made it possible to benchmark four different open-source aligners (SimAlign; Fastalign; AwesomeAlign; Eflomal), allowing to identify the one with better performance and, simultaneously, validate the aforementioned approach. This work should be taken as a statement of multidisciplinary, proving and validating the much-needed articulation between different scientific fields that compose and characterize the area of Natural Language Processing (NLP).
Downloads
Downloads
Publicado
Como Citar
Edição
Secção
Licença
Os autores mantêm os direitos autorais e concedem à revista o direito de primeira publicação. Os artigos estão simultaneamente licenciados sob a Creative Commons Attribution License que permite a partilha do trabalho com reconhecimento da sua autoria e da publicação inicial nesta revista.
Os autores têm autorização para disponibilizar a versão do texto publicada na RAPL em repositórios institucionais ou outras plataformas de distribuição de trabalhos académicos (p.ex. ResearchGate).