Reconhecimento Automático Multilingue de Entidades Mencionadas em Diversos Domínios, para Efeitos de Anonimização de Tradução Automática

Autores

  • Miguel Menezes Universidade de Lisboa, Faculdade de Letras, Lisboa / INESC-ID, Lisboa
  • Vera Cabarrão Unbabel, Lisboa
  • Helena Moniz Universidade de Lisboa, Faculdade de Letras, Lisboa / INESC-ID, Lisboa
  • Pedro Mota Unbabel, Lisboa

DOI:

https://doi.org/10.26334/2183-9077/rapln9ano2022a12

Palavras-chave:

Machine-Translation, Named Entities, Annotation, Gold Standards, Aligners, Tradução Automática, Entidades Mencionadas, Anotação, Sistemas de Alinhamento

Resumo

The following article describes the research developed at Unbabel, a Portuguese Machine-Translation start-up, that combines Machine Translation (MT) with human post-edition with a focus on customer service content. With the work carried out within a real multilingual AI powered, human-refined, MT industry, we aim to contribute to furthering MT quality and good-practices, by exposing the importance of having continuously in development, robust Named Entity Recognition systems for General Data Protection Regulation (GDPR) compliance. We will report three different experiments, resulting from a shared work with Unbabel´s linguists and Unbabel´s Artificial Intelligence (AI) engineering team, matured over a year. The first experiment focused on developing a methodology for the identification and annotation of domain-specific Named Entities (NEs) for the Food-Industry. The devised methodology allows the construction of gold standards for building domain specific NER systems and can be applied for a myriad of different domains. With the implementation of the designed method, we were able to identify the following domain-specific NEs set: Restaurant Names; Restaurant Chains; Dishes; Beverage, Ingredients. The second and third experiments explored the possibilities of constructing, in a semi-automatically way, multilingual NER gold standards for different domains and language pairs, using aligners that project Named Entities across a parallel corpus. Both experiments made it possible to benchmark four different open-source aligners (SimAlign; Fastalign; AwesomeAlign; Eflomal), allowing to identify the one with better performance and, simultaneously, validate the aforementioned approach. This work should be taken as a statement of multidisciplinary, proving and validating the much-needed articulation between different scientific fields that compose and characterize the area of Natural Language Processing (NLP).

Downloads

Não há dados estatísticos.

Downloads

Publicado

2022-10-25

Como Citar

Menezes, M., Cabarrão, V., Moniz, H., & Mota, P. (2022). Reconhecimento Automático Multilingue de Entidades Mencionadas em Diversos Domínios, para Efeitos de Anonimização de Tradução Automática. Revista Da Associação Portuguesa De Linguística, (9), 169–185. https://doi.org/10.26334/2183-9077/rapln9ano2022a12