Contextual Reinforcement, Entity Delimitation and Generative Data Augmentation for Entity Recognition and Relation Extraction in Official Documents

Fabiano Muniz Belém; Cláudio Valiense; Celso França; Marcos Carvalho; Marcelo Ganem; Gabriel Teixeira; Gabriel Jallais; Alberto H. F. Laender; Marcos A. Gonçalves

doi:10.5753/jidm.2023.3180

Authors

Fabiano Muniz Belém Universidade Federal de Minas Gerais https://orcid.org/0000-0003-1076-2052
Cláudio Valiense Universidade Federal de Minas Gerais https://orcid.org/0000-0002-7366-2633
Celso França Universidade Federal de Minas Gerais https://orcid.org/0000-0002-0251-7172
Marcos Carvalho Universidade Federal de Minas Gerais https://orcid.org/0000-0001-6474-6467
Marcelo Ganem Universidade Federal de Minas Gerais https://orcid.org/0000-0003-0842-4732
Gabriel Teixeira Universidade Federal de Minas Gerais https://orcid.org/0009-0003-2951-2982
Gabriel Jallais Universidade Federal de Minas Gerais https://orcid.org/0009-0009-6559-1546
Alberto H. F. Laender Universidade Federal de Minas Gerais https://orcid.org/0000-0001-5032-2233
Marcos A. Gonçalves Universidade Federal de Minas Gerais https://orcid.org/0000-0002-2075-3363

DOI:

https://doi.org/10.5753/jidm.2023.3180

Keywords:

Named Entity Recognition, Relation Extraction, Contextual Embeddings, Contextual Reinforcement, Entity Delimitation, Training Data Augmentation, GPT, Public Administration

Abstract

Transformer architectures have become the main component of various state-of-the-art methods for natural language processing tasks, such as Named Entity Recognition and Relation Extraction (NER+RE). As these architectures rely on semantic (contextual) aspects of word sequences, they may fail to accurately identify and delimit entity spans when there is little semantic context surrounding the named entities. This is the case of entities composed only by digits and punctuation, such as IDs and phone numbers, as well as long composed names. In this article, we propose new techniques for contextual reinforcement and entity delimitation based on pre- and post-processing techniques to provide a richer semantic context, improving SpERT, a state-of-the-art Span-based Entity and Relation Transformer. To provide further context to the training process of NER+RE, we propose a data augmentation technique based on Generative Pretrained Transformers (GPT). We evaluate our strategies using real data from public administration documents (official gazettes and biddings) and court lawsuits. Our results show that our pre- and post-processing strategies, when used co-jointly, allows significant improvements on NER+ER effectiveness, while we also show the benefits of using GPT for training data augmentation.

Downloads

Download data is not yet available.

References

Belém, F. M., Ganem, M., Celso França, M. C., Laender, A. H. F., and Gonçalves, M. A. (2022). Reforço e Delimitação Contextual para Reconhecimento de Entidades e Relações em Documentos Oficiais. In Anais do XXXVII Simpósio Brasileiro de Bancos de Dados, pages 292–303, Porto Alegre, RS, Brasil. SBC. DOI: 10.5753/sbbd.2022.224650.

Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J. D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., Agarwal, S., Herbert-Voss, A., Krueger, G., Henighan, T., Child, R., Ramesh, A., Ziegler, D., Wu, J., Winter, C., Hesse, C., Chen, M., Sigler, E., Litwin, M., Gray, S., Chess, B., Clark, J., Berner, C., McCandlish, S., Radford, A., Sutskever, I., and Amodei, D. (2020). Language Models are Few-Shot Learners. In Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M., and Lin, H., editors, Advances in Neural Information Processing Systems, volume 33, pages 1877–1901. Curran Associates, Inc.

Brunner, U. and Stockinger, K. (2020). Entity Matching with Transformer Architectures - A Step Forward in Data Integration. In International Conference on Extending Database Technology, pages 463–473.

Caputo, A., Basile, P., and Semeraro, G. (2009). Boosting a Semantic Search Engine by Named Entities. In Foundations of Intelligent Systems, pages 241–250.

Cardoso, T. N. C., Silva, R. M., Canuto, S. D., Moro, M. M., and Gonçalves, M. A. (2017). Ranked batchmode active learning. Inf. Sci., 379:313–337. DOI: 10.1016/j.ins.2016.10.037.

Constantino, K., Cruz, V., Zucheratto, O., França, C., Carvalho, M., Silva, T. H. P., Laender, A. H. F., and Gonçalves, M. (2022). Segmentação e Classificação Semântica de Trechos de Diários Oficiais Usando Aprendizado Ativo. In Anais do XXXVII Simpósio Brasileiro de Bancos de Dados, pages 304–316, Porto Alegre, RS, Brasil. SBC. DOI: 10.5753/sbbd.2022.224656.

de Freitas, J., Pappa, G. L., da Silva, A. S., Gonçalves, M. A., de Moura, E. S., Veloso, A., Laender, A. H. F., and de Carvalho, M. G. (2010). Active Learning Genetic Programming for Record Deduplication. In Proceedings of the IEEE Congress on Evolutionary Computation, CEC 2010, Barcelona, Spain, 18-23 July 2010, pages 1–8. IEEE. DOI: 10.1109/CEC.2010.5586104.

Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K. (2019). BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Conference of the of the North American Chapter of the Association for Computa- tional Linguistics: Human Language Technologies, pages 4171–4186.

Eberts, M. and Ulges, A. (2020). Span-based Joint Entity and Relation Extraction with Transformer Pre-training. In 24th European Conference on Artificial Intelligence, pages 2006–2013.

Eberts, M. and Ulges, A. (2021). An End-to-end Model for Entity-level Relation Extraction using Multi-instance Learning. In Association for Computational Linguistics, pages 3650–3660.

Eckart de Castilho, R., Mújdricza-Maydt, É., Yimam, S. M., Hartmann, S., Gurevych, I., Frank, A., and Biemann, C. (2016). A Web-based Tool for the Integrated Annotation of Semantic and Syntactic Structures. In Proceedings of the Workshop on Language Technology Resources and Tools for Digital Humanities (LT4DH), pages 76–84, Osaka, Japan. The COLING 2016 Organizing Committee.

Finkel, J. R., Grenager, T., and Manning, C. (2005). Incorporating Non-local Information into Information Extraction Systems by Gibbs Sampling. In Annual Meeting of the Association for Computational Linguistics, pages 363–370.

Fu, J., Huang, X., and Liu, P. (2021). SpanNER: Named Entity Re-/Recognition as Span Prediction. In Annual Meeting of the Association for Computational Linguistics, pages 7183–7195.

Huang, G., Zhong, J., Wang, C., Dai, Q., and Li, R. (2022a). Prompt-Based Self-training Framework for Few-Shot Named Entity Recognition. In Memmi, G., Yang, B., Kong, L., Zhang, T., and Qiu, M., editors, Knowledge Science, Engineering and Management, pages 91–103, Cham. Springer International Publishing.

Huang, Y., He, K., Wang, Y., Zhang, X., Gong, T., Mao, R., and Li, C. (2022b). COPNER: Contrastive Learning with Prompt Guiding for Few-shot Named Entity Recognition. In Proceedings of the 29th International Conference on Computational Linguistics, pages 2515–2527, Gyeongju, Republic of Korea. International Committee on Computational Linguistics.

Liu, C., Fan, H., and Liu, J. (2021). Span-Based Nested Named Entity Recognition with Pretrained Language Model. In Jensen, C. S., Lim, E.-P., Yang, D.-N., Lee, W.-C., Tseng, V. S., Kalogeraki, V., Huang, J.-W., and Shen, C.-Y., editors, Database Systems for Advanced Applications, pages 620–628.

Luz de Araujo, P. H., de Campos, T. E., de Oliveira, R. R. R., Stauffer, M., Couto, S., and Bermejo, P. (2018). LeNER-Br: a Dataset for Named Entity Recognition in Brazilian Legal Text. In International Conference on the Computational Processing of Portuguese (PROPOR), pages 313–323.

Menezes, G. V., Almeida, J. M., Belém, F., Gonçalves, M. A., Lacerda, A., de Moura, E. S., Pappa, G. L., Veloso, A., and Ziviani, N. (2010). Demand-Driven Tag Recommendation. In Balcázar, J. L., Bonchi, F., Gionis, A., and Sebag, M., editors, Machine Learning and Knowledge Discovery in Databases, European Conference, ECML PKDD 2010, Barcelona, Spain, September 20-24, 2010, Proceedings, Part II, volume 6322 of Lecture Notes in Computer Science, pages 402–417. Springer. DOI: 10.1007/978-3-642-15883-4_26.

Niu, F., Zhang, C., Ré, C., and Shavlik, J. W. (2012). Deep-Dive: Web-scale Knowledge-base Construction using Statistical Learning and Inference. VLDS, 12:25–28.

Patil, N., Patil, A., and Pawar, B. (2020). Named Entity Recognition using Conditional Random Fields. Procedia Computer Science, 167:1181–1188.

Silva, L., Canalle, G. K., Salgado, A. C., Lóscio, B., and Moro, M. (2019). Uma Análise Experimental do Impacto da Seleção de Atributos em Processos de Resolução de Entidades. In SBBD, pages 37–48.

Silva, R. M., Gomes, G. C. M., Alvim, M. S., and Gonçalves, M. A. (2022). How to build high quality L2R training data: Unsupervised compression-based selective sampling for learning to rank. Inf. Sci., 601:90–113. DOI: 10.1016/j.ins.2022.04.012.

Wang, T., Zhao, X., Lv, Q., Hu, B., and Sun, D. (2021). Density Weighted Diversity Based Query Strategy for Active Learning. In IEEE International Conference on Computer Supported Cooperative Work in Design (CSCWD), pages 156–161.

Wang, X., Tian, J., Gui, M., Li, Z., Ye, J., Yan, M., and Xiao, Y. (2022). PromptMNER: Prompt-Based Entity-Related Visual Clue Extraction and Integration for Multimodal Named Entity Recognition. In Bhattacharya, A., Lee Mong Li, J., Agrawal, D., Reddy, P. K., Mohania, M., Mondal, A., Goyal, V., and Uday Kiran, R., editors, Database Systems for Advanced Applications, pages 297–305, Cham. Springer International Publishing.

Zhang, S., He, L., Vucetic, S., and Dragut, E. (2018). Regular Expression Guided Entity Mention Mining from Noisy Web Data. In Empirical Methods in Natural Language Processing, pages 1991–2000.

Zhu, Y., Ye, Y., Li, M., Zhang, J., and Wu, O. (2022). Investigating annotation noise for named entity recognition. Neural Computing and Applications, 35. DOI: 10.1007/s00521-022-07733-0.

Contextual Reinforcement, Entity Delimitation and Generative Data Augmentation for Entity Recognition and Relation Extraction in Official Documents

Authors

DOI:

Keywords:

Abstract

Downloads

References

Downloads

Published

How to Cite

Issue

Section

Make a Submission

Metrics: