PromptNER: An Automatic Prompt-Learning Data Labeling Approach for Named Entity Recognition on Sensitive Data

Claudio M. V. de Andrade; Fabiano Muniz Belém; Celso França; Marcos Carvalho; Marcelo Ganem; Gabriel Texeira; Gabriel Jallais; Alberto H. F. Laender; Marcos A. Gonçalves

doi:10.5753/jidm.2025.4298

Authors

Claudio M. V. de Andrade Universidade Federal de Minas Gerais https://orcid.org/0000-0002-7366-2633
Fabiano Muniz Belém Universidade Federal de Minas Gerais https://orcid.org/0000-0003-1076-2052
Celso França Universidade Federal de Minas Gerais https://orcid.org/0000-0002-0251-7172
Marcos Carvalho Universidade Federal de Minas Gerais https://orcid.org/0000-0001-6474-6467
Marcelo Ganem Universidade Federal de Minas Gerais https://orcid.org/0000-0003-0842-4732
Gabriel Texeira Universidade Federal de Minas Gerais https://orcid.org/0009-0003-2951-2982
Gabriel Jallais Universidade Federal de Minas Gerais https://orcid.org/0009-0009-6559-1546
Alberto H. F. Laender Universidade Federal de Minas Gerais https://orcid.org/0000-0001-5032-2233
Marcos A. Gonçalves Universidade Federal de Minas Gerais https://orcid.org/0000-0002-2075-3363

DOI:

https://doi.org/10.5753/jidm.2025.4298

Keywords:

Automatic Labeling, Named Entity Recognition, Prompt Learning, Large Language Models, Sensitive Data

Abstract

We address the task of Named Entity Recognition (NER) for entities of the types Organization and Product/Service found in textual complaints recorded on Web platforms. Due to the high inference power of Large Language Models (LLM’s), there is a growing interest in applying them to distinct problems. However, they face issues of high infrastructure cost and privacy concerns when using external API’s. Accordingly, in this article we propose PromptNER, an approach that uses LLM’s for the recognition of entities in consumers’ complaints and use them to locally train simpler models, such as SpERT (Span-based Entity and Relation Extraction Transformer), to address the task of entity and relation extraction, achieving scalabilty and privacy. Our PromptNER enhanced model achieves significant gains, between 41%-129% in F-score compared to the SpERT model trained with manually-labeled data and between 30%-268% over recent (zero-shot) Large Language Models (Llama 3.1).

Downloads

Download data is not yet available.

References

Akter, S. and Wamba, S. F. (2016). Big data analytics in E-commerce: a systematic review and agenda for future research. Electronic Markets, 26(2):173–194. DOI: 10.1007/s12525-016-0219-0.

Belém, F., Martins, E., Pontes, T., Almeida, J., and Gonçalves, M. (2011). Associative tag recommendation exploiting multiple textual features. In Proceedings of the 34th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR ’11, page 1033–1042, New York, NY, USA. Association for Computing Machinery. DOI: 10.1145/2009916.2010053.

Belém, F. M., de Andrade, C. M. V., França, C., Carvalho, M., Ganem, M. A. S., Teixeira, G., Jallais, G., Laender, A. H. F., and Gonçalves, M. A. (2023). Contextual reinforcement, entity delimitation and generative data augmentation for entity recognition and relation extraction in official documents. Journal of Information and Data Management, 14(1). DOI: 10.5753/JIDM.2023.3180.

Belém, F. M., Ganem, M. A. S., França, C., Carvalho, M., Laender, A. H. F., and Gonçalves, M. A. (2022). Reforço e delimitação contextual para reconhecimento de entidades e relações em documentos oficiais. In Proceedings of the 37th Brazilian Symposium on Databases, SBBD 2022, Buzios, Brazil, September 19-23, 2022, pages 292–303. SBC. DOI: 10.5753/SBBD.2022.224650.

Belém, F., Ganem, M., França, C., Carvalho, M., Laender, A. H. F., and Gonçalves, M. (2022). Reforço e Delimitação Contextual para Reconhecimento de Entidades e Relações em Documentos Oficiais. In Anais do XXXVII Simpósio Brasileiro de Bancos de Dados, pages 292–303.

Brown, T. B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., Agarwal, S., Herbert-Voss, A., Krueger, G., Henighan, T., Child, R., Ramesh, A., Ziegler, D. M., Wu, J., Winter, C., Hesse, C., Chen, M., Sigler, E., Litwin, M., Gray, S., Chess, B., Clark, J., Berner, C., McCandlish, S., Radford, A., Sutskever, I., and Amodei, D. (2020). Language models are few-shot learners.

Brunner, U. and Stockinger, K. (2020). Entity Matching with Transformer Architectures - A Step Forward in Data Integration. In Proceedings of the International Conference on Extending Database Technology, pages 463–473.

Caputo, A., Basile, P., and Semeraro, G. (2009). Boosting a Semantic Search Engine by Named Entities. In Foundations of Intelligent Systems, pages 241–250.

Christiano, P., Leike, J., Brown, T. B., Martic, M., Legg, S., and Amodei, D. (2023). Deep reinforcement learning from human preferences.

Constantino, K., Cruz, V., Zucheratto, O., França, C., Carvalho, M., Silva, T. H., Laender, A., and Gonçalves, M. (2022). Segmentação e classificação semântica de trechos de diários oficiais usando aprendizado ativo. In Anais do XXXVII Simpósio Brasileiro de Bancos de Dados, pages 304–316, Porto Alegre, RS, Brasil. SBC. DOI: 10.5753/sbbd.2022.224656.

Constantino, K., H. P. Silva, T., B. Silva, J. V., L. Cruz, V. A., M. M. Zucheratto, O., Carvalho, M., Santos, W., França, C., M. V. de Andrade, C., H. F. Laender, A., and Gonçalves, M. A. (2023). Using active learning for segmentation and semantic classification of legal acts extracted from official diaries. Journal of Information and Data Management, (1).

Cunha, W., Canuto, S. D., Viegas, F., Salles, T., Gomes, C., Mangaravite, V., Resende, E., Rosa, T., Gonçalves, M. A., and Rocha, L. (2020). Extended pre-processing pipeline for text classification: On the role of meta-feature representations, sparsification and selective sampling. Inf. Process. Manag., 57(4):102263. DOI: 10.1016/J.IPM.2020.102263.

de Andrade, C. M., Belém, F. M., Cunha, W., França, C., Viegas, F., Rocha, L., and Gonçalves, M. A. (2023). On the class separability of contextual embeddings representations – or “the classifier does not matter when the (text) representation is so good!”. Information Processing & Management, 60(4):103336. DOI: https://doi.org/10.1016/j.ipm.2023.103336.

de Carvalho, M. G., Gonçalves, M. A., Laender, A. H. F., and da Silva, A. S. (2006). Learning to deduplicate. In Marchionini, G., Nelson, M. L., and Marshall, C. C., editors, ACM/IEEE Joint Conference on Digital Libraries, JCDL 2006, Chapel Hill, NC, USA, June 11-15, 2006, Proceedings, pages 41–50. ACM. DOI: 10.1145/1141753.1141760.

Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K. (2019). BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 4171–4186.

Eberts, M. and Ulges, A. (2020). Span-based Joint Entity and Relation Extraction with Transformer Pre-training. In Proceedings of the 24th European Conference on Artificial Intelligence, pages 2006–2013.

Eberts, M. and Ulges, A. (2021). An End-to-end Model for Entity-level Relation Extraction using Multi-instance Learning. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics, pages 3650–3660.

Fabbri, A. R., Kryscinski, W., McCann, B., Xiong, C., Socher, R., and Radev, D. R. (2021). Summeval: Reevaluating summarization evaluation. Transactions of the Association for Computational Linguistics, 9:391–409. DOI: 10.1162/tacl_a_00373.

Ferreira, A. A., Veloso, A., Gonçalves, M. A., and Laender, A. H. F. (2014). Self-training author name disambiguation for information scarce scenarios. J. Assoc. Inf. Sci. Technol., 65(6):1257–1278. DOI: 10.1002/ASI.22992.

Finkel, J. R., Grenager, T., and Manning, C. (2005). Nonlocal Information into Information Extraction Systems by Gibbs Sampling. In Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics, pages 363–370.

Fu, J., Huang, X., and Liu, P. (2021). SpanNER: Named Entity Re-/Recognition as Span Prediction. In Annual Meeting of the Association for Computational Linguistics, pages 7183–7195.

Gwet, K. L. (2011). On the krippendorff’s alpha coefficient.

Ji, B., Yu, J., Li, S., Ma, J., Wu, Q., Tan, Y., and Liu, H. (2020). Span-based Joint Entity and Relation Extraction with Attention-based Span-specific and Contextual Semantic Representations. In Proceedings of the 28th International Conference on Computational Linguistics, pages 88–99.

Liu, C., Fan, H., and Liu, J. (2021). Span-Based Nested Named Entity Recognition with Pretrained Language Model. In Jensen, C. S., Lim, E.-P., Yang, D.-N., Lee, W.-C., Tseng, V. S., Kalogeraki, V., Huang, J.-W., and Shen, C.-Y., editors, In Processing of the 26th International Conference Database Systems for Advanced Applications, pages 620–628.

Liu, Y., Han, T., Ma, S., Zhang, J., Yang, Y., Tian, J., He, H., Li, A., He, M., Liu, Z., Wu, Z., Zhao, L., Zhu, D., Li, X., Qiang, N., Shen, D., Liu, T., and Ge, B. (2023). Summary of chatgpt-related research and perspective towards the future of large language models. Meta-Radiology, 1(2):100017. DOI: 10.1016/j.metrad.2023.100017.

Luo, X., Xue, Y., Xing, Z., and Sun, J. (2022). PRCBERT: Prompt Learning for Requirement Classification using BERT-based Pretrained Language Models. In Proceedings of the 37th IEEE/ACM International Conference on Auto- mated Software Engineering, pages 1–13.

Mangaravite, V., Carvalho, M., Cantelli, L., Ponce, L. M., Campoi, B., Nunes, G., Laender, A. H. F., and Gonçalves, M. A. (2022). DedupeGov: Uma Plataforma para Integração de Grandes Volumes de Dados de Pessoas Físicas e Jurídicas em Âmbito Governamental. In Anais do XXXVII Simpósio Brasileiro de Bancos de Dados, pages 90–102. DOI: 10.5753/sbbd.2022.224655.

Mourão, F., Rocha, L., Araújo, R. B., Couto, T., Gonçalves, M. A., and Jr., W. M. (2008). Understanding temporal aspects in document classification. In Najork, M., Broder, A. Z., and Chakrabarti, S., editors, Proceedings of the International Conference on Web Search and Web Data Mining, WSDM 2008, Palo Alto, California, USA, February 11-12, 2008, pages 159–170. ACM. DOI: 10.1145/1341531.1341554.

Niu, F., Zhang, C., Ré, C., and Shavlik, J. W. (2012). Deep-Dive: Web-scale Knowledge-base Construction using Statistical Learning and Inference. In Proceedings of the Second International Workshop on Searching and Integrating New Web Data Sources, Istanbul, Turkey, August 31, 2012, pages 25–28.

Ouyang, L., Wu, J., Jiang, X., Almeida, D., Wainwright, C. L., Mishkin, P., Zhang, C., Agarwal, S., Slama, K., Ray, A., Schulman, J., Hilton, J., Kelton, F., Miller, L., Simens, M., Askell, A., Welinder, P., Christiano, P., Leike, J., and Lowe, R. (2022). Training language models to follow instructions with human feedback.

Paiva, B. B. M., Nascimento, E. R., Gonçalves, M. A., and Belém, F. (2022). A reinforcement learning approach for single redundant view co-training text classification. Inf. Sci., 615:24–38. DOI: 10.1016/J.INS.2022.09.065.

Patil, N., Patil, A., and Pawar, B. (2020). Named entity recognition using conditional random fields. Procedia Computer Science, 167:1181–1188. International Conference on Computational Intelligence and Data Science. DOI: https://doi.org/10.1016/j.procs.2020.03.431.

Rodrigues, P. H. S., de Sousa, D. X., Rosa, T. C., and Gonçalves, M. A. (2022). Risk-sensitive deep neural learning to rank. In Amigó, E., Castells, P., Gonzalo, J., Carterette, B., Culpepper, J. S., and Kazai, G., editors, SIGIR ’22: The 45th International ACM SIGIR Conference on Research and Development in Information Retrieval, Madrid, Spain, July 11 - 15, 2022, pages 803–813. ACM. DOI: 10.1145/3477495.3532056.

Salles, T., Rocha, L., Mourão, F., Gonçalves, M. A., Viegas, F., and Jr., W. M. (2017). A two-stage machine learning approach for temporally-robust text classification. Inf. Syst., 69:40–58. DOI: 10.1016/J.IS.2017.04.004.

Salles, T., Rocha, L., Pappa, G. L., Mourão, F., Jr., W. M., and Gonçalves, M. A. (2010). Temporally-aware algorithms for document classification. In Crestani, F., Marchand-Maillet, S., Chen, H., Efthimiadis, E. N., and Savoy, J., editors, Proceeding of the 33rd International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 2010, Geneva, Switzerland, July 19-23, 2010, pages 307–314. ACM. DOI: 10.1145/1835449.1835502.

Silva, L., Canalle, G. K., Salgado, A. C., Lóscio, B., and Moro, M. (2019). Uma Análise Experimental do Impacto da Seleção de Atributos em Processos de Resolução de Entidades. In Anais do XXXIV Simpósio Brasileiro de Bancos de Dados, pages 37–48.

Tang, R., Han, X., Jiang, X., and Hu, X. (2023). Does synthetic data generation of llms help clinical text mining? Computer Science Archive, abs/2303.04360. DOI: 10.48550/arXiv.2303.04360.

Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., Rodriguez, A., Joulin, A., Grave, E., and Lample, G. (2023). Llama: Open and efficient foundation language models.

Wang, S., Sun, X., Li, X., Ouyang, R., Wu, F., Zhang, T., Li, J., and Wang, G. (2023). GPT-NER: Named Entity Recognition via Large Language Models. Computer Science Archive, abs/2304.10428. DOI: 10.48550/arXiv.2304.10428.

Yadav, V. and Bethard, S. (2018). A survey on recent advances in named entity recognition from deep learning models. In Bender, E. M., Derczynski, L., and Isabelle, P., editors, Proceedings of the 27th International Conference on Computational Linguistics, pages 2145–2158, Santa Fe, New Mexico, USA. Association for Computational Linguistics.

Ye, F., Huang, L., Liang, S., and Chi, K. (2023). Decomposed Two-Stage Prompt Learning for Few-Shot Named Entity Recognition. Information, 14(5). DOI: 10.3390/info14050262.

Zhang, S., Roller, S., Goyal, N., Artetxe, M., Chen, M., Chen, S., Dewan, C., Diab, M., Li, X., Lin, X. V., et al. (2022). Opt: Open pre-trained transformer language models. arXiv preprint arXiv:2205.01068.

Zhu, Y., Ye, Y., Li, M., Zhang, J., and Wu, O. (2023). Investigating annotation noise for named entity recognition. Neural Computing and Applications, 35(1):993–1007. DOI: 10.1007/s00521-022-07733-0.

PromptNER: An Automatic Prompt-Learning Data Labeling Approach for Named Entity Recognition on Sensitive Data

Authors

DOI:

Keywords:

Abstract

Downloads

References

Downloads

Published

How to Cite

Issue

Section

Make a Submission

Metrics: