Natural Language Processing in Software Engineering: A Systematic Literature Review
DOI:
https://doi.org/10.5753/jserd.2025.5097Keywords:
Natural Language Processing, Software Engineering, Machine Learning, Literature ReviewAbstract
Context: Software engineering (SE) artifacts and documents, such as requirements specifications, user stories, test cases, and concepts of operations (ConOps), are typically written in natural language, making their manipulation challenging. Natural Language Processing (NLP) is a viable solution for managing these tasks. Objective: To conduct a systematic literature review to explore the current use of NLP in SE artifacts and tasks, supplemented by a tertiary study focusing on the emerging role of Large Language Models (LLMs) in software engineering research. Method: We searched digital libraries for relevant papers and applied inclusion and exclusion criteria to filter the primary studies. We then analyzed NLP techniques applied to SE documents and examined their usage in this context. Our research methodology followed Kitchenham and Charters' guidelines. Additionally, we conducted a tertiary study to synthesize findings from existing systematic literature reviews and surveys specifically addressing LLMs in software engineering. Results: We selected 60 primary studies to identify the most common methods for NLP pipelines, feature extraction, language models, and machine learning algorithms used in SE. We also assessed the purposes of these methods, their benefits for SE, their difficulty, and their contribution to SE advancement. The tertiary study revealed a rapid proliferation of LLM-focused research, with comprehensive reviews documenting exponential growth in publications and widespread adoption across diverse SE tasks. Conclusion: Requirements are the most frequently addressed artifacts using NLP techniques, with preprocessing and part-of-speech (POS) tagging being widely used. There is a notable increase in the use of large language models for various SE tasks, such as requirements elicitation, source code generation, bug fixing, and software testing. The tertiary study confirms that LLMs represent a pivotal shift in the research landscape, warranting dedicated investigation to understand their transformative impact on NLP applications in software engineering.
Downloads
References
AlDhafer, O., Ahmad, I., and Mahmood, S. (2022). An end-to-end deep learning system for requirements classification using recurrent neural networks. Information and Software Technology, 147:106877.
Alrashedy, K., Dharmaretnam, D., German, D. M., Srinivasan, V., and Aaron Gulliver, T. (2020). Scc++: Predicting the programming language of questions and snippets of stack overflow. Journal of Systems and Software, 162:110505.
alsukhni, B. (2021). Multi-label arabic text classification based on deep learning. In 2021 12th International Conference on Information and Communication Systems (ICICS), pages 475–477.
Arora, C., Sabetzadeh, M., Briand, L., and Zimmer, F. (2015). Automated checking of conformance to requirements templates using natural language processing. IEEE Transactions on Software Engineering, 41(10):944–968.
Arthur, M. P. (2020). Automatic source code documentation using code summarization technique of nlp. Procedia Computer Science, 171:2522–2531. Third International Conference on Computing and Network Communications (CoCoNet’19).
Arunthavanathan, A., Shanmugathasan, S., Ratnavel, S., Thiyagarajah, V., Perera, I., Meedeniya, D., and Balasubramaniam, D. (2016). Support for traceability management of software artefacts using natural language processing. In 2016 Moratuwa Engineering Research Conference (MERCon), pages 18–23.
Asadabadi, M. R., Saberi, M., Zwikael, O., and Chang, E. (2020). Ambiguous requirements: A semi-automated approach to identify and clarify ambiguity in large-scale projects. Computers & Industrial Engineering, 149:106828.
Baum, L. E. and Petrie, T. (1966). Statistical inference for probabilistic functions of finite state markov chains. The annals of mathematical statistics, 37(6):1554–1563.
Bhatia, K., Mishra, S., and Sharma, A. (2020). Clustering glossary terms extracted from large-sized software requirements using fasttext. In Proceedings of the 13th Innovations in Software Engineering Conference on Formerly Known as India Software Engineering Conference, ISEC 2020, New York, NY, USA. Association for Computing Machinery.
Bidulya, Y. (2018). An approach to the development of software for effective search of scientific articles. In 2018 3rd Russian-Pacific Conference on Computer Technology and Applications (RPC), pages 1–4.
Blasi, A., Gorla, A., Ernst, M. D., and Pezzè, M. (2023). Call me maybe: Using nlp to automatically generate unit test cases respecting temporal constraints. In Proceedings of the 37th IEEE/ACM International Conference on Automated Software Engineering, ASE ’22, New York, NY, USA. Association for Computing Machinery.
Bojanowski, P., Grave, E., Joulin, A., and Mikolov, T. (2016). Enriching word vectors with subword information. arXiv preprint arXiv:1607.04606.
Casamayor, A., Godoy, D., and Campo, M. (2011). Mining textual requirements to assist architectural software design: a state of the art review. Springer Science+Business Media B.V.
Casillo, F., Deufemia, V., and Gravino, C. (2022). Detecting privacy requirements from user stories with nlp transfer learning models. Information and Software Technology, 146:106853.
Cheema, S. M., Tariq, S., and Pires, I. M. (2023). A natural language interface for automatic generation of data flow diagram using web extraction techniques. Journal of King Saud University - Computer and Information Sciences, 35(2):626–640.
Cho, H., Lee, S., and Kang, S. (2022). Classifying issue reports according to feature descriptions in a user manual based on a deep learning model. Information and Software Technology, 142:106743.
Cruz, B. D., Jayaraman, B., Dwarakanath, A., and McMillan, C. (2017). Detecting vague words & phrases in requirements documents in a multilingual environment. In 2017 IEEE 25th International Requirements Engineering Conference (RE), pages 233–242.
Dalpiaz, F., Ferrari, A., Franch, X., and Palomares, C. (2018). Natural language processing for requirements engineering: The best is yet to come.
Dalpiaz, F., van der Schalk, I., Brinkkemper, S., Aydemir, F. B., and Lucassen, G. (2019). Detecting terminological ambiguity in user stories: Tool and experimentation. Information and Software Technology, 110:3–16.
De Bortoli Fávero, E. M., Casanova, D., and Pimentel, A. R. (2022). Se3m: A model for software effort estimation using pre-trained embedding models. Information and Software Technology, 147:106886.
Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K. (2019). Bert: Pre-training of deep bidirectional transformers for language understanding.
Elallaoui, M., Nafil, K., and Touahni, R. (2018). Automatic transformation of user stories into uml use case diagrams using nlp techniques. Procedia Computer Science, 130:42–49. The 9th International Conference on Ambient Systems, Networks and Technologies (ANT 2018) / The 8th International Conference on Sustainable Energy Information Technology (SEIT-2018) / Affiliated Workshops.
Ezzini, S., Abualhaija, S., Arora, C., and Sabetzadeh, M. (2022). Automated handling of anaphoric ambiguity in requirements: A multi-solution study. In Proceedings of the 44th International Conference on Software Engineering, ICSE ’22, page 187–199, New York, NY, USA. Association for Computing Machinery.
Ezzini, S., Abualhaija, S., Arora, C., Sabetzadeh, M., and Briand, L. C. (2021). Using domain-specific corpora for improved handling of ambiguity in requirements. In Proceedings of the 43rd International Conference on Software Engineering, ICSE ’21, page 1485–1497. IEEE Press.
Fan, A., Gokkaya, B., Harman, M., Lyubarskiy, M., Sengupta, S., Yoo, S., and Zhang, J. M. (2023). Large language models for software engineering: Survey and open problems.
Fantechi, A., Gnesi, S., and Semini, L. (2023). Vibe: Looking for variability in ambiguous requirements. Journal of Systems and Software, 195:111540.
Fattahi, J. and Mejri, M. (2021). Spaml: a bimodal ensemble learning spam detector based on nlp techniques. In 2021 IEEE 5th International Conference on Cryptography, Security and Privacy (CSP), pages 107–112.
Fellbaum, C., editor (1998). WordNet: An Electronic Lexical Database. Language, Speech, and Communication. MIT Press, Cambridge, MA.
Fischbach, J., Frattini, J., Vogelsang, A., Mendez, D., Unterkalmsteiner, M., Wehrle, A., Henao, P. R., Yousefi, P., Juricic, T., Radduenz, J., and Wiecher, C. (2023). Automatic creation of acceptance tests by extracting conditionals from requirements: Nlp approach and case study. Journal of Systems and Software, 197:111549.
Ghaisas, S., Motwani, M., and Anish, P. R. (2013). Detecting system use cases and validations from documents. In 2013 28th IEEE/ACM International Conference on Automated Software Engineering (ASE), pages 568–573.
Gilson, F. and Weyns, D. (2019). When natural language processing jumps into collaborative software engineering. In 2019 IEEE International Conference on Software Architecture Companion (ICSA-C), pages 238–241.
Gomes, L., da Silva Torres, R., and Côrtes, M. L. (2023). Bert- and tf-idf-based feature extraction for long-lived bug prediction in floss: A comparative study. Information and Software Technology, 160:107217.
Goodfellow, I. J., Bengio, Y., and Courville, A. (2016). Deep Learning. MIT Press, Cambridge, MA, USA.
Greghi, J. G., Martins, E., and Carvalho, A. M. B. R. (2015). Semi-automatic generation of extended finite state machines from natural language standard documents. In 2015 IEEE International Conference on Dependable Systems and Networks Workshops, pages 45–50.
Gupta, S. and Gupta, S. K. (2019). Natural language processing in mining unstructured data from software repositories: a review. Indian Academy of Sciences.
Com certeza! Aqui está a próxima seção das suas referências, formatada com uma linha em branco entre cada entrada:
Gupta, S., Malik, S., Pollock, L., and Vijay-Shanker, K. (2013). Part-of-speech tagging of program identifiers for improved text-based software engineering tools. In 2013 21st International Conference on Program Comprehension (ICPC), pages 3–12.
Halim, F. and Siahaan, D. (2019). Detecting non-atomic requirements in software requirements specifications using classification methods. In 2019 1st International Conference on Cybernetics and Intelligent System (ICORIS), volume 1, pages 269–273.
Hamza, M. and Walker, R. J. (2015). Recommending features and feature relationships from requirements documents for software product lines. In 2015 IEEE/ACM 4th International Workshop on Realizing Artificial Intelligence Synergies in Software Engineering, pages 25–31.
Hou, X., Zhao, Y., Liu, Y., Yang, Z., Wang, K., Li, L., Luo, X., Lo, D., Grundy, J., and Wang, H. (2024). Large language models for software engineering: A systematic literature review.
Hu, D., Chen, M., Wang, T., Chang, J., Yin, G., Yu, Y., and Zhang, Y. (2018). Recommending similar bug reports: A novel approach using document embedding model. In 2018 25th Asia-Pacific Software Engineering Conference (APSEC), pages 725–726.
Jaiwai, M. and Sammapun, U. (2017). Extracting uml class diagrams from software requirements in thai using nlp. In 2017 14th International Joint Conference on Computer Science and Software Engineering (JCSSE), pages 1–5.
Kadebu, P., Sikka, S., Tyagi, R. K., and Chiurunge, P. (2023). A classification approach for software requirements towards maintainable security. Scientific African, 19:e01496.
Kaur, K. and Kaur, P. (2023). Bert-cnn: Improving bert for requirements classification using cnn. Procedia Computer Science, 218:2604–2611. International Conference on Machine Learning and Data Engineering.
Kitchenham, B. A. and Charters, S. (2007). Guidelines for performing systematic literature reviews in software engineering. Technical Report EBSE 2007-001, Keele University and Durham University Joint Report.
Kolahdouz-Rahimi, S., Lano, K., and Lin, C. (2023). Requirement formalisation using natural language processing and machine learning: A systematic review.
Krizhevsky, A., Sutskever, I., and Hinton, G. E. (2012). Imagenet classification with deep convolutional neural networks. In Proceedings of the 25th International Conference on Neural Information Processing Systems - Volume 1, NIPS’12, page 1097–1105, Red Hook, NY, USA. Curran Associates Inc.
Lapeña, R., Font, J., Pastor, O., and Cetina, C. (2017). Analyzing the impact of natural language processing over feature location in models. SIGPLAN Not., 52(12):63–76.
Li, B. and Nong, X. (2022). Automatically classifying non-functional requirements using deep neural network. Pattern Recognition, 132:108948.
Li, L., Li, Z., Zhang, W., Zhou, J., Wang, P., Wu, J., He, G., Zeng, X., Deng, Y., and Xie, T. (2020). Clustering test steps in natural language toward automating test automation. In Proceedings of the 28th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering, ESEC/FSE 2020, page 1285–1295, New York, NY, USA. Association for Computing Machinery.
Li, Y., Guzman, E., Tsiamoura, K., Schneider, F., and Bruegge, B. (2015). Automated requirements extraction for scientific software. Procedia Computer Science, 51:582–591. International Conference On Computational Science, ICCS 2015.
Liu, K., Reddivari, S., and Reddivari, K. (2022). Artificial intelligence in software requirements engineering: State-of-the-art. IEEE 23rd International Conference on Information Reuse and Integration for Data Science (IRI).
Liu, S., Sun, J., Liu, Y., Zhang, Y., Wadhwa, B., Dong, J. S., and Wang, X. (2014). Automatic early defects detection in use case documents. In Proceedings of the 29th ACM/IEEE International Conference on Automated Software Engineering, ASE ’14, page 785–790, New York, NY, USA. Association for Computing Machinery.
Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., and Stoyanov, V. (2019). Roberta: A robustly optimized bert pretraining approach.
Ma, S., Wang, H., Ma, L., Wang, L., Wang, W., Huang, S., Dong, L., Wang, R., Xue, J., and Wei, F. (2024). The era of 1-bit llms: All large language models are in 1.58 bits.
Malhotra, R., Chug, A., Hayrapetian, A., and Raje, R. (2016). Analyzing and evaluating security features in software requirements. In 2016 International Conference on Innovation and Challenges in Cyber Security (ICICCS-INBUSH), pages 26–30.
Manning, C. D., Raghavan, P., and Schütze, H. (2008). Introduction to Information Retrieval. Cambridge University Press.
Mastropaolo, A., Scalabrino, S., Cooper, N., Palacio, D. N., Poshyvanyk, D., Oliveto, R., and Bavota, G. (2021). Studying the usage of text-to-text transfer transformer to support code-related tasks. In Proceedings of the 43rd International Conference on Software Engineering, ICSE ’21, page 336–347. IEEE Press.
Mikolov, T., Chen, K., Corrado, G., and Dean, J. (2013). Efficient estimation of word representations in vector space.
Nasiri, S., Rhazali, Y., Lahmer, M., and Chenfour, N. (2020). Towards a generation of class diagram from user stories in agile methods. Procedia Computer Science, 170:831–837. The 11th International Conference on Ambient Systems, Networks and Technologies (ANT) / The 3rd International Conference on Emerging Data and Industry 4.0 (EDI40) / Affiliated Workshops.
Omran, F. N. A. A. and Treude, C. (2017). Choosing an nlp library for analyzing software documentation: A systematic literature review and a series of experiments. IEEE/ACM 14th International Conference on Mining Software Repositories (MSR).
Ozkaya, I. (2023). Application of large language models to software engineering tasks: Opportunities, risks, and implications. IEEE Software, 40(3):4–8.
Pennington, J., Socher, R., and Manning, C. (2014). GloVe: Global vectors for word representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 1532–1543, Doha, Qatar. Association for Computational Linguistics.
Pérez, F., Lapeña, R., Marcén, A. C., and Cetina, C. (2021). Topic modeling for feature location in software models: Studying both code generation and interpreted models. Information and Software Technology, 140:106676.
Raffel, C., Shazeer, N., Roberts, A., Lee, K., Narang, S., Matena, M., Zhou, Y., Li, W., and Liu, P. J. (2023). Exploring the limits of transfer learning with a unified text-to-text transformer.
Raharjana, I. K., Siahaan, D., and Fatichah, C. (2021). User stories and natural language processing: A systematic literature review. IEEE Access.
Rani, P., Panichella, S., Leuenberger, M., Di Sorbo, A., and Nierstrasz, O. (2021). How to identify class comment types? a multi-language approach for class comment classification. Journal of Systems and Software, 181:111047.
Reimers, N. and Gurevych, I. (2019). Sentence-bert: Sentence embeddings using siamese bert-networks.
Sajid, A., Jan, S., and Shah, I. A. (2017). Automatic topic modeling for single document short texts. In 2017 International Conference on Frontiers of Information Technology (FIT), pages 70–75.
Sanh, V., Debut, L., Chaumond, J., and Wolf, T. (2020). Distilbert, a distilled version of bert: smaller, faster, cheaper and lighter.
Sawant, K. P., Roy, S., Parachuri, D., Plesse, F., and Bhattacharya, P. (2014). Enforcing structure on textual use cases via annotation models. In Proceedings of the 7th India Software Engineering Conference, ISEC ’14, New York, NY, USA. Association for Computing Machinery.
Sawant, N. and Sengamedu, S. H. (2022). Learning-based identification of coding best practices from software documentation. In 2022 IEEE International Conference on Software Maintenance and Evolution (ICSME), pages 533–542.
Shakeri Hossein Abad, Z., Gervasi, V., Zowghi, D., and H. Far, B. (2019). Supporting analysts by dynamic extraction and classification of requirements-related knowledge. In 2019 IEEE/ACM 41st International Conference on Software Engineering (ICSE), pages 442–453.
Shehadeh, K., Arman, N., and Khamayseh, F. (2021). Semi-automated classification of arabic user requirements into functional and non-functional requirements using nlp tools. In 2021 International Conference on Information Technology (ICIT), pages 527–532.
Shreda, Q. A. and Hanani, A. A. (2021). Identifying non-functional requirements from unconstrained documents using natural language processing and machine learning approaches. IEEE Access, pages 1–1.
Shu, Y., Lun, Y. H., Run, Y. X., and Ye, W. (2016). An automated method for constructing ontology. In 2016 7th IEEE International Conference on Software Engineering and Service Science (ICSESS), pages 538–541.
Siahaan, D., Raharjana, I. K., and Fatichah, C. (2023). User story extraction from natural language for requirements elicitation: Identify software-related information from online news. Information and Software Technology, 158:107195.
Singh, M. (2019). Using natural language processing and graph mining to explore inter- related requirements in software artefacts. SIGSOFT Softw. Eng. Notes, 44(1):37–42.
Sonbol, R., Rebdawi, G., and Ghneim, N. (2022). The use of nlp-based text representation techniques to support requirement engineering tasks: A systematic mapping review. IEEE Access, 10:62811–62830.
Stöckle, P., Wasserer, T., Grobauer, B., and Pretschner, A. (2023). Automated identification of security-relevant configuration settings using nlp. In Proceedings of the 37th IEEE/ACM International Conference on Automated Software Engineering, ASE ’22, New York, NY, USA. Association for Computing Machinery.
Tahvili, S., Hatvani, L., Ramentol, E., Pimentel, R., Afzal, W., and Herrera, F. (2020). A novel methodology to classify test cases using natural language processing and imbalanced learning. Engineering Applications of Artificial Intelligence, 95:103878.
Treude, C., Prolo, C. A., and Filho, F. F. (2015). Challenges in analyzing software documentation in portuguese. In 2015 29th Brazilian Symposium on Software Engineering, pages 179–184.
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L., and Polosukhin, I. (2017). Attention is all you need.
Viggiato, M., Paas, D., Buzon, C., and Bezemer, C.-P. (2022). Using natural language processing techniques to improve manual test case descriptions. In Proceedings of the 44th International Conference on Software Engineering: Software Engineering in Practice, ICSE-SEIP ’22, page 311–320, New York, NY, USA. Association for Computing Machinery.
Wang, H., Ma, S., Dong, L., Huang, S., Wang, H., Ma, L., Yang, F., Wang, R., Wu, Y., and Wei, F. (2023a). Bitnet: Scaling 1-bit transformers for large language models.
Wang, S., Huang, L., Gao, A., Ge, J., Zhang, T., Feng, H., Satyarth, I., Li, M., Zhang, H., and Ng, V. (2023b). Machine/deep learning for software engineering: A systematic literature review. IEEE Transactions on Software Engineering, Vol. 49, No. 3.
Wein, S. and Briggs, P. (2021). A fully automated approach to requirement extraction from design documents. In 2021 IEEE Aerospace Conference (50100), pages 1–7.
Yang, S. and Sahraoui, H. (2022). Towards automatically extracting uml class diagrams from natural language specifications. In Proceedings of the 25th International Conference on Model Driven Engineering Languages and Systems: Companion Proceedings, MODELS ’22, page 396–403, New York, NY, USA. Association for Computing Machinery.
Zamani, K. (2021). A prediction model for software requirements change impact. In 2021 36th IEEE/ACM International Conference on Automated Software Engineering (ASE), pages 1028–1032.
Zhai, J., Shi, Y., Pan, M., Zhou, G., Liu, Y., Fang, C., Ma, S., Tan, L., and Zhang, X. (2020). C2s: Translating natural language comments to formal program specifications. In Proceedings of the 28th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering, ESEC/FSE 2020, page 25–37, New York, NY, USA. Association for Computing Machinery.
Downloads
Published
How to Cite
Issue
Section
License
Copyright (c) 2025 Gabriel Nogueira Pacheco, Luiz Eduardo Galvão Martins, Ana Estela Antunes da Silva, Niklas Lavesson, Tony Gorschek

This work is licensed under a Creative Commons Attribution 4.0 International License.

