NES: Neural Embedding Squared
DOI:
https://doi.org/10.5753/jisa.2025.4890Keywords:
textual data augmentation, sentence embeddings, natural language processingAbstract
In the fields of natural language processing (NLP) and machine learning, the quality and quantity of training data play a pivotal role in model performance. Textual data augmentation, a technique that artificially enhances the size of the training dataset by generating diverse yet semantically equivalent samples, has emerged as a crucial tool for overcoming data scarcity and improving the robustness of NLP models. However, the available solutions that achieve state-of-the-art performance require considerable computing power. This occurs because they use resource hungry machine learning models for each synthetic sentence generated. This paper introduces an approach to textual data augmentation, leveraging semantic representations to produce augmented data that not only expands the dataset but also understands the distribution of data points spatially. The approach requires less computing power by exploiting a fast prediction and spatial exploration in the embedding representation. In our experiments, it was able to double model performance while fixing class unbalance.
Downloads
References
Artetxe, M. and Schwenk, H. (2019). Massively multilingual sentence embeddings for zero-shot cross-lingual transfer and beyond. Transactions of the Association for Computational Linguistics, 7:597-610. DOI: 10.1162/tacl_a_00288.
Bayer, M., Kaufhold, M.-A., and Reuter, C. (2022). A survey on data augmentation for text classification. ACM Comput. Surv., 55(7). DOI: 10.1145/3544558.
Brown, T. B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., Agarwal, S., Herbert-Voss, A., Krueger, G., Henighan, T., Child, R., Ramesh, A., Ziegler, D. M., Wu, J., Winter, C., Hesse, C., Chen, M., Sigler, E., Litwin, M., Gray, S., Chess, B., Clark, J., Berner, C., McCandlish, S., Radford, A., Sutskever, I., and Amodei, D. (2020). Language models are few-shot learners. DOI: 10.48550/arXiv.2005.14165.
C. Júnior, E., L. Costa, W., L. C. Portela, A., S. Rocha, L., L. Gomes, R., and M. C. Andrade, R. (2022). Detecting attacks and locating malicious devices using unmanned air vehicles and machine learning. Journal of Internet Services and Applications, 13(1):11–20. DOI: 10.5753/jisa.2022.2327.
Chen, J., Tam, D., Raffel, C., Bansal, M., and Yang, D. (2023). An Empirical Survey of Data Augmentation for Limited Data Learning in NLP. Transactions of the Association for Computational Linguistics, 11:191-211. DOI: 10.1162/tacl_a_00542.
Ciolino, M., Noever, D., and Kalin, J. (2021). Multilingual augmenter: The model chooses. DOI: 10.48550/arXiv.2102.09708.
Dai, H., Liu, Z., Liao, W., Huang, X., Cao, Y., Wu, Z., Zhao, L., Xu, S., Liu, W., Liu, N., Li, S., Zhu, D., Cai, H., Sun, L., Li, Q., Shen, D., Liu, T., and Li, X. (2023). Auggpt: Leveraging chatgpt for text data augmentation. DOI: 10.48550/arXiv.2302.13007.
Fadaee, M., Bisazza, A., and Monz, C. (2017). Data augmentation for low-resource neural machine translation. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers). Association for Computational Linguistics. DOI: 10.18653/v1/p17-2090.
Feng, F., Yang, Y., Cer, D., Arivazhagan, N., and Wang, W. (2022). Language-agnostic BERT sentence embedding. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 878-891, Dublin, Ireland. Association for Computational Linguistics. DOI: 10.18653/v1/2022.acl-long.62.
Giridhara., P., Mishra., C., Venkataramana., R., Bukhari., S., and Dengel., A. (2019). A study of various text augmentation techniques for relation classification in free text. In Proceedings of the 8th International Conference on Pattern Recognition Applications and Methods - ICPRAM,, pages 360-367. INSTICC, SciTePress. DOI: 10.5220/0007311003600367.
Kobayashi, S. (2018). Contextual augmentation: Data augmentation by words with paradigmatic relations. DOI: 10.18653/v1/N18-2072.
Ladeira, L. Z., Santos, F., Cléopas, L., Buteneers, P., and Villas, L. (2022). Neo-nda: Neo natural language data augmentation. In 2022 IEEE 16th International Conference on Semantic Computing (ICSC), pages 99-102. DOI: 10.1109/ICSC52841.2022.00021.
Li, B., Hou, Y., and Che, W. (2022). Data augmentation approaches in natural language processing: A survey. AI Open, 3:71-90. DOI: https://doi.org/10.1016/j.aiopen.2022.03.001.
Li, Y., Li, X., Yang, Y., and Dong, R. (2020). A diverse data augmentation strategy for low-resource neural machine translation. Information, 11(5). DOI: 10.3390/info11050255.
Luo, G., Zhou, Y., Ren, T., Chen, S., Sun, X., and Ji, R. (2023). Cheap and quick: Efficient vision-language instruction tuning for large language models. In Oh, A., Naumann, T., Globerson, A., Saenko, K., Hardt, M., and Levine, S., editors, Advances in Neural Information Processing Systems, volume 36, pages 29615-29627. Curran Associates, Inc.. DOI: 10.48550/arXiv.2305.15023.
Ma, X., Fang, G., and Wang, X. (2023). Llm-pruner: On the structural pruning of large language models. In Oh, A., Naumann, T., Globerson, A., Saenko, K., Hardt, M., and Levine, S., editors, Advances in Neural Information Processing Systems, volume 36, pages 21702-21720. Curran Associates, Inc.. DOI: 10.48550/arXiv.2305.11627.
Reimers, N. and Gurevych, I. (2019). Sentence-bert: Sentence embeddings using siamese bert-networks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics. DOI: 10.48550/arXiv.1908.10084.
Rodrigues, D. O., de Souza, A. M., Braun, T., Maia, G., Loureiro, A. A. F., and Villas, L. A. (2023). Service provisioning in edge-cloud continuum: Emerging applications for mobile devices. Journal of Internet Services and Applications, 14(1):47–83. DOI: 10.5753/jisa.2023.2913.
Sanh, V., Debut, L., Chaumond, J., and Wolf, T. (2020). Distilbert, a distilled version of bert: smaller, faster, cheaper and lighter. DOI: 10.48550/arXiv.1910.01108.
Wei, J., Huang, C., Xu, S., and Vosoughi, S. (2021). Text augmentation in a multi-task view. DOI: 10.48550/arXiv.2101.05469.
Wei, J. and Zou, K. (2019). Eda: Easy data augmentation techniques for boosting performance on text classification tasks. DOI: 10.18653/v1/D19-1670.
Wu, X., Lv, S., Zang, L., Han, J., and Hu, S. (2018). Conditional bert contextual augmentation. DOI: 10.48550/arXiv.1812.06705.
Yang, Y., Malaviya, C., Fernandez, J., Swayamdipta, S., Le Bras, R., Wang, J.-P., Bhagavatula, C., Choi, Y., and Downey, D. (2020). Generative data augmentation for commonsense reasoning. In Findings of the Association for Computational Linguistics: EMNLP 2020, pages 1008-1025, Online. Association for Computational Linguistics. DOI: 10.18653/v1/2020.findings-emnlp.90.
Yoo, K. M., Park, D., Kang, J., Lee, S.-W., and Park, W. (2021). Gpt3mix: Leveraging large-scale language models for text augmentation. DOI: 10.48550/arXiv.2104.08826.
Downloads
Published
How to Cite
Issue
Section
License
Copyright (c) 2025 Journal of Internet Services and Applications

This work is licensed under a Creative Commons Attribution 4.0 International License.

