"Call My Big Sibling (CMBS)" – A Confidence-Based Strategy Leveraging Instance Selection to Combine Small and Large Language Models for Cost-Effective Text Classification

Claudio Moisés Valiense de Andrade; Washington Cunha; Davi Reis; Celso França; Wasterman Ávila Apolinário; Luana de Castro Santos; Adriana Silvina Pagano; Leonardo Chaves Dutra da Rocha; Marcos André Gonçalves

doi:10.5753/jbcs.2026.6153

Authors

Claudio Moisés Valiense de Andrade Federal University of Minas Gerais https://orcid.org/0000-0002-7366-2633
Washington Cunha State University of Campinas https://orcid.org/0000-0002-1988-8412
Davi Reis Federal University of Sao Joao del Rei https://orcid.org/0009-0000-7299-1542
Celso França Federal University of Minas Gerais https://orcid.org/0000-0002-0251-7172
Wasterman Ávila Apolinário Federal University of Sao Joao del Rei https://orcid.org/0009-0002-9657-4887
Luana de Castro Santos Federal University of Minas Gerais https://orcid.org/0009-0001-2619-4152
Adriana Silvina Pagano Federal University of Minas Gerais https://orcid.org/0000-0002-3150-3503
Leonardo Chaves Dutra da Rocha Federal University of Sao Joao del Rei https://orcid.org/0000-0002-4913-4902
Marcos André Gonçalves Federal University of Minas Gerais https://orcid.org/0000-0002-2075-3363

DOI:

https://doi.org/10.5753/jbcs.2026.6153

Keywords:

Text Classification, Large Language Model, Cost and Effective

Abstract

Transformers have achieved state-of-the-art results, with Large Language Models (LLMs) leading many NLP tasks. However, it remains unclear whether LLMs always outperform first-generation Transformers (aka Small Language Models, SLMs) across different text classification tasks and scenarios (e.g., movie reviews, topic classification). This study compares four SLMs (BERT, RoBERTa, Qwen, BART) with four open LLMs (LLaMA 3.1, Mistral, Falcon, DeepSeek) across nine sentiment and four topic classification datasets, totaling over 1000 results. Results show that open LLMs only moderately outperform or tie with SLMs when fine-tuned, and at a very high computational cost. To address this trade-off, we propose “Call My Big Sibling” (CMBS), a novel confidence-based framework that integrates calibrated SLMs and open LLMs using advanced instance selection techniques. CMBS assigns high-confidence instances to the cheaper SLM, while low-confidence instances are routed to an LLM in zero-shot, in-context, or partially tuned modes, optimizing cost-effectiveness. Experiments show CMBS significantly outperforms SLMs and delivers LLM-level performance at a fraction of the cost, offering a cost-sensitive solution for NLP applications.

Downloads

Download data is not yet available.

References

Antypas, D., Ushio, A., Camacho-Collados, J., Neves, L., Silva, V., and Barbieri, F. (2022). Twitter Topic Classification. In Proceedings of the 29th International Conference on Computational Linguistics, Gyeongju, Republic of Korea. International Committee on Computational Linguistics.

Bai, J., Zhang, X., Li, C., Hong, H., Xu, X., Lin, C., and Rong, W. (2023a). How to determine the most powerful pre-trained language model without brute force fine-tuning? an empirical survey. In Bouamor, H., Pino, J., and Bali, K., editors, Findings of the Association for Computational Linguistics: EMNLP 2023, pages 5369-5382, Singapore. Association for Computational Linguistics. DOI: 10.18653/v1/2023.findings-emnlp.357.

Bai, J., Zhang, X., Li, C., Hong, H., Xu, X., Lin, C., and Rong, W. (2023b). How to determine the most powerful pre-trained language model without brute force fine-tuning? an empirical survey. In Bouamor, H., Pino, J., and Bali, K., editors, Findings of the EMNLP 2023. DOI: 10.18653/v1/2023.findings-emnlp.357.

Belém, F., Cunha, W., França, C., Andrade, C., Rocha, L., and Gonçalves, M. A. (2024). A novel two-step fine-tuning pipeline for cold-start active learning in text classification tasks. arXiv preprint arXiv:2407.17284.

BRIER, G. W. (1950). Verification of forecasts expressed in terms of probability. Monthly Weather Review, 78(1):1 - 3. DOI: 10.1175/1520-0493(1950)078<0001:VOFEIT>2.0.CO;2.

Canuto, S. D., Gonçalves, M. A., and Benevenuto, F. (2016). Exploiting new sentiment-based meta-level features for effective sentiment analysis. In Bennett, P. N., Josifovski, V., Neville, J., and Radlinski, F., editors, Proceedings of the Ninth ACM International Conference on Web Search and Data Mining, San Francisco, CA, USA, February 22-25, 2016, pages 53-62. ACM. DOI: 10.1145/2835776.2835821.

Craven, M., DiPasquo, D., Freitag, D., McCallum, A., Mitchell, T., Nigam, K., and Slattery, S. (1998). Learning to extract symbolic knowledge from the world wide web. In Proceedings of the Fifteenth National/Tenth Conference on Artificial Intelligence/Innovative Applications of Artificial Intelligence, AAAI '98/IAAI '98, page 509–516.

Cunha, W., Canuto, S., Viegas, F., Salles, T., Gomes, C., Mangaravite, V., Resende, E., Rosa, T., Gonçalves, M. A., and Rocha, L. (2020). Extended pre-processing pipeline for text classification: On the role of meta-feature representations, sparsification and selective sampling. Information Processing & Management, 57(4):102263.

Cunha, W., França, C., Fonseca, G., Rocha, L., and Gonçalves, M. A. (2023a). An effective, efficient, and scalable confidence-based instance selection framework for transformer-based text classification. In Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 665-674.

Cunha, W., Mangaravite, V., Gomes, C., Canuto, S., Resende, E., Nascimento, C., Viegas, F., França, C., Martins, W. S., Almeida, J., Rosa, T., Rocha, L., and Goncalves, M. A. (2021). On the cost-effectiveness of neural and non-neural approaches and representations for text classification: A comprehensive comparative study. Information Processing Management, 58(3):102481. DOI: https://doi.org/10.1016/j.ipm.2020.102481.

Cunha, W., Moreo, A., Esuli, A., Sebastiani, F., Rocha, L., and Gonçalves, M. A. (2025a). A noise-oriented and redundancy-aware instance selection framework. ACM Trans. Inf. Syst., 43(2). DOI: 10.1145/3705000.

Cunha, W., Rocha, L., and Gonçalves, M. A. (2025b). A thorough benchmark of automatic text classification: From traditional approaches to large language models.

Cunha, W., Viegas, F., França, C., Rosa, T., Rocha, L., and Gonçalves, M. A. (2023b). A comparative survey of instance selection methods applied to non-neural and transformer-based text classification. ACM CSUR. DOI: 10.1145/3582000.

de Andrade, C. M., Belém, F. M., Cunha, W., França, C., Viegas, F., Rocha, L., and Gonçalves, M. A. (2023). On the class separability of contextual embeddings representations – or “the classifier does not matter when the (text) representation is so good!”. Information Processing & Management, 60(4):103336. DOI: https://doi.org/10.1016/j.ipm.2023.103336.

de Carvalho, A. P., Ferreira, A. A., Laender, A. H. F., and Gonçalves, M. A. (2011). Incremental unsupervised name disambiguation in cleaned digital libraries. J. Inf. Data Manag., 2(3):289-304.

DeepSeek-AI et al. (2025). Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.

Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K. (2019). BERT: Pre-training of deep bidirectional transformers for language understanding. pages 4171-4186. DOI: 10.18653/v1/N19-1423.

Fonseca, G., Cunha, W., Prenassi, G., Gonçalves, M. A., and Da Rocha, L. C. D. (2025). Instance-selection-inspired undersampling strategies for bias reduction in small and large language models for binary text classification. In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 9323-9340.

França, C., Lima, R. C., Andrade, C., Cunha, W., de Melo, P. O. V., Ribeiro-Neto, B., Rocha, L., Santos, R. L., Pagano, A. S., and Gonçalves, M. A. (2024). On representation learning-based methods for effective, efficient, and scalable code retrieval. Neurocomputing, 600:128172.

Griggs, T., Liu, X., Yu, J., Kim, D., Chiang, W., Cheung, A., and Stoica, I. (2024). Mélange: Cost efficient large language model serving by exploiting GPU heterogeneity. CoRR, abs/2404.14527. DOI: 10.48550/ARXIV.2404.14527.

Han, Z., Gao, C., Liu, J., Zhang, J., and Zhang, S. Q. (2024). Parameter-efficient fine-tuning for large models: A comprehensive survey.

Hu, E. J., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., Wang, L., and Chen, W. (2021). Lora: Low-rank adaptation of large language models.

Lacoste, A., Luccioni, A., Schmidt, V., and Dandres, T. (2019). Quantifying the carbon emissions of machine learning. arXiv preprint arXiv:1910.09700.

Lepagnol, P., Gerald, T., Ghannay, S., Servan, C., and Rosset, S. (2024). Small language models are good too: An empirical study of zero-shot classification. In Calzolari, N., Kan, M.-Y., Hoste, V., Lenci, A., Sakti, S., and Xue, N., editors, Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024), pages 14923-14936, Torino, Italia. ELRA and ICCL.

Liang, P., Bommasani, R., Lee, T., Tsipras, D., Soylu, D., Yasunaga, M., Zhang, Y., Narayanan, D., Wu, Y., Kumar, A., Newman, B., Yuan, B., Yan, B., Zhang, C., Cosgrove, C. A., Manning, C. D., Re, C., Acosta-Navas, D., Hudson, D. A., Zelikman, E., Durmus, E., Ladhak, F., Rong, F., Ren, H., Yao, H., WANG, J., Santhanam, K., Orr, L., Zheng, L., Yuksekgonul, M., Suzgun, M., Kim, N., Guha, N., Chatterji, N. S., Khattab, O., Henderson, P., Huang, Q., Chi, R. A., Xie, S. M., Santurkar, S., Ganguli, S., Hashimoto, T., Icard, T., Zhang, T., Chaudhary, V., Wang, W., Li, X., Mai, Y., Zhang, Y., and Koreeda, Y. (2023). Holistic evaluation of language models. Transactions on Machine Learning Research. Featured Certification, Expert Certification.

Maas, A. L., Daly, R. E., Pham, P. T., Huang, D., Ng, A. Y., and Potts, C. (2011). Learning word vectors for sentiment analysis. In Lin, D., Matsumoto, Y., and Mihalcea, R., editors, The 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, Proceedings of the Conference, 19-24 June, 2011, Portland, Oregon, USA, pages 142-150. The Association for Computer Linguistics.

Malo, P., Sinha, A., Korhonen, P. J., Wallenius, J., and Takala, P. (2014). Good debt or bad debt: Detecting semantic orientations in economic texts. J. Assoc. Inf. Sci. Technol., 65(4):782-796. DOI: 10.1002/ASI.23062.

Mendes, L. F., Gonçalves, M., Cunha, W., Rocha, L., Couto-Rosa, T., and Martins, W. (2020). "keep it simple, lazy" - metalazy: A new metastrategy for lazy text classification. In Proceedings of the 29th ACM International Conference on Information & Knowledge Management, CIKM '20, page 1125–1134, New York, NY, USA. Association for Computing Machinery. DOI: 10.1145/3340531.3412180.

Pang, B. and Lee, L. (2005). Seeing stars: Exploiting class relationships for sentiment categorization with respect to rating scales. In Knight, K., Ng, H. T., and Oflazer, K., editors, Proceedings of the 43rd Annual Meeting of the Association for Computational Linguistics (ACL'05), pages 115-124, Ann Arbor, Michigan. Association for Computational Linguistics. DOI: 10.3115/1219840.1219855.

Rosenthal, S., Farra, N., and Nakov, P. (2019). Semeval-2017 task 4: Sentiment analysis in twitter. CoRR, abs/1912.00741.

Santana, A. F., Gonçalves, M. A., Laender, A. H. F., and Ferreira, A. A. (2017). Incremental author name disambiguation by exploiting domain-specific heuristics. J. Assoc. Inf. Sci. Technol., 68(4):931-945. DOI: 10.1002/ASI.23726.

Socher, R., Perelygin, A., Wu, J., Chuang, J., Manning, C. D., Ng, A. Y., and Potts, C. (2013). Recursive deep models for semantic compositionality over a sentiment treebank. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, EMNLP 2013, 18-21 October 2013, Grand Hyatt Seattle, Seattle, Washington, USA, A meeting of SIGDAT, a Special Interest Group of the ACL, pages 1631-1642. ACL.

Sorensen, T., Robinson, J., Rytting, C., Shaw, A., Rogers, K., Delorey, A., Khalil, M., Fulda, N., and Wingate, D. (2022). An information-theoretic approach to prompt engineering without ground truth labels. In Muresan, S., Nakov, P., and Villavicencio, A., editors, Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 819-862, Dublin, Ireland. Association for Computational Linguistics. DOI: 10.18653/v1/2022.acl-long.60.

Spirling, A. (2023). Why open-source generative ai models are an ethical way forward for science. Nature, 616(7957):413-413.

Strubell, E., Ganesh, A., and McCallum, A. (2019). Energy and policy considerations for deep learning in nlp. arXiv preprint arXiv:1906.02243.

Tang, J., Zhang, J., Yao, L., Li, J., Zhang, L., and Su, Z. (2008). Arnetminer: Extraction and mining of academic social networks. KDD '08, page 990–998, New York, NY, USA. Association for Computing Machinery. DOI: 10.1145/1401890.1402008.

Viegas, F., Canuto, S., Cunha, W., Francca, C., Valiense, C., Rocha, L., and Gonçalves, M. A. (2023). Clusent – combining semantic expansion and de-noising for dataset-oriented sentiment analysis of short texts. In Proceedings of the 29th Brazilian Symposium on Multimedia and the Web, WebMedia '23, page 110–118, New York, NY, USA. Association for Computing Machinery. DOI: 10.1145/3617023.3617039.

Wang, S., Sun, X., Li, X., Ouyang, R., Wu, F., Zhang, T., Li, J., Wang, G., and Guo, C. (2025). GPT-NER: Named entity recognition via large language models. In Chiruzzo, L., Ritter, A., and Wang, L., editors, Findings of the Association for Computational Linguistics: NAACL 2025, pages 4257-4275, Albuquerque, New Mexico. Association for Computational Linguistics. DOI: 10.18653/v1/2025.findings-naacl.239.

Wolfe, J., Jin, X., Bahr, T., and Holzer, N. (2017). Application of softmax regression and its validation for spectral-based land cover mapping. The International Archives of the Photogrammetry, Remote Sensing and Spatial Information Sciences, XLII-1/W1:455-459. DOI: 10.5194/isprs-archives-XLII-1-W1-455-2017.

Wu, S., Xiong, Y., Cui, Y., Wu, H., Chen, C., Yuan, Y., Huang, L., Liu, X., Kuo, T.-W., Guan, N., and Xue, C. J. (2025). Retrieval-augmented generation for natural language processing: A survey.

Xu, C., Xu, Y., Wang, S., Liu, Y., Zhu, C., and McAuley, J. (2024). Small models are valuable plug-ins for large language models. In Ku, L.-W., Martins, A., and Srikumar, V., editors, Findings of the Association for Computational Linguistics: ACL 2024, pages 283-294, Bangkok, Thailand. Association for Computational Linguistics. DOI: 10.18653/v1/2024.findings-acl.18.

Xu, S., Yan, Z., Dai, C., and Wu, F. (2025). Mega-rag: a retrieval-augmented generation framework with multi-evidence guided answer refinement for mitigating hallucinations of llms in public health. Frontiers in Public Health, Volume 13 - 2025. DOI: 10.3389/fpubh.2025.1635381.

Yue, M., Zhao, J., Zhang, M., Du, L., and Yao, Z. (2024). Large language model cascades with mixture of thought representations for cost-efficient reasoning. In The Twelfth International Conference on Learning Representations.

Zanotto, B. S., Beck da Silva Etges, A. P., Dal Bosco, A., Cortes, E. G., Ruschel, R., De Souza, A. C., Andrade, C. M., Viegas, F., Canuto, S., Luiz, W., et al. (2021). Stroke outcome measurements from electronic medical records: cross-sectional study on the effectiveness of neural and nonneural classifiers. JMIR Medical Informatics, 9(11):e29120.

Zhang, W., Deng, Y., Liu, B., Pan, S., and Bing, L. (2024). Sentiment analysis in the era of large language models: A reality check. In Duh, K., Gomez, H., and Bethard, S., editors, Findings of the Association for Computational Linguistics: NAACL 2024, pages 3881-3906, Mexico City, Mexico. Association for Computational Linguistics. DOI: 10.18653/v1/2024.findings-naacl.246.