RagPharma: A RAG-Based Chatbot for Medicine Leaflets with a Dual-Dataset Evaluation Framework

Authors

DOI:

https://doi.org/10.5753/jbcs.2025.5767

Keywords:

Retrieval-Augmented Generation, Large Language Models, Healthcare Chatbots, Perplexity Accuracy, Medicine Package Leaflets

Abstract

Despite being essential sources of information, Brazilian medicine package leaflets remain underutilized due to their complexity and lack of user-friendly tools for information retrieval. Currently, there are no chat-based systems in Portuguese designed to assist patients in accessing and understanding leaflet content. To address this gap, we present RagPharma, a novel Retrieval-Augmented Generation (RAG) system that integrates professional medicine leaflets into a chat interface to answer patient queries. During RagPharma's development, we observed that evaluation performance was significantly higher when using questions derived from the same dataset used to build the system. This led to the identification of a critical evaluation bias, often overlooked in RAG applications. In response, we propose a novel dual-dataset evaluation framework, which separates the knowledge base and the evaluation source in distinct, but related, datasets. Experimental results confirmed the presence of bias when using overlapping datasets and demonstrated the reliability of our dual-dataset methodology. Under this new evaluation scheme, RagPharma achieved 81% accuracy using the Mistral 7B model—representing a 60% improvement over standalone LLMs. These findings validate both the effectiveness of RagPharma and the importance of unbiased evaluation strategies in domain-specific RAG systems.

Downloads

Download data is not yet available.

References

Abdin, M., Jacobs, S. A., Awan, A. A., Aneja, J., Awadallah, A., Awadalla, H., Bach, N., Bahree, A., Bakhtiari, A., et al. (2024). Phi-3 technical report: A highly capable language model locally on your phone. DOI: 10.48550/arXiv.2404.14219.

Ahmed, S. T., Fathima, A. S., Nishabai, M., Sophia, S., et al. (2024). Medical chatbot assistance for primary clinical guidance using machine learning techniques. Procedia Computer Science, 233:279-287. DOI: 10.1016/j.procs.2024.03.217.

AI@Meta (2024). Llama 3 model card. Available at:[link].

Cheung, B. H. H., Lau, G. K. K., Wong, G. T. C., Lee, E. Y. P., Kulkarni, D., Seow, C. S., Wong, R., and Co, M. T.-H. (2023). Chatgpt versus human in generating medical graduate exam multiple choice questions-a multinational prospective study (hong kong sar, singapore, ireland, and the united kingdom). PloS one, 18(8):e0290691. DOI: 10.1371/journal.pone.0290691.

Dos Santos, D. J. L., Feitosa, M., Sena, E., and Dalcin, F. (2019). A importância da bula para o usuário de medicamentos. Brazilian Journal of Surgery & Clinical Research, 27(1). Available at:[link].

Es, S., James, J., Anke, L. E., and Schockaert, S. (2024). Ragas: Automated evaluation of retrieval augmented generation. In Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics: System Demonstrations, pages 150-158. DOI: 10.48550/arXiv.2309.15217.

Fonseca, E., Santos, L., Criscuolo, M., and Aluisio, S. (2016). Assin: Avaliacao de similaridade semantica e inferencia textual. In Computational Processing of the Portuguese Language-12th International Conference, Tomar, Portugal, pages 13-15. Available at:[link].

Jelinek, F., Mercer, R. L., Bahl, L. R., and Baker, J. K. (1977). Perplexity-a measure of the difficulty of speech recognition tasks. The Journal of the Acoustical Society of America, 62(S1):S63-S63. DOI: 10.1121/1.2016299.

Jiang, A. Q., Sablayrolles, A., Mensch, A., Bamford, C., Chaplot, D. S., de las Casas, D., Bressand, F., Lengyel, G., Lample, G., Saulnier, L., Lavaud, L. R., Lachaux, M.-A., Stock, P., Scao, T. L., Lavril, T., Wang, T., Lacroix, T., and Sayed, W. E. (2023). Mistral 7b.

Kim, W. T., Shin, J., Yoo, I.-S., Lee, J.-W., Jeon, H. J., Yoo, H.-S., Kim, Y., Jo, J.-M., Hwang, S., Lee, W.-J., Park, S., and Kim, Y.-J. (2024). Medication extraction and drug interaction chatbot: Generative pretrained transformer-powered chatbot for drug-drug interaction. Mayo Clinic Proceedings: Digital Health, 2(4):611-619. DOI: 10.1016/j.mcpdig.2024.09.001.

Lin, C.-Y. (2004). Rouge: A package for automatic evaluation of summaries. In Text summarization branches out, pages 74-81. Available at:[link].

Liu, Y., Huang, L., Li, S., Chen, S., Zhou, H., Meng, F., Zhou, J., and Sun, X. (2023). Recall: A benchmark for llms robustness against external counterfactual knowledge. arXiv preprint arXiv:2311.08147. DOI: 10.48550/arxiv.2311.08147.

Lunardi, R., Coppola, P., et al. (2024). Conversational-agent for patient information leaflet. In Proceedings of the 14th Italian Information Retrieval Workshop, pages 70-73. Available at:[link].

Maheen, F., Asif, M., Ahmad, H., Ahmad, S., Alturise, F., Asiry, O., and Ghadi, Y. Y. (2022). Automatic computer science domain multiple-choice questions generation based on informative sentences. PeerJ Computer Science, 8:e1010. DOI: 10.7717/peerj-cs.1010.

May, P. (2021). Machine translated multilingual sts benchmark dataset. Available at: "[link]".

Minutolo, A., Damiano, E., De Pietro, G., Fujita, H., and Esposito, M. (2022). A conversational agent for querying italian patient information leaflets and improving health literacy. Computers in Biology and Medicine, 141:105004. DOI: 10.1016/j.compbiomed.2021.105004.

Mucciaccia, S. S., Meireles Paixão, T., Wall Mutz, F., Santos Badue, C., Ferreira de Souza, A., and Oliveira-Santos, T. (2025). Automatic multiple-choice question generation and evaluation systems based on llm: A study case with university resolutions. In Proceedings of the 31st International Conference on Computational Linguistics, pages 2246-2260, Abu Dhabi, UAE. Association for Computational Linguistics. Available at:[link].

Papineni, K., Roukos, S., Ward, T., and Zhu, W.-J. (2002). Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pages 311-318. DOI: 10.3115/1073083.1073135.

Real, L., Fonseca, E., and Oliveira, H. G. (2020). The assin 2 shared task: a quick overview. In International Conference on Computational Processing of the Portuguese Language, pages 406-412. Springer. DOI: 10.1007/978-3-030-41505-1_39.

Saad-Falcon, J., Khattab, O., Potts, C., and Zaharia, M. (2023). Ares: An automated evaluation framework for retrieval-augmented generation systems. arXiv preprint arXiv:2311.09476. DOI: 10.18653/v1/2024.naacl-long.20.

Souza, F., Nogueira, R., and Lotufo, R. (2020). Bertimbau: pretrained bert models for brazilian portuguese. In 9th Brazilian Conference on Intelligent Systems, BRACIS, Rio Grande do Sul, Brazil, October 20-23 (to appear). DOI: 10.1007/978-3-030-61377-8_28.

Tian, Y., Ma, W., Xia, F., and Song, Y. (2019). Chimed: A chinese medical corpus for question answering. In Proceedings of the 18th BioNLP Workshop and Shared Task, pages 250-260. DOI: 10.18653/v1/W19-5027.

Torres, J. J. G., Bîndilă, M. B., Hofstee, S., Szondy, D., Nguyen, Q.-H., Wang, S., and Englebienne, G. (2024). Automated question-answer generation for evaluating rag-based chatbots. In Proceedings of the First Workshop on Patient-Oriented Language Processing (CL4Health)@ LREC-COLING 2024, pages 204-214. Available at:[link].

Wang, L., Yang, N., Huang, X., Yang, L., Majumder, R., and Wei, F. (2024). Multilingual e5 text embeddings: A technical report. arXiv preprint arXiv:2402.05672. DOI: 10.48550/arXiv.2402.05672.

Xiong, G., Jin, Q., Lu, Z., and Zhang, A. (2024). Benchmarking retrieval-augmented generation for medicine. In Findings of the Association for Computational Linguistics ACL 2024, pages 6233-6251. DOI: 10.18653/v1/2024.findings-acl.372.

Zhang, T., Kishore, V., Wu, F., Weinberger, K. Q., and Artzi, Y. (2019). Bertscore: Evaluating text generation with bert. CoRR, abs/1904.09675. DOI: 10.48550/arXiv.1904.09675.

Downloads

Published

2025-10-23

How to Cite

Navarro, L. C., Mutz, F., Paixão, T. M., Zanetti, G. G., Badue, C., De Souza, A. F., & Oliveira-Santos, T. (2025). RagPharma: A RAG-Based Chatbot for Medicine Leaflets with a Dual-Dataset Evaluation Framework. Journal of the Brazilian Computer Society, 31(1), 1137–1149. https://doi.org/10.5753/jbcs.2025.5767

Issue

Section

Articles