STELLAR: A Structured, Trustworthy, and Explainable LLM-Led Architecture for Reliable Customer Support
DOI:
https://doi.org/10.5753/jbcs.2026.6044Keywords:
Large Language Models (LLMs), Intelligent Customer Support Systems, Structured LLM Architectures, Reliable and Trustworthy AI, Retrieval-Augmented Generation (RAG)Abstract
While Large Language Models (LLMs) offer transformative potential for automating customer support, significant hurdles remain concerning their reliability, explainability, and consistent performance in complex, sensitive interactions. This paper introduces STELLAR (Structured, Trustworthy, and Explainable LLM-Led Architecture for Reliable Customer Support), a novel architectural blueprint designed to address these issues. STELLAR utilizes a Directed Acyclic Graph (DAG) structure composed of nine specialized modules and eleven predefined workflows to orchestrate support interactions in a structured and predictable manner. This design promotes enhanced traceability, reliability, and control compared to less constrained systems. The architecture integrates components for few-shot classification, Retrieval-Augmented Generation (RAG), urgency-aware human escalation, compliance verification, user interaction validation, and knowledge base refinement through a semi-automated loop. This modular design deliberately balances LLM-driven innovation with operational requirements such as human-in-the-loop integration and ethical safeguards through embedded checks. We evaluated the core modules of STELLAR in key tasks - classification, retrieval, and compliance - demonstrating strong performance and reliability. Together, these features position STELLAR as a robust and transparent foundation for the next generation of intelligent, reliable customer support systems.
Downloads
References
Bodonhelyi, Z., Recski, G., and Iantovics, L. B. (2024). User Intent Recognition and Satisfaction with Large Language Models: A User Study with ChatGPT. arXiv preprint arXiv:2402.02136. DOI: 10.48550/arxiv.2402.02136.
Brown, T. B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., Agarwal, S., Herbert-Voss, A., Krueger, G., Henighan, T., Child, R., Ramesh, A., Ziegler, D. M., Wu, J., Winter, C., Hesse, C., Chen, M., Sigler, E., Litwin, M., Gray, S., Chess, B., Clark, J., Berner, C., McCandlish, S., Radford, A., Sutskever, I., and Amodei, D. (2020). Language Models are Few-Shot Learners. In Advances in Neural Information Processing Systems (NeurIPS), volume 33, pages 1877-1901. DOI: 10.48550/arxiv.2005.14165.
cardiffnlp (2021). twitter-roberta-base-sentiment-latest. Available at:[link].
Chase, H. (2022). LangChain. Available at:[link].
Chen, H., Liu, X., Yin, D., and Tang, J. (2018). A Survey on Dialogue Systems: Recent Advances and New Frontiers. arXiv preprint arXiv:1711.01731v3. DOI: 10.48550/arxiv.1711.01731.
Dong, H., Huang, Z., Xu, Y., Zhang, Y., and Yu, Y. (2025). ProTOD: Proactive Task-Oriented Dialogue System Based on Large Language Model. In 30th International Conference on Computational Linguistics (COLING), pages 9147-9164. Available at:[link].
Fu, Y. and Feng, C. (2023). GPTCache: An Open-Source Semantic Cache for LLM Applications. In 3rd Workshop for Natural Language Processing Open Source Software (NLP-OSS), pages 212-218. Available at:[link].
Ghosh, D., Xiang, Z., Singh, M., and Choi, Y. (2024). Aegis: Online Adaptive AI Content Safety Moderation with Ensemble of LLM Experts. arXiv preprint arXiv:2404.05993. DOI: 10.48550/arxiv.2404.05993.
Gim, I., Chen, G., Lee, S.-S., Sarda, N., Khandelwal, A., and Zhong, L. (2024). Prompt Cache: Modular Attention Reuse for Low-Latency Inference. arXiv preprint arXiv:2311.04934v2. DOI: 10.48550/arxiv.2311.04934.
Google (2024). Gemma-2-9b-it. Available at:[link].
Hudeček, O. and Dušek, O. (2023). Are Large Language Models All You Need for Task-Oriented Dialogue? In 24th Annual Meeting of the Special Interest Group on Discourse and Dialogue (SIGDIAL), pages 216-228. DOI: 10.18653/v1/2023.sigdial-1.21.
Jia, R., Arora, S., Lee, H., Khaliq, B., Kwak, H., and Yang, D. (2024). Leveraging LLMs for Dialogue Quality Measurement. [link]. arXiv preprint arXiv:2406.17304.
Jiang, Z., Xu, F. F., Gao, L., Sun, Z., Dou, Q., Bing, L., Lin, R. W., and Han, S. (2023). Active Retrieval Augmented Generation. In Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 7969-7992. DOI: 10.18653/v1/2023.emnlp-main.495.
Joko, S., Sakai, T., Wu, D., and Joho, H. (2024). Doing Personal LAPS: LLM-Augmented Dialogue Construction for Personalized Multi-Session Conversational Search. In 47th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR), pages 796-806. ACM. DOI: 10.1145/3626772.3657815.
Li, A., Gong, B., Yang, B., Shan, B., Liu, C., et al. (2025). MiniMax-01: Scaling Foundation Models with Lightning Attention. arXiv preprint arXiv:2501.08313v1. DOI: 10.48550/arxiv.2501.08313.
Lin, Z. and Zhang, Y. (2023). LLM-EVAL: Unified Multi-Dimensional Automatic Evaluation for Open-Domain Conversations with LLMs. In 5th Workshop on NLP for Conversational AI (NLP4ConvAI), pages 47-58. Available at:[link].
Meta (2024a). Llama-3.2-3B. Available at:[link].
Meta (2024b). Meta-Llama-3-8B-Instruct. Available at:[link].
Meta (2024c). Meta-Llama-3.1-70B-Instruct. Available at:[link].
Meta (2024d). Meta-Llama-3.3-70B-Instruct. Available at:[link].
Moura, J. (2023). CrewAI. Available at:[link].
Nehring, J., Augenstein, L., Gregor, B., Heinzerling, B., and Kern, R. (2024). Dynamic Prompting: Large Language Models for Task-Oriented Dialog. In 9th Italian Conference on Computational Linguistics (CLiC-it), pages 1-10. Available at:[link].
Rayo, J., Vila, J., Klinaku, S., and Schmidt, A. (2025). A Hybrid Approach to Information Retrieval and Answer Generation for Regulatory Texts. arXiv preprint arXiv:2502.16767. DOI: 10.48550/arxiv.2502.16767.
sentence-transformers (2020a). all-MiniLM-L12-v2. Available at:[link].
sentence-transformers (2020b). all-MiniLM-L6-v2. Available at:[link].
sentence-transformers (2020c). multi-qa-MiniLM-L6-cos-v1. Available at:[link].
sentence-transformers (2021). all-mpnet-base-v2. Available at:[link].
Shao, Z., Ren, Y., Sun, H., Wu, H., and Li, X. (2023). Enhancing Retrieval-Augmented Large Language Models with Iterative Retrieval-Generation Synergy. In Findings of the Association for Computational Linguistics (EMNLP), pages 9248-9274. DOI: 10.18653/v1/2023.findings-emnlp.620.
Shavit, Y., Agarwal, S., Brundage, M., Adler, S., O’Keefe, C., Campbell, R., Kjellsson, H., Button, T., Sastry, G., Kokotajlo, D., Saunders, W., Knight, R., Schulman, J., and Solaiman, I. (2023). Practices for Governing Agentic AI Systems. Available at:[link].
ShieldGemma Team (2024). ShieldGemma: Generative AI Content Moderation Based on Gemma. arXiv preprint arXiv:2407.21772. Available at:[link].
Sreekar, K., Ashok, A., Lalwani, J., Rajanala, S., Joty, S., Lyzinski, V., Elazar, Y., and Potti, N. (2024). AXCEL: Automated eXplainable Consistency Evaluation using LLMs. In Findings of the Association for Computational Linguistics (EMNLP), pages 14943-14957. DOI: 10.18653/v1/2024.findings-emnlp.878.
Sumers, T. R., Yao, S., Narasimhan, K., and Griffiths, T. L. (2024). Cognitive Architectures for Language Agents. arXiv preprint arXiv:2309.02427v3. DOI: 10.48550/arxiv.2309.02427.
Thoppilan, R., De Freitas, D., Hall, J., Shazeer, N., Kulshreshtha, A., Cheng, H. T., Chowdhery, A., Ichter, B., Le, Q. V., Salkin, R., Gao, L., Narang, S., Fiedel, N., Dean, J., and Roberts, A. (2022). LaMDA: Language Models for Dialog Applications. arXiv preprint arXiv:2201.08239. DOI: 10.48550/arxiv.2201.08239.
Wang, X., Wang, Z., Gao, X., Zhang, F., Wu, Y., Xu, Z., Li, Z., Zhang, W., Yuan, Z., Li, Z., Zhang, H., Li, H., Liu, Z., and Sun, M. (2024). Searching for Best Practices in Retrieval-Augmented Generation. arXiv preprint arXiv:2407.01219v1. DOI: 10.18653/v1/2024.emnlp-main.981.
Wei, J., Wang, X., Schuurmans, D., Bosma, M., Ichter, B., Xia, F., Chi, E., Le, Q., and Zhou, D. (2023). Chain-of-Thought Prompting Elicits Reasoning in Large Language Models. arXiv preprint arXiv:2201.11903v6. Available at:[link].
Weizenbaum, J. (1966). ELIZA - A Computer Program for the Study of Natural Language Communication Between Man and Machine. Communications of the ACM, 9(1):36-45. DOI: 10.1145/357980.357991.
Wu, Q., Bansal, G., Zhang, J., Wu, Y., Zhang, S., Zhu, E., Li, B., Jiang, L., Zhang, X., and Wang, C. (2023). AutoGen: Enabling Next-Gen LLM Applications via Multi-Agent Conversation Framework. arXiv preprint arXiv:2308.08155. Available at:[link].
Wulf, J. and Meierhofer, J. (2024). Utilizing Large Language Models for Automating Technical Customer Support. arXiv preprint arXiv:2406.01407. DOI: 10.48550/arxiv.2406.01407.
Xi, Z., Chen, W., Guo, X., He, W., Ding, Y., Hong, B., Zhang, M., Wang, J., Jin, S., Zhou, E., Zheng, R., Fan, A., Wang, H., Gui, T., Zhang, Q., Wang, F., Zhang, B., Wang, Z., Zhao, H., Liu, S., Li, Z., Yuan, J., Wu, L., Liu, Z., Sun, M., and Zhang, Y. (2023). The Rise and Potential of Large Language Model Based Agents: A Survey. arXiv preprint arXiv:2309.07864v3. DOI: 10.1007/s11432-024-4222-0.
Xu, Z., Jain, S., and Kankanhalli, M. (2024). Hallucination is Inevitable: An Innate Limitation of Large Language Models. arXiv preprint arXiv:2401.11817v2. DOI: 10.48550/arxiv.2401.11817.
Zhang, W. and Zhang, J. (2023). Hallucination Mitigation for Retrieval-Augmented Large Language Models: A Review. Mathematics, 13(5):856. DOI: 10.3390/math13050856.
Downloads
Published
How to Cite
Issue
Section
License
Copyright (c) 2026 Matheus Ferracciú Scatolin, Helio Pedrini

This work is licensed under a Creative Commons Attribution 4.0 International License.

