Statistical Invariance vs. AI Safety: Why Prompt Filtering Fails Against Contextual Attacks
DOI:
https://doi.org/10.5753/jbcs.2026.5961Keywords:
Statistical Invariance, Contextual Moderation, Probabilistic Behavior, Responsible AIAbstract
Large Language Models (LLMs) are increasingly deployed in high-stakes applications, yet their alignment with ethical standards remains fragile and poorly understood. To investigate the probabilistic and dynamic nature of this alignment, we conducted a black-box evaluation of nine widely used LLM platforms, anonymized to emphasize the underlying mechanisms of ethical alignment rather than model benchmarking. We introduce the Semantic Hijacking Method (SHM) as an experimental framework, formally defined and grounded in probabilistic modeling, designed to reveal how ethical alignment can erode gradually, even when all user inputs remain policy-compliant. Across three experimental rounds (324 total executions), SHM achieved a 97.8% success rate in eliciting harmful content, with failure rates progressing from 93.5% (multi-turn conversations) to 100% (both refined sequences and single-turn interactions), demonstrating that vulnerabilities are inherent to semantic processing rather than conversational memory. A qualitative cross-linguistic analysis revealed cultural variations in harmful narratives, with Brazilian Portuguese responses frequently echoing historical and socio-cultural biases, making them more persuasive to local users. Overall, our findings demonstrate that ethical alignment is not a static barrier but a dynamic and fragile property that challenges binary safety metrics. Due to potential risks of misuse, all prompts and outputs are made available exclusively to authorized reviewers under ethical approval, and this publication focuses solely on reporting the research findings.
Downloads
References
Abadi, M., Agarwal, A., Barham, P., Brevdo, E., Chen, Z., Citro, C., Corrado, G. S., Davis, A., Dean, J., Devin, M., et al. (2015). TensorFlow: Large-scale machine learning on heterogeneous systems. Available at: [link].
Ahmed, I., Jeon, G., and Piccialli, F. (2022). From artificial intelligence to explainable artificial intelligence in industry 4.0: a survey on what, how, and where. IEEE Transactions on Industrial Informatics, 18(8):5031-5042. DOI: 10.1109/tii.2022.3146552.
Desai, V., Gattani, A., and Dalvi, H. (2024). Explainable models for the detection of incidents of fake news and hate speech. In Text and Social Media Analytics for Fake News and Hate Speech Detection, pages 114-136. Chapman and Hall/CRC. DOI: 10.1201/9781003409519-6.
Garcia, G., Afonso, L., and Papa, J. (2022). FakeRecogna: A New Brazilian Corpus for Fake News Detection, pages 57-67. DOI: 10.1007/978-3-030-98305-5_6.
Gohel, P., Singh, P., and Mohanty, M. (2021). Explainable ai: current status and future directions. arXiv preprint arXiv:2107.07045. DOI: 10.48550/arxiv.2107.07045.
Lakkaraju, H., Kamar, E., Caruana, R., and Leskovec, J. (2019). Faithful and customizable explanations of black box models. In Proceedings of the 2019 AAAI/ACM Conference on AI, Ethics, and Society, pages 131-138. DOI: 10.1145/3306618.3314229.
Lima, T. B., Rolim, V., Nascimento, A. C., Miranda, P., Macario, V., Rodrigues, L., Freitas, E., Gašević, D., and Mello, R. F. (2024). Towards explainable automatic punctuation restoration for portuguese using transformers. Expert Systems with Applications, 257:125097. DOI: 10.1016/j.eswa.2024.125097.
Lundberg, S. M. and Lee, S.-I. (2017). A unified approach to interpreting model predictions. Advances in neural information processing systems, 30. DOI: 10.48550/arxiv.1705.07874.
Mersha, M. A., Yigezu, M. G., Shakil, H., AlShami, A. K., Byun, S., and Kalita, J. (2025). A unified framework with novel metrics for evaluating the effectiveness of xai techniques in llms. DOI: 10.48550/arxiv.2503.05050.
Moradi, M. and Samwald, M. (2021). Explaining black-box models for biomedical text classification. IEEE journal of biomedical and health informatics, 25(8):3112-3120. DOI: 10.1109/jbhi.2021.3056748.
Moraliyage, H., Kulawardana, G., De Silva, D., Issadeen, Z., Manic, M., and Katsura, S. (2025). Explainable artificial intelligence with integrated gradients for the detection of adversarial attacks on text classifiers. Applied System Innovation, 8(1):17. DOI: 10.3390/asi8010017.
Oliveira, H., Ferreira Mello, R., Barreiros Rosa, B. A., Rakovic, M., Miranda, P., Cordeiro, T., Isotani, S., Bittencourt, I., and Gasevic, D. (2023). Towards explainable prediction of essay cohesion in portuguese and english. In LAK23: 13th International Learning Analytics and Knowledge Conference, pages 509-519. DOI: 10.1145/3576050.3576152.
Pendyala, V. S. and Hall, C. E. (2024). Explaining misinformation detection using large language models. Electronics, 13(9):1673. DOI: 10.3390/electronics13091673.
Ribeiro, M. T., Singh, S., and Guestrin, C. (2016). " why should i trust you?" explaining the predictions of any classifier. In Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining, pages 1135-1144. DOI: 10.18653/v1/n16-3020.
Santos, R. L., Monteiro, R. A., and Pardo, T. A. (2018). The fake. br corpus-a corpus of fake news for brazilian portuguese. In Latin American and Iberian Languages Open Corpora Forum (OpenCor), pages 1-2. DOI: 10.5753/erbd.2023.229495.
Shevskaya, N. V. (2021). Explainable artificial intelligence approaches: challenges and perspectives. In 2021 International Conference on Quality Management, Transport and Information Security, Information Technologies (IT&QM&IS), pages 540-543. IEEE. DOI: 10.1109/itqmis53292.2021.9642869.
Souza, F., Nogueira, R., and Lotufo, R. (2020). Bertimbau: pretrained bert models for brazilian portuguese. In Brazilian conference on intelligent systems, pages 403-417. Springer. DOI: 10.1007/978-3-030-61377-8_28.
Sundararajan, M., Taly, A., and Yan, Q. (2017). Axiomatic attribution for deep networks. In Proceedings of the 34th International Conference on Machine Learning - Volume 70, ICML'17, page 3319–3328. JMLR.org. DOI: 10.48550/arxiv.1703.01365.
Downloads
Published
How to Cite
Issue
Section
License
Copyright (c) 2026 Aline Ioste, SaraJane Peres, Marcelo Finger

This work is licensed under a Creative Commons Attribution 4.0 International License.

