Statistical Invariance vs. AI Safety: Why Prompt Filtering Fails Against Contextual Attacks

Authors

DOI:

https://doi.org/10.5753/jbcs.2026.5961

Keywords:

Statistical Invariance, Contextual Moderation, Probabilistic Behavior, Responsible AI

Abstract

Large Language Models (LLMs) are increasingly deployed in high-stakes applications, yet their alignment with ethical standards remains fragile and poorly understood. To investigate the probabilistic and dynamic nature of this alignment, we conducted a black-box evaluation of nine widely used LLM platforms, anonymized to emphasize the underlying mechanisms of ethical alignment rather than model benchmarking. We introduce the Semantic Hijacking Method (SHM) as an experimental framework, formally defined and grounded in probabilistic modeling, designed to reveal how ethical alignment can erode gradually, even when all user inputs remain policy-compliant. Across three experimental rounds (324 total executions), SHM achieved a 97.8% success rate in eliciting harmful content, with failure rates progressing from 93.5% (multi-turn conversations) to 100% (both refined sequences and single-turn interactions), demonstrating that vulnerabilities are inherent to semantic processing rather than conversational memory. A qualitative cross-linguistic analysis revealed cultural variations in harmful narratives, with Brazilian Portuguese responses frequently echoing historical and socio-cultural biases, making them more persuasive to local users. Overall, our findings demonstrate that ethical alignment is not a static barrier but a dynamic and fragile property that challenges binary safety metrics. Due to potential risks of misuse, all prompts and outputs are made available exclusively to authorized reviewers under ethical approval, and this publication focuses solely on reporting the research findings.

Downloads

Download data is not yet available.

References

Akuthota, V., Kasula, R., Sumona, S. T., Mohiuddin, M., Reza, M. T., and Rahman, M. M. (2023). Vulnerability detection and monitoring using llm. In 2023 IEEE 9th International Women in Engineering (WIE) Conference on Electrical and Computer Engineering (WIECON-ECE), pages 309-314. IEEE. DOI: 10.48550/arXiv.2502.07049.

Chang, Z., Li, M., Liu, Y., Wang, J., Wang, Q., and Liu, Y. (2024). Play guessing game with llm: Indirect jailbreak attack with implicit clues. arXiv preprint arXiv:2402.09091. DOI: 10.48550/arXiv.2402.09091.

Chen, K., Liu, Y., Wang, D., Chen, J., and Wang, W. (2024). Characterizing and evaluating the reliability of llms against jailbreak attacks. arXiv preprint arXiv:2408.09326. DOI: 10.48550/arxiv.2408.09326.

Du, X., Mo, F., Wen, M., Gu, T., Zheng, H., Jin, H., and Shi, J. (2025). Multi-turn jailbreaking large language models via attention shifting. Proceedings of the AAAI Conference on Artificial Intelligence, 39(22):23814-23822. DOI: 10.1609/aaai.v39i22.34553.

Ghallab, M. (2019). Responsible ai: requirements and challenges. AI Perspectives, 1(1):1-7. DOI: 10.1186/s42467-019-0003-z.

Guo, W., Li, J., Wang, W., Li, Y., He, D., Yu, J., and Zhang, M. (2025). Mtsa: Multi-turn safety alignment for llms through multi-round red-teaming. arXiv preprint arXiv:2505.17147. DOI: 10.18653/v1/2025.acl-long.1282.

Ji, J., Liu, M., Dai, J., Pan, X., Zhang, C., Bian, C., Chen, B., Sun, R., Wang, Y., and Yang, Y. (2023). Beavertails: Towards improved safety alignment of llm via a human-preference dataset. Advances in Neural Information Processing Systems, 36:24678-24704. DOI: 10.48550/arxiv.2307.04657.

Li, N., Han, Z., Steneker, I., Primack, W., Goodside, R., Zhang, H., Wang, Z., Menghini, C., and Yue, S. (2024). Llm defenses are not robust to multi-turn human jailbreaks yet. arXiv preprint arXiv:2408.15221. DOI: 10.48550/arxiv.2408.15221.

Li, Y., Shen, X., Yao, X., Ding, X., Miao, Y., Krishnan, R., and Padman, R. (2025). Beyond single-turn: A survey on multi-turn interactions with large language models. arXiv preprint arXiv:2504.04717. DOI: 10.48550/arxiv.2504.04717.

Liu, X., Xu, N., Chen, M., and Xiao, C. (2023). Autodan: Generating stealthy jailbreak prompts on aligned large language models. arXiv preprint arXiv:2310.04451. DOI: 10.48550/arXiv.2310.04451.

Lu, Q., Zhu, L., Whittle, J., Xu, X., et al. (2023). Responsible AI: Best practices for creating trustworthy AI systems. Addison-Wesley Professional. Book.

Markov, T., Zhang, C., Agarwal, S., Nekoul, F. E., Lee, T., Adler, S., Jiang, A., and Weng, L. (2023). A holistic approach to undesired content detection in the real world. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 37, pages 15009-15018. DOI: 10.1609/aaai.v37i12.26752.

Monteith, S., Glenn, T., Geddes, J. R., Whybrow, P. C., Achtyes, E., and Bauer, M. (2024). Artificial intelligence and increasing misinformation. The British Journal of Psychiatry, 224(2):33-35. DOI: 10.1192/bjp.2023.136.

Obradovich, N., Khalsa, S. S., Khan, W. U., Suh, J., Perlis, R. H., Ajilore, O., and Paulus, M. P. (2024). Opportunities and risks of large language models in psychiatry. NPP—Digital Psychiatry and Neuroscience, 2(1):8. DOI: 10.1038/s44277-024-00010-z.

Qi, X., Zeng, Y., Xie, T., Chen, P.-Y., Jia, R., Mittal, P., and Henderson, P. (2023). Fine-tuning aligned language models compromises safety, even when users do not intend to! arXiv preprint arXiv:2310.03693. DOI: 10.48550/arxiv.2310.03693.

Ramesh, A., Bhardwaj, S., Saibewar, A., and Kaul, M. (2025). Efficient jailbreak attack sequences on large language models via multi-armed bandit-based context switching. In The Thirteenth International Conference on Learning Representations. Avaialble at: [link].

Sarker, I. H. (2024). Llm potentiality and awareness: a position paper from the perspective of trustworthy and responsible ai modeling. Discover Artificial Intelligence, 4(1):40. DOI: 10.1007/s44163-024-00129-0.

Sun, X., Zhang, D., Yang, D., Zou, Q., and Li, H. (2024). Multi-turn context jailbreak attack on large language models from first principles. arXiv preprint arXiv:2408.04686. DOI: 10.48550/arxiv.2408.04686.

Vieira, S. M., Kaymak, U., and Sousa, J. M. (2010). Cohen's kappa coefficient as a performance measure for feature selection. In International conference on fuzzy systems, pages 1-8. IEEE. DOI: 10.1109/FUZZY.2010.5584442.

Wang, J., Liu, Z., Park, K. H., Jiang, Z., Zheng, Z., Wu, Z., Chen, M., and Xiao, C. (2023). Adversarial demonstration attacks on large language models. arXiv preprint arXiv:2305.14950. DOI: 10.48550/arXiv.2305.14950.

Wei, A., Haghtalab, N., and Steinhardt, J. (2023). Jailbroken: How does llm safety training fail? Advances in Neural Information Processing Systems, 36:80079-80110. DOI: 10.48550/arxiv.2307.02483.

Wu, Y. (2024). Large language model and text generation. In Natural Language Processing in Biomedicine: A Practical Guide, pages 265-297. Springer. DOI: 10.1007/978-3-030-97549-2_12.

Ying, Z., Zhang, D., Jing, Z., Xiao, Y., Zou, Q., Liu, A., Liang, S., Zhang, X., Liu, X., and Tao, D. (2025). Reasoning-augmented conversation for multi-turn jailbreak attacks on large language models. arXiv preprint arXiv:2502.11054. DOI: 10.18653/v1/2025.findings-emnlp.929.

Yun, C., Wagner, C., and Heilinger, J.-C. (2022). It is not about bias but discrimination. Avaialble at: [link].

Zeng, F., Gan, W., Wang, Y., and Philip, S. Y. (2023). 8j. In 2023 IEEE 29th International Conference on Parallel and Distributed Systems (ICPADS), pages 840-847. IEEE. Book.

Zeng, Y., Lin, H., Zhang, J., Yang, D., Jia, R., and Shi, W. (2024). How johnny can persuade llms to jailbreak them: Rethinking persuasion to challenge ai safety by humanizing llms. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 14322-14350. DOI: 10.48550/arXiv.2401.06373.

Zhang, M., Pan, X., and Yang, M. (2023). Jade: A linguistics-based safety evaluation platform for large language models. arXiv preprint arXiv:2311.00286. DOI: 10.48550/arxiv.2311.00286.

Zhou, A. and Arel, R. (2025). Siege: Multi-turn jailbreaking of large language models with tree search. In ICLR 2025 Workshop on Building Trust in Language Models and Applications. Available at: [link].

Zhou, Z., Xiang, J., Chen, H., Liu, Q., Li, Z., and Su, S. (2024). Speak out of turn: Safety vulnerability of large language models in multi-turn dialogue. arXiv preprint arXiv:2402.17262. DOI: 10.48550/arxiv.2402.17262.

Zou, A., Goldstein, T., and Carlini, N. (2023). Universal and transferable adversarial attacks on aligned language models. arXiv preprint arXiv:2307.15043. Available at: [link].

Downloads

Published

2026-01-27

How to Cite

Ioste, A., Peres, S. M., & Finger, M. (2026). Statistical Invariance vs. AI Safety: Why Prompt Filtering Fails Against Contextual Attacks. Journal of the Brazilian Computer Society, 32(1), 43–54. https://doi.org/10.5753/jbcs.2026.5961

Issue

Section

Regular Issue