Trait and Consistency Evaluation: Measuring Behavioral Stability and the Adversarial Compensation Effect
DOI:
https://doi.org/10.5753/jbcs.2026.6343Keywords:
Latent Trait Modeling, Behavioral Stability, Compressed Reasoning, Rubric-Based Evaluation, Prompt StratificationAbstract
The stochastic nature of Large Language Models (LLMs) challenges traditional evaluation paradigms, which rely on single-response metrics and often mask complex behavioral patterns. This paper introduces Trait and Consistency Evaluation for LLMs (TraCE-LLM), an evaluation protocol that quantifies latent behavioral traits and model consistency within a black-box paradigm. Through a factorial design combining five LLMs, three benchmarks and a systematic stratification by prompt style (Naive, Chain-of-Thought and Adversarial), the framework employs a multidimensional rubric to measure Depth of Reasoning (DoR) and Originality (ORI) of model responses. The primary empirical contribution of this study is the identification and formalization of the Adversarial Compensation Effect (ACE), a phenomenon wherein smaller-capacity models under adversarial stress exhibit a paradoxical gain in accuracy metrics while suffering a severe degradation in behavioral stability. Our results also demonstrate an asymmetric stability with DoR being a significantly more stable trait than ORI and the prevalence of compressed reasoning, where 17.8% of correct answers lack adequate justification. By decoupling response correctness from process quality, TraCE-LLM provides a blueprint for more granular and reliable evaluation, arguing that LLM auditing must be multidimensional, context-sensitive and psychometrically informed to ensure the development of safer and more interpretable AI.
Downloads
References
AI, D. (2024). Deepseek-v3: A mixture-of-experts language model. arXiv preprint, arXiv:2412.19437. Available at:[link]. Technical report describing the DeepSeek-V3 architecture and training.
Anthropic (2024). Claude 3 and 3.5 model card. Available at:[link]. Includes information on Claude 3.5 Haiku and related models.
Bai, G., Liu, J., Bu, X., et al. (2024). MT-bench-101: A fine-grained benchmark for evaluating large language models in multi-turn dialogues. In Ku, L.-W., Martins, A., and Srikumar, V., editors, Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 7421-7454, Bangkok, Thailand. Association for Computational Linguistics. DOI: 10.18653/v1/2024.acl-long.401.
Biderman, S., Schoelkopf, H., Sutawika, L., et al. (2024). Lessons from the trenches on reproducible evaluation of language models. DOI: 10.48550/arxiv.2405.14782.
Bishop, C. M. (2006). Pattern recognition and machine learning. Springer. Textbook for graduates; includes bibliographical references (pages 711-728) and index. DOI: 10.1007/978-0-387-45528-0.
Bonett, D. G. and Wright, T. A. (2000). Sample size requirements for estimating pearson, kendall and spearman correlations. Psychometrika, 65(1):23-28. DOI: 10.1007/BF02294183.
Brom, P. C., Di Oliveira, V., and Weigang, L. (2025). TraCE-LLM: Evaluation datasets and pipeline (v2.3). DOI: 10.5281/zenodo.18549677.
Brown, T., Mann, B., Ryder, N., and Subbiah (2020). Language models are few-shot learners. In Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M., and Lin, H., editors, Advances in Neural Information Processing Systems, volume 33, pages 1877-1901. Curran Associates, Inc.. DOI: 10.48550/arxiv.2005.14165.
Bubeck, S., Chandrasekaran, V., Eldan, R., et al. (2023). Sparks of artificial general intelligence: Early experiments with gpt-4. arXiv, abs/2303.12712. DOI: 10.48550/arXiv.2303.12712.
Chiang, W.-L., Zheng, L., Sheng, Y., et al. (2024). Chatbot arena: An open platform for evaluating llms by human preference. arXiv, abs/2403.04132. DOI: 10.48550/arXiv.2403.04132.
Clark, P., Cowhey, I., Etzioni, O., et al. (2018). Think you have solved question answering? try arc, the ai2 reasoning challenge. DOI: 10.48550/arxiv.1803.05457.
Davison, A. C. and Hinkley, D. V. (1997). Bootstrap Methods and their Application. Cambridge Series in Statistical and Probabilistic Mathematics. Cambridge University Press. DOI: 10.1017/cbo9780511802843.
DeepSeek (2025). Deepseek api. Available at:[link]. API documentation for DeepSeek models, including context and limits.
Di Oliveira, V., Brom, P. C., and Li, W. (2025). Two-step RAG for metadata filtering and statistical LLM evaluation. IEEE Latin America Transactions, 23(12):1201-1210. DOI: 10.1109/TLA.2025.11231222.
Doshi-Velez, F. and Kim, B. (2017). Towards a rigorous science of interpretable machine learning. DOI: 10.48550/arxiv.1702.08608.
Efron, B. and Tibshirani, R. J. (1993). An Introduction to the Bootstrap. Chapman and Hall/CRC, New York. DOI: 10.1007/978-1-4899-4541-9.
Freitag, M., Grangier, D., and Caswell, I. (2020). BLEU might be guilty but references are not innocent. In Webber, B., Cohn, T., He, Y., and Liu, Y., editors, Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 61-71, Online. Association for Computational Linguistics. DOI: 10.18653/v1/2020.emnlp-main.5.
Ganguli, D., Lovitt, L., Kernion, J., et al. (2022). Red teaming language models to reduce harms: Methods, scaling behaviors, and lessons learned. DOI: 10.48550/arxiv.2209.07858.
Geirhos, R., Jacobsen, J.-H., Michaelis, C., et al. (2020). Shortcut learning in deep neural networks. Nature Machine Intelligence, 2(11):665-673. DOI: 10.1038/s42256-020-00257-z.
Goodfellow, I., Bengio, Y., and Courville, A. (2016). Deep Learning. MIT Press. DOI: 10.1038/nature14539.
Hampel, F. R. (1974). The influence curve and its role in robust estimation. Journal of the American Statistical Association, 69(346):383-393. DOI: 10.2307/2285666.
Hendrycks, D., Burns, C., Basart, S., Zou, A., et al. (2021). Measuring massive multitask language understanding. Available at:[link].
Henseler, J., Ringle, C. M., and Sarstedt, M. (2015). A new criterion for assessing discriminant validity in variance-based structural equation modeling. Journal of the Academy of Marketing Science, 43(1):115-135. DOI: 10.1007/s11747-014-0403-8.
Holtzman, A., Buys, J., Du, L., Forbes, M., and Choi, Y. (2020). The curious case of neural text degeneration. DOI: 10.48550/arxiv.1904.09751.
Hu, Z., Zhang, J., Xiong, Z., et al. (2025). Language model preference evaluation with multiple weak evaluators. Available at:[link].
Kaiser, H. F. (1960). The application of electronic computers to factor analysis. Educational and Psychological Measurement, 20(1):141-151. DOI: 10.1177/001316446002000116.
Kang, J., Son, D., Song, H., and Chang, B. (2024). In-context learning with noisy labels. Available at:[link].
Kaplan, J., McCandlish, S., Henighan, T., Brown, T. B., et al. (2020). Scaling laws for neural language models. DOI: 10.48550/arxiv.2001.08361.
KENDALL, M. G. (1938). A new measure of rank correlation. Biometrika, 30(1-2):81-93. DOI: 10.1093/biomet/30.1-2.81.
KENDALL, M. G. (1945). The treatment of ties in ranking problems. Biometrika, 33(3):239-251. DOI: 10.1093/biomet/33.3.239.
Lanham, T., Chen, A., Radhakrishnan, A., Steiner, B., et al. (2023). Measuring faithfulness in chain-of-thought reasoning. DOI: 10.48550/arxiv.2307.13702.
Lee, S., Cai, Y., Meng, D., et al. (2024). Unleashing large language models' proficiency in zero-shot essay scoring. In Al-Onaizan, Y., Bansal, M., and Chen, Y.-N., editors, Findings of the Association for Computational Linguistics: EMNLP 2024, pages 181-198, Miami, Florida, USA. Association for Computational Linguistics. DOI: 10.18653/v1/2024.findings-emnlp.10.
Li, R., Zhu, C., Xu, B., Wang, X., and Mao, Z. (2025a). Automated creativity evaluation for large language models: A reference-based approach. DOI: 10.18653/v1/2025.findings-emnlp.1171.
Li, Y., Wang, H., Zhang, Q., et al. (2025b). Unieval: Unified holistic evaluation for unified multimodal understanding and generation. DOI: 10.48550/arxiv.2505.10483.
Liang, P., Bommasani, R., Lee, T., and Tsipras, D. (2023). Holistic evaluation of language models. DOI: 10.1111/nyas.15007.
Madaan, L., Singh, A. K., Schaeffer, R., et al. (2024). Quantifying variance in evaluation benchmarks. DOI: 10.48550/arxiv.2406.10229.
Mondorf, P. and Plank, B. (2024). Beyond accuracy: Evaluating the reasoning behavior of large language models - a survey. arXiv preprint arXiv:2404.01869. DOI: 10.48550/arxiv.2404.01869.
Nalbandyan, G., Shahbazyan, R., and Bakhturina, E. (2025). Score: Systematic consistency and robustness evaluation for large language models. In Proceedings of the 2025 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Industry Track, pages 470-484. Association for Computational Linguistics. Available as PDF at ACL Anthology. DOI: 10.18653/v1/2025.naacl-industry.39.
OpenAI (2024). Gpt-4o mini: advancing cost-efficient intelligence. Available at:[link]. Product announcement and model description.
OpenAI (2025). Evals: A framework for evaluating large language models and an open-source registry of benchmarks. Available at:[link]. Accessed August 9, 2025.
OpenAI (2025). Gpt-4.1 nano. Available at:[link]. Model card and API documentation.
Osgood, C. E., Suci, G. J., and Tannenbaum, P. H. (1967). The Measurement of Meaning. University of Illinois Press. Paperback publication date: 01 January 1967. DOI: 10.2307/3709408.
Panickssery, A., Bowman, S. R., and Feng, S. (2024). Llm evaluators recognize and favor their own generations. DOI: 10.52202/079017-2197.
Pezeshkpour, P. and Hruschka, E. (2024). Large language models sensitivity to the order of options in multiple-choice questions. In Duh, K., Gomez, H., and Bethard, S., editors, Findings of the Association for Computational Linguistics: NAACL 2024, pages 2006-2017, Mexico City, Mexico. Association for Computational Linguistics. DOI: 10.18653/v1/2024.findings-naacl.130.
Polo, F. M., Xu, R., Weber, L., et al. (2024). Efficient multi-prompt evaluation of llms. DOI: 10.52202/079017-0707.
Raj, H., Gupta, V., Rosati, D., and Majumdar, S. (2025). Semantic consistency for assuring reliability of large language models. DOI: 10.48550/arxiv.2308.09138.
Reckase, M. D. (2009). Multidimensional Item Response Theory. Statistics for Social and Behavioral Sciences. Springer, New York, NY, 1 edition. DOI: 10.1007/978-0-387-89976-3.
Shen, X., Chen, Z., Backes, M., Shen, Y., Zhang, Y., and verazuo (repository maintainer) (2024). In-the-wild jailbreak prompts on llms. Available at:[link].
Shi, C., Yang, H., Cai, D., et al. (2024). A thorough examination of decoding methods in the era of LLMs. In Al-Onaizan, Y., Bansal, M., and Chen, Y.-N., editors, Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 8601-8629, Miami, Florida, USA. Association for Computational Linguistics. DOI: 10.18653/v1/2024.emnlp-main.489.
Song, Y., Wang, G., Li, S., Lin, B. Y., et al. (2025). The good, the bad, and the greedy: Evaluation of llms should not ignore non-determinism. In Proceedings of the 2025 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pages 4195-4206. Association for Computational Linguistics. Available as PDF at ACL Anthology. DOI: 10.18653/v1/2025.naacl-long.211.
Srivastava, A., Rastogi, A., Rao, A., et al. (2022). Beyond the imitation game: Quantifying and extrapolating the capabilities of language models. arXiv preprint arXiv:2206.04615. DOI: 10.48550/arxiv.2206.04615.
Thelwall, M. (2025). Chatgpt for complex text evaluation tasks. Journal of the Association for Information Science and Technology, 76(4):645-648. DOI: 10.1002/asi.24966.
Turpin, M., Michael, J., Perez, E., and Bowman, S. (2023). Language models don't always say what they think: Unfaithful explanations in chain-of-thought prompting. Advances in Neural Information Processing Systems, 36:74952-74965. DOI: 10.52202/075280-3275.
Wang, B., Wei, C., Liu, Z., et al. (2024a). Resilience of large language models for noisy instructions. In Al-Onaizan, Y., Bansal, M., and Chen, Y.-N., editors, Findings of the Association for Computational Linguistics: EMNLP 2024, pages 11939-11950, Miami, Florida, USA. Association for Computational Linguistics. DOI: 10.18653/v1/2024.findings-emnlp.697.
Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang, S., Chowdhery, A., and Zhou, D. (2022). Self-consistency improves chain of thought reasoning in language models. arXiv preprint arXiv:2203.11171. DOI: 10.48550/arxiv.2203.11171.
Wang, Y., Ma, X., Zhang, G., Ni, Y., Chandra, A., Guo, S., Ren, W., Arulraj, A., He, X., Jiang, Z., et al. (2024b). Mmlu-pro: A more robust and challenging multi-task language understanding benchmark. Advances in Neural Information Processing Systems, 37:95266-95290. DOI: 10.52202/079017-3018.
Wei, J., Tay, Y., Bommasani, R., Raffel, C., et al. (2022). Emergent abilities of large language models. DOI: 10.48550/arxiv.2206.07682.
Wei, J., Wang, X., Schuurmans, D., Bosma, M., et al. (2023). Chain-of-thought prompting elicits reasoning in large language models. DOI: 10.52202/068431-1800.
Weigang, L. and Brom, P. C. (2025a). Llm-bt-terms: Back-translation as a framework for terminology standardization and dynamic semantic embedding. DOI: 10.48550/arxiv.2506.08174.
Weigang, L. and Brom, P. C. (2025b). Paradox of poetic intent in back-translation: evaluating the quality of large language models in chinese translation. Frontiers of Information Technology & Electronic Engineering, 26(11):2176. DOI: 10.1631/FITEE.2500298.
xAI (2025). Grok 3 and grok 3 mini beta. Available at:[link]. Announcement and high-level description of Grok 3 mini (beta).
Yu, Z., Gao, C., Yao, W., et al. (2024). FreeEval: A modular framework for trustworthy and efficient evaluation of large language models. In Hernandez Farias, D. I., Hope, T., and Li, M., editors, Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pages 1-13, Miami, Florida, USA. Association for Computational Linguistics. DOI: 10.18653/v1/2024.emnlp-demo.1.
Zellers, R., Holtzman, A., Bisk, Y., et al. (2019). HellaSwag: Can a machine really finish your sentence? In Korhonen, A., Traum, D., and Màrquez, L., editors, Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 4791-4800, Florence, Italy. Association for Computational Linguistics. DOI: 10.18653/v1/P19-1472.
Zheng, L., Chiang, W.-L., Sheng, Y., Zhuang, S., Wu, Z., Zhuang, Y., Lin, Z., Li, Z., Li, D., Xing, E. P., Zhang, H., Gonzalez, J. E., and Stoica, I. (2023). Judging llm-as-a-judge with mt-bench and chatbot arena. DOI: 10.52202/075280-2020.
Zhou, Y., Muresanu, A. I., Han, Z., Paster, K., Pitis, S., Chan, H., and Ba, J. (2022). Large language models are human-level prompt engineers. In The eleventh international conference on learning representations. DOI: 10.48550/arxiv.2211.01910.
Zou, A., Wang, Z., Carlini, N., Nasr, M., Kolter, J. Z., and Fredrikson, M. (2023). Universal and transferable adversarial attacks on aligned language models. Available at:[link].
Downloads
Published
How to Cite
Issue
Section
License
Copyright (c) 2026 Pedro Carvalho Brom, Vinícius Di Oliveira, Li Weigang

This work is licensed under a Creative Commons Attribution 4.0 International License.

