Assessing the Capability of LLMs in Solving POSCOMP Questions

Authors

DOI:

https://doi.org/10.5753/jbcs.2025.4493

Keywords:

LLMs, POSCOMP, SBC

Abstract

Recent advancements in Large Language Models (LLMs) have significantly expanded the capabilities of artificial intelligence in natural language processing tasks. Despite this progress, their performance in specialized domains such as computer science remains relatively unexplored. Understanding the proficiency of LLMs in these domains is critical for evaluating their practical utility and guiding future developments. The POSCOMP, a prestigious Brazilian examination used for graduate admissions in computer science promoted by the Brazlian Computer Society (SBC), provides a challenging benchmark. This study investigates whether LLMs can match or surpass human performance on the POSCOMP exam. Four LLMs – ChatGPT-4, Gemini 1.0 Advanced, Claude 3 Sonnet, and Le Chat Mistral Large – were evaluated on the 2022 and 2023 POSCOMP exams. The assessments measured the models’ proficiency in handling complex questions typical of the exam. LLM performance was notably better on text-based questions than on image interpretation tasks. In the 2022 exam, ChatGPT-4 led with 57 correct answers out of 70 questions, followed by Gemini 1.0 Advanced (49), Le Chat Mistral (48), and Claude 3 Sonnet (44). Similar trends were observed in the 2023 exam. ChatGPT-4 achieved the highest performance, surpassing all students who took the POSCOMP 2023 exam. LLMs, particularly ChatGPT-4, show promise in text-based tasks on the POSCOMP exam, although image interpretation remains a challenge. Given the rapid evolution of LLMs, we expanded our analysis to include more recent models -- o1, Gemini 2.5 Pro, Claude 3.7 Sonnet, and o3-mini-high -- evaluated on the 2022–2024 POSCOMP exams. These newer models demonstrate further improvements and consistently surpass both the average and top-performing human participants across all three years.

Downloads

Download data is not yet available.

References

Anthropic (2024). Claude. Available at:[link].

Applis, L., Panichella, A., and Marang, R. (2023). Searching for quality: Genetic algorithms and metamorphic testing for software engineering ml. In Proceedings of the Genetic and Evolutionary Computation Conference, pages 1490-1498. ACM. DOI: 10.1145/3583131.3590379.

Bommarito, J., II, M. J. B., Katz, D. M., and Katz, J. (2023). Gpt as knowledge worker: A zero-shot evaluation of (ai)cpa capabilities. abs/2301.04408. DOI: 10.48550/ARXIV.2301.04408.

Chen, T. Y., Kuo, F., Liu, H., Poon, P., Towey, D., Tse, T. H., and Zhou, Z. Q. (2018). Metamorphic testing: A review of challenges and opportunities. ACM Computing Surveys, 51(1):4:1-4:27. DOI: 10.1145/3143561.

DAIR.AI (2025). Prompt engineering guide. Available at:[link].

Goodfellow, I., Bengio, Y., and Courville, A. (2016). Deep Learning. MIT Press. Book.

Google (2024). Gemini. Available at:[link].

Guillen-Grima, F., Guillen-Aguinaga, S., Guillen-Aguinaga, L., Alas-Brun, R., Onambele, L., Ortega, W., Montejo, R., Aguinaga-Ontoso, E., Barach, P., and Aguinaga-Ontoso, I. (2023). Evaluating the efficacy of chatgpt in navigating the spanish medical residency entrance examination (mir): Promising horizons for ai in clinical medicine. Clinics and Practice, 13(6):1460-1487. DOI: 10.3390/clinpract13060130.

II, M. J. B. and Katz, D. M. (2022). Gpt takes the bar exam. abs/2212.14402. DOI: 10.48550/ARXIV.2212.14402.

Intrator, Y., Halfon, M., Goldenberg, R., Tsarfaty, R., Eyal, M., Rivlin, E., Matias, Y., and Aizenberg, N. (2024). Breaking the language barrier: Can direct inference outperform pre-translation in multilingual llm applications? abs/2403.04792. DOI: 10.48550/ARXIV.2403.04792.

Joshi, I., Budhiraja, R., Dev, H., Kadia, J., Ataullah, M. O., Mitra, S., Akolekar, H. D., and Kumar, D. (2024). Chatgpt in the classroom: An analysis of its strengths and weaknesses for solving undergraduate computer science questions. In Proceedings of the 55th ACM Technical Symposium on Computer Science Education, pages 625-631. ACM. DOI: 10.48550/arXiv.2304.14993.

Liu, P., Yuan, W., Fu, J., Jiang, Z., Hayashi, H., and Neubig, G. (2023). Pre-train, prompt, and predict: A systematic survey of prompting methods in natural language processing. ACM Computing Surveys (CSUR), 55(9):1-35. DOI: 10.48550/arXiv.2107.13586.

López Espejel, J., Ettifouri, E. H., Yahaya Alassan, M. S., Chouham, E. M., and Dahhane, W. (2023). Gpt-3.5, gpt-4, or bard? evaluating llms reasoning ability in zero-shot setting and performance boosting through prompts. Natural Language Processing Journal, 5:100032. DOI: 10.1016/j.nlp.2023.100032.

Mendonça, N. C. (2024). Evaluating chatgpt-4 vision on brazil's national undergraduate computer science exam. ACM Transactions on Computing Education, 24(3):1-56. DOI: 10.48550/arXiv.2406.09671.

Mistral (2024). Le chat mistral. Available at:[link].

Nunes, D., Primi, R., Pires, R., de Alencar Lotufo, R., and Nogueira, R. F. (2023). Evaluating gpt-3.5 and gpt-4 models on brazilian university admission exams. abs/2303.17003. DOI: 10.48550/ARXIV.2303.17003.

OpenAI (2024). Chatgpt. Available at:[link].

Pires, R., Almeida, T. S., Abonizio, H. Q., and Nogueira, R. F. (2023). Evaluating gpt-4's vision capabilities on brazilian university admission exams. abs/2311.14169. DOI: 10.48550/ARXIV.2311.14169.

Sallou, J., Durieux, T., and Panichella, A. (2024). Breaking the silence: the threats of using llms in software engineering. In ACM/IEEE 46th International Conference on Software Engineering - New Ideas and Emerging Results. ACM/IEEE. DOI: 10.48550/arXiv.2312.08055.

Sociedade Brasileira de Computação (2025). Poscomp. Available at:[link].

Toyama, Y., Harigai, A., Abe, M., Nagano, M., Kawabata, M., Seki, Y., and Takase, K. (2024). Performance evaluation of chatgpt, gpt-4, and bard on the official board examination of the japan radiology society. Japanese Journal of Radiology, 42:201-207. DOI: 10.1007/s11604-023-01491-2.

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L., and Polosukhin, I. (2017). Attention is all you need. In Advances in Neural Information Processing Systems, pages 5998-6008. DOI: 10.48550/arXiv.1706.03762.

Viegas, C., Gheyi, R., and Ribeiro, M. (2025). Assessing the capability of llms in solving poscomp questions. DOI: 10.48550/arXiv.2505.20338.

Wei, J., Wang, X., Schuurmans, D., Bosma, M., Ichter, B., Xia, F., Chi, E. H., Le, Q. V., and Zhou, D. (2022). Chain-of-thought prompting elicits reasoning in large language models. In Advances in Neural Information Processing Systems. DOI: 10.48550/arXiv.2201.11903.

Zhang, X., Li, C., Zong, Y., Ying, Z., He, L., and Qiu, X. (2023). Evaluating the performance of large language models on gaokao benchmark. abs/2305.12474. DOI: 10.48550/ARXIV.2305.12474.

Downloads

Published

2025-10-09

How to Cite

Viegas, C., Gheyi, R., & Ribeiro, M. (2025). Assessing the Capability of LLMs in Solving POSCOMP Questions. Journal of the Brazilian Computer Society, 31(1), 991–1004. https://doi.org/10.5753/jbcs.2025.4493

Issue

Section

Articles