Evaluating ChatGPT’s ability to perform Natural Deduction proofs in Propositiona and Predicate Logic

Authors

DOI:

https://doi.org/10.5753/rbie.2025.4500

Keywords:

Natural Deduction, Propositional Logic, Predicate Logic, Logical Reasoning, Large Language Models, ChatGPT

Abstract

Due to recent popularization of Large Language Models, such as ChatGPT, many studies have been conducted to explore the use of these tools to assist in the teaching and learning process. These systems have the ability to understand and process amounts of data, allowing them, for example, to provide individualized support to students in solving exercises. However, it is important to consider that these systems mostly have a reasoning process based on statistical methods, such as machine learning algorithms, which can hallucinate resulting in incorrect responses in tasks as logical reasoning. This paper aims to evaluate the ability of the ChatGPT in solving exercises of Natural Deduction in Propositional and Predicate Logic. In order to do so, we conduct experiments using a database of natural deduction exercises and the results confirm that the statistical reasoning process is still not adequate for perform logical reasoning, making ChatGPT, for now, unsuitable for this task.

Downloads

Download data is not yet available.

References

Ando, R., Morishita, T., Abe, H., Mineshima, K., & Okada, M. (2023, junho). Evaluating Large Language Models with NeuBAROCO: Syllogistic Reasoning Ability and Human-like Biases Em S. Chatzikyriakidis & V. de Paiva (Ed.), Proceedings of the 4th Natural Logic Meets Machine Learning Workshop (pp. 1–11). Association for Computational Linguistics. [link] [GS Search]

Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J. D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al. (2020). Language models are few-shot learners. Advances in Neural Information Processing Systems (NeurIPS), 33, 1877–1901. [link] [[GS Search].

Carl, M. (2023). Using large language models for (de-) formalization and natural argumentation exercises for beginner’s students. arXiv preprint arXiv:2304.06186. https://doi.org/10.48550/arXiv.2304.06186 [GS Search]

Chowdhery, A., Narang, S., Devlin, J., Bosma, M., Mishra, G., Roberts, A., Barham, P., Chung, H. W., Sutton, C., Gehrmann, S., Schuh, P., Shi, K., Tsvyashchenko, S., Maynez, J., Rao, A., Barnes, P., Tay, Y., Shazeer, N., Prabhakaran, V., Fiedel, N. (2023). PaLM: Scaling Language Modeling with Pathways. Journal of Machine Learning Research, 24(240), 1–113. [link] [GS Search]

Chung, H. W., Hou, L., Longpre, S., Zoph, B., Tay, Y., Fedus, W., Li, Y., Wang, X., Dehghani, M., Brahma, S., Webson, A., Gu, S. S., Dai, Z., Suzgun, M., Chen, X., Chowdhery, A., Castro-Ros, A., Pellat, M., Robinson, K., Wei, J. (2024). Scaling Instruction-Finetuned Language Models. Journal of Machine Learning Research, 25(70), 1–53. [link] [GS Search]

Coq, P. (1996). The Coq Proof Assistant-Reference Manual. INRIA Rocquencourt and ENS Lyon, version, 5. [GS Search]

De Moura, L., Kong, S., Avigad, J., Van Doorn, F., & von Raumer, J. (2015). The Lean theorem prover (system description). Automated Deduction-CADE-25: 25th International Conference on Automated Deduction, Berlin, Germany, August 1-7, 2015, Proceedings 25, 378–388. https://doi.org/10.1007/978-3-319-21401-6_26 [GS Search]

Enderton, H. B. (1972). A mathematical introduction to logic. Academic Press. https://doi.org/10.2307/2272104 [GS Search]

Ferreirós, J. (2001). The road to modern logic—an interpretation. Bulletin of Symbolic Logic, 7(4), 441–484. https://doi.org/10.2307/268779 [GS Search]

First, E., Rabe, M., Ringer, T., & Brun, Y. (2023). Baldur: Whole-proof generation and repair with large language models. Proceedings of the 31st ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering, 1229–1241. https://doi.org/10.1145/3611643.3616243 [GS Search]

Hammer, E. M. (1998). Semantics for existential graphs. Journal of Philosophical Logic, 27, 489–503. https://doi.org/10.1023/A:1017908108789 [GS Search]

Han, J. M., Rute, J., Wu, Y., Ayers, E. W., & Polu, S. (2022). Proof Artifact Co-training for Theorem Proving with Language Models. International Conference on Learning Representations. https://doi.org/10.48550/arXiv.2102.06203 [GS Search]

Huth, M., & Ryan, M. (2004). Logic in Computer Science: Modelling and Reasoning about Systems (2nd Ed.). Cambridge University Press. https://doi.org/10.1017/CBO9780511810275 [GS Search]

Jiang, A. Q., Li, W., Han, J. M., & Wu, Y. (2021). Lisa: Language models of ISAbelle proofs. 6th Conference on Artificial Intelligence and Theorem Proving (AITP), 378–392. [link] [GS Search]

Jiang, A. Q., Li, W., Tworkowski, S., Czechowski, K., Odrzygóźdź, T., Miłoś, P., Wu, Y., & Jamnik, M. (2022). Thor: Wielding hammers to integrate language models and automated theorem provers. 36th Conference on Neural Information Processing Systems (NeurIPS), 35, 8360–8373. https://doi.org/10.48550/arXiv.2205.10893 [GS Search]

Kasneci, E., Seßler, K., Küchemann, S., Bannert, M., Dementieva, D., Fischer, F., Gasser, U., Groh, G., Günnemann, S., Hüllermeier, E., et al. (2023). ChatGPT for good? On opportunities and challenges of large language models for education. Learning and Individual Differences, 103, 102274. https://doi.org/10.1016/j.lindif.2023.102274 [GS Search]

Lample, G., Lacroix, T., Lachaux, M.-A., Rodriguez, A., Hayat, A., Lavril, T., Ebner, G., & Martinet, X. (2022). Hypertree proof search for neural theorem proving. Advances in Neural Information Processing Systems (NeurIPS), 35, 26337–26349. https://doi.org/10.48550/arXiv.2205.11491 [GS Search]

Lewis, M., Liu, Y., Goyal, N., Ghazvininejad, M., Mohamed, A., Levy, O., Stoyanov, V., & Zettlemoyer, L. (2020, julho). BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension. Em D. Jurafsky, J. Chai, N. Schluter & J. Tetreault (Ed.), Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics (pp. 7871–7880). Association for Computational Linguistics. https://doi.org/10.18653/v1/2020.acl-main.703 [GS Search]

Liu, H., Ning, R., Teng, Z., Liu, J., Zhou, Q., & Zhang, Y. (2023). Evaluating the logical reasoning ability of chatgpt and gpt-4. arXiv preprint arXiv:2304.03439. https://doi.org/10.48550/arXiv.2304.03439 [GS Search]

Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., & Stoyanov, V. (2019). Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692. https://doi.org/10.48550/arXiv.1907.11692 [GS Search]

Martins, F., Oliveira, A., Vasconcelos, D., & Menezes, M. (2023). Avaliando a habilidade do ChatGPT de realizar provas de Dedução Natural em Lógica Proposicional. Anais do XXXIV Simpósio Brasileiro de Informática na Educação, 1282–1292. https://doi.org/10.5753/sbie.2023.234658 [GS Search]

Nipkow, T., Wenzel, M., & Paulson, L. C. (2002). Isabelle/HOL: a proof assistant for higher-order logic. Springer. https://doi.org/10.1007/3-540-45949-9_5 [GS Search]

OpenAI. (2021). ChatGPT [Acesso em: 13 de junho de 2023. Disponível em [link].

Pelletier, F. J. (1999). A brief history of natural deduction. History and Philosophy of Logic, 20(1), 1–31. https://doi.org/10.1080/014453499298165 [GS Search]

Pelletier, F. J. (2000). A history of natural deduction and elementary logic textbooks. Logical consequence: Rival approaches, 1, 105–138. [link] [GS Search]

Polu, S., Han, J. M., Zheng, K., Baksys, M., Babuschkin, I., & Sutskever, I. (2022). Formal mathematics statement curriculum learning. arXiv preprint arXiv:2202.01344. https://doi.org/10.48550/arXiv.2202.01344 [GS Search]

Polu, S., & Sutskever, I. (2020). Generative language modeling for automated theorem proving. arXiv preprint arXiv:2009.03393. https://doi.org/10.48550/arXiv.2009.03393 [GS Search]

Russell, S. J. (2010). Artificial intelligence: a modern approach. Pearson Education, Inc. [GS Search]

Saparov, A., Pang, R. Y., Padmakumar, V., Joshi, N., Kazemi, M., Kim, N., & He, H. (2023). Testing the General Deductive Reasoning Capacity of Large Language Models Using OOD Examples. Em A. Oh, T. Neumann, A. Globerson, K. Saenko, M. Hardt & S. Levine (Ed.), Advances in Neural Information Processing Systems (NeurIPS), 36, pp. 3083–3105. Curran Associates, Inc. [link] [GS Search]

Schick, T., Dwivedi-Yu, J., Dessı, R., Raileanu, R., Lomeli, M., Hambro, E., Zettlemoyer, L., Cancedda, N., & Scialom, T. (2023). Toolformer: Language models can teach themselves to use tools. Advances in Neural Information Processing Systems (NeurIPS), 36. https://doi.org/10.48550/arXiv.2302.04761 [GS Search]

Shikishima, C., Hiraishi, K., Yamagata, S., Sugimoto, Y., Takemura, R., Ozaki, K., Okada, M., Toda, T., & Ando, J. (2009). Is g an entity? A Japanese twin study using syllogisms and intelligence tests. Intelligence, 37(3), 256–267. https://doi.org/10.1016/j.intell.2008.10.010 [GS Search]

Tlili, A., Shehata, B., Adarkwah, M. A., Bozkurt, A., Hickey, D. T., Huang, R., & Agyemang, B. (2023). What if the devil is my guardian angel: ChatGPT as a case study of using chatbots in education. Smart Learning Environments, 10(1), 15. https://doi.org/10.1186/s40561-023-00237-x [GS Search]

Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., Rodriguez, A., Joulin, A., Grave, E., & Lample, G. (2023). LLaMA: Open and Efficient Foundation Language Models. https://doi.org/10.48550/arXiv.2302.13971 [GS Search]

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł., & Polosukhin, I. (2017). Attention is all you need. Advances in Neural Information Processing Systems (NIPS), 30. https://doi.org/10.48550/arXiv.1706.03762 [GS Search]

Wang, H., Yuan, Y., Liu, Z., Shen, J., Yin, Y., Xiong, J., Xie, E., Shi, H., Li, Y., Li, L., et al. (2023). DT-solver: Automated theorem proving with dynamic-tree sampling guided by proof-level value function. Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 12632–12646. https://doi.org/10.18653/v1/2023.acl-long.706 [GS Search]

Weber, F., Wambsganss, T., Rüttimann, D., & Söllner, M. (2021). Pedagogical agents for interactive learning: A taxonomy of conversational agents in education. Forty-Second International Conference on Information Systems. Austin, Texas, 1–17. [link] [GS Search]

Yang, K., & Deng, J. (2019). Learning to prove theorems via interacting with proof assistants. International Conference on Machine Learning, 97, 6984–6994. [link] [GS Search]

Yang, K., Swope, A., Gu, A., Chalamala, R., Song, P., Yu, S., Godil, S., Prenger, R. J., & Anandkumar, A. (2023). LeanDojo: Theorem proving with retrieval-augmented language models. Advances in Neural Information Processing Systems (NeurIPS), 36. https://doi.org/10.48550/arXiv.2306.15626 [GS Search]

Published

2025-05-02

How to Cite

MARTINS, F. L. B.; OLIVEIRA, A. C. A. de; VASCONCELOS, D. R. de; MENEZES, M. V. de. Evaluating ChatGPT’s ability to perform Natural Deduction proofs in Propositiona and Predicate Logic. Brazilian Journal of Computers in Education, [S. l.], v. 33, p. 244–278, 2025. DOI: 10.5753/rbie.2025.4500. Disponível em: https://journals-sol.sbc.org.br/index.php/rbie/article/view/4500. Acesso em: 30 jan. 2026.

Issue

Section

Awarded Papers :: CBIE