How Effectively Do LLMs Automate Data Analysis? A Comparative Study with ChatGPT's Data Analyst, Grok, and Qwen
DOI:
https://doi.org/10.5753/jidm.2026.5963Keywords:
Large Language Models, ChatGPT, Data Analysis, Grok, Qwen, Automation, Predictive Analysis, Prescriptive AnalysisAbstract
Artificial Intelligence (AI) tools are increasingly becoming integral to analytical processes. This paper evaluates the potential of Large Language Models (LLMs), specifically OpenAI’s ChatGPT’s Data Analyst, Grok 3, and Qwen2.5-Max in data analysis. We conducted a structured experiment employing this tool in 108 questions spanning descriptive, diagnostic, predictive, and prescriptive analyses to assess its effectiveness. The study revealed an overall efficiency rate of 72.22% for ChatGPT's Data Analyst, outperforming Grok 3 at 45.37% and Qwen-Max 2.5 at 8.33%. By discussing the strengths and limitations of a state-of-the-art LLM-based tool in aiding data scientists, this study aims to mark a critical milestone for future developments in the field, particularly as a reference for the open-source community.
Downloads
References
Abaskohi, A., Ramesh, A. V., Nanisetty, S., Goel, C., Vazquez, D., Pal, C., Gella, S., Carenini, G., and Laradji, I. H. (2025). Agentada: Skill-adaptive data analytics for tailored insight discovery.
Achiam, J., Adler, S., Agarwal, S., Ahmad, L., Akkaya, I., Aleman, F. L., Almeida, D., Altenschmidt, J., Altman, S., Anadkat, S., et al. (2024). Gpt-4 technical report. arXiv preprint arXiv:2303.08774.
Banco Central do Brasil (2024). Sistema de operacoes do credito rural e do proagro (sicor). Acesso em: 28 ago. 2025.
Banerjee, A., Bandyopadhyay, T., and Acharya, P. (2013). Data analytics: Hyped up aspirations or true potential? Vikalpa, 38(4):1-12.
Cheng, L., Li, X., and Bing, L. (2023). Is gpt-4 a good data analyst? Journal of Artificial Intelligence Research, Findings of EMNLP 2023:9496-9514.
Daibes, M. and Lima, B. B. (2024). Cracking the heart code: Using chatgpt's data analyst feature for cardiovascular imaging research. The International Journal of Cardiovascular Imaging, pages 1-2.
de Miranda, B. A. and Campelo, C. E. C. (2024). How effective is an llm-based data analysis automation tool? a case study with chatgpt's data analyst. In Anais do XXXIX Simposio Brasileiro de Banco de Dados (SBBD), pages 287-299, Florianopolis, SC, Brazil. Sociedade Brasileira de Computacao (SBC). DOI: 10.5753/sbbd.2024.240841.
Ding, B., Qin, C., Liu, L., Chia, Y. K., Li, B., Joty, S., and Bing, L. (2023). Is gpt-3 a good data annotator? arXiv preprint arXiv:2305.00899.
Grattafiori, A., Dubey, A., Jauhri, A., Pandey, A., Kadian, A., Al-Dahle, A., Letman, A., et al. (2024). The llama 3 herd of models.
Hu, X., Zhao, Z., Wei, S., Chai, Z., Ma, Q., Wang, G., Wang, X., Su, J., Xu, J., Zhu, M., Cheng, Y., Yuan, J., Li, J., Kuang, K., Yang, Y., Yang, H., and Wu, F. (2024). InfiAgent-DABench: Evaluating Agents on Data Analysis Tasks. arXiv preprint arXiv:2401.05507.
Jaimovitch-Lopez, G., Ferri, C., Hernandez-Orallo, J., Martinez-Plumed, F., and Ramirez-Quintana, M. J. (2022). Can language models automate data wrangling? Machine Learning, 112:2053-2082.
Jiang, A. Q., Sablayrolles, A., Mensch, A., Bamford, C., Chaplot, D. S., Casas, D. d. l., Bressand, F., Lengyel, G., Lample, G., Saulnier, L., et al. (2023). Mistral 7b. arXiv preprint arXiv:2310.06825.
Kasetty, T., Mahajan, D., Dziugaite, G. K., Drouin, A., and Sridhar, D. (2024). Evaluating interventional reasoning capabilities of large language models. arXiv preprint arXiv:2404.05545. DOI: 10.48550/arXiv.2404.05545.
Lino, R. C. (2021). O impacto da analitica hoje e no futuro. Master's thesis, Universidade de Lisboa (Portugal).
Liu, X., Wu, Z., Wu, X., Lu, P., Chang, K.-W., and Feng, Y. (2024). Are llms capable of data-based statistical and causal reasoning? Benchmarking advanced quantitative reasoning with data. DOI: 10.48550/arXiv.2402.17644.
Nasseri, M., Brandtner, P., Zimmermann, R., Falatouri, T., Darbanian, F., and Obinwanne, T. (2023). Applications of large language models (llms) in business analytics - exemplary use cases in data preparation tasks. 14059:182-198.
Ouyang, L., Wu, J., Jiang, X., Almeida, D., Wainwright, C., Mishkin, P., Zhang, C., Agarwal, S., Slama, K., Ray, A., et al. (2022). Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730-27744.
Ramannavar, M. and Sidnal, N. S. (2016). Big data and analytics-a journey through basic concepts to research issues. In Suresh, L. and Panigrahi, B., editors, Proceedings of the International Conference on Soft Computing Systems, volume 398 of Advances in Intelligent Systems and Computing, pages 291-306. Springer India.
Sharma, A., Li, X., Guan, H., Sun, G., Zhang, L., Wang, L., Wu, K., Cao, L., Zhu, E., Sim, A., Wu, T., and Zou, J. (2023). Automatic data transformation using large language model - an experimental study on building energy data. pages 1824-1834. DOI: 10.1109/BigData59044.2023.10386931.
Solaiman, I., Brundage, M., Clark, J., Askell, A., Herbert-Voss, A., Wu, J., Radford, A., and Wang, J. (2019). Release strategies and the social impacts of language models. arXiv preprint arXiv:1908.09203.
Team, G., Georgiev, P., Lei, V. I., Burnell, R., Bai, L., Gulati, A., Tanzer, G., Vincent, D., et al. (2024). Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context.
Team, Q. (2024). Qwen2.5 technical report. arXiv preprint arXiv:2412.15115.
United Nations Children's Fund (UNICEF) (2024). Hiv and aids global and regional trends.
Wei, J., Wang, X., Schuurmans, D., Bosma, M., ichter, b., Xia, F., Chi, E., Le, Q. V., and Zhou, D. (2022). Chain-of-thought prompting elicits reasoning in large language models. Advances in Neural Information Processing Systems, 35:24824-24837.
xAI (2025). Grok 3: Beta-the age of reasoning agents. https://x.ai/news/grok-3. Accessed: 2025-04-16.
Zhang, H., Dong, Y., Xiao, C., and Oyamada, M. (2023). Large language models as data preprocessors. arXiv preprint arXiv:2305.00899.
Zhang, Y., Jiang, Q., Han, X., Chen, N., Yang, Y., and Ren, K. (2024). Benchmarking data science agents. arXiv:2402.17168.

