Automatic classification of educational videos supported by comment-based machine learning techniques: an experimental analysis using Youtube




Text Mining, Machine Learning, Classification, Comments, Videos, Youtube


Technological advances allow new content to be created and be available via Web every minute, providing great progress in several areas. However, this availability also brings drawbacks in the Educational field. It is noteworthy that the excess of materials/content makes teaching-learning process difficult due to the high time spent in searching for content that meets the needs of the users. In this sense, new methods to identify educational content, in videos, for example, need to be developed. From this perspective, it can be seen that significant differences are identified in the comments provided by users on educational videos, thus indicating the potential for using them in the process of selecting these types of videos. In this context, the present work analyzes and collects comments from 500 videos of the Youtube platform, being 250 educational and 250 non-educational, and uses Text Mining and Machine Learning techniques to develop a classification model that, based on the most frequent words of comments on videos, categorize them as educational or non-educational. Thus, we provide a mechanism that filters videos according to their class and returns to the user only videos with educational content. Results demonstrate that it is possible to classify educational and non-educational videos with an accuracy rate of 91.30%, when using the Random Forest classifier. Furthermore, due to the promising results, we developed SysVidEduc, an API that uses the comments from Youtube videos and automatically classifies them as educational or non-educational.


Download data is not yet available.


Abu-El-Haija, S., Kothari, N., Lee, J., Natsev, P., Toderici, G., Varadarajan, B., & Vijayanarasimhan, S. (2016). Youtube-8m: A large-scale video classification benchmark. arXiv preprint arXiv:1609.08675. [GS Search]

Afonso, A. R., & Duque, C. G. (2019). Análise de sentimentos em comentários de vídeos do youtube utilizando aprendizagem de máquinas supervisionada. Ciência da Informação, 48(3). [GS Search]

Allahyari, M., Pouriyeh, S., Assefi, M., Safaei, S., Trippe, E. D., Gutierrez, J. B., & Kochut, K. (2017). A brief survey of text mining: Classification, clustering and extraction techniques. arXiv preprint arXiv:1707.02919. doi: 10.48550/arXiv.1707.02919. [GS Search]

Amanda, R., & Negara, E. S. (2020). Analysis and implementation machine learning for youtube data classification by comparing the performance of classification algorithms. Jurnal Online Informatika, 5(1), 61–72. [GS Search]

Berrar, D. (2019). Cross-validation. In S. Ranganathan, M. Gribskov, K. Nakai, & C. Schönbach (Eds.), Encyclopedia of bioinformatics and computational biology (p. 542-545). Oxford, UK: Academic Press. doi: 10.1016/B978-0-12-809633-8.20349-X. [GS Search]

Berry, M. J., & Linoff, G. S. (2004). Data mining techniques: for marketing, sales, and customer relationship management. John Wiley & Sons. [GS Search]

Braga, J., & Menezes, L. (2014). Objetos de aprendizagem, volume 1: introdução e fundamentos (Vol. 1). UFABC. Retrieved from [Link] [GS Search]

Breiman, L. (2001). Random Forests. Machine Learning, 45(1). doi: 10.1023/A:1010933404324. [GS Search]

Carvalho, H. C. F. B., Pitangui, C. G., Assis, L. P., & Andrade, A. V. (2020). Educavídeos: Um sistema de recomendação de objetos de aprendizagem de vídeos educacionais do youtube. In Esud 2020 - xvii congresso brasileiro de ensino superior a distância. Retrieved from [Link] [GS Search]

Carvalho, H. C. F. B., Pitangui, C. G., Trindade, E. A. C., Assis, L. P., & Andrade, A. V. (2020). Learning objects and youtube: an analysis of videos and their categories. In Laclo 2020 - xv latin american conference on learning technologies. doi: 10.1109/LACLO50806.2020.9381145. [GS Search]

Carvalho, H. C. F. B., Pitangui, C. G., Trindade, E. A. C., Assis, L. P. d., Andrade, A. V., & de Souza, D. P. B. (2020). Categorização de vídeos educacionais do youtube por meio de comentários. RENOTE, 18(2), 621-629. doi: 10.22456/1679-1916.110305. [GS Search]

Cohen, W. W. (1995). Fast effective rule induction. In A. Prieditis & S. Russell (Eds.), Machine learning proceedings 1995 (p. 115-123). San Francisco, CA, USA: Morgan Kaufmann. doi: 10.1016/B978-1-55860-377-6.50023-2. [GS Search]

Dang, S., & Ahmad, P. H. (2014). Text mining: Techniques and its application. International Journal of Engineering & Technology Innovations, 1(4), 22–25. [GS Search]

do Nascimento, P., Barreto, R., Primo, T., Gusmão, T., & Oliveira, E. (2017). Recomendaçao de objetos de aprendizagem baseada em modelos de estilos de aprendizagem: Uma revisao sistemática da literatura. , 28(1), 213. doi: 10.5753/cbie.sbie.2017.213. [GS Search]

Frank, E., Hall, M. A., & Witten, I. H. (2016). The weka workbench. online appendix for "data mining: Practical machine learning tools and techniques". Morgan Kaufmann Publishers. Retrieved from [Link] [GS Search]

Frank, E., & Witten, I. H. (1998). Generating accurate rule sets without global optimization. In J. Shavlik (Ed.), Fifteenth international conference on machine learning (p. 144-151). Morgan Kaufmann. [GS Search]

Gaikwad, S. V., Chaugule, A., & Patil, P. (2014). Text mining methods and techniques. International Journal of Computer Applications, 85(17). doi: 10.5120/14937-3507. [GS Search]

Gomes, L. (2008). Vídeos didáticos: uma proposta de critérios para análise. Revista Brasileira de Estudos Pedagógicos, 89(223). doi: 10.24109/2176-6681.rbep.89i223.688. [GS Search]

HaCohen-Kerner, Y., Miller, D., & Yigal, Y. (2020). The influence of preprocessing on text classification using a bag-of-words representation. PloS one, 15(5), 1-22. doi: 10.1371/journal.pone.0232525. [GS Search]

Hickman, L., Thapa, S., Tay, L., Cao, M., & Srinivasan, P. (2022). Text preprocessing for text mining in organizational research: Review and recommendations. Organizational Research Methods, 25(1), 114-146. doi: 10.1177/1094428120971683. [GS Search]

Hobbs, J. R., & Riloff, E. (2010). Information extraction. In N. Indurkhya & F. J. Damerau (Eds.), Handbook of natural language processing (p. 511-532). Boca Raton, FL, USA: Chapman and Hall/CRC. doi: 10.1201/9781420085938. [GS Search]

IEEE (2002). Ieee standard for learning object metadata. ieee standard 1484.12.1. New York, NY, USA: Institute of Electrical and Electronics Engineers. Retrieved from [Link] [GS Search]

Islam, M. Z., Estivill-Castro, V., Rahman, M. A., & Bossomaier, T. (2018). Combining k-means and a genetic algorithm through a novel arrangement of genetic operators for high quality clustering. Expert Systems with Applications, 91, 402-417. doi: 10.1016/j.eswa.2017.09.005. [GS Search]

Júnior, C. B., & Dorça, F. (2018). Uma abordagem para a criação e recomendação de objetos de aprendizagem usando um algoritmo genético, tecnologias da web semântica e uma ontologia. In Brazilian symposium on computers in education (simpósio brasileiro de informática na educação-sbie) (p. 1533-1542). doi: 10.5753/cbie.sbie.2018.1533. [GS Search]

Jusoh, S., & Alfawareh, H. M. (2012). Techniques, applications and challenging issue in text mining. International Journal of Computer Science Issues (IJCSI), 9(6), 431. [GS Search]

Kannan, S., Gurusamy, V., Vijayarani, S., Ilamathi, J., Nithya, M., Kannan, S., & Gurusamy, V. (2014). Preprocessing techniques for text mining. International Journal of Computer Science & Communication Networks, 5(1), 7-16. [GS Search]

Kesavaraj, G., & Sukumaran, S. (2013). A study on classification techniques in data mining. In 2013 fourth international conference on computing, communications and networking technologies (icccnt) (pp. 1–7). doi: 10.1109/ICCCNT.2013.6726842. [GS Search]

Menolli, A., Malucelli, A., & Reinehr, S. (2011). Criaçao semi-automatica de objetos de aprendizagem a partir de conteúdos da wiki. In Brazilian symposium on computers in education (simpósio brasileiro de informática na educação-sbie). [GS Search]

Miranda, R. M. d. (2004). Groa: um gerenciador de repositórios de objetos de aprendizagem. Unpublished master’s thesis, Universidade Federal do Rio Grande do Sul, Porto Alegre, RS, BR. [GS Search]

Mitchell, T. M. (1997). Machine learning. New York, NY, USA: McGraw-hill New York. [GS Search]

Morais, E. A. M., & Ambrósio, A. P. L. (2007). Mineração de textos. Relatório Técnico–Instituto de Informática (UFG). [GS Search]

Pinheiro, R. R. A., et al. (2018). Sistema de recomendação de vídeos educacionais: um estudo de caso no youtube. Unpublished master’s thesis, Universidade Federal de Alagoas, Maceió, AL, BR. [GS Search]

Quinlan, R. (1993). C4.5: Programs for machine learning. San Mateo, CA, USA: Morgan Kaufmann Publishers. [GS Search]

Rajput, A., Aharwal, R. P., Dubey, M., Saxena, S., & Raghuvanshi, M. (2011). J48 and jrip rules for e-governance data. International Journal of Computer Science and Security (IJCSS), 5(2), 201. [GS Search]

Ruggieri, S. (2002). Efficient c4. 5 [classification algorithm]. IEEE transactions on knowledge and data engineering, 14(2), 438-444. doi: 10.1109/69.991727. [GS Search]

Russell, S., & Norvig, P. (2010). Artificial intelligence: a modern approach. Upper Saddle River, NJ, USA: Pearson Education. [GS Search]

Sukanya, M., & Biruntha, S. (2012). Techniques on text mining. In 2012 ieee international conference on advanced communication control and computing technologies (icaccct) (p. 269-271). doi: 10.1109/ICACCCT.2012.6320784. [GS Search]

Sumathi, S., & Sivanandam, S. (2006). Introduction to data mining and its applications (Vol. 29). Springer. doi: 10.1007/978-3-540-34351-6. [GS Search]

Thelwall, M. (2018). Social media analytics for youtube comments: Potential and limitations. International Journal of Social Research Methodology, 21(3), 303–316. doi: 10.1080/13645579.2017.1381821. [GS Search]

Trindade, E. A. C., de Assis, L. P., Andrade, A. V., Carvalho, H. C. F. B., Pitangui, C. G., & Dorça, F. A. (2020). Modelagem do problema de cobertura de conjunto para recomendação de objetos de aprendizagem aplicado ao repositório do youtube. RENOTE, 18(2), 358–367. doi: 10.22456/1679-1916.110254. [GS Search]

Vieira, F. J. R., & Nunes, M. A. S. N. (2012). Dica: Sistema de recomendação de objetos de aprendizagem baseado em conteúdo. Scientia Plena, 8(5). [GS Search]

Vijayarani, S., Ilamathi, M. J., Nithya, M., et al. (2015). Preprocessing techniques for text mining - an overview. International Journal of Computer Science & Communication Networks, 5(1), 7-16. [GS Search]

Vijayarani, S., Janani, R., et al. (2016). Text mining: open source tokenization tools - an analysis. Advanced Computational Intelligence: An International Journal (ACII), 3(1), 37-47. [GS Search]

Wiederhold, G., & McCarthy, J. (1992). Arthur samuel: Pioneer in machine learning. IBM Journal of Research and Development, 36(3), 329-331. doi: 10.1147/rd.363.0329. [GS Search]

Wiley, D. A. (2000). Learning object design and sequencing theory. Unpublished doctoral dissertation, Brigham Young University. [GS Search]

Witten, I. H., Frank, E., & Hall, M. A. (2011). Data mining: Practical machine learning tools and techniques (Vol. 3). Morgan Kaufmann. doi: 10.1016/C2009-0-19715-5. [GS Search]

Zheng, C., Xue, J., Sun, Y., & Zhu, T. (2021). Public opinions and concerns regarding the canadian prime minister’s daily covid-19 briefing: Longitudinal study of youtube comments using machine learning techniques. Journal of medical Internet research, 23(2), e23957. doi: 10.2196/23957. [GS Search]



How to Cite

CARVALHO, H. C. F. B.; DORÇA, F. A.; PITANGUI, C. G.; ASSIS, L. P. de; ANDRADE, A. V.; TRINDADE, E. A. C. Automatic classification of educational videos supported by comment-based machine learning techniques: an experimental analysis using Youtube. Brazilian Journal of Computers in Education, [S. l.], v. 30, p. 419–448, 2022. DOI: 10.5753/rbie.2022.2455. Disponível em: Acesso em: 4 jul. 2024.




Most read articles by the same author(s)