Authorship attribution in school works through stylometry and natural language processing
DOI:
https://doi.org/10.5753/rbie.2022.2588Keywords:
Stylometry, Authorship Attribution, Scholar Document Classification, Stylometric Feature Extraction, Decision Trees EnsemblesAbstract
The growth of digital documents, associated with their usage in several knowledge areas requires computational resources for its comprehension and analysis. The literature proposes distinguishing authors by their writing style and keywords. However, these studies mainly involve journalistic and literary contexts written in English. This research is unique because it explores authorship analysis within a dataset composed of school activities written by undergraduate students in Portuguese. Such a scenario is challenging because it contains fewer documents per author, homogeneous authors, and fewer research and tools in Portuguese. Due to the insufficient number of samples, we used robust journalistic datasets as reference. The experiments verified that stylometric representations are superior to textual representations in restricted domains, which suffer from the topic’s broader corpora. Furthermore, we found out that the ensemble of extremelly randomized decision trees associated with the proposed stylometric features overcome every other model tested, in allthe datasets, reaching an average accuracy of 0.71 and 0.81 AUC.
Downloads
References
Agarap, A. F. M. (2018). A neural network architecture combining gated recurrent unit (gru) and support vector machine (svm) for intrusion detection in network traffic data. In Proceedings of the 2018 10th international conference on machine learning and computing (pp. 26–30). doi: 10.48550/arXiv:1709.03082 . [GS Search]
Aluísio, S., Pelizzoni, J., Marchi, A. R., de Oliveira, L., Manenti, R. & Marquiafável, V. (2003). An account of the challenge of tagging a reference corpus for brazilian portuguese. In International workshop on computational processing of the portuguese language (pp. 110–117). doi: 10.1007/3-540-45011-417 .[GS Search]
Baker, R., Isotani, S. & Carvalho, A. (2011, xx xx). Mineração de Dados Educacionais: Oportunidades para o Brasil. Revista Brasileira de Informática na Educação, 19(02), xx. doi: 10.5753/RBIE.2011.19.02.03 .[GS Search]
Bevendorff, J., Ghanem, B., Giachanou, A., Kestemont, M., Manjavacas, E., Potthast, M., . . . others (2020). Shared tasks on authorship analysis at pan 2020. In European conference on information retrieval (pp. 508–516). doi: 10.1007/978-3-030-45442-566 . [GS Search]
Bezdek, J. C. (2013). Pattern recognition with fuzzy objective function algorithms. Springer Science & Business Media. doi: 10.1007/BF00058655 . [GS Search]
Bojanowski, P., Grave, E., Joulin, A. & Mikolov, T. (2017). Enriching word vectors with subword information. In (Vol. 5, pp. 135–146). MIT Press. doi: 10.1162/tacl_a_00051 [GS Search]
Botelho, J. C. & da Silva Martins, M. R. A. (2020). Avaliação da aprendizagem: novas perspectivas para velhos problemas. In (Vol. 2). [Link]. [GS Search]
Breiman, L. (1996). Bagging predictors. In (Vol. 24, pp. 123–140). Springer. doi: 10.1007/BF00058655 . [GS Search]
Breiman, L. (2001). Random forests. In (Vol. 45, pp. 5–32). Springer. doi: 10.1023/A:1010933404324 [GS Search]
Chen, T. & Guestrin, C. (2016). Xgboost: A scalable tree boosting system. In Proceedings of the 22nd acm sigkdd international conference on knowledge discovery and data mining (pp. 785–794). doi: 10.48550/arXiv.1603.02754 .[GS Search]
Chowdhury, G. (2003). Natural language processing. In (Vol. 37, pp. 51–89). Wiley Online Library. doi: 10.1002/aris.1440370103 .[GS Search]
Cortez, P. & Silva, A. M. G. (2008). Using data mining to predict secondary school student performance. [Link] [GS Search]
Curtis, G. J. & Tremayne, K. (2019). Is plagiarism really on the rise? results from four 5-yearly surveys. In (pp. 1–11). Taylor & Francis. doi: 10.1080/03075079.2019.1707792 [GS Search]
Custódio, J. E. & Paraboni, I. (2021). Stacked authorship attribution of digital texts. In (Vol. 176, p. 114866). Elsevier. doi:
10.1016/j.eswa.2021.114866 [GS Search]
Demšar, J. (2006). Statistical comparisons of classifiers over multiple data sets. The Journal of Machine Learning Research, 7, 1–30. [GS Search]
Duarte, V. M. d. N. (2021, 06 10). Sintagma nominal e sintagma verbal. [GS Search]
Dugast, D. (1979). Vocabulaire et stylistique (Vol. 8). Slatkine. [GS Search]
Freitas, C., Carvalho, P., Gonçalo Oliveira, H., Mota, C., Santos, D. et al. (2010). Second harem: advancing the state of the art of named entity recognition in portuguese. In Proceedings of the international conference on language resources and evaluation (lrec 2010)(valletta 17-23 may de 2010) european language resources association. [GS Search]
Gamon, M. (2004). Linguistic correlates of style: authorship classification with deep linguistic analysis features. In Coling 2004: Proceedings of the 20th international conference on computational linguistics (pp. 611–617). doi: 10.3115/1220355.1220443 . [GS Search]
Geurts, P., Ernst, D. & Wehenkel, L. (2006). Extremely randomized trees. Machine learning, 63(1), 3–42. doi: 10.1007/s10994-006-6226-1 . [GS Search]
Gillam, L. & Vartapetiance, A. (2012). Quite simple approaches for authorship attribution, intrinsic plagiarism detection and sexual predator identification (Tech. Rep.). University of Surrey. [GS Search]
Goebel, R., Chander, A., Holzinger, K., Lecue, F., Akata, Z., Stumpf, S., . . . Holzinger, A. (2018). Explainable ai: the new 42? In International cross-domain conference for machine learning and knowledge extraction (pp. 295–303). doi: 10.1007/978-3-319-99740-721 . [GS Search]
Goldberg, Y. (2017). Neural network methods for natural language processing. Synthesis lectures on human language technologies, 10(1), 1–309. doi: 10.2200/S00762ED1V01Y201703HLT037 . [GS Search]
Halvani, O., Graner, L. & Regev, R. (2020). A step towards interpretable authorship verification. doi: 10.48550/arXiv.2006.12418 . [GS Search]
Hartmann, N., Fonseca, E., Shulby, C., Treviso, M., Silva, J. & Aluísio, S. (2017, October). Portuguese word embeddings: Evaluating on word analogies and natural language tasks. In Proceedings of the 11th Brazilian symposium in information and human language technology (pp. 122–131). Uberlândia, Brazil: Sociedade Brasileira de Computação. doi: 10.48550/arXiv.1708.06025 . [GS Search]
Honnibal, M. & Montani, I. (2017). spacy 2: Natural language understanding with bloom embeddings, convolutional neural networks and incremental parsing. [GS Search]
Honoré, A. (1979). Some simple measures of richness of vocabulary. [GS Search]
Jang, B., Kim, I. & Kim, J. W. (2019). Word2vec convolutional neural networks for classification of news articles and tweets. In (Vol. 14, p. e0220976). Public Library of Science San Francisco, CA USA. doi: 10.1371/journal.pone.0220976. [GS Search]
Juola, P. (2008). Authorship attribution (Vol. 3). Now Publishers Inc. doi: 10.1561/1500000005. [GS Search]
Khonji, M., Iraqi, Y. & Jones, A. (2015). An evaluation of authorship attribution using random forests. In 2015 international conference on information and communication technology research (ictrc) (pp. 68–71). doi: 10.1109/ICTRC.2015.7156423. [GS Search]
Kocev, D., Vens, C., Struyf, J. & Džeroski, S. (2013). Tree ensembles for predicting structured outputs. Pattern Recognition, 46(3), 817–833. doi: 10.1016/j.patcog.2012.09.023. [GS Search]
Kumawat, D. & Jain, V. (2015). Pos tagging approaches: A comparison. International Journal of Computer Applications, 118(6). doi: 10.5120/20752-3148. [GS Search]
Maitra, P., Ghosh, S. & Das, D. (2016). Authorship verification-an approach based on random forest. doi: 10.48550/arXiv.1607.08885. [GS Search]
Martins, T. B., Ghiraldelo, C. M., Nunes, M. d. G. V., de Oliveira Junior, O. N. et al. (1996). Readability formulas applied to textbooks in brazilian portuguese. Icmsc-Usp. [Link]. [GS Search]
Mikolov, T., Chen, K., Corrado, G. & Dean, J. (2013). Efficient estimation of word representations in vector space. doi: 10.48550/arXiv.1301.3781. [GS Search]
Neal, T., Sundararajan, K., Fatima, A., Yan, Y., Xiang, Y. & Woodard, D. (2017). Surveying stylometry techniques and applications. In (Vol. 50, pp. 1–36). ACM New York, NY, USA. doi: 10.1145/3132039. [GS Search]
Pacheco, M. L., Fernandes, K. & Porco, A. (2015). Random forest with increased generalization: A universal background approach for authorship verification. In Clef (working notes). [Link]. [GS Search]
Pardo, T. A. S. & Nunes, M. d. G. V. (2003). A construção de um corpus de textos científicos em português do brasil e sua marcação retórica. Série de Relatórios Técnicos do Instituto de Ciências Matemáticas e de Computação-ICMC, Universidade de São Paulo(212). [Link]. [GS Search]
Peng, J., Choo, R. K.-K. & Ashman, H. (2016). Astroturfing detection in social media: Using binary n-gram analysis for authorship attribution. In 2016 ieee trustcom/bigdatase/ispa (pp. 121–128). doi: 10.1109/TrustCom.2016.0054. [GS Search]
Pires, A. R. O. (2017). Named entity extraction from portuguese web text. Unpublished master’s thesis, Faculdade de Engenharia da Universidade Do Porto. [GS Search]
Ramshaw, L. A. & Marcus, M. P. (1999). Text chunking using transformation-based learning. In Natural language processing using very large corpora (pp. 157–176). Springer. doi: 10.1007/978-94-017-2390-9_10. [GS Search]
Santos, D. & Zanchettin, C. (2021). Estudo comparativo entre abordagens estilométricas e textuais para atribuição de autoria em trabalhos escolares. In Anais do XXXII Simpósio Brasileiro de Informática na Educação (pp. 760–772). Porto Alegre, RS, Brasil: SBC. doi: 10.5753/sbie.2021.217413 . [GS Search]
Scarton, C. E. & Aluísio, S. M. (2010). Análise da inteligibilidade de textos via ferramentas de processamento de língua natural: adaptando as métricas do coh-metrix para o português. In (Vol. 2, pp. 45–61). [GS Search]
Schapire, R. E., Singer, Y. & Singhal, A. (1998). Boosting and rocchio applied to text filtering. In Proceedings of the 21st annual international acm sigir conference on research and development in information retrieval (pp. 215–223). doi: 10.1145/290941.290996 . [GS Search]
Shapley, L. S. (1953). A value for n-person games. Contributions to the Theory of Games, 2(28), 307–317. doi: 10.1515/9781400881970-018 . [GS Search]
Shrestha, P., Sierra, S., González, F. A., Montes, M., Rosso, P. & Solorio, T. (2017). Convolutional neural networks for authorship attribution of short texts. In Proceedings of the 15th conference of the european chapter of the association for computational linguistics: Volume 2, short papers (pp. 669–674). [GS Search]
Silva, D. d. C. (2011). Algoritmos de processamento da linguagem e síntese de voz com emoções aplicados a um conversor texto-fala baseado em hmm. Doutorado, Programa de Engenharia Elétrica, Instituto Alberto Luiz Coimbra de Pós-Graduação e Pesquisa de Engenharia (COPPE/UFRJ), Rio de Janeiro. [Link]. [GS Search]
Singh, S. & Remenyi, D. (2016). Plagiarism and ghostwriting: The rise in academic misconduct. In (Vol. 112, pp. 1–7). Academy of Science of South Africa. doi: 10.17159/sajs.2016/20150300 .[GS Search]
Soares, F., Yamashita, G. H. & Anzanello, M. J. (2018). A parallel corpus of theses and dissertations abstracts. In International conference on computational processing of the portuguese language (pp. 345–352). doi: 10.1007/978-3-319-99722-335 . [GS Search]
Stamatatos, E. (2009). A survey of modern authorship attribution methods. In (Vol. 60, pp.538–556). Wiley Online Library. doi: 10.1002/asi.21001 . [GS Search]
Sundararajan, M. & Najmi, A. (2020, 13–18 Jul). The many shapley values for model explanation. In H. D. III & A. Singh (Eds.), Proceedings of the 37th international conference on machine learning (Vol. 119, pp. 9269–9278). PMLR. [GS Search]
Tarekegn, A. N., Giacobini, M. & Michalak, K. (2021). A review of methods for imbalanced multi-label classification. Pattern Recognition, 118, 107965. doi: 10.1016/j.patcog.2021.107965 . [GS Search]
Tempestt, N., Kalaivani Sundararajan, A. F., Yan, Y., Xiang, Y., Woodard, D. et al. (2017). Surveying stylometry techniques and applications. ACM Computing Surveys, 50(6). doi: 10.1145/3132039 . [GS Search]
Thinsungnoena, T., Kaoungkub, N., Durongdumronchaib, P., Kerdprasopb, K., Kerdprasopb, N. et al. (2015). The clustering validity with silhouette and sum of squared errors. learning, 3 (7). doi: 10.12792/iciae2015.012 . [GS Search]
Tweedie, F. J. & Baayen, R. H. (1998). How variable may a constant be? measures of lexical richness in perspective. Computers and the Humanities, 32(5), 323–352. doi: 10.1023/A:1001749303137 . [GS Search]
Van der Maaten, L. & Hinton, G. (2008). Visualizing data using t-sne. In (Vol. 9). [Link]. [GS Search]
Varela, P. J., Albonico, M., Justino, E. J. R., Bortolozzi, F. et al. (2018). A computational approach for authorship attribution on multiple languages. In 2018 international joint conference on neural networks (ijcnn) (p. 1-8). doi: 10.1109/IJCNN.2018.8489704 . [GS Search]
Weisberg, S. (2001). Yeo-johnson power transformations. Department of Applied Statistics, University of Minnesota. http://www.stat.umn.edu/arc/yjpower.pdf. [GS Search]
Werneck, V. R. (2006). Sobre o processo de construção do conhecimento: o papel do ensino
e da pesquisa. Ensaio: avaliação e políticas públicas em educação, 14, 173–196. doi: 10.1590/S0104-40362006000200003 . [GS Search]
Yang, M., Chen, X., Tu, W., Lu, Z., Zhu, J. & Qu, Q. (2018). A topic drift model for authorship attribution. In (Vol. 273, pp. 133–140). Elsevier.[GS Search]
Yule, G. (1944). The statistical study of literary vocabulary. cambridge, cambridge [eng.]. University Press. Journal of the Royal Statistical Society. [Link]. [GS Search]
Additional Files
Published
How to Cite
Issue
Section
License
Copyright (c) 2022 Daniel Cirne Vilas-Boas Dos Santos, Cleber Zanchettin
This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.