Wiki Evolution dataset applicability: English Wikipedia revision articles represented by quality attributes

Authors

DOI:

https://doi.org/10.5753/jidm.2024.3568

Keywords:

Wikipedia, Dataset, Information Quality

Abstract

This paper presents the creation of the Wikipedia article's evolution dataset. This dataset is a set of revisions of articles, represented by quality attributes and quality classification. This dataset can be used for studies regarding automatic quality classification that consider the article revision history as well as understanding how the content and quality of articles evolve over time in this collaborative platform. To illustrate a potential application, this study provides a practical example of utilizing a Machine Learning model trained on the constructed dataset.

Downloads

Download data is not yet available.

References

Batista, N. A., Brandão, M. A., Pinheiro, M. B., Dalip, D. H., and Moro, M. M. (2018). Dados de múltiplas fontes da web: coleta, integração e pré-processamento. In de Computação – SBC, S. B., editor, Anais do XXIV Simpósio Brasileiro de Sistemas Multimídia e Web: Minicursos, chapter 5, pages 153–192. Sociedade Brasileira de Computação – SBC.

Blumenstock, J. E. (2008). Size matters: Word count as a measure of quality on wikipedia. In Proceedings of the 17th International Conference on World Wide Web, WWW ’08, pages 1095–1096, New York, NY, USA. ACM. DOI: 10.1145/1367497.1367673.

Dalip, D. H. (2015). Uma Abordagem Multi-Visão para a Estimativa Automática da Qualidade de Conteúdo Colaborativo na Web 2.0. PhD thesis, UFMG.

Dalip, D. H., Gonçalves, M. A., Cristo, M., and Calado, P. (2011). Automatic assessment of document quality in web collaborative digital libraries. volume 2(3), page 1–30. DOI: 10.1145/2063504.2063507.

Dang, Q. V. and Ignat, C.-L. (2016). Quality assessment of wikipedia articles without feature engineering. In Proceedings of the 16th ACM/IEEE-CS on Joint Conference on Digital Libraries, JCDL ’16, pages 27–30, New York, NY, USA. ACM. DOI: 10.1145/2910896.2910917.

Dondio, P., Barrett, S., Weber, S., and Seigneur, J.-M. (2006). Extracting trust from domain analysis: A case study on the wikipedia project. volume 4158, pages 362–. DOI: 10.1007/1183956935.

Hasan Dalip, D., André Gonçalves, M., Cristo, M., and Calado, P. (2009). Automatic quality assessment of content created collaboratively by web communities: A case study of wikipedia. In Proceedings of the 9th ACM/IEEE-CS Joint Conference on Digital Libraries, JCDL ’09, pages 295–304, New York, NY, USA. ACM. DOI: 10.1145/1555400.1555449.

Jhandir, M. Z., Tenvir, A., On, B.-W., Lee, I., and Choi, G. S. (2017). Controversy detection in wikipedia using semantic dissimilarity. Inf. Sci., 418(C):581–600. DOI: 10.1016/j.ins.2017.08.037.

Lipka, N. and Stein, B. (2010). Identifying featured articles in wikipedia: writing style matters. pages 1147–1148. DOI: 10.1145/1772690.1772847.

Ma, Z., Tao, J., and Hu, J. (2017). The dynamics of wikipedia article revisions: an analysis of revision activities and patterns. International Journal of Data Mining, Modelling and Management, 9(4):298–314.

Pinto, A. C., Silva, B. S., Carmo, P. R. M., Lima, R. L. A., Amorim, L. S. P., Viana, R. T. C., Dalip, D. H., and Oliveira, P. A. C. (2020). Webfeatures: A web tool to extract features from collaborative content. In Anais Estendidos do XXVI Simpósio Brasileiro de Sistemas Multimídia e Web, pages 103–106, Porto Alegre, RS, Brasil. SBC. DOI: 10.5753/webmediaestendido.2020.13071.

Raman, N. et al. (2020a). Classifying wikipedia article quality with revision history networks. In Proceedings of the 16th International Symposium on Open Collaboration.

Raman, N. A. R. U. N. et al. (2020b). Revisionist history: Predicting wikipedia article quality with edit histories.

Ruprechter, T., Santos, T., and Helic, D. (2020). Relating wikipedia article quality to edit behavior and link structure. Applied Network Science, 5:1–20.

Sugandhika, C. and Ahangama, S. (2022). Assessing information quality of wikipedia articles through google’s e-a-t model. IEEE Access, 10:1–1. DOI: 10.1109/ACCESS.2022.3172962.

Tyagi, N., Solanki, A., and Tyagi, S. (2010). An algorithmic approach to data preprocessing in web usage mining. International Journal of Information Technology and Knowledge Management, 2.

Wang, P. and Li, X. (2020). Assessing the quality of information on wikipedia: A deep-learning approach.

Warncke-Wang, M., Cosley, D., and Riedl, J. (2013). Tell me more: an actionable quality model for wikipedia. DOI: 10.1145/2491055.2491063.

Wikipedia (2023a). Wikipédia:content assessment. Avaliable at [link]. Acessed in 21st july, 2023.

Wikipedia (2023b). Wikipedia:general disclaimer. Avaliable at [link]. Acessed in 21st july, 2023.

Wikipedia (2023c). Wikipedia:size of wikipedia. Available at [link]. Acessed in 21st july, 2023.

Downloads

Published

2024-04-05

How to Cite

Sanches, A. L., de Deus Vieira Júnior, S., Hasan Dalip, D., & C. O. Lopes, B. G. (2024). Wiki Evolution dataset applicability: English Wikipedia revision articles represented by quality attributes. Journal of Information and Data Management, 15(1), 216–223. https://doi.org/10.5753/jidm.2024.3568

Issue

Section

Dataset Showcase Workshop 2022 - Extended Papers