Cross-collection Dataset of Public Domain Portuguese-language Works
DOI:
https://doi.org/10.5753/jidm.2022.2349Keywords:
Dataset, Portuguese Literature, Public Domain Works, Feature EngineeringAbstract
Many datasets are published in English to get more engagement, popularity and reach within a research community. Indeed, most sciences are language-agnostic and thrive on publicly available data. However, such a claim is not always valid for Arts, where Literature and Music are two examples of fields that heavily rely on the language of the work. Especially in Literature, combining human expertise with book consumers’ data may generate what is needed to sustain constant changes experienced in the book publishing market. Therefore, we introduce PPORTAL, the first public domain Portuguese-language literature dataset that is composed of a wide variety of book-related metadata. After
introducing its building process and content, we present an exploratory data analysis with a quantitative description of its main features. We also show its usability as a resource on different research domains through examples of real-world applications, as well as pointing out other potential applications.
Downloads
References
Alves, A. L. F., Baptista, C. d. S., Firmino, A. A., de Oliveira, M. G., and de Paiva, A. C. (2016). A spatial and temporal sentiment analysis approach applied to twitter microtexts. Journal of Information and Data Management, 6(2):118. doi:10.5753/jidm.2015.1563.
Bao, H., He, K., Yin, X., Li, X., Bao, X., Zhang, H., Wu, J., and Gao, Z. (2021). Bert-based meta-learning approach with looking back for sentiment analysis of literary book reviews. In Wang, L., Feng, Y., Hong, Y., and He, R., editors, Natural Language Processing and Chinese Computing - 10th CCF International Conference, NLPCC 2021, Qingdao, China, October 13-17, 2021, Proceedings, Part II, volume 13029 of Lecture Notes in Computer Science, pages 235–247. Springer. doi:10.1007/978-3-030-88483-3_18.
Blondel, V. D. et al. (2008). Fast unfolding of communities in large networks. Journal of Statistical Mechanics: Theory and Experiment, 2008(10):P10008. doi:10.1088/1742-5468/2008/10/p10008.
Bowyer, K. W., Chawla, N. V., Hall, L. O., and Kegelmeyer, W. P. (2011). SMOTE: synthetic minority over-sampling technique. CoRR, abs/1106.1813. URL [link].
Champagne, A. (2020). What Is A Reader? How Readers on Goodreads are Changing the Canon in the Twenty-First Century. In Annual Int. Conf. of the Alliance of Digital Humanities Organizations, Conference Abstracts.
de Araujo, P. H. L., de Campos, T. E., de Oliveira, R. R., Stauffer, M., Couto, S., and Bermejo, P. (2018). Lener-br: a dataset for named entity recognition in brazilian legal text. In Int’l Conf. on Computational Processing of the Portuguese Language, pages 313–323. Springer.
Graovac, J., Kovačević, J., and Pavlović-Lažetić, G. (2015). Language independent n-gram-based text categorization with weighting factors: A case study. Journal of Information and Data Management, 6(1):4. doi:10.5753/jidm.2015.1552.
Harb, J. G. D., Ebeling, R., and Becker, K. (2019). Exploring deep learning for the analysis of emotional reactions to terrorist events on twitter. Journal of Information and Data Management, 10(2):97–115. doi:10.5753/jidm.2019.2039.
Lebrun, T. and Audet, R. (2020). Artificial Intelligence and the Book Industry. White Paper. Zenodo. doi:10.5281/zenodo.4036258.
Lozano, L. C. and Planells, S. C. (2020). Best books ever dataset. Zenodo. doi:10.5281/zenodo.4265096.
Maharjan, S., Kar, S., Montes, M., González, F. A., and Solorio, T. (2018). Letting emotions flow: Success prediction by modeling the flow of emotions in books. In Procs. Conf. of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers), pages 259–265. doi:10.18653/v1/N18-2042.
Maity, S. K., Panigrahi, A., and Mukherjee, A. (2019). Analyzing social book reading behavior on goodreads and how it predicts amazon best sellers. In Influence and Behavior Analysis in Social Networks and Social Media, pages 211–235. Springer, Cham.
Matsuno, I. P., Rossi, R. G., Marcacini, R. M., and Rezende, S. O. (2017). Aspect-based sentiment analysis using semisupervised learning in bipartite heterogeneous networks. Journal of Information and Data Management, 7(2):141. doi:10.5753/jidm.2016.1584.
Ni, J., Li, J., and McAuley, J. (2019). Justifying recommendations using distantly-labeled reviews and fine-grained aspects. In Procs. Conf. on Empirical Methods in Natural Language Processing and Int’l Joint Conf. on Natural Language Processing (EMNLP-IJCNLP), pages 188–197.
Procópio Jr., P. S., Gonçalves, M. A., Laender, A. H. F., Salles, T., and Figueiredo, D. (2012). Time-aware ranking in sport social networks. Journal of Information and Data Management, 3(3):195. doi:10.5753/jidm.2012.1448.
Rigau, P. and Tienda, A. (2020). 100 bestselller books during covid-19 in spain. Zenodo. doi:10.5281/zenodo.3820050.
Sabri, N. and Weber, I. (2021). A global book reading dataset. Data, 6(8):83. doi:10.3390/data6080083.
Sebastiani, F. (2002). Machine learning in automated text categorization. ACM Comput. Surv., 34(1):1–47. doi:10.1145/505282.505283.
Shahsavari, S., Ebrahimzadeh, E., Shahbazi, B., Falahi, M., Holur, P., Bandari, R., R. Tangherlini, T., and Roychowdhury, V. (2020). An automated pipeline for character and relationship extraction from readers literary book reviews on goodreads.com. In 12th ACM Conference on Web Science, WebSci ’20, page 277–286, New York, NY, USA. Association for Computing Machinery. doi:10.1145/3394231.3397918.
Sharma, A., Liu, H., and Liu, H. (2020). Best seller rank (bsr) to sales: An empirical look at amazon.com. In 2020 IEEE 20th International Conference on Software Quality, Reliability and Security Companion (QRS-C), pages 609–615. doi:10.1109/QRS-C51114.2020.00104.
Silva, M. O., Scofield, C., and Moro, M. M. (2021a). PPORTAL: Public Domain Portuguese-language Literature Dataset. In Anais do III Dataset Showcase Workshop, pages 77–88, Porto Alegre, RS, Brasil. SBC. doi:10.5753/dsw.2021.17416.
Silva, M. O., Scofield, C., and Moro, M. M. (2021b). PPORTAL: Public domain Portuguese-language literature Dataset. Zenodo. doi:10.5281/zenodo.5178063.
Silva, M. O., Scofield, C., Oliveira, G. P., Seufitelli, D., and Moro, M. M. (2021c). Exploring Brazilian Cultural Identity Through Reading Preferences. In Anais do X Brazilian Workshop on Social Network Analysis and Mining, pages 115–126. SBC. doi:10.5753/brasnam.2021.16130.
Silva, M. O., Scofield, C., Oliveira, G. P., Seufitelli, D. B., and Moro, M. M. (2021d). BraCID: Brazilian Cultural Identity Information Through Reading Preferences. Zenodo. doi:10.5281/zenodo.4890048.
Soares, F., Yamashita, G. H., and Anzanello, M. J. (2018). A parallel corpus of theses and dissertations abstracts. In International Conference on Computational Processing of the Portuguese Language, pages 345–352. Springer.
Sousa, A. W. and Fabro, M. D. D. (2019). Iudicium textum dataset uma base de textos jurídicos para nlp. In XXXIV Simpósio Brasileiro de Banco de Dados: Dataset Showcase Workshop, SBBD 2019 Companion. SBC.
Souza, V., Nobre, J., and Becker, K. (2020). Characterization of anxiety, depression, and their comorbidity from texts of social networks. In Anais do XXXV Simpósio Brasileiro de Bancos de Dados, pages 121–132, Porto Alegre, RS, Brasil. SBC. doi:10.5753/sbbd.2020.13630.
Thelwall, M. and Kousha, K. (2017). Goodreads: A social network site for book readers. Journal of the Association for Information Science and Technology, 68(4):972–983.
Wagner Filho, J. A., Wilkens, R., Idiart, M., and Villavicencio, A. (2018). The brwac corpus: A new open resource for brazilian portuguese. In Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018).
Wan, M., Misra, R., Nakashole, N., and McAuley, J. J. (2019). Fine-grained spoiler detection from large-scale review corpora. In Procs. Conf. of the Association for Computational Linguistics (ACL), pages 2605–2610. doi:10.18653/v1/p19-1248.
Wang, X., Yucesoy, B., Varol, O., Eliassi-Rad, T., and Barabasi, A.-L. (2019). Success in books: predicting book sales before publication. EPJ Data Science, 8(31). doi:10.1140/epjds/s13688-019-0208-6.
Yadollahi, A., Shahraki, A. G., and Zaiane, O. R. (2017). Current state of text sentiment analysis from opinion to emotion mining. ACM Comput. Surv., 50(2). doi:10.1145/3057270.