Cross-collection Dataset of Public Domain Portuguese-language Works


  • Mariana O. Silva Universidade Federal de Minas Gerais
  • Clarisse Scofield Universidade Federal de Minas Gerais
  • Luiza de Melo-Gomes Universidade Federal de Minas Gerais
  • Mirella M. Moro Universidade Federal de Minas Gerais



Dataset, Portuguese Literature, Public Domain Works, Feature Engineering


Many datasets are published in English to get more engagement, popularity and reach within a research community. Indeed, most sciences are language-agnostic and thrive on publicly available data. However, such a claim is not always valid for Arts, where Literature and Music are two examples of fields that heavily rely on the language of the work. Especially in Literature, combining human expertise with book consumers’ data may generate what is needed to sustain constant changes experienced in the book publishing market. Therefore, we introduce PPORTAL, the first public domain Portuguese-language literature dataset that is composed of a wide variety of book-related metadata. After
introducing its building process and content, we present an exploratory data analysis with a quantitative description of its main features. We also show its usability as a resource on different research domains through examples of real-world applications, as well as pointing out other potential applications.


Download data is not yet available.


Dataset Showcase Workshop 2021 - Extended Papers