https://journals-sol.sbc.org.br/index.php/jidm/issue/feedJournal of Information and Data Management2025-01-14T21:12:19+00:00Carina Dornelescarina.dorneles@ufsc.brOpen Journal Systems<p>JIDM is an electronic journal that is published three times a year. Submissions are continuously received, and the first phase of the reviewing process usually takes 6 to 7 months. JIDM is sponsored by the Brazilian Computer Society, focusing on information and data management in large repositories and document collections. It relates to different areas of Computer Science, including databases, information retrieval, digital libraries, knowledge discovery, data mining, and geographical information systems. </p>https://journals-sol.sbc.org.br/index.php/jidm/article/view/3383Analysis of Expenses from Brazilian Federal Deputies between 2015 and 20182024-06-15T14:22:52+00:00Felippe Pires Ferreirafelippe_pires@usp.brIlan S. G. de Figueiredoilan.figueiredo@usp.brLarissa R. Teixeirarteixeira.larissa@usp.brWilliam Zaniboni Silvawilliamzaniboni@usp.brCaetano Traina Juniorcaetano@icmc.usp.brCristina Dutra de Aguiarcdac@icmc.usp.brRobson L. F. Cordeirorobsonc@andrew.cmu.edu<p>The analysis of public expenses is fundamental to foster the correct use of public resources, guaranteeing the application of the principles of publicity and efficiency. Within the scope of the Brazilian parliament, Parliamentary Quotas are also identified as public resources, therefore they need to be subject to the same control criteria. This research aims to carry out analyzes of parliamentary expenses related to Parliamentary Quotas, presenting the distribution of expenses related to the 55th Legislature (2015-2018) of Brazil, in addition to identifying anomalies in such expenses. Through a clustering-based analysis, the expenses were compared with the goal of finding similarities between the spending behavior of the federal deputies. This study, through data mining, presents the results obtained from analyzing different parliamentary expenses under the party or regional aspect of each deputy. The results obtained allowed us to answer questions related to the characteristics of the expenses involving Parliamentary Quotas, anomalous expenses, and similarity between parliamentary expenses, such as, the identification of expenditure patterns, which allow the verification of regional variability, as well as identifying some of the expenditures as possibly anomalous.</p>2025-01-14T00:00:00+00:00Copyright (c) 2025 Journal of Information and Data Managementhttps://journals-sol.sbc.org.br/index.php/jidm/article/view/3657Exploratory Analysis of Microdata from the National High School Exam - Enem: Performance and Specificities of the Participants2024-08-12T21:32:14+00:00João Augusto Fernandes Barbosajoaoaugusto_fb@usp.brRobson Leonardo Ferreira Cordeirorobsonc@andrew.cmu.edu<div class="page" title="Page 1"> <div class="layoutArea"> <div class="column"> <p>The National High School Exam (ENEM) is a significant test in Brazil that measures high school teaching quality and performance. It has also been used for evaluating undergraduate course candidates since 2004. ENEM has had a transformative impact on the education market, with schools now prioritizing exam preparation. However, there is a lack of comprehensive studies on the performance and characteristics of ENEM participants, particularly those with disabilities such as attention deficit or autism spectrum disorder. This article examines the challenges faced by these subgroups of participants considering the period between 2015 and 2019, and using analytical tools as clustering, heatmaps, and hypothesis testing to understand the main data patterns. The findings aim to support the development of more tailored and flexible study programs to meet the needs of participants. Our study reveals that individuals with certain disabilities, like Attention Deficit and Dyslexia, tend to achieve higher scores, while those with Mental disabilities and Deafness perform below the national average. Additionally, the results suggests that the grade disparity between students with and without disabilities may be influenced by socioeconomic factors.</p> </div> </div> </div>2025-11-01T00:00:00+00:00Copyright (c) 2025 Journal of Information and Data Managementhttps://journals-sol.sbc.org.br/index.php/jidm/article/view/4116Grid-Ordering for Outlier Detection in Massive Data Streams2024-07-02T13:04:00+00:00Braulio V. Sánchez Vincesbraulio.sanchez@usp.brRobson L. F. Cordeirorobsonc@andrew.cmu.edu<p>Outlier detection is critical in data mining, encompassing the revelation of hidden insights or identification of potentially disruptive anomalies. While numerous strategies have been proposed for serial-processing outlier detection, the ever-expanding realm of big data applications demands efficient distributed computing solutions. This paper addresses the challenge of real-time outlier detection in multidimensional data streams with high-frequency arrivals, by presenting GOOST. This novel algorithm employs neighborhood analysis by leveraging grid-based data sorting. GOOST efficiently detects distance-based outliers, ensuring accurate detection in distributed environments within a competitive and much more stable processing time than previous solutions. We perform experiments on 6 real and synthetic data sets with up to 1.2M events, and up to 55 dimensions. We demonstrate that GOOST outperforms 3 state-of-the-art methods in terms of quality of results (30% more accurate) within competitive (and 45% more stable) processing times for real-time analysis of multidimensional data streams and high event frequency, thus offering a promising solution for various scientific and commercial domains.</p>2025-01-14T00:00:00+00:00Copyright (c) 2025 Journal of Information and Data Managementhttps://journals-sol.sbc.org.br/index.php/jidm/article/view/4082A Robust Measure for Evaluating Representativeness of Summarized Trajectories with Multiple Aspects2024-04-09T14:51:14+00:00Vanessa Lago Machadovanessalagomachado@gmail.comTarlis Tortelli Portelatarlis@tarlis.com.brLucas Vaninilucasvanini@ifsul.edu.brChiara Rensochiara.renso@isti.cnr.itRonaldo dos Santos Mellor.mello@ufsc.br<p>As trajectory datasets grow larger, summarization techniques become increasingly important. However, current methods often lack a suitable measure of representativeness, making evaluation a complex task. This is especially true in the context of multi-aspect trajectories, where evaluating summarization techniques is particularly challenging. To address this, we have developed a novel representativeness measure called RMMAT. This innovative method combines similarity metrics and covered information, offering adaptability to diverse data and analysis needs. With RMMAT, evaluating summarization techniques is simplified, and deeper insights can be gained from extensive trajectory data. Our evaluation of real-world trajectory datasets demonstrates that RMMAT is a robust Representativeness Measure for Summarized Trajectories with Multiple Aspects. This measure could help researchers and analysts to evaluate and empower them to make informed decisions about the quality and relevance of representative data for their analytical goals.</p>2025-01-14T00:00:00+00:00Copyright (c) 2025 Journal of Information and Data Managementhttps://journals-sol.sbc.org.br/index.php/jidm/article/view/4110Towards Data Summarization of Multi-Aspect Trajectories Based on Spatio-Temporal Segmentation2024-11-05T20:55:21+00:00Vanessa Lago Machadovanessalagomachado@gmail.comTarlis Tortelli Portelatarlis@tarlis.com.brGeomar André Schreinergschreiner@uffs.edu.brRonaldo dos Santos Mellor.mello@ufsc.br<p>This paper presents a new method for summarizing multiple aspect trajectories (MATs). This kind of data holds several challenges in terms of analysis and extraction of meaningful insights due to their spatial, temporal, and semantic dimensions. In order to address them, our method leverages a combination of spatial grid-based segmentation and temporal sequence analysis. It segments the trajectory data into spatial cells using a grid-based approach. The spatial segmentation enables a finer-grained analysis of the trajectories within each cell. Next, we consider the temporal sequence of points within each cell to capture the temporal intervals of the trajectories. By combining spatial and temporal perspectives, the method identifies representative trajectories that capture the main behavior of semantically enriched object movements. We evaluated the utility of our method by applying two distinct strategies: (i) the RMMAT measure, assessing the quality of representative MAT in terms of similarity and coverage of information, and (ii) the Average Recall (AR) metric, measuring the ability of our representative MAT to capture essential data characteristics. Our evaluation demonstrates the effectiveness of MAT-SGT in summarizing MATs. The proposed method holds potential applications across diverse domains, including transportation planning, urban analytics, and human mobility analysis, where the concise representation of trajectories is crucial for decision-making and knowledge discovery.</p>2025-01-14T00:00:00+00:00Copyright (c) 2025 Journal of Information and Data Managementhttps://journals-sol.sbc.org.br/index.php/jidm/article/view/4170Rural Properties Supported by the Carbon Storage and Sequestration Model in the area under the Biome2024-07-29T14:21:27+00:00Fabiana da Silva Soaresfabianasoares@estudante.ufscar.brBruna Henrique Sacramentobrunahsacramento@gmail.comHilton Luis Ferraz da Silveirahilton.ferraz@embrapa.brRoberta Averna Valenteroavalen@ufscar.br<p>The carbon storage and sequestration are impacted by land use/land cover (LULC) changes, being an important ecosystem service, responsible for climate regulation. Through InVEST's Carbon Storage and Sequestration model, combined with LULC and declared areas in Cadastro Ambiental Rural (CAR) in Rondônia State, a current and two future scenarios of carbon pools with secondary forest of 5 and 80 years were created in the Amazon biome. The declared areas have a predominance of forest formation and pasture, and the pools with the highest gains in tons of carbon were aboveground and belowground biomass, with a total gain of 2% and 7%, respectively, concerning the current one. Thus, it emphasizes the importance of command-and-control tools and forest recovery incentives.</p>2025-02-06T00:00:00+00:00Copyright (c) 2025 Journal of Information and Data Managementhttps://journals-sol.sbc.org.br/index.php/jidm/article/view/4181Detailed Mapping of Irrigated Rice Fields Using Remote Sensing data and Segmentation Techniques: A case of study in Turvo, Santa Catarina, Brazil2024-10-31T18:15:20+00:00Andre Dalla Bernardina Garciaandre.garcia@inpe.brVictor Hugo Rohden Prudentevictorrp@umich.eduDarlan Teles da Silvadarlan.silva@inpe.brMichel Eustáquio Dantas Chavesmichel.dantas@unesp.brKleber Trabaquiniklebertrabaquini@epagri.sc.gov.brIeda Del'Arco Sanchesieda.sanches@inpe.br<p>In this study, we evaluated multiple methods and data sources for mapping irrigated rice fields in Turvo, Santa Catarina, using a detailed reference map that includes irrigation channels, roads, and boundaries within and between rice fields. We tested different approaches using a per-pixel and segmentation approaches. In the per-pixel classifications scenarios we used a a Random Forest (RF) applied to the China-Brazil Earth-Resources Satellite Multispectral and Panchromatic Wide-Scan Camera (CBERS-4A/WPM) data, and to Sentinel-2 (S2) imagery. For the segmentation approach we used a combination of S2 imagery with a Segment Anything Model geospatial (Samgeo) mask applied to high-resolution CBERS-4A/WPM data (S2+WPM/Samgeo). We qualitatively and quantitatively compared maps derived from a existing source (MapBiomas) with our scenarios. MapBiomas and per-pixel S2 classification provided adequate general plot boundary identification, however, lacked finer details. CBERS-4A/WPM data captured some of these details, although they showed a high rate of false positives due to confusion with other vegetation types. We also examined how detailed rice field mapping affected time-series analysis. Our findings indicate that the S2+WPM/Samgeo approach most closely matched the reference map time-series and offered superior detail, better distinguishing field heterogeneities. This method could support more detailed and accurate monitoring of rice fields. Overall, S2+WPM/Samgeo delivered the most precise and detailed mapping of irrigated rice in the region.</p>2025-01-27T00:00:00+00:00Copyright (c) 2025 Journal of Information and Data Managementhttps://journals-sol.sbc.org.br/index.php/jidm/article/view/4182QQ-SPM: spatial keyword search based on qualitative and quantitative spatial patterns2024-07-01T11:33:16+00:00Carlos Vinicius Alves Minervino Pontescarlosminervino@copin.ufcg.edu.brCláudio E. C. Campelocampelo@dsc.ufcg.edu.brMaxwell Guimarães de Oliveiramaxwell@computacao.ufcg.edu.brSalatiel Dantas Silvasalatiel@copin.ufcg.edu.br<p>The search for Points of Interest (POIs) based on keywords and user preferences is a daily need for many people. One way of representing this kind of search is the Spatial Pattern Matching (SPM) query, which allows for the retrieval of geo-textual objects based on spatial patterns defined by keywords and distance criteria. However, SPM is not able to represent qualitative requirements, such as connectivity relations between the searched objects. In this context, this work proposes the Qualitative and Quantitative Spatial Pattern Matching (QQ-SPM) query, which allows searches with qualitative connectivity constraints in addition to keywords and distance criteria. We also propose the QQESPM algorithm which combines an efficient use of geo-textual indexing with a top-down search strategy inspired in the Efficient Spatial Pattern Matching (ESPM) algorithm. Through a performance evaluation, QQESPM algorithm proved to be more than 1000 times faster in average execution time than simple approaches for the QQ-SPM search. Furthermore, it achieved slower execution time growth when facing the increase of the dataset size, showcasing its efficienty and efficacy for handling geo-textual searches in a quantitative and qualitative setting.</p>2025-01-27T00:00:00+00:00Copyright (c) 2025 Journal of Information and Data Managementhttps://journals-sol.sbc.org.br/index.php/jidm/article/view/4230Enhancing the Performance of Machine Learning Classifiers through Data Cleaning with Ensemble Confident Learning2024-07-18T22:37:08+00:00Renato Okabayashi Miyajire.miyaji@usp.brFelipe Valencia de Almeidafelipe.valencia.almeida@usp.brPedro Luiz Pizzigatti Corrêapedro.correa@usp.br<p>Model-centric techniques, such as hyper parameter optimization and regularization, are commonly used in the literature to enhance the performance of Machine Learning Classifiers. However, when dealing with noisy data, Data-Centric approaches show promising potential. Thus, in this paper a new method is proposed: the Ensemble Confident Learning (ECL), which enhances the Confident Learning technique with the use of multiple learners to improve the selection of instances with biased labels. This method was applied for a case study of Species Distribution Modeling in the Amazon using Classifiers to estimate the probability of species occurrence based on environmental conditions. Compared to Confident Learning, ECL showed an improvement of 20% in Recall and 3.5% in ROC-AUC for Logistic Regression.</p>2025-01-20T00:00:00+00:00Copyright (c) 2025 Journal of Information and Data Managementhttps://journals-sol.sbc.org.br/index.php/jidm/article/view/4241Forecasting Business Process Remaining Time Through Deep Learning Approaches2024-11-05T01:51:58+00:00Ronildo Oliveira da Silvaronildo.oliveira@alu.ufc.brRegis Pires Magalhãesregismagalhaes@ufc.brLívia Almada Cruzlivia.almada@ufc.brCriston Pereira de Souzacriston@ufc.brDavi Romero de Vasconcelosdaviromero@ufc.brJosé Antônio Fernandes de Macedojose.macedo@dc.ufc.br<p>Business process analysis is a part of process mining, which involves predictive monitoring. It seeks to predict individual processes, such as determining the next step to execute based on past events or estimating the remaining time until process completion. Such predictions can help to prevent waits, discover process bottlenecks, and assist alert systems. This paper aims to evaluate deep learning architectures to predict the time required to complete a business process instance. We have evaluated the models using three real datasets, including two widely used public ones. The experimental results show deep learning architectures that combined dense layers with a self-attention mechanism outperformed the current state-of-the-art, demonstrating superior performance regarding the mean absolute error metric in most of the datasets analyzed.</p>2025-08-22T00:00:00+00:00Copyright (c) 2025 Journal of Information and Data Managementhttps://journals-sol.sbc.org.br/index.php/jidm/article/view/4263A Systematic Review of FAIR-compliant Big Data Software Reference Architectures2024-09-19T04:23:22+00:00João Pedro de Carvalho Castrojpcarvalhocastro@ufmg.brMaria Júlia Soares De Grandimaju.degrandi@usp.brCristina Dutra de Aguiarcdac@icmc.usp.br<p>To meet the standards of the Open Science movement, the FAIR Principles emphasize the importance of making scientific data Findable, Accessible, Interoperable, and Reusable. Yet, creating a repository that adheres to these principles presents significant challenges. Managing large volumes of diverse research data and metadata, often generated rapidly, requires a precise approach. This necessity has led to the development of Software Reference Architectures (SRAs) to guide the implementation process for FAIR-compliant repositories. This article conducts a systematic review of research efforts focused on architectural solutions for such repositories. We detail our methodology, covering all activities undertaken in the planning and execution phases of the review. We analyze 323 references from reputable sources and expert recommendations, identifying 7 studies on general-purpose big data SRAs, 13 pipelines implementing FAIR Principles in specific contexts, and 3 FAIR-compliant big data SRAs. We provide a thorough description of their key features and assess whether the research questions posed in the planning phase were adequately addressed. Additionally, we discuss the limitations of the retrieved studies and identify tendencies and opportunities for further research.</p>2025-03-18T00:00:00+00:00Copyright (c) 2025 Journal of Information and Data Managementhttps://journals-sol.sbc.org.br/index.php/jidm/article/view/4291Applying Graph Databases and Human Mobility Data to Track Infectious Disease Spread in Brazil2024-04-09T01:30:36+00:00Mariama C. S. de Oliveiramariama.serafimdeoliveira@abo.fiLucas Henrique G. de Saleslucas.gonzaga@ufrpe.brAndrêza Leite de Alencarandreza.leite@ufrpe.brNatalia T. S. de Oliveirantso@cin.ufpe.brAntônio Ricardo K. Cunharicardo.khouri@fiocruz.brPablo Ivan P. Ramospablo.ramos@fiocruz.br<p>This work has been enriched by the invaluable contributions of the following institutions: The ÆSOP (Alert-Early System of Outbreaks with Pandemic Potential) project provided us with the primary idea and guidance for this research, laying the foundation for our study. CIn-UFPE, SiDi, and Samsung Brazil, who supported the ``Data Engineering and Data Science'' Residency program, where this study was successfully applied as part of our coursework, culminating in its completion. The UFRPE collaborative efforts and orientation support played a pivotal role in the successful execution and completion of this research project.</p>2025-03-18T00:00:00+00:00Copyright (c) 2025 Journal of Information and Data Managementhttps://journals-sol.sbc.org.br/index.php/jidm/article/view/4293An Unsupervised Method for Fault Detection in Transmission Lines Using Denial Constraints2024-06-28T15:13:31+00:00Nicolas Tamalunt18@inf.ufpr.brLeandro Augusto Ensinaleandro.ensina@ufpr.brEduardo Cunha de Almeidaeduardo@inf.ufpr.brEduardo Henrique Monteiro Penaeduardopena@utfpr.edu.brLuiz Eduardo Soares de Oliveiraluiz.oliveira@ufpr.br<p>This paper presents a denial constraint (DC) discovery approach for detecting faults in utility companies' electric transmission lines. Transmission lines rely on a protection system that continually streams and stores waveform data with three-phase current and voltage information. Considering that those data are stored in a relational database, we use the high expressive power of DCs to capture the expected behavior of a transmission line, as they are ideal for representing rules in databases. Since defining DCs in our scenario requires expensive domain expertise and, worse, is an error-prone task, we use a state-of-the-art algorithm to discover reliable DCs. Unfortunately, the amount of data in the studied scenario makes state-of-the-art DC discovery algorithms impractical due to the long execution times. In response, we propose a novel DC discovery approach using streaming windows to address this issue. Our hypothesis is that DCs discovered in pre-fault windows significantly differ from those in post-fault windows and can be used as a fault detection approach. We use this intuition to detect faults without human intervention (<em>i.e.</em>, an unsupervised method). The extensive experimental evaluation on a dataset with diversified fault events shows that our approach can detect faults with 100% accuracy.</p>2025-02-13T00:00:00+00:00Copyright (c) 2025 Journal of Information and Data Managementhttps://journals-sol.sbc.org.br/index.php/jidm/article/view/4298PromptNER: An Automatic Prompt-Learning Data Labeling Approach for Named Entity Recognition on Sensitive Data2024-09-10T01:34:18+00:00Claudio M. V. de Andradeclaudio.valiense@dcc.ufmg.brFabiano Muniz Belémfmuniz@dcc.ufmg.brCelso França celsofranca@dcc.ufmg.brMarcos Carvalhomarcoscarvalho@dcc.ufmg.brMarcelo Ganemmarceloganem@dcc.ufmg.brGabriel Texeiragabrielmedeiros@dcc.ufmg.brGabriel Jallaisgabrieljallais@dcc.ufmg.brAlberto H. F. Laenderlaender@dcc.ufmg.brMarcos A. Gonçalvesmgoncalv@dcc.ufmg.br<p>We address the task of Named Entity Recognition (NER) for entities of the types Organization and Product/Service found in textual complaints recorded on Web platforms. Due to the high inference power of Large Language Models (LLM’s), there is a growing interest in applying them to distinct problems. However, they face issues of high infrastructure cost and privacy concerns when using external API’s. Accordingly, in this article we propose PromptNER, an approach that uses LLM’s for the recognition of entities in consumers’ complaints and use them to locally train simpler models, such as SpERT (Span-based Entity and Relation Extraction Transformer), to address the task of entity and relation extraction, achieving scalabilty and privacy. Our PromptNER enhanced model achieves significant gains, between 41%-129% in F-score compared to the SpERT model trained with manually-labeled data and between 30%-268% over recent (zero-shot) Large Language Models (Llama 3.1).</p>2025-01-20T00:00:00+00:00Copyright (c) 2025 Journal of Information and Data Managementhttps://journals-sol.sbc.org.br/index.php/jidm/article/view/4303Exploiting Machine Learning Algorithms in the Classification Step of Record Linkage2024-10-28T12:46:00+00:00Milena Macedo Santosmilenasantosmcd@gmail.comDimas Cassimiro Nascimentodimas.cassimiro@ufape.edu.br<p>Record linkage is a well-known task that aims to determine duplicate pairs of records in datasets. In this work, we evaluated several Machine Learning-based classification algorithms (Adaboost, MLP, SVM, Random Forest and XGboost) in the context of record linkage. We conducted experiments which aimed to evaluate the influence of balanced and unbalanced training sets over the efficacy of the record linkage classification step. We also explore the usage of scatterplots to improve the qualitative discussion of the obtained experimental results. According to the obtained experimental results, the Random Forest algorithm has generated the highest F-measure considering the evaluated datasets. In addition, the XGboost model has also presented competitive results, especially in the context of bibliographic and movie datasets.</p>2025-08-22T00:00:00+00:00Copyright (c) 2025 Journal of Information and Data Managementhttps://journals-sol.sbc.org.br/index.php/jidm/article/view/4344Evaluating Preprocessing and Textual Representation on Brazilian Public Bidding Document Classification2024-08-21T02:25:10+00:00Michele A. Brandãomichele.brandao@ifmg.edu.brMariana O. Silvamariana.santos@dcc.ufmg.brGabriel P. Oliveiragabrielpoliveira@dcc.ufmg.brAnisio Lacerdaanisio@dcc.ufmg.brGisele L. Pappaglpappa@dcc.ufmg.br<p>In this paper, we tackle the task of classifying public bidding documents, which holds significant importance for both public and private entities seeking precise insights into bidding processes. Our study evaluates the impact of various preprocessing techniques and textual representation models, particularly word embeddings, on the accuracy of document classification. Overall, our results reveal while preprocessing techniques have minimal influence on classification outcomes, the choice of textual representation model significantly affects the representativeness of document classes. Moreover, we perform a qualitative analysis of misclassification cases, providing valuable insights into potential areas for improvement in document classification methodologies. Our findings underscore the importance of selecting appropriate textual representation models to enhance the accuracy and efficiency of document classification systems.</p>2025-01-20T00:00:00+00:00Copyright (c) 2025 Journal of Information and Data Managementhttps://journals-sol.sbc.org.br/index.php/jidm/article/view/4299GDRF: An Innovative Graph-Based Rank Fusion Method for Enhancing Diversity in Image Metasearch2024-06-06T05:49:16+00:00José Solenir Lima Figuerêdojslfigueredo@ecomp.uefs.brAna Lúcia Marreiros Maiaanamarreiros@gmail.comRodrigo Tripodi Calumbyrtcalumby@uefs.br<p>Metasearch technique combines a set of ranked images retrieved with different search engines to build a unified ranking in order to improve relevance. For this purpose, rank aggregation methods have been widely used, which also can improve the result provide by ambiguous or underspecified queries through process named diversification. However, current aggregation methods assume that the input rankings are built only according to the relevance of the items, disregarding the inter-relationship between images in each ranking. Consequently, these methods tend to be inadequate for diversity-oriented retrieval. The aggregated ranking may not improve results, mainly when considered a diversity optimization. To address this problem, we propose a diversity-aware rank fusion method, which was validated in the context of diverse image metasearch. Our method was compared with several order-based and score-based aggregation methods. The experimental findings indicate that the proposed method significantly improves the overall diversity of metasearch results. This result demonstrates the potential of the proposed method and paves the way for further research to explore the development of new methods implementing new aware-diversity heuristics.</p>2025-01-14T00:00:00+00:00Copyright (c) 2025 Journal of Information and Data Managementhttps://journals-sol.sbc.org.br/index.php/jidm/article/view/4358APProve: Facilitating Active Monitoring, Versioning and Provenance Tracking of Data Management Plans2024-06-18T14:12:06+00:00Annatercia Gomes Pinheiroannatercia@ufrj.brMaria Luiza Machado Camposmluiza@ufrj.brSérgio Manuel Serra da Cruzserra@ppgi.ufrj.br<p>As modern societies become increasingly data-driven, reliance on research data generated by complex experimental setups and scientific systems has grown. However, applying this scientific knowledge to diverse contexts can be challenging. This paper introduces the Active Plans Provenance (APProve) framework, which offers solutions and mechanisms to monitor variations and trace the lineage of Data Management Plans (DMPs) in a simple and dynamic manner. APProve facilitates the synchronized evolution of DMPs with research projects, ensuring accessibility, shareability, and long-term maintainability for researchers. To assist researchers in tracking DMP versions, we developed a loosely coupled architecture that integrates seamlessly with traditional DMP generation tools. To evaluate the framework's effectiveness, we conducted two distinct experiments, both involving DMPs generated by the ARGOS system. The initial experiment assessed APProve functionalities using a static DMP from the VODAN BR project. The second experiment analyzed an a-DMP from the OpenSoils platform project, a Brazilian soil data governance system. APProve empowers users to monitor changes, visualize version histories, and trace retrospective provenance related to DMP modifications. Additionally, it simplifies the DMP import process and provides comprehensive project visualization, including automated comparison of DMP versions across projects. The experiments demonstrated the framework's efficacy in tracking and visualizing DMP version histories, thereby enhancing the management and evolution of these plans. In the VODAN BR case, APProve captured provenance details and enabled comparisons across revisions. For OpenSoils, the framework facilitated dynamic updates, ensuring alignment with ongoing project changes and highlighting discrepancies. Both cases confirmed APProve’s ability to improve accessibility and usability for researchers.</p>2025-08-23T00:00:00+00:00Copyright (c) 2025 Journal of Information and Data Managementhttps://journals-sol.sbc.org.br/index.php/jidm/article/view/4361Exploring Evolutionary Patterns: A Jupyter Notebook for Discovering Frequent Subtrees in Phylogenetic Tree Databases2024-07-19T18:46:57+00:00João Vitor Moraesjoaovitormoraes@id.uff.brCamila Ferraricamilaferrari@id.uff.brIsabel Rossetirosseti@ic.uff.brDaniel de Oliveiradanielcmo@ic.uff.br<p>The exploratory analysis of evolutionary information within a phylogenetic tree database is a crucial task in the field of bioinformatics. Phylogenetic trees are constructed by exploring multiple evolutionary and tree construction methods. For instance, methods like Maximum Parsimony, Maximum Likelihood, and Neighbor-Joining may yield slightly different trees due to their distinct approaches to inferring phylogenies (<em>e.g.</em>, distance and character-based methods). Therefore, analyzing evolutionary data often entails identifying frequent subtrees within a given set of phylogenetic trees. However, this identification process can be computing-intensive, depending on the size of the input tree database. In this manuscript, we introduce the NMFSt.P Notebook, which aims to simplify the comparison of multiple phylogenetic trees for identifying frequent subtrees in the database. Our experiments demonstrate that NMFSt.P produces results comparable to the baseline approach while bringing the advantage of flexibility for the scientist.</p>2025-08-23T00:00:00+00:00Copyright (c) 2025 Journal of Information and Data Managementhttps://journals-sol.sbc.org.br/index.php/jidm/article/view/4663Explainable Clustering: A solution to interpret and describe clusters2024-10-02T21:37:04+00:00Guilherme Sérgio de Oliveiraguilherme.sergio@ufv.brFabrício A. Silvafabricio.asilva@ufv.brRicardo V. Ferreiraricardo.ferreira@cinnecta.com<p>Unsupervised learning algorithms represent a set of techniques for finding hidden patterns or characteristics in data without a previously defined label. An unsupervised learning technique is clustering, which consists of grouping data with similar characteristics into the same group, while data with different characteristics belong to other groups. Despite being a technique with many applications, understanding the output of clustering models is a complex task, requiring extensive manual analysis to understand the characteristics of each group, since the output doesn't contain much information about the cluster's characteristics. Therefore, this article proposes MAACLI: <strong>M</strong>odel and <strong>A</strong>lgorithm <strong>A</strong>gnostic <strong>CL</strong>ustering <strong>I</strong>nterpretability, a technique for generating user-friendly descriptions to help interpret groups generated by unsupervised clustering algorithms. The solution consists of two components that generate friendly descriptions of the groups and was tested on two types of datasets, one of which was provided by a partner company. The solution was able to generate simple, user-friendly descriptions of the groups, extracting only the important attributes.</p>2025-06-20T00:00:00+00:00Copyright (c) 2025 Journal of Information and Data Managementhttps://journals-sol.sbc.org.br/index.php/jidm/article/view/4668Evaluating Window Size Effects on Univariate Time Series Forecasting with Machine Learning2024-10-26T20:01:41+00:00João David Freitasjoaodavidfreitasc@edu.unifor.brCaio Pontecaioponte@unifor.brRafael Bomfimbomfim@unifor.brCarlos Caminhacaminha@ufc.br<p>In the realm of time series prediction modeling, the window size (<em>w</em>) is a critical hyperparameter that determines the number of time units included in each example provided to a learning model. This hyperparameter is crucial because it allows the learning model to recognize both long-term and short-term trends, as well as seasonal patterns, while reducing sensitivity to random noise. This study aims to elucidate the impact of window size on the performance of machine learning algorithms in univariate time series forecasting tasks, specifically addressing the more challenging scenario of larger forecast horizons. To achieve this, we employed 40 time series from two different domains, conducting experiments with varying window sizes using four types of machine learning algorithms: Bagging (Random Forest), Boosting (AdaBoost), Stacking, and a Recurrent Neural Network (RNN) architecture, more specifically the Long Short-Term Memory (LSTM). The results reveal that increasing the window size generally enhances the evaluation metric values up to a stabilization point, beyond which further increases do not significantly improve predictive accuracy. This stabilization effect was observed in both domains when <em>w</em> values exceeded 100 time steps. Moreover, the study found that LSTM architectures do not consistently outperform ensemble models in various univariate time series forecasting scenarios.</p>2025-08-23T00:00:00+00:00Copyright (c) 2025 Journal of Information and Data Managementhttps://journals-sol.sbc.org.br/index.php/jidm/article/view/4670Enhancing Contributions to Brazilian Social Media Analysis Based on Topic Modeling with Native BERT Models2024-10-26T20:06:21+00:00Giordanno Brunno Bergamini Gomesbergaminigomes@gmail.comRomis Attuxattux@unicamp.brCristiano Cordeiro Cruzcristianoccruz@yahoo.com.br<p>This study introduces a computational approach utilizing natural language processing for text analysis, particularly focusing on topic modeling from large-scale textual data. Given the increasing volume of information shared on social media platforms like X (Twitter), there is a pressing need for effective methods to extract and understand the underlying topics in these texts. Continuing the previous work with LDA, BTM, NMF, and BERTopic, we conducted experiments using advanced BERT embedding models tailored for Brazilian Portuguese, namely BERTimbau and BERTweet.BR, alongside the standard multilingual BERTopic model. We also perfomed experiments with LLM embedding models inside the BERTopic structure, NV-Embed-v2, and gte-Qwen2-7B-instruct. Our findings reveal that the gte-Qwen2-7B-instruct model outperforms the others regarding topic coherence, followed by NV-Embed-v2, BERTimbau Large, BERTimbau Base, BERTweet.BR, and the standard multilingual BERTopic. In the case of the BERT models, this demonstrates the superior capability of models trained specifically on Brazilian Portuguese data in capturing the nuances of the language. In the case of the LLM models, the multilingual capability (including Portuguese) demonstrates the performance of gte-Qwen2-7B-instruct over NV-Embed-v2. The enhanced performance of the gte-Qwen2-7B-instruct model highlights the importance of larger model sizes in achieving higher accuracy and coherence in topic modeling tasks. These results contribute valuable insights for future research in social, political, and economic issue analysis through social media data.</p>2025-06-23T00:00:00+00:00Copyright (c) 2025 Journal of Information and Data Managementhttps://journals-sol.sbc.org.br/index.php/jidm/article/view/4676Frequent Genre Mining on Hit and Viral Songs2024-07-30T12:59:33+00:00Gabriel P. Oliveiragabrielpoliveira@dcc.ufmg.brMirella M. Moromirella@dcc.ufmg.br<p>Music is a dynamic cultural industry that has produced large volumes of data since the beginning of streaming services. Understanding such data provides valuable insights into music consumption, and helps identifying emerging trends and fostering creativity within the music industry. Nowadays, combining different genres has become a common practice to promote new music and reach new audiences. Given the diversity of combinations between all genres, predictive and descriptive analyses are very challenging. This work aims to explore the relationship between genre combinations and music popularity by mining frequent patterns in hit and viral songs across global and regional markets. We extend previous work by incorporating viral songs into the analysis, thus strengthening the comparative analysis of musical popularity's interconnected facets. We use the Apriori algorithm to mine genre patterns and association rules that reveal how music genres combine with each other in each market. Our findings reveal significant differences in popular genres across regions and highlight the dynamic nature of genre-blending in modern music. In addition, we are able to use such patterns to identify and recommend promising genre combinations for such markets through the association rules.</p>2025-03-18T00:00:00+00:00Copyright (c) 2025 Journal of Information and Data Managementhttps://journals-sol.sbc.org.br/index.php/jidm/article/view/4677Influence of data stratification criteria on fairer classifications2024-07-29T17:59:54+00:00Diego Minateldminatel@usp.brNícolas Roque dos Santosnrsantos@usp.brAngelo Cesar Mendes da Silvaangelo.mendes@usp.brMariana Cúrimcuri@icmc.usp.brRicardo Marcondes Marcaciniricardo.marcacini@usp.brAlneu de Andrade Lopesalneu@icmc.usp.br<p>Data stratification by class is a prominent strategy to enhance the accuracy of model evaluation in unbalanced scenarios. This type of strategy, added to other stratification criteria, can also be effective in a significant issue with machine learning systems, which is their potential to propagate discriminatory effects, harming specific people groups. Therefore, it is crucial to assess whether these systems' decision-making processes are fair across the diversity present in society. This assessment requires stratifying the test set not only by class but also by sociodemographic groups. Furthermore, applying stratification by class and group during the validation step can contribute to developing fairer models. Despite its importance, there is a lack of studies analyzing the influence of data stratification on fairness in machine learning. We address this gap and propose an experimental setup to analyze how different data stratification criteria influence the development of impartial classifiers. Our results suggest that stratifying data by class and group aids develop fairer classifiers, thereby minimizing the spread of discriminatory effects in decision-making processes.</p>2025-06-20T00:00:00+00:00Copyright (c) 2025 Journal of Information and Data Managementhttps://journals-sol.sbc.org.br/index.php/jidm/article/view/4678Enhancing COVID-19 Prognosis Prediction with Machine Learning and LIME Explanation2024-11-04T15:06:40+00:00José Solenir Lima Figuerêdojslfigueredo@ecomp.uefs.brRenata Freitas Araújo-Calumbyfarm.renata@hotmail.comRodrigo Tripodi Calumbyrtcalumby@uefs.br<p>This study evaluates machine learning methods to predict the prognosis of patients in COVID-19 context. This study evaluates machine learning methods for predicting patient prognosis in the COVID-19 context. For the best-performing algorithm, we applied LIME to assess feature contributions to each decision, providing insights to assist experts in understanding the rationale behind the model's predictions. The results indicate that the developed model accurately predicted patient prognosis, achieving an ROC-AUC = 0.8524. The results also point out a higher risk of death among patients over 60 years of age, with comorbidities, and symptoms such as dyspnea and Oxygen saturation <95%, confirming results observed in other regions of the world. The results also indicated a higher percentage of deaths among those with little or no education.<br />The prediction explanations allowed us to understand how each feature contributes to the decision made by the model, improving its transparency. For instance, in an illustrative case, LIME demonstrated that invasive ventilatory support and an age of 61 years positively contributed to the prediction of mortality, whereas hospitalization and the patient's race (being white) were not significant predictors for this particular patient.</p>2025-06-20T00:00:00+00:00Copyright (c) 2025 Journal of Information and Data Management