Journal of Information and Data Management
https://journals-sol.sbc.org.br/index.php/jidm
<p>JIDM is an electronic journal that is published three times a year. Submissions are continuously received, and the first phase of the reviewing process usually takes 4 to 6 months. JIDM is sponsored by the Brazilian Computer Society, focusing on information and data management in large repositories and document collections. It relates to different areas of Computer Science, including databases, information retrieval, digital libraries, knowledge discovery, data mining, and geographical information systems. </p>en-USdanielcmo@ic.uff.br (Daniel de Oliveira)danielcmo@ic.uff.br (Daniel de Oliveira)Tue, 14 Jan 2025 21:12:19 +0000OJS 3.2.1.2http://blogs.law.harvard.edu/tech/rss60Analysis of Expenses from Brazilian Federal Deputies between 2015 and 2018
https://journals-sol.sbc.org.br/index.php/jidm/article/view/3383
<p>The analysis of public expenses is fundamental to foster the correct use of public resources, guaranteeing the application of the principles of publicity and efficiency. Within the scope of the Brazilian parliament, Parliamentary Quotas are also identified as public resources, therefore they need to be subject to the same control criteria. This research aims to carry out analyzes of parliamentary expenses related to Parliamentary Quotas, presenting the distribution of expenses related to the 55th Legislature (2015-2018) of Brazil, in addition to identifying anomalies in such expenses. Through a clustering-based analysis, the expenses were compared with the goal of finding similarities between the spending behavior of the federal deputies. This study, through data mining, presents the results obtained from analyzing different parliamentary expenses under the party or regional aspect of each deputy. The results obtained allowed us to answer questions related to the characteristics of the expenses involving Parliamentary Quotas, anomalous expenses, and similarity between parliamentary expenses, such as, the identification of expenditure patterns, which allow the verification of regional variability, as well as identifying some of the expenditures as possibly anomalous.</p>Felippe Pires Ferreira, Ilan S. G. de Figueiredo, Larissa R. Teixeira, William Zaniboni Silva, Caetano Traina Junior, Cristina Dutra de Aguiar, Robson L. F. Cordeiro
Copyright (c) 2025 Journal of Information and Data Management
https://journals-sol.sbc.org.br/index.php/jidm/article/view/3383Tue, 14 Jan 2025 00:00:00 +0000Grid-Ordering for Outlier Detection in Massive Data Streams
https://journals-sol.sbc.org.br/index.php/jidm/article/view/4116
<p>Outlier detection is critical in data mining, encompassing the revelation of hidden insights or identification of potentially disruptive anomalies. While numerous strategies have been proposed for serial-processing outlier detection, the ever-expanding realm of big data applications demands efficient distributed computing solutions. This paper addresses the challenge of real-time outlier detection in multidimensional data streams with high-frequency arrivals, by presenting GOOST. This novel algorithm employs neighborhood analysis by leveraging grid-based data sorting. GOOST efficiently detects distance-based outliers, ensuring accurate detection in distributed environments within a competitive and much more stable processing time than previous solutions. We perform experiments on 6 real and synthetic data sets with up to 1.2M events, and up to 55 dimensions. We demonstrate that GOOST outperforms 3 state-of-the-art methods in terms of quality of results (30% more accurate) within competitive (and 45% more stable) processing times for real-time analysis of multidimensional data streams and high event frequency, thus offering a promising solution for various scientific and commercial domains.</p>Braulio V. Sánchez Vinces, Robson L. F. Cordeiro
Copyright (c) 2025 Journal of Information and Data Management
https://journals-sol.sbc.org.br/index.php/jidm/article/view/4116Tue, 14 Jan 2025 00:00:00 +0000A Robust Measure for Evaluating Representativeness of Summarized Trajectories with Multiple Aspects
https://journals-sol.sbc.org.br/index.php/jidm/article/view/4082
<p>As trajectory datasets grow larger, summarization techniques become increasingly important. However, current methods often lack a suitable measure of representativeness, making evaluation a complex task. This is especially true in the context of multi-aspect trajectories, where evaluating summarization techniques is particularly challenging. To address this, we have developed a novel representativeness measure called RMMAT. This innovative method combines similarity metrics and covered information, offering adaptability to diverse data and analysis needs. With RMMAT, evaluating summarization techniques is simplified, and deeper insights can be gained from extensive trajectory data. Our evaluation of real-world trajectory datasets demonstrates that RMMAT is a robust Representativeness Measure for Summarized Trajectories with Multiple Aspects. This measure could help researchers and analysts to evaluate and empower them to make informed decisions about the quality and relevance of representative data for their analytical goals.</p>Vanessa Lago Machado, Tarlis Tortelli Portela, Lucas Vanini, Chiara Renso, Ronaldo dos Santos Mello
Copyright (c) 2025 Journal of Information and Data Management
https://journals-sol.sbc.org.br/index.php/jidm/article/view/4082Tue, 14 Jan 2025 00:00:00 +0000Towards Data Summarization of Multi-Aspect Trajectories Based on Spatio-Temporal Segmentation
https://journals-sol.sbc.org.br/index.php/jidm/article/view/4110
<p>This paper presents a new method for summarizing multiple aspect trajectories (MATs). This kind of data holds several challenges in terms of analysis and extraction of meaningful insights due to their spatial, temporal, and semantic dimensions. In order to address them, our method leverages a combination of spatial grid-based segmentation and temporal sequence analysis. It segments the trajectory data into spatial cells using a grid-based approach. The spatial segmentation enables a finer-grained analysis of the trajectories within each cell. Next, we consider the temporal sequence of points within each cell to capture the temporal intervals of the trajectories. By combining spatial and temporal perspectives, the method identifies representative trajectories that capture the main behavior of semantically enriched object movements. We evaluated the utility of our method by applying two distinct strategies: (i) the RMMAT measure, assessing the quality of representative MAT in terms of similarity and coverage of information, and (ii) the Average Recall (AR) metric, measuring the ability of our representative MAT to capture essential data characteristics. Our evaluation demonstrates the effectiveness of MAT-SGT in summarizing MATs. The proposed method holds potential applications across diverse domains, including transportation planning, urban analytics, and human mobility analysis, where the concise representation of trajectories is crucial for decision-making and knowledge discovery.</p>Vanessa Lago Machado, Tarlis Tortelli Portela, Geomar André Schreiner, Ronaldo dos Santos Mello
Copyright (c) 2025 Journal of Information and Data Management
https://journals-sol.sbc.org.br/index.php/jidm/article/view/4110Tue, 14 Jan 2025 00:00:00 +0000Rural Properties Supported by the Carbon Storage and Sequestration Model in the area under the Biome
https://journals-sol.sbc.org.br/index.php/jidm/article/view/4170
<p>The carbon storage and sequestration are impacted by land use/land cover (LULC) changes, being an important ecosystem service, responsible for climate regulation. Through InVEST's Carbon Storage and Sequestration model, combined with LULC and declared areas in Cadastro Ambiental Rural (CAR) in Rondônia State, a current and two future scenarios of carbon pools with secondary forest of 5 and 80 years were created in the Amazon biome. The declared areas have a predominance of forest formation and pasture, and the pools with the highest gains in tons of carbon were aboveground and belowground biomass, with a total gain of 2% and 7%, respectively, concerning the current one. Thus, it emphasizes the importance of command-and-control tools and forest recovery incentives.</p>Fabiana da Silva Soares, Bruna Henrique Sacramento, Hilton Luis Ferraz da Silveira, Roberta Averna Valente
Copyright (c) 2025 Journal of Information and Data Management
https://journals-sol.sbc.org.br/index.php/jidm/article/view/4170Thu, 06 Feb 2025 00:00:00 +0000Detailed Mapping of Irrigated Rice Fields Using Remote Sensing data and Segmentation Techniques: A case of study in Turvo, Santa Catarina, Brazil
https://journals-sol.sbc.org.br/index.php/jidm/article/view/4181
<p>In this study, we evaluated multiple methods and data sources for mapping irrigated rice fields in Turvo, Santa Catarina, using a detailed reference map that includes irrigation channels, roads, and boundaries within and between rice fields. We tested different approaches using a per-pixel and segmentation approaches. In the per-pixel classifications scenarios we used a a Random Forest (RF) applied to the China-Brazil Earth-Resources Satellite Multispectral and Panchromatic Wide-Scan Camera (CBERS-4A/WPM) data, and to Sentinel-2 (S2) imagery. For the segmentation approach we used a combination of S2 imagery with a Segment Anything Model geospatial (Samgeo) mask applied to high-resolution CBERS-4A/WPM data (S2+WPM/Samgeo). We qualitatively and quantitatively compared maps derived from a existing source (MapBiomas) with our scenarios. MapBiomas and per-pixel S2 classification provided adequate general plot boundary identification, however, lacked finer details. CBERS-4A/WPM data captured some of these details, although they showed a high rate of false positives due to confusion with other vegetation types. We also examined how detailed rice field mapping affected time-series analysis. Our findings indicate that the S2+WPM/Samgeo approach most closely matched the reference map time-series and offered superior detail, better distinguishing field heterogeneities. This method could support more detailed and accurate monitoring of rice fields. Overall, S2+WPM/Samgeo delivered the most precise and detailed mapping of irrigated rice in the region.</p>Andre Dalla Bernardina Garcia, Victor Hugo Rohden Prudente, Darlan Teles da Silva, Michel Eustáquio Dantas Chaves, Kleber Trabaquini, Ieda Del'Arco Sanches
Copyright (c) 2025 Journal of Information and Data Management
https://journals-sol.sbc.org.br/index.php/jidm/article/view/4181Mon, 27 Jan 2025 00:00:00 +0000QQ-SPM: spatial keyword search based on qualitative and quantitative spatial patterns
https://journals-sol.sbc.org.br/index.php/jidm/article/view/4182
<p>The search for Points of Interest (POIs) based on keywords and user preferences is a daily need for many people. One way of representing this kind of search is the Spatial Pattern Matching (SPM) query, which allows for the retrieval of geo-textual objects based on spatial patterns defined by keywords and distance criteria. However, SPM is not able to represent qualitative requirements, such as connectivity relations between the searched objects. In this context, this work proposes the Qualitative and Quantitative Spatial Pattern Matching (QQ-SPM) query, which allows searches with qualitative connectivity constraints in addition to keywords and distance criteria. We also propose the QQESPM algorithm which combines an efficient use of geo-textual indexing with a top-down search strategy inspired in the Efficient Spatial Pattern Matching (ESPM) algorithm. Through a performance evaluation, QQESPM algorithm proved to be more than 1000 times faster in average execution time than simple approaches for the QQ-SPM search. Furthermore, it achieved slower execution time growth when facing the increase of the dataset size, showcasing its efficienty and efficacy for handling geo-textual searches in a quantitative and qualitative setting.</p>Carlos Vinicius Alves Minervino Pontes, Cláudio E. C. Campelo, Maxwell Guimarães de Oliveira, Salatiel Dantas Silva
Copyright (c) 2025 Journal of Information and Data Management
https://journals-sol.sbc.org.br/index.php/jidm/article/view/4182Mon, 27 Jan 2025 00:00:00 +0000Enhancing the Performance of Machine Learning Classifiers through Data Cleaning with Ensemble Confident Learning
https://journals-sol.sbc.org.br/index.php/jidm/article/view/4230
<p>Model-centric techniques, such as hyper parameter optimization and regularization, are commonly used in the literature to enhance the performance of Machine Learning Classifiers. However, when dealing with noisy data, Data-Centric approaches show promising potential. Thus, in this paper a new method is proposed: the Ensemble Confident Learning (ECL), which enhances the Confident Learning technique with the use of multiple learners to improve the selection of instances with biased labels. This method was applied for a case study of Species Distribution Modeling in the Amazon using Classifiers to estimate the probability of species occurrence based on environmental conditions. Compared to Confident Learning, ECL showed an improvement of 20% in Recall and 3.5% in ROC-AUC for Logistic Regression.</p>Renato Okabayashi Miyaji, Felipe Valencia de Almeida, Pedro Luiz Pizzigatti Corrêa
Copyright (c) 2025 Journal of Information and Data Management
https://journals-sol.sbc.org.br/index.php/jidm/article/view/4230Mon, 20 Jan 2025 00:00:00 +0000Applying Graph Databases and Human Mobility Data to Track Infectious Disease Spread in Brazil
https://journals-sol.sbc.org.br/index.php/jidm/article/view/4291
<p>This work has been enriched by the invaluable contributions of the following institutions: The ÆSOP (Alert-Early System of Outbreaks with Pandemic Potential) project provided us with the primary idea and guidance for this research, laying the foundation for our study. CIn-UFPE, SiDi, and Samsung Brazil, who supported the ``Data Engineering and Data Science'' Residency program, where this study was successfully applied as part of our coursework, culminating in its completion. The UFRPE collaborative efforts and orientation support played a pivotal role in the successful execution and completion of this research project.</p>Mariama C. S. de Oliveira, Lucas Henrique G. de Sales, Andrêza Leite de Alencar, Natalia T. S. de Oliveira, Antônio Ricardo K. Cunha, Pablo Ivan P. Ramos
Copyright (c) 2025 Journal of Information and Data Management
https://journals-sol.sbc.org.br/index.php/jidm/article/view/4291Tue, 18 Mar 2025 00:00:00 +0000An Unsupervised Method for Fault Detection in Transmission Lines Using Denial Constraints
https://journals-sol.sbc.org.br/index.php/jidm/article/view/4293
<p>This paper presents a denial constraint (DC) discovery approach for detecting faults in utility companies' electric transmission lines. Transmission lines rely on a protection system that continually streams and stores waveform data with three-phase current and voltage information. Considering that those data are stored in a relational database, we use the high expressive power of DCs to capture the expected behavior of a transmission line, as they are ideal for representing rules in databases. Since defining DCs in our scenario requires expensive domain expertise and, worse, is an error-prone task, we use a state-of-the-art algorithm to discover reliable DCs. Unfortunately, the amount of data in the studied scenario makes state-of-the-art DC discovery algorithms impractical due to the long execution times. In response, we propose a novel DC discovery approach using streaming windows to address this issue. Our hypothesis is that DCs discovered in pre-fault windows significantly differ from those in post-fault windows and can be used as a fault detection approach. We use this intuition to detect faults without human intervention (<em>i.e.</em>, an unsupervised method). The extensive experimental evaluation on a dataset with diversified fault events shows that our approach can detect faults with 100% accuracy.</p>Nicolas Tamalu, Leandro Augusto Ensina, Eduardo Cunha de Almeida, Eduardo Henrique Monteiro Pena, Luiz Eduardo Soares de Oliveira
Copyright (c) 2025 Journal of Information and Data Management
https://journals-sol.sbc.org.br/index.php/jidm/article/view/4293Thu, 13 Feb 2025 00:00:00 +0000PromptNER: An Automatic Prompt-Learning Data Labeling Approach for Named Entity Recognition on Sensitive Data
https://journals-sol.sbc.org.br/index.php/jidm/article/view/4298
<p>We address the task of Named Entity Recognition (NER) for entities of the types Organization and Product/Service found in textual complaints recorded on Web platforms. Due to the high inference power of Large Language Models (LLM’s), there is a growing interest in applying them to distinct problems. However, they face issues of high infrastructure cost and privacy concerns when using external API’s. Accordingly, in this article we propose PromptNER, an approach that uses LLM’s for the recognition of entities in consumers’ complaints and use them to locally train simpler models, such as SpERT (Span-based Entity and Relation Extraction Transformer), to address the task of entity and relation extraction, achieving scalabilty and privacy. Our PromptNER enhanced model achieves significant gains, between 41%-129% in F-score compared to the SpERT model trained with manually-labeled data and between 30%-268% over recent (zero-shot) Large Language Models (Llama 3.1).</p>Claudio M. V. de Andrade, Fabiano Muniz Belém, Celso França , Marcos Carvalho, Marcelo Ganem, Gabriel Texeira, Gabriel Jallais, Alberto H. F. Laender, Marcos A. Gonçalves
Copyright (c) 2025 Journal of Information and Data Management
https://journals-sol.sbc.org.br/index.php/jidm/article/view/4298Mon, 20 Jan 2025 00:00:00 +0000Evaluating Preprocessing and Textual Representation on Brazilian Public Bidding Document Classification
https://journals-sol.sbc.org.br/index.php/jidm/article/view/4344
<p>In this paper, we tackle the task of classifying public bidding documents, which holds significant importance for both public and private entities seeking precise insights into bidding processes. Our study evaluates the impact of various preprocessing techniques and textual representation models, particularly word embeddings, on the accuracy of document classification. Overall, our results reveal while preprocessing techniques have minimal influence on classification outcomes, the choice of textual representation model significantly affects the representativeness of document classes. Moreover, we perform a qualitative analysis of misclassification cases, providing valuable insights into potential areas for improvement in document classification methodologies. Our findings underscore the importance of selecting appropriate textual representation models to enhance the accuracy and efficiency of document classification systems.</p>Michele A. Brandão, Mariana O. Silva, Gabriel P. Oliveira, Anisio Lacerda, Gisele L. Pappa
Copyright (c) 2025 Journal of Information and Data Management
https://journals-sol.sbc.org.br/index.php/jidm/article/view/4344Mon, 20 Jan 2025 00:00:00 +0000GDRF: An Innovative Graph-Based Rank Fusion Method for Enhancing Diversity in Image Metasearch
https://journals-sol.sbc.org.br/index.php/jidm/article/view/4299
<p>Metasearch technique combines a set of ranked images retrieved with different search engines to build a unified ranking in order to improve relevance. For this purpose, rank aggregation methods have been widely used, which also can improve the result provide by ambiguous or underspecified queries through process named diversification. However, current aggregation methods assume that the input rankings are built only according to the relevance of the items, disregarding the inter-relationship between images in each ranking. Consequently, these methods tend to be inadequate for diversity-oriented retrieval. The aggregated ranking may not improve results, mainly when considered a diversity optimization. To address this problem, we propose a diversity-aware rank fusion method, which was validated in the context of diverse image metasearch. Our method was compared with several order-based and score-based aggregation methods. The experimental findings indicate that the proposed method significantly improves the overall diversity of metasearch results. This result demonstrates the potential of the proposed method and paves the way for further research to explore the development of new methods implementing new aware-diversity heuristics.</p>José Solenir Lima Figuerêdo, Ana Lúcia Marreiros Maia, Rodrigo Tripodi Calumby
Copyright (c) 2025 Journal of Information and Data Management
https://journals-sol.sbc.org.br/index.php/jidm/article/view/4299Tue, 14 Jan 2025 00:00:00 +0000Frequent Genre Mining on Hit and Viral Songs
https://journals-sol.sbc.org.br/index.php/jidm/article/view/4676
<p>Music is a dynamic cultural industry that has produced large volumes of data since the beginning of streaming services. Understanding such data provides valuable insights into music consumption, and helps identifying emerging trends and fostering creativity within the music industry. Nowadays, combining different genres has become a common practice to promote new music and reach new audiences. Given the diversity of combinations between all genres, predictive and descriptive analyses are very challenging. This work aims to explore the relationship between genre combinations and music popularity by mining frequent patterns in hit and viral songs across global and regional markets. We extend previous work by incorporating viral songs into the analysis, thus strengthening the comparative analysis of musical popularity's interconnected facets. We use the Apriori algorithm to mine genre patterns and association rules that reveal how music genres combine with each other in each market. Our findings reveal significant differences in popular genres across regions and highlight the dynamic nature of genre-blending in modern music. In addition, we are able to use such patterns to identify and recommend promising genre combinations for such markets through the association rules.</p>Gabriel P. Oliveira, Mirella M. Moro
Copyright (c) 2025 Journal of Information and Data Management
https://journals-sol.sbc.org.br/index.php/jidm/article/view/4676Tue, 18 Mar 2025 00:00:00 +0000