Journal of Information and Data Management https://journals-sol.sbc.org.br/index.php/jidm <p>JIDM is an electronic journal that is published three times a year. Submissions are continuously received, and the first phase of the reviewing process usually takes 4 to 6 months. JIDM is sponsored by the Brazilian Computer Society, focusing on information and data management in large repositories and document collections. It relates to different areas of Computer Science, including databases, information retrieval, digital libraries, knowledge discovery, data mining, and geographical information systems. </p> en-US danielcmo@ic.uff.br (Daniel de Oliveira) danielcmo@ic.uff.br (Daniel de Oliveira) Tue, 14 Jan 2025 21:12:19 +0000 OJS 3.2.1.2 http://blogs.law.harvard.edu/tech/rss 60 Analysis of Expenses from Brazilian Federal Deputies between 2015 and 2018 https://journals-sol.sbc.org.br/index.php/jidm/article/view/3383 <p>The analysis of public expenses is fundamental to foster the correct use of public resources, guaranteeing the application of the principles of publicity and efficiency. Within the scope of the Brazilian parliament, Parliamentary Quotas are also identified as public resources, therefore they need to be subject to the same control criteria. This research aims to carry out analyzes of parliamentary expenses related to Parliamentary Quotas, presenting the distribution of expenses related to the 55th Legislature (2015-2018) of Brazil, in addition to identifying anomalies in such expenses. Through a clustering-based analysis, the expenses were compared with the goal of finding similarities between the spending behavior of the federal deputies. This study, through data mining, presents the results obtained from analyzing different parliamentary expenses under the party or regional aspect of each deputy. The results obtained allowed us to answer questions related to the characteristics of the expenses involving Parliamentary Quotas, anomalous expenses, and similarity between parliamentary expenses, such as, the identification of expenditure patterns, which allow the verification of regional variability, as well as identifying some of the expenditures as possibly anomalous.</p> Felippe Pires Ferreira, Ilan S. G. de Figueiredo, Larissa R. Teixeira, William Zaniboni Silva, Caetano Traina Junior, Cristina Dutra de Aguiar, Robson L. F. Cordeiro Copyright (c) 2025 Journal of Information and Data Management https://journals-sol.sbc.org.br/index.php/jidm/article/view/3383 Tue, 14 Jan 2025 00:00:00 +0000 Grid-Ordering for Outlier Detection in Massive Data Streams https://journals-sol.sbc.org.br/index.php/jidm/article/view/4116 <p>Outlier detection is critical in data mining, encompassing the revelation of hidden insights or identification of potentially disruptive anomalies. While numerous strategies have been proposed for serial-processing outlier detection, the ever-expanding realm of big data applications demands efficient distributed computing solutions. This paper addresses the challenge of real-time outlier detection in multidimensional data streams with high-frequency arrivals, by presenting GOOST. This novel algorithm employs neighborhood analysis by leveraging grid-based data sorting. GOOST efficiently detects distance-based outliers, ensuring accurate detection in distributed environments within a competitive and much more stable processing time than previous solutions. We perform experiments on 6 real and synthetic data sets with up to 1.2M events, and up to 55 dimensions. We demonstrate that GOOST outperforms 3 state-of-the-art methods in terms of quality of results (30% more accurate) within competitive (and 45% more stable) processing times for real-time analysis of multidimensional data streams and high event frequency, thus offering a promising solution for various scientific and commercial domains.</p> Braulio V. Sánchez Vinces, Robson L. F. Cordeiro Copyright (c) 2025 Journal of Information and Data Management https://journals-sol.sbc.org.br/index.php/jidm/article/view/4116 Tue, 14 Jan 2025 00:00:00 +0000 A Robust Measure for Evaluating Representativeness of Summarized Trajectories with Multiple Aspects https://journals-sol.sbc.org.br/index.php/jidm/article/view/4082 <p>As trajectory datasets grow larger, summarization techniques become increasingly important. However, current methods often lack a suitable measure of representativeness, making evaluation a complex task. This is especially true in the context of multi-aspect trajectories, where evaluating summarization techniques is particularly challenging. To address this, we have developed a novel representativeness measure called RMMAT. This innovative method combines similarity metrics and covered information, offering adaptability to diverse data and analysis needs. With RMMAT, evaluating summarization techniques is simplified, and deeper insights can be gained from extensive trajectory data. Our evaluation of real-world trajectory datasets demonstrates that RMMAT is a robust Representativeness Measure for Summarized Trajectories with Multiple Aspects. This measure could help researchers and analysts to evaluate and empower them to make informed decisions about the quality and relevance of representative data for their analytical goals.</p> Vanessa Lago Machado, Tarlis Tortelli Portela, Lucas Vanini, Chiara Renso, Ronaldo dos Santos Mello Copyright (c) 2025 Journal of Information and Data Management https://journals-sol.sbc.org.br/index.php/jidm/article/view/4082 Tue, 14 Jan 2025 00:00:00 +0000 Towards Data Summarization of Multi-Aspect Trajectories Based on Spatio-Temporal Segmentation https://journals-sol.sbc.org.br/index.php/jidm/article/view/4110 <p>This paper presents a new method for summarizing multiple aspect trajectories (MATs). This kind of data holds several challenges in terms of analysis and extraction of meaningful insights due to their spatial, temporal, and semantic dimensions. In order to address them, our method leverages a combination of spatial grid-based segmentation and temporal sequence analysis. It segments the trajectory data into spatial cells using a grid-based approach. The spatial segmentation enables a finer-grained analysis of the trajectories within each cell. Next, we consider the temporal sequence of points within each cell to capture the temporal intervals of the trajectories. By combining spatial and temporal perspectives, the method identifies representative trajectories that capture the main behavior of semantically enriched object movements. We evaluated the utility of our method by applying two distinct strategies: (i) the RMMAT measure, assessing the quality of representative MAT in terms of similarity and coverage of information, and (ii) the Average Recall (AR) metric, measuring the ability of our representative MAT to capture essential data characteristics. Our evaluation demonstrates the effectiveness of MAT-SGT in summarizing MATs. The proposed method holds potential applications across diverse domains, including transportation planning, urban analytics, and human mobility analysis, where the concise representation of trajectories is crucial for decision-making and knowledge discovery.</p> Vanessa Lago Machado, Tarlis Tortelli Portela, Geomar André Schreiner, Ronaldo dos Santos Mello Copyright (c) 2025 Journal of Information and Data Management https://journals-sol.sbc.org.br/index.php/jidm/article/view/4110 Tue, 14 Jan 2025 00:00:00 +0000 Enhancing the Performance of Machine Learning Classifiers through Data Cleaning with Ensemble Confident Learning https://journals-sol.sbc.org.br/index.php/jidm/article/view/4230 <p>Model-centric techniques, such as hyper parameter optimization and regularization, are commonly used in the literature to enhance the performance of Machine Learning Classifiers. However, when dealing with noisy data, Data-Centric approaches show promising potential. Thus, in this paper a new method is proposed: the Ensemble Confident Learning (ECL), which enhances the Confident Learning technique with the use of multiple learners to improve the selection of instances with biased labels. This method was applied for a case study of Species Distribution Modeling in the Amazon using Classifiers to estimate the probability of species occurrence based on environmental conditions. Compared to Confident Learning, ECL showed an improvement of 20% in Recall and 3.5% in ROC-AUC for Logistic Regression.</p> Renato Okabayashi Miyaji, Felipe Valencia de Almeida, Pedro Luiz Pizzigatti Corrêa Copyright (c) 2025 Journal of Information and Data Management https://journals-sol.sbc.org.br/index.php/jidm/article/view/4230 Mon, 20 Jan 2025 00:00:00 +0000 PromptNER: An Automatic Prompt-Learning Data Labeling Approach for Named Entity Recognition on Sensitive Data https://journals-sol.sbc.org.br/index.php/jidm/article/view/4298 <p>We address the task of Named Entity Recognition (NER) for entities of the types Organization and Product/Service found in textual complaints recorded on Web platforms. Due to the high inference power of Large Language Models (LLM’s), there is a growing interest in applying them to distinct problems. However, they face issues of high infrastructure cost and privacy concerns when using external API’s. Accordingly, in this article we propose PromptNER, an approach that uses LLM’s for the recognition of entities in consumers’ complaints and use them to locally train simpler models, such as SpERT (Span-based Entity and Relation Extraction Transformer), to address the task of entity and relation extraction, achieving scalabilty and privacy. Our PromptNER enhanced model achieves significant gains, between 41%-129% in F-score compared to the SpERT model trained with manually-labeled data and between 30%-268% over recent (zero-shot) Large Language Models (Llama 3.1).</p> Claudio M. V. de Andrade, Fabiano Muniz Belém, Celso França , Marcos Carvalho, Marcelo Ganem, Gabriel Texeira, Gabriel Jallais, Alberto H. F. Laender, Marcos A. Gonçalves Copyright (c) 2025 Journal of Information and Data Management https://journals-sol.sbc.org.br/index.php/jidm/article/view/4298 Mon, 20 Jan 2025 00:00:00 +0000 Evaluating Preprocessing and Textual Representation on Brazilian Public Bidding Document Classification https://journals-sol.sbc.org.br/index.php/jidm/article/view/4344 <p>In this paper, we tackle the task of classifying public bidding documents, which holds significant importance for both public and private entities seeking precise insights into bidding processes. Our study evaluates the impact of various preprocessing techniques and textual representation models, particularly word embeddings, on the accuracy of document classification. Overall, our results reveal while preprocessing techniques have minimal influence on classification outcomes, the choice of textual representation model significantly affects the representativeness of document classes. Moreover, we perform a qualitative analysis of misclassification cases, providing valuable insights into potential areas for improvement in document classification methodologies. Our findings underscore the importance of selecting appropriate textual representation models to enhance the accuracy and efficiency of document classification systems.</p> Michele A. Brandão, Mariana O. Silva, Gabriel P. Oliveira, Anisio Lacerda, Gisele L. Pappa Copyright (c) 2025 Journal of Information and Data Management https://journals-sol.sbc.org.br/index.php/jidm/article/view/4344 Mon, 20 Jan 2025 00:00:00 +0000 GDRF: An Innovative Graph-Based Rank Fusion Method for Enhancing Diversity in Image Metasearch https://journals-sol.sbc.org.br/index.php/jidm/article/view/4299 <p>Metasearch technique combines a set of ranked images retrieved with different search engines to build a unified ranking in order to improve relevance. For this purpose, rank aggregation methods have been widely used, which also can improve the result provide by ambiguous or underspecified queries through process named diversification. However, current aggregation methods assume that the input rankings are built only according to the relevance of the items, disregarding the inter-relationship between images in each ranking. Consequently, these methods tend to be inadequate for diversity-oriented retrieval. The aggregated ranking may not improve results, mainly when considered a diversity optimization. To address this problem, we propose a diversity-aware rank fusion method, which was validated in the context of diverse image metasearch. Our method was compared with several order-based and score-based aggregation methods. The experimental findings indicate that the proposed method significantly improves the overall diversity of metasearch results. This result demonstrates the potential of the proposed method and paves the way for further research to explore the development of new methods implementing new aware-diversity heuristics.</p> José Solenir Lima Figuerêdo, Ana Lúcia Marreiros Maia, Rodrigo Tripodi Calumby Copyright (c) 2025 Journal of Information and Data Management https://journals-sol.sbc.org.br/index.php/jidm/article/view/4299 Tue, 14 Jan 2025 00:00:00 +0000