https://journals-sol.sbc.org.br/index.php/jidm/issue/feed Journal of Information and Data Management 2024-06-17T12:53:07+00:00 Daniel de Oliveira danielcmo@ic.uff.br Open Journal Systems <p>JIDM is an electronic journal that is published three times a year. Submissions are continuously received, and the first phase of the reviewing process usually takes 4 to 6 months. JIDM is sponsored by the Brazilian Computer Society, focusing on information and data management in large repositories and document collections. It relates to different areas of Computer Science, including databases, information retrieval, digital libraries, knowledge discovery, data mining, and geographical information systems. </p> https://journals-sol.sbc.org.br/index.php/jidm/article/view/2607 Biophysical Chemistry of Macromolecules Research Group at the State University of Maringá 2022-07-21T12:10:22+00:00 Diego de Souza Lima diegodslima92@gmail.com Gisele Strieder Philippsen gistrieder@ufpr.br Elisangela Andrade Ângelo elisangela.angelo@ifpr.edu.br Maria Aparecida Fernandez mafernandez@uem.br Flavio Augusto Vicente Seixas favseixas@uem.br <p>The interdisciplinary field of Biophysical Chemistry, which applies concepts from Physical Chemistry to describe biological phenomena, is essential for modern molecular biology advancements. This approach enables the description of biological systems in terms of their constituent parts, such as atoms and molecules, facilitating a structural understanding of their characteristics. Nonetheless, to describe such large systems, computational methods are needed. The Biophysical Chemistry of Macromolecules research group at the State University of Maringá is dedicated to investigating such systems, mainly protein-ligand complexes, through bioinformatics approaches combined with experimental techniques to validate in silico results. The main purpose of the research projects is to develop applications for drug discovery in the context of antimicrobial, antiviral, antifungal, and antihyperglycemic agents, with the aim of advancing the field of bioinformatics in Brazil.</p> 2024-02-21T00:00:00+00:00 Copyright (c) 2023 Journal of Information and Data Management https://journals-sol.sbc.org.br/index.php/jidm/article/view/2609 G3B3 to GP15: From early years to Health Informatics Research Group at Paulista University - UNIP 2022-07-21T12:24:47+00:00 Renato Massaharu Hassunuma renato.hassunuma@docente.unip.br Patricia Carvalho Garcia patricia.garcia@docente.unip.br Michele Janegitz Acorci-Valério michele.valerio@docente.unip.br Marjorie de Assis Golim marjorie.golim@unesp.br Sandra Heloisa Nunes Messias sandra.messias@docente.unip.br <p>This article summarizes the history and intellectual production from G3B3 (Grupo de Estudos em Bioinformática Estrutural) to GP15 (Grupo de Pesquisa em Informática em Saúde), conducted at Paulista University - UNIP, campus Bauru. In the early years, several activities were developed by G3B3 until the conversion of team to GP15. This group has the computational simulation of biomolecules as one of its research lines. Together with the second line of research, entitled "Development of teaching material and research using computational resources", the production of this group is mainly related to the development of scripts for the visualization of biomolecules and the production of digital books. In early 2022, the current team consists of five university professors and four students from the Biomedicine Course at Universidade Paulista - UNIP, Bauru campus. Together, the team has published more than 50 books in different publishers, part of which is aimed at the development of scripts for computer simulation software. All these books are adopted by course teachers in different subjects.</p> 2024-02-16T00:00:00+00:00 Copyright (c) 2023 Journal of Information and Data Management https://journals-sol.sbc.org.br/index.php/jidm/article/view/2621 The Barroso Research lab: biomolecular interactions, computing, and data-driven science to understand and engineer biological and pharmaceutical systems in a global academic partnership 2022-07-21T12:23:53+00:00 Fernando L. Barroso da Silva flbarroso@usp.br Catherine Etchebest catherine.etchebest@inserm.fr Erik E. Santiso eesantis@ncsu.edu Carolina Corrêa Giron carolinacorreagiron@gmail.com Ilyas Grandguillaume ilyas.grandguillaume@inserm.fr Rauni Borges Marques raunimarques@usp.br Sergio A. Poveda-Cuevas alejandropc@alumni.usp.br <p>Biomolecular interactions, high throughput computing, and data-driven science have been the central research foundations of the Barroso Research laboratory. We have been developing and applying innovative computational technology, offering a rational computational-based approach to the investigation of protein systems, and discovering key disease-related protein mechanisms, therapeutic agents, biomarkers, and proteins for specific applications and their controlled release. Born in 2001 at the School of Pharmaceutical Sciences at Ribeirão Preto with the genes of transdisciplinary and internationalization, the laboratory has always been well integrated with research groups in Europe, the US, and Latin America. Students from different fields and places have been forged in this environment at the crossroads of Structural Bioinformatics, Molecular Biophysics, Biological Physics, Physical Chemistry, Engineering, Medicine, Food, and Pharma. The more than 50 scientific papers published in high-impact journals, book chapters, and conference talks reflect our contributions to expanding knowledge and advancing Bioinformatics as an important tool to understand nature and guide innovations.</p> 2024-04-16T00:00:00+00:00 Copyright (c) 2023 Journal of Information and Data Management https://journals-sol.sbc.org.br/index.php/jidm/article/view/2622 ACDBio: The Biological Data Computational Analysis group at ICMC/USP, IFSP, and Barretos Cancer Hospital 2022-07-21T12:16:00+00:00 Adenilso Simao adenilso@icmc.usp.br Adriane Feijó Evangelista afevangelista@alumni.usp.br Alfredo Guilherme Souza alfredo@usp.br Cynthia de Oliveira Lage Ferreira cynthia@icmc.usp.br Jorge Francisco Cutigi cutigi@ifsp.edu.br Paulo Henrique Ribeiro phribeiro@ifsp.edu.br Rodrigo Henrique Ramos ramos@ifsp.edu.br <p>Recent advances in biological and health technology have resulted in vast digital data. However, the major challenge is interpreting such data to find valuable knowledge. For this, using computing is essential and mandatory since quick data processing and analysis, allied with knowledge extraction techniques, enable working effectively with large biological datasets. In this context, the ACDBio group works with the computational analysis of biological data from different sources, aiming to find new information and knowledge in data or answer questions that are not yet known. So far, the group has worked on several challenging topics, such as identifying significant genes for cancer topological analysis of genes in interaction networks, among others. The group uses computational techniques such as complex networks and their algorithms, machine learning, and topological data analysis. This article aims to present the ACDBio group, and the main research topics worked on by its members. We also present the main results and future work expected by the group.</p> 2024-02-17T00:00:00+00:00 Copyright (c) 2023 Journal of Information and Data Management https://journals-sol.sbc.org.br/index.php/jidm/article/view/2624 Water quality in marine and freshwater environments: a metagenomics approach 2022-07-21T12:27:47+00:00 Carolina O. P. Gil carol.pgil@peb.ufrj.br Ana Carolina M. Pires anacarolinamp@peb.ufrj.br Fernanda O. F. Schmidt fernandaferreira@peb.ufrj.br Nathália S. C. Santos nathaliasantos@peb.ufrj.br Priscila C. Alves priscila.pharm@peb.ufrj.br Odara A. Oliveira odara.araujo96@gmail.com Dhara Avelino-Alves dharaavelino@gmail.com Flávio F. Nobre flavio@peb.ufrj.br Gizele Duarte Garcia gidugar@gmail.com Cristiane C. Thompson thompsoncristiane@gmail.com Graciela M. Dias gracielamd@biof.ufrj.br Fabiano F. Thompson fabianothompson1@gmail.com Diogo A. Tschoeke diogoat@peb.ufrj.br <p>In this article, we have reviewed the work carried out by the UFRJ microbiology laboratory related to water quality and microorganisms associated with different aquatic ecosystems. We placed water at the center of the One Health concept, due to the integration that water makes between different living beings and the environment. We selected papers published between 2012 and 2022 by UFRJ microbiology laboratory related to bioinformatics genomic and metagenomics analysis. We described the main impacts caused in aquatic environments, about the microorganisms involved in biogeochemical cycles, microorganisms as bioindicators and their resistance genes. Finally, we identified the microorganisms that were most abundant in all the articles studied and pointed out some public policies that we consider important to maintain water quality and reduce anthropic impacts.</p> 2024-02-19T00:00:00+00:00 Copyright (c) 2023 Journal of Information and Data Management https://journals-sol.sbc.org.br/index.php/jidm/article/view/2625 Bioinformatics of infectious and chronic diseases at the Center for Technological Development in Health of Fiocruz 2022-07-21T12:14:00+00:00 Nicolas Carels nicolas.carels@fiocruz.br Gilberto Ferreira da Silva gilberto.silva@fiocruz.br Carlyle Ribeiro Lima carlyle.lima@fiocruz.br Franklin Souza da Silva franklin.souza@fiocruz.br Milena Magalhães milena.magalhaes@fiocruz.br Ana Emília Goulart Lemos ana.goulart@fiocruz.br <p>One of the bioinformatics purposes is data mining and integration to solve fundamental scientific challenges. We have been investigating biological systems including viruses, bacteria, fungi, protozoans, plants, insects, and animals with such concern. Gradually, we moved from basic questions on genome organization to application in infectious and chronic diseases by integrating interactome and RNA-seq data to modeling techniques such as Flux Balance Analysis, structural modeling, Boolean modeling, system dynamics, and computation biology in a system biology perspective. At the moment, we focus on the rational therapy of cancer assisted by RNA sequencing, network modeling, and structural modeling.</p> 2024-02-17T00:00:00+00:00 Copyright (c) 2023 Journal of Information and Data Management https://journals-sol.sbc.org.br/index.php/jidm/article/view/2626 Let them eat cake: when the small aims at being LARGE or the empowering effects of bioinformatics in NGS wonderland 2022-07-21T12:09:12+00:00 Gabriel M. Yazbeck dna@ufsj.edu.br Raíssa C. D. Graciano raissadgraciano@gmail.com Rosiane P. Santos rosianeps2007@yahoo.com.br Rafael Sachetto Oliveira sachetto@ufsj.edu.br <p>This report summarizes the path (and pitfalls) in the way of the Genetic Resources Laboratory (LARGE-UFSJ), trailed with the aid of bioinformatics, in the field of massive DNA data analyses and its application in the field of conservation of biodiversity, particularly of Neotropical migratory fish. We use the metaphor of DNA sequencing as the cake, both as a prized delicacy formerly inaccessible to the masses, as in the infamous <span style="font-weight: 400;">"</span>let them eat cake", scornfully exclaimed by Marie-Antoinette during bread shortage in the French Revolution, but also as a means to achieve rapid growth for small research groups, as the plot device in Lewis Carroll' Alice in Wonderland. Next-Generation Sequencing (NGS) methods have been known to promote a true revolution in the Life Sciences, empowering groups with limited resources to explore the relatively new, still unknown and often surprising world of genetic sequences. Indeed, we argue for the inertia breaking potential of NGS and give our group's trajectory as a testimony. It all begun with the fortuitous union of providential fish DNA big-data gathered by Genetics professor, Dr. Yazbeck, and Computer Science professor, Dr. Sachetto's curiosity onto biological research, along the wit of some young researchers. Our initial NGS challenge was to provide the assembly and annotation of the first mitochondrial genome for the Anostomidae fish family. The LARGE's NGS research program was able to promote the characterization of what was then arguably the highest number of microsatellite DNA markers for the flagship species,<em> Salminus brasiliensis</em> (dourado) and <em>Brycon orbignyanus</em> (piracanjuba), useful in environmental applications for conservation (green biotechnology). We also have provided this large raw datasets, as well as elaborated massive results, freely available to the scientific community in data repositories such as GenBank, SRA and FigShare, such as genomic assemblies and gene annotation in these fish. Technological spin offs with application in the environmental protection and food production fields have also been devised as direct consequence of the availability of such rich and diverse data.</p> 2024-02-17T00:00:00+00:00 Copyright (c) 2023 Journal of Information and Data Management https://journals-sol.sbc.org.br/index.php/jidm/article/view/2661 Bioinformatics and Computational Biology Research at the Computer Science Department at UFMG 2022-07-21T12:06:55+00:00 Diego Mariano diego@dcc.ufmg.br Frederico Chaves Carvalho fredericochaves@dcc.ufmg.br Luana Luiza Bastos luizabastos.luana9@gmail.com Lucas Moraes dos Santos lucas.santos@dcc.ufmg.br Vivian Morais Paixão vivianmp95@ufmg.br Raquel C. de Melo-Minardi raquelcm@dcc.ufmg.br <p>Bioinformatics is an emerging research field that encompasses the use of computational methods, algorithms, and tools to solve life science problems. At the Laboratory of Bioinformatics and Systems (LBS), our research lines include the use of graph-based algorithms to improve the prediction of the structure and function of macromolecules, the detection of molecular recognition patterns, the application of mathematical models and artificial intelligence techniques to assist enzyme engineering, and development of models, algorithms, and tools. Additionally, the group has played a role in scientific outreach and spreading bioinformatics in Brazil. In this article, we summarize the 20 years of Bioinformatics and Computational Biology research conducted by our group at LBS in the Department of Computer Science at the Universidade Federal de Minas Gerais (DCC-UFMG).</p> 2024-02-16T00:00:00+00:00 Copyright (c) 2023 Journal of Information and Data Management https://journals-sol.sbc.org.br/index.php/jidm/article/view/2673 Computational Biology Laboratory - Combi-Lab 2022-07-21T12:06:07+00:00 Karina dos Santos Machado karinaecomp@gmail.com Adriano Velasque Werhli werhli@gmail.com <p>This article presents the Computational Biology - Combi-Lab research group at the Universidade Federal do Rio Grande (FURG) which started its activities in 2011. The main objective of the group is to bring together researchers and students who are interested in all aspects of Computational Biology. Specifically, the group aims to develop, improve and use sophisticated statistical, computational, and mathematical methods to contribute to the advancement of this research area. This article provides an overview of the Combi-Lab timeline from its founding to the actual days, highlighting various articles and discussing about the future of the group. More importantly, joint projects and collaborators are presented, and their contribution to the development of the Bioinformatics is explained. In conclusion, as we look to the past and face the challenges of the future, we hold fast to our goal of becoming a solid and leading reference in Computational Biology at our university and community, and giving back to the society the maximum that we can.</p> 2024-02-16T00:00:00+00:00 Copyright (c) 2023 Journal of Information and Data Management https://journals-sol.sbc.org.br/index.php/jidm/article/view/2684 NBioinfo: Establishing a Bioinformatics Core in a University-based General Hospital in South Brazil 2022-07-21T12:15:22+00:00 Mariana Recamonde-Mendoza mmendoza@hcpa.edu.br Gerda Cristal Villalba Silva cristal.villalba@hotmail.com Thayne Woycinck Kowalski tkowalski@hcpa.edu.br Otávio von Ameln Lovison olovison@hcpa.edu.br Rafaela Ramalho Guerra rrguerra@hcpa.edu.br Andreza Francisco Martins andfmartins@hcpa.edu.br Ursula Matte umatte@hcpa.edu.br <p>Bioinformatics is an indispensable discipline for current research in life and medical sciences. The increasing volume and complexity of biological data and the growing tendency for open data and data reuse projects have made computer-based analytical tools central to these research fields. However, it is an intrinsic interdisciplinary field with a multitude of skill sets required for using bioinformatics tools or undertaking research toward developing new methods. There is still a lack of skilled human resources to meet the numerous and growing application possibilities, which represents a bottleneck in many research projects. This paper reports our efforts to create the Núcleo de Bioinformática (NBioinfo, or Bioinformatics Core) at the Hospital de Clínicas de Porto Alegre (HCPA), a major public university hospital in Brazil. NBioinfo aims to serve as a hub for research and interaction in Bioinformatics and Computational Biology at HCPA, institutionally developing these areas of knowledge and promoting scientific advances triggered by bioinformatics. We briefly present our research group's history and goals, and describe our activities toward providing HCPA with competencies in these fields. We also describe the scientific and methodological challenges recently faced by our group and the advances promoted by scientific collaborations and research projects developed at NBioinfo.</p> 2024-02-16T00:00:00+00:00 Copyright (c) 2023 Journal of Information and Data Management https://journals-sol.sbc.org.br/index.php/jidm/article/view/3053 The Use of Data Mining Techniques in the Diagnosis and Prevention of Cerebrovascular Accident (CVA) 2024-04-22T20:13:06+00:00 Maria Adriana Ferreira da Silva maria.silva78326@alunos.ufersa.edu.br Angélica Félix de Castro angelica@ufersa.edu.br Isaac de Lima Oliveira Filho isaacoliveira@uern.br Marcelino Pereira dos Santos Silva prof.marcelino@gmail.com <p>Over the years, there has been a rise in the occurrence of Cerebrovascular Accident (CVA) cases, due to the increase in the elderly population. Current data indicate that stroke is one of the leading causes of death and disability worldwide, affecting millions of people and leaving survivors with numerous sequelae, whether they are physical or mental. Many factors such as diabetes, smoking, high blood pressure, and others, favor the onset of stroke, which increases mortality rates, making it necessary to know these factors in order to contribute to early preventive measures. In this sense, the purpose of this article is to use six data mining algorithms with the objective of helping to identify and diagnose people prone to having a stroke based on risk factors and indicative signs. The algorithms used were: Decision Tree, K-Nearest Neighbors (K-NN), Multilayer Perceptron Neural Network (MLP), Support Vector Machine (SVM), Naive Bayes, and the Apriori algorithm. The results showed that the MLP and decision tree algorithms obtained the best results, indicating their use in intelligent solutions for this area.</p> 2024-11-12T00:00:00+00:00 Copyright (c) 2024 Journal of Information and Data Management https://journals-sol.sbc.org.br/index.php/jidm/article/view/3060 Usage of the Bag Distance Filtering with In-Memory Metric Trees 2023-09-28T00:03:12+00:00 Sergio Luis Sardi Mergen mergen@inf.ufsm.br <p>Metric trees are efficient indexing structures for multidimensional objects defined in terms of a metric space. One possible application is for string similarity search, using the edit distance as the metric function. A previous work proposes clustering objects under leaf nodes and using the bag distance as a filtering step before the edit distance is computed. Cost predictions estimate that the filtering compensates in practical scenarios. The work has important implications when data resides on secondary storage, where nodes have a fixed size that aligns with page disks. In this paper, we expand the discussion by using the bag distance filtering step for in-memory metric trees, where the clusters have no size constraints. We adjust existing metric trees to support leaf nodes with arbitrary cluster sizes and incorporate parameters based on size and density to decide when a leaf node should be subdivided. Experiments show that cluster size can have a substantial impact during both index construction and search. We report the gains achieved in terms of processing cost and the number of distance computations when using the most suited values for the cluster size and density parameters.</p> 2024-04-17T00:00:00+00:00 Copyright (c) 2023 Journal of Information and Data Management https://journals-sol.sbc.org.br/index.php/jidm/article/view/3079 Built-up Integration: A New Terminology and Taxonomy for Managing Information On-the-fly 2023-05-02T18:17:29+00:00 Maria Helena Franciscatto mhfranciscatto@inf.ufpr.br Luis Carlos Erpen de Bona bona@inf.ufpr.br Celio Trois trois@inf.ufsm.br Marcos Didonet Del Fabro marcos.didonetdelfabro@cea.fr <p>Obtaining useful data to meet specific query requirements usually demands to integrate data sources at query time, which is known as on-the-fly integration. Currently, many studies address this concept by discovering useful data sources in an ad-hoc manner, and merging them for providing actionable information to the end user. This set of steps, however, lack a standardization in their identification, since they are described in the literature under many different names. Hence, without an unified nomenclature and knowledge organization, the development in the area may be considerably impaired. This paper proposes a novel term called Built-up Integration aiming at knowledge regulation, and a taxonomy for embracing a set of common tasks observed in studies that select and integrate sources on-the-fly. As result from the taxonomy, we demonstrate how Built-up Integration features can be found in the literature, through an exemplification with related studies. We also highlight research opportunities regarding Built-up Integration, as a way to guide future development in a subdomain of Data Integration.</p> 2024-02-19T00:00:00+00:00 Copyright (c) 2023 Journal of Information and Data Management https://journals-sol.sbc.org.br/index.php/jidm/article/view/3337 Machine Learning Model Explainability supported by Data Explainability: a Provenance-Based Approach 2023-10-24T15:46:40+00:00 Rosana Leandro de Oliveira rosanasleandro@ime.eb.br Julio Cesar Duarte duarte@ime.eb.br Kelli de Faria Cordeiro kelli@ime.eb.br <p>The task of explaining the result of Machine Learning (ML) predictive models has become critically important nowadays, given the necessity to improve the results' reliability. Several techniques have been used to explain the prediction of ML models, and some research works explore the use of data provenance in ML cycle phases. However, there is a gap in relating the provenance data with model explainability provided by Explainable Artificial Intelligence (XAI) techniques. To address this issue, this work presents an approach to capture provenance data, mainly in the pre-processing phase, and relate it to the results of explainability techniques. To support that, a relational data model was also proposed and is the basis for our concept of data explainability. Furthermore, a graphic visualization was developed to better present the improved technique. The experiments' results showed that the improvement of the ML explainability techniques was reached mainly by the understanding of the attributes' derivation, which built the model, enabled by data explainability.</p> 2024-02-21T00:00:00+00:00 Copyright (c) 2023 Journal of Information and Data Management https://journals-sol.sbc.org.br/index.php/jidm/article/view/3209 Sentence-ITDL: Generating POI type Embeddings based on Variable Length Sentences 2023-08-25T17:53:28+00:00 Salatiel Dantas Silva salatiel@copin.ufcg.edu.br Claudio E. C. Campelo campelo@dsc.ufcg.edu.br Maxwell Guimarães de Oliveira maxwell@computacao.ufcg.edu.br <p>Point of Interest (POI) types are one of the most researched aspects of urban data. Developing new methods capable of capturing the semantics and similarity of POI types enables the creation of computational mechanisms that may assist in many tasks, such as POI recommendation and Urban Planning. Several works have successfully modeled POI types considering POI co-occurrences in different spatial regions along with statistical models based on the Word2Vec technique from Natural Language Processing (NLP). In the state-of-the-art, binary relations between each POI in a region indicate the co-occurrences. The relations are used to generate a set of two-word sentences using the POI types. Such sentences feed a Word2Vec model that produces POI type embeddings. Although these works have presented good results, they do not consider the spatial distance among related POIs as a feature to represent POI types. In this context, we present the Sentence-ITDL, an approach based on Word2Vec variable length sentences that include such a distance to generate POI type embeddings, providing an improved POI type representation. Our approach uses the distance to generate Word2Vec variable-length sentences. We define ranges of distances mapped to word positions in a sentence. From the mapping, nearby will have their types mapped to close positions in the sentences.Word2Vec's architecture uses the word position in a sentence to adjust the training weights of each POI type. In this manner, POI type embeddings can incorporate the distance. Experiments based on similarity assessments between POI types revealed that our representation provides values close to human judgment.</p> 2024-07-17T00:00:00+00:00 Copyright (c) 2024 Journal of Information and Data Management https://journals-sol.sbc.org.br/index.php/jidm/article/view/3327 Modeling the Spatiotemporal Evolution of Brazil 2024-04-12T13:22:09+00:00 Fernanda de Oliveira Ramalho fernanda.ramalho@dcc.ufmg.br Clodoveu A. Davis Jr. clodoveu@dcc.ufmg.br <p>This article presents the modeling of a geographic database that includes temporal data on the evolution of territorial boundaries. Data on the changes that occurred in the Brazilian territory between the years of 1872 and 2015 are organized in order to implement the necessary structures for manipulation of temporal data. The purpose of this work is to achieve a single structure for analysis, comparison and visualization of spatiotemporal data. The result of the modeling is presented, along with an interaction in which it is possible to retrieve geographic data from a range of years.</p> 2024-06-25T00:00:00+00:00 Copyright (c) 2024 Journal of Information and Data Management https://journals-sol.sbc.org.br/index.php/jidm/article/view/3317 The Impact of Representation Learning on Unsupervised Graph Neural Networks for One-Class Recommendation 2023-08-22T12:51:41+00:00 Marcos Paulo Silva Gôlo marcosgolo@usp.br Leonardo Gonçalves de Moraes leonardo.g.moraes@usp.br Rudinei Goularte rudinei@icmc.usp.br Ricardo Marcondes Marcacini ricardo.marcacini@icmc.usp.br <p>We present a Graph Neural Network (GNN) using link prediction for One-class Recommendation. Traditional recommender systems require positive and negative interactions to recommend items to users, but negative interactions are scarce, making it challenging to cover the scope of non-recommendations. Our proposed approach explores One-Class Learning (OCL) to overcome this limitation by using only one class (positive interactions) to train and predict whether or not a new example belongs to the training class in enriched heterogeneous graphs. The paper also proposes an explainability model and performs a qualitative evaluation through the TSNE algorithm in the learned embeddings. The methods' analysis in a two-dimensional projection showed our enriched graph neural network proposal was the only one that could separate the representations of users and items. Moreover, the proposed explainability method showed the user nodes connected with the predicted item are the most important to recommend this item to another user. Another conclusion from the experiments is that the added nodes to enrich the graph also impact the recommendation.</p> 2024-02-22T00:00:00+00:00 Copyright (c) 2024 Journal of Information and Data Management https://journals-sol.sbc.org.br/index.php/jidm/article/view/3328 Using Non-Local Connections to Augment Knowledge and Efficiency in Multiagent Reinforcement Learning: an Application to Route Choice 2023-05-01T10:30:09+00:00 Ana L. C. Bazzan bazzan@inf.ufrgs.br H. U. Gobbi hugobbi@inf.ufrgs.br G. D. dos Santos gdsantos@inf.ufrgs.br <p>Providing timely information to drivers is proving valuable in urban mobility applications. There has been several attempts to tackle this question, from transportation engineering, as well as from computer science points of view. In this paper we use reinforcement learning to let driver agents learn how to select a route. In previous works, vehicles and the road infrastructure exchange information to allow drivers to make better informed decisions. In the present paper, we provide extensions in two directions. First, we use non-local information to augment the knowledge that some elements of the infrastructure have. By non-local we mean information that are not in the immediate neighborhood. This is done by constructing a graph in which the elements of the infrastructure are connected according to a similarity measure regarding patterns. Patterns here relate to a set of different attributes: we consider not only travel time, but also include emission of gases. The second extension refers to the environment: the road network now contains signalized intersections. Our results show that using augmented information leads to more efficiency. In particular, we measure travel time and emission of CO along time, and show that the agents learn to use routes that reduce both these measures and, when non-local information is used, the learning task is accelerated.</p> 2024-02-29T00:00:00+00:00 Copyright (c) 2024 Journal of Information and Data Management https://journals-sol.sbc.org.br/index.php/jidm/article/view/3365 Two Meta-learning approaches for noise filter algorithm recommendation 2023-08-22T12:52:20+00:00 Pedro B. Pio pedrobpio@gmail.com Adriano Rivolli rivolli@utfpr.edu.br André C. P. L. F. de Carvalho andre@icmc.usp.br Luís P. F. Garcia luis.garcia@unb.br <p>Preprocessing techniques can increase the predictive performance, or even allow the use, of Machine Learning (ML) algorithms. This occurs because many of these techniques can improve the quality of a dataset, such as noise removal or filtering. However, it is not simple to identify which preprocessing techniques to apply to a given dataset. This work presents two approaches to recommend a noise filtering technique using meta-learning. Meta-learning is an automated machine learning (AutoML) method that can, based on a set of features extracted from a dataset, induce a meta-model able to predict the most suitable technique to be applied to a new dataset. The first approach returns a ranking of the noise filter techniques using regression models. The second sequentially applies multiple meta-models, to decide the most suitable noise filter technique for a particular dataset. For both approaches we extract the meta-features from use synthetics datasets and use as meta-label the f1-score value obtained by different ML algorithms when applied to these datasets. For the experiments, eight noise filtering techniques were used. The experimental results indicated that the rank approach acquired higher performance gain than the baseline, while the second obtained higher predictive performance. The ranking based approach also ranked the best algorithm in the top-3 positions with high predictive accuracy.</p> 2024-02-23T00:00:00+00:00 Copyright (c) 2024 Journal of Information and Data Management https://journals-sol.sbc.org.br/index.php/jidm/article/view/3368 Legal Document Segmentation and Labeling Through Named Entity Recognition Approaches 2023-08-14T18:01:55+00:00 Gabriel M. C. Guimarães gabriel.ciriatico@aluno.unb.br Felipe X. B. da Silva felipe.barbosa@aluno.unb.br Lucas A. B. Macedo almeida.bandeira@aluno.unb.br Victor H. F. Lisboa victor.lisboa@aluno.unb.br Ricardo M. Marcacini ricardo.marcacini@gmail.com Andrei L. Queiroz andreiqueiroz@unb.br Vinicius R. P. Borges viniciusrpb@unb.br Thiago P. Faleiros thiagodepaulo@unb.br Luis P. F. Garcia luis.garcia@unb.br <p>The document segmentation task allows us to divide documents into smaller parts, known as segments, which can then be labelled within different categories. This problem can be divided in two steps: the extraction and the labeling of these segments. We tackle the problem of document segmentation and segment labeling focusing on official gazettes or legal documents. They have a structure that can benefit from token classification approaches, especially Named Entity Recognition (NER), since they are divided into labelled segments. In this study, we use word-based and sentence-based CRF, CNN-CNN-LSTM and CNN-biLSTM-CRF models to bring together text segmentation and token classification. To validate our experiments, we propose a new annotated data set named PersoSEG composed of 127 documents in Portuguese from the Official Gazette of the Federal District, published between 2001 and 2015, with a Krippendorff's alpha agreement coefficient of 0.984. As a result, we observed a better performance for word-based models, especially with the CRF architecture, that achieved an average F1-Score of 75.65% for 12 different categories of segments.</p> 2024-02-23T00:00:00+00:00 Copyright (c) 2024 Journal of Information and Data Management https://journals-sol.sbc.org.br/index.php/jidm/article/view/3369 Empirical Comparison of EEG Signal Classification Techniques through Genetic Programming-based AutoML: An Extended Study 2023-04-30T12:37:33+00:00 Icaro M. Miranda icaro.miranda@aluno.unb.br Claus de C. Aranha caranha@cs.tsukuba.ac.jp André C. P. L. F. de Carvalho andre@icmc.usp.br Luís P. F. Garcia luis.garcia@unb.br <p>Machine Learning (ML) applications using complex data often need multiple preprocessing techniques and predictive models to find a solution that meets their needs. In this context, Automated Machine Learning (AutoML) techniques help to provide automated data preparation and modeling and improve ML pipelines. AutoML can follow different strategies, among them Genetic Programming (GP). GP stands out for its ability to create pipelines of arbitrary format, with high interpretability and the ability to customize information from the data domain context. This paper presents a comparative study of two AutoML approaches optimized with GP for the time series classification problem and its characterization through four domain-based feature sets. We selected the Electroencephalogram (EEG) signals as a case of study due to their high complexity, spatial and temporal co-variance, and non-stationarity. Our data characterization shows that using only spectral or time-domain features is unsuitable for achieving high-performance pipelines. Our results reveal how AutoML can generate more accurate and interpretable solutions than the literature's complex or <em>ad hoc</em> models. The proposed approach facilitates the analysis of dimensional reduction through fitness convergence, tree depth, and generated features.</p> 2024-02-27T00:00:00+00:00 Copyright (c) 2024 Journal of Information and Data Management https://journals-sol.sbc.org.br/index.php/jidm/article/view/3371 How to balance financial returns with metalearning for trend prediction 2023-10-02T11:19:36+00:00 Alvaro Valentim Pereira de Menezes Bandeira avalentim98@usp.br Gabriel Monteiro Ferracioli ferracioligabriel@usp.br Moisés Rocha dos Santos mmrsantos@usp.br André Carlos Ponce de Leon Ferreira de Carvalho andre@icmc.usp.br <p>The prediction of market price movement is an essential tool for decision-making in trading scenarios. However, there are several candidate methods for this task. Metalearning can be an important ally for the automatic selection of methods, which can be machine learning algorithms for classification tasks, named here classification algorithms. In this work, we present the use of metalearning for classification in market movement prediction and elaborate new analyses of its statistical implications. Different setups and metrics were evaluated for the meta-target selection. Cumulative return was the metric that achieved the best meta and base-level results. According to the experimental results, metalearning was a competitive selection strategy for predicting market price movement. This work is an extension of Bandeira et. al[2022].</p> 2024-02-27T00:00:00+00:00 Copyright (c) 2024 Journal of Information and Data Management https://journals-sol.sbc.org.br/index.php/jidm/article/view/3463 Instance hardness measures for classification and regression problems 2023-09-01T13:28:12+00:00 Gustavo P. Torquette gustavo.torquette@unifesp.br Victor S. Nunes victor.nunes@ga.ita.br Pedro Y. A. Paiva paiva@ita.br Ana C. Lorena aclorena@ita.br <p>While the most common approach in Machine Learning (ML) studies is to analyze the performance achieved on a dataset through summary statistics, a fine-grained analysis at the level of its individual instances can provide valuable information for the ML practitioner. For instance, one can inspect whether the instances which are hardest to have their labels predicted might have any quality issues that should be addressed beforehand; or one may identify the need for more powerful learning methods for addressing the challenge imposed by one or a set of instances. This paper formalizes and presents a set of meta-features for characterizing which instances of a dataset are the hardest to have their label predicted accurately and why they are so, aka instance hardness measures. While there are already measures able to characterize instance hardness in classification problems, there is a lack of work devoted to regression problems. Here we present and analyze instance hardness measures for both classification and regression problems according to different perspectives, taking into account the particularities of each of these problems. For validating our results, synthetic datasets with different sources and levels of complexity are built and analyzed, indicating what kind of difficulty each measure is able to better quantify. A Python package containing all implementations is also provided.</p> 2024-02-27T00:00:00+00:00 Copyright (c) 2024 Journal of Information and Data Management https://journals-sol.sbc.org.br/index.php/jidm/article/view/3460 LiPSet: A Comprehensive Dataset of Labeled Portuguese Public Bidding Documents 2023-10-09T13:09:25+00:00 Mariana O. Silva mariana.santos@dcc.ufmg.br Gabriel P. Oliveira gabrielpoliveira@dcc.ufmg.br Henrique Hott henriquehott@dcc.ufmg.br Larissa D. Gomide larissa.gomide@dcc.ufmg.br Bárbara M. A. Mendes barbaramit@ufmg.br Clara A. Bacha clarabacha@ufmg.br Lucas L. Costa lucas-lage@ufmg.br Michele A. Brandão michele.brandao@ifmg.edu.br Anisio Lacerda anisio@dcc.ufmg.br Gisele L. Pappa glpappa@dcc.ufmg.br <p>Collecting, processing, and organizing governmental public documents pose significant challenges due to their diverse sources and formats, complicating data analysis. In this context, this work introduces LiPSet, a comprehensive dataset of labeled documents from Brazilian public bidding processes in Minas Gerais state. We provide an overview of the data collection process and present a methodology for data labeling that includes a meta-classifier to assist in the manual labeling process. Next, we perform an exploratory data analysis to summarize the key features and contributions of the LiPSet dataset. We also showcase a practical application of LiPSet by employing it as input data for classifying bidding documents. The results of the classification task exhibit promising performance, demonstrating the potential of LiPSet for training neural network models. Finally, we discuss various applications of LiPSet and highlight the primary challenges associated with its utilization.</p> 2024-04-05T00:00:00+00:00 Copyright (c) 2024 Journal of Information and Data Management https://journals-sol.sbc.org.br/index.php/jidm/article/view/3564 Datasets for Portuguese Legal Semantic Textual Similarity 2023-10-09T13:09:13+00:00 Daniel da Silva Junior danieljunior@id.uff.br Paulo Roberto dos Santos Corval paulocorval@id.uff.br Daniel de Oliveira danielcmo@ic.uff.br Aline Paes alinepaes@ic.uff.br <p>The Brazilian judiciary faces a significant workload, leading to prolonged durations for legal proceedings. In response, the Brazilian National Council of Justice introduced the Resolution 469/2022, which provides formal guidelines for document and process digitalization, thereby creating the opportunity to implement automatic techniques in the legal field. These techniques aim to assist with various tasks, especially managing the large volume of texts involved in law procedures. Notably, Artificial Intelligence (AI) techniques open room to process and extract valuable information from textual data, which could significantly expedite the process. However, one of the challenges lies in the scarcity of datasets specific to the legal domain required for various AI techniques. Obtaining such datasets is difficult as they require some expertise for labeling. To address this challenge, this article presents four datasets from the legal domain: two include unlabelled documents and metadata, while the other two are labeled using a heuristic approach designed for use in textual semantic similarity tasks. Additionally, the article presents a small ground truth dataset generated from domain expert annotations to evaluate the effectiveness of the proposed heuristic labeling process. The analysis of the ground truth labels highlights that conducting semantic analysis of domain-specific texts can be challenging, even for domain experts. Nonetheless, the comparison between the ground truth and heuristic labels demonstrates the utility and effectiveness of the heuristic labeling approach.</p> 2024-04-05T00:00:00+00:00 Copyright (c) 2024 Journal of Information and Data Management https://journals-sol.sbc.org.br/index.php/jidm/article/view/3568 Wiki Evolution dataset applicability: English Wikipedia revision articles represented by quality attributes 2023-11-23T16:40:19+00:00 Ana Luiza Sanches analuizatrz@gmail.com Sinval de Deus Vieira Júnior sinvalvieirajunior@gmail.com Daniel Hasan Dalip hasan@cefetmg.br Bárbara Gabrielle C. O. Lopes barbaragcol@dcc.ufmg.br <p>This paper presents the creation of the Wikipedia article's evolution dataset. This dataset is a set of revisions of articles, represented by quality attributes and quality classification. This dataset can be used for studies regarding automatic quality classification that consider the article revision history as well as understanding how the content and quality of articles evolve over time in this collaborative platform. To illustrate a potential application, this study provides a practical example of utilizing a Machine Learning model trained on the constructed dataset.</p> 2024-04-05T00:00:00+00:00 Copyright (c) 2024 Journal of Information and Data Management https://journals-sol.sbc.org.br/index.php/jidm/article/view/3570 Workflow for the acquisition, processing, and dissemination of Brazilian public data focused on education 2023-08-10T20:29:55+00:00 Abílio Nogueira Barros abilio.nogueira@ufrpe.br Aldéryck Félix de Albuquerque derycck@gmail.com Andrêza Leite de Alencar andreza.leite@ufrpe.br André Nascimento andre.nascimento@ufrpe.br Ibsen Mateus Bittencourt ibsen@feac.ufal.br Rafael Ferreira Mello rafael.mello@ufrpe.br <p>This article aims to demonstrate the process of creating public databases focused on the educational and population areas. It describes the process of obtaining data from official government sources such as INEP (National Institute for Educational Studies and Research) and IBGE (Brazilian Institute of Geography and Statistics), the procedures for data adaptation and optimization to create their historical series, as well as the best practices followed for their development and the generated metadata. Highlighting the specificities between the themes of education and population, reporting their challenges and peculiarities of each dataset. It also reports the results that can already be directly obtained from each dataset and how, when combined, they can track indicators of the National Education Plan, one of the largest Brazilian public policies focused on education.</p> 2024-04-05T00:00:00+00:00 Copyright (c) 2024 Journal of Information and Data Management https://journals-sol.sbc.org.br/index.php/jidm/article/view/3571 Indicators and Municipal Data: A Database for Evaluating the Efficiency of Public Expenditures 2023-10-09T13:08:37+00:00 Paula Guelman Davis paula.davis@seguranca.mg.gov.br <div class="page" title="Page 1"> <div class="layoutArea"> <div class="column"> <p>This article describes the construction of a database with financial and operational data from Brazilian municipalities. Public data were collected regarding expenses by function (education, health, public security, among others), indicators and other data that reflect the municipal situation in the areas of education, health, public security, development, sanitation and finance. Data from various sources were integrated and transformed to allow studies on the correlation between performance indicators of the effectiveness of public governance, and the corresponding spenditures, to follow up and assess the effects of public policies.</p> </div> </div> </div> 2024-04-06T00:00:00+00:00 Copyright (c) 2024 Journal of Information and Data Management https://journals-sol.sbc.org.br/index.php/jidm/article/view/4290 Inter-MOON: Enhanced Middleware for Interoperability between Relational and Blockchain-based Databases 2024-06-17T12:53:07+00:00 Rafael Avilar Sá rafael.sa@lsbd.ufc.br Leonardo O. Moreira leonardo.moreira@lsbd.ufc.br Javam C. Machado javam.machado@lsbd.ufc.br <p>Multi-model architectures enable the querying of data from different sources through a unified interface, providing interoperability among databases. However, support for blockchain-based databases is still scarce. Inter-MOON is a new approach that aims to promote the interoperability of blockchain-based and relational database systems through the virtualization of blockchain assets in a relational environment, allowing for the execution of all four basic SQL DML commands. Through experimentation, results indicate that Inter-MOON provides near total support for SQL SELECT query syntax and exhibits performance comparable to or better than similar tools. This work is an extension of the original work that introduces Inter-MOON.</p> 2024-11-18T00:00:00+00:00 Copyright (c) 2024 Journal of Information and Data Management