https://journals-sol.sbc.org.br/index.php/jidm/issue/feedJournal of Information and Data Management2024-04-12T13:22:09+00:00Daniel de Oliveiradanielcmo@ic.uff.brOpen Journal Systems<p>JIDM is an electronic journal that is published three times a year. Submissions are continuously received, and the first phase of the reviewing process usually takes 4 to 6 months. JIDM is sponsored by the Brazilian Computer Society, focusing on information and data management in large repositories and document collections. It relates to different areas of Computer Science, including databases, information retrieval, digital libraries, knowledge discovery, data mining, and geographical information systems. </p>https://journals-sol.sbc.org.br/index.php/jidm/article/view/2607Biophysical Chemistry of Macromolecules Research Group at the State University of Maringá2022-07-21T12:10:22+00:00Diego de Souza Limadiegodslima92@gmail.comGisele Strieder Philippsengistrieder@ufpr.brElisangela Andrade Ângeloelisangela.angelo@ifpr.edu.brMaria Aparecida Fernandezmafernandez@uem.brFlavio Augusto Vicente Seixasfavseixas@uem.br<p>The interdisciplinary field of Biophysical Chemistry, which applies concepts from Physical Chemistry to describe biological phenomena, is essential for modern molecular biology advancements. This approach enables the description of biological systems in terms of their constituent parts, such as atoms and molecules, facilitating a structural understanding of their characteristics. Nonetheless, to describe such large systems, computational methods are needed. The Biophysical Chemistry of Macromolecules research group at the State University of Maringá is dedicated to investigating such systems, mainly protein-ligand complexes, through bioinformatics approaches combined with experimental techniques to validate in silico results. The main purpose of the research projects is to develop applications for drug discovery in the context of antimicrobial, antiviral, antifungal, and antihyperglycemic agents, with the aim of advancing the field of bioinformatics in Brazil.</p>2024-02-21T00:00:00+00:00Copyright (c) 2023 Journal of Information and Data Managementhttps://journals-sol.sbc.org.br/index.php/jidm/article/view/2609G3B3 to GP15: From early years to Health Informatics Research Group at Paulista University - UNIP2022-07-21T12:24:47+00:00Renato Massaharu Hassunumarenato.hassunuma@docente.unip.brPatricia Carvalho Garciapatricia.garcia@docente.unip.brMichele Janegitz Acorci-Valériomichele.valerio@docente.unip.brMarjorie de Assis Golimmarjorie.golim@unesp.brSandra Heloisa Nunes Messiassandra.messias@docente.unip.br<p>This article summarizes the history and intellectual production from G3B3 (Grupo de Estudos em Bioinformática Estrutural) to GP15 (Grupo de Pesquisa em Informática em Saúde), conducted at Paulista University - UNIP, campus Bauru. In the early years, several activities were developed by G3B3 until the conversion of team to GP15. This group has the computational simulation of biomolecules as one of its research lines. Together with the second line of research, entitled "Development of teaching material and research using computational resources", the production of this group is mainly related to the development of scripts for the visualization of biomolecules and the production of digital books. In early 2022, the current team consists of five university professors and four students from the Biomedicine Course at Universidade Paulista - UNIP, Bauru campus. Together, the team has published more than 50 books in different publishers, part of which is aimed at the development of scripts for computer simulation software. All these books are adopted by course teachers in different subjects.</p>2024-02-16T00:00:00+00:00Copyright (c) 2023 Journal of Information and Data Managementhttps://journals-sol.sbc.org.br/index.php/jidm/article/view/2621The Barroso Research lab: biomolecular interactions, computing, and data-driven science to understand and engineer biological and pharmaceutical systems in a global academic partnership2022-07-21T12:23:53+00:00Fernando L. Barroso da Silvaflbarroso@usp.brCatherine Etchebestcatherine.etchebest@inserm.frErik E. Santisoeesantis@ncsu.eduCarolina Corrêa Gironcarolinacorreagiron@gmail.comIlyas Grandguillaumeilyas.grandguillaume@inserm.frRauni Borges Marquesraunimarques@usp.brSergio A. Poveda-Cuevasalejandropc@alumni.usp.br<p>Biomolecular interactions, high throughput computing, and data-driven science have been the central research foundations of the Barroso Research laboratory. We have been developing and applying innovative computational technology, offering a rational computational-based approach to the investigation of protein systems, and discovering key disease-related protein mechanisms, therapeutic agents, biomarkers, and proteins for specific applications and their controlled release. Born in 2001 at the School of Pharmaceutical Sciences at Ribeirão Preto with the genes of transdisciplinary and internationalization, the laboratory has always been well integrated with research groups in Europe, the US, and Latin America. Students from different fields and places have been forged in this environment at the crossroads of Structural Bioinformatics, Molecular Biophysics, Biological Physics, Physical Chemistry, Engineering, Medicine, Food, and Pharma. The more than 50 scientific papers published in high-impact journals, book chapters, and conference talks reflect our contributions to expanding knowledge and advancing Bioinformatics as an important tool to understand nature and guide innovations.</p>2024-04-16T00:00:00+00:00Copyright (c) 2023 Journal of Information and Data Managementhttps://journals-sol.sbc.org.br/index.php/jidm/article/view/2622ACDBio: The Biological Data Computational Analysis group at ICMC/USP, IFSP, and Barretos Cancer Hospital2022-07-21T12:16:00+00:00Adenilso Simaoadenilso@icmc.usp.brAdriane Feijó Evangelistaafevangelista@alumni.usp.brAlfredo Guilherme Souzaalfredo@usp.brCynthia de Oliveira Lage Ferreiracynthia@icmc.usp.brJorge Francisco Cutigicutigi@ifsp.edu.brPaulo Henrique Ribeirophribeiro@ifsp.edu.brRodrigo Henrique Ramosramos@ifsp.edu.br<p>Recent advances in biological and health technology have resulted in vast digital data. However, the major challenge is interpreting such data to find valuable knowledge. For this, using computing is essential and mandatory since quick data processing and analysis, allied with knowledge extraction techniques, enable working effectively with large biological datasets. In this context, the ACDBio group works with the computational analysis of biological data from different sources, aiming to find new information and knowledge in data or answer questions that are not yet known. So far, the group has worked on several challenging topics, such as identifying significant genes for cancer topological analysis of genes in interaction networks, among others. The group uses computational techniques such as complex networks and their algorithms, machine learning, and topological data analysis. This article aims to present the ACDBio group, and the main research topics worked on by its members. We also present the main results and future work expected by the group.</p>2024-02-17T00:00:00+00:00Copyright (c) 2023 Journal of Information and Data Managementhttps://journals-sol.sbc.org.br/index.php/jidm/article/view/2624Water quality in marine and freshwater environments: a metagenomics approach2022-07-21T12:27:47+00:00Carolina O. P. Gilcarol.pgil@peb.ufrj.brAna Carolina M. Piresanacarolinamp@peb.ufrj.brFernanda O. F. Schmidtfernandaferreira@peb.ufrj.brNathália S. C. Santosnathaliasantos@peb.ufrj.brPriscila C. Alvespriscila.pharm@peb.ufrj.brOdara A. Oliveiraodara.araujo96@gmail.comDhara Avelino-Alvesdharaavelino@gmail.comFlávio F. Nobreflavio@peb.ufrj.brGizele Duarte Garciagidugar@gmail.comCristiane C. Thompsonthompsoncristiane@gmail.comGraciela M. Diasgracielamd@biof.ufrj.brFabiano F. Thompsonfabianothompson1@gmail.comDiogo A. Tschoekediogoat@peb.ufrj.br<p>In this article, we have reviewed the work carried out by the UFRJ microbiology laboratory related to water quality and microorganisms associated with different aquatic ecosystems. We placed water at the center of the One Health concept, due to the integration that water makes between different living beings and the environment. We selected papers published between 2012 and 2022 by UFRJ microbiology laboratory related to bioinformatics genomic and metagenomics analysis. We described the main impacts caused in aquatic environments, about the microorganisms involved in biogeochemical cycles, microorganisms as bioindicators and their resistance genes. Finally, we identified the microorganisms that were most abundant in all the articles studied and pointed out some public policies that we consider important to maintain water quality and reduce anthropic impacts.</p>2024-02-19T00:00:00+00:00Copyright (c) 2023 Journal of Information and Data Managementhttps://journals-sol.sbc.org.br/index.php/jidm/article/view/2625Bioinformatics of infectious and chronic diseases at the Center for Technological Development in Health of Fiocruz2022-07-21T12:14:00+00:00Nicolas Carelsnicolas.carels@fiocruz.brGilberto Ferreira da Silvagilberto.silva@fiocruz.brCarlyle Ribeiro Limacarlyle.lima@fiocruz.brFranklin Souza da Silvafranklin.souza@fiocruz.brMilena Magalhãesmilena.magalhaes@fiocruz.brAna Emília Goulart Lemosana.goulart@fiocruz.br<p>One of the bioinformatics purposes is data mining and integration to solve fundamental scientific challenges. We have been investigating biological systems including viruses, bacteria, fungi, protozoans, plants, insects, and animals with such concern. Gradually, we moved from basic questions on genome organization to application in infectious and chronic diseases by integrating interactome and RNA-seq data to modeling techniques such as Flux Balance Analysis, structural modeling, Boolean modeling, system dynamics, and computation biology in a system biology perspective. At the moment, we focus on the rational therapy of cancer assisted by RNA sequencing, network modeling, and structural modeling.</p>2024-02-17T00:00:00+00:00Copyright (c) 2023 Journal of Information and Data Managementhttps://journals-sol.sbc.org.br/index.php/jidm/article/view/2626Let them eat cake: when the small aims at being LARGE or the empowering effects of bioinformatics in NGS wonderland2022-07-21T12:09:12+00:00Gabriel M. Yazbeckdna@ufsj.edu.brRaíssa C. D. Gracianoraissadgraciano@gmail.comRosiane P. Santosrosianeps2007@yahoo.com.brRafael Sachetto Oliveirasachetto@ufsj.edu.br<p>This report summarizes the path (and pitfalls) in the way of the Genetic Resources Laboratory (LARGE-UFSJ), trailed with the aid of bioinformatics, in the field of massive DNA data analyses and its application in the field of conservation of biodiversity, particularly of Neotropical migratory fish. We use the metaphor of DNA sequencing as the cake, both as a prized delicacy formerly inaccessible to the masses, as in the infamous <span style="font-weight: 400;">"</span>let them eat cake", scornfully exclaimed by Marie-Antoinette during bread shortage in the French Revolution, but also as a means to achieve rapid growth for small research groups, as the plot device in Lewis Carroll' Alice in Wonderland. Next-Generation Sequencing (NGS) methods have been known to promote a true revolution in the Life Sciences, empowering groups with limited resources to explore the relatively new, still unknown and often surprising world of genetic sequences. Indeed, we argue for the inertia breaking potential of NGS and give our group's trajectory as a testimony. It all begun with the fortuitous union of providential fish DNA big-data gathered by Genetics professor, Dr. Yazbeck, and Computer Science professor, Dr. Sachetto's curiosity onto biological research, along the wit of some young researchers. Our initial NGS challenge was to provide the assembly and annotation of the first mitochondrial genome for the Anostomidae fish family. The LARGE's NGS research program was able to promote the characterization of what was then arguably the highest number of microsatellite DNA markers for the flagship species,<em> Salminus brasiliensis</em> (dourado) and <em>Brycon orbignyanus</em> (piracanjuba), useful in environmental applications for conservation (green biotechnology). We also have provided this large raw datasets, as well as elaborated massive results, freely available to the scientific community in data repositories such as GenBank, SRA and FigShare, such as genomic assemblies and gene annotation in these fish. Technological spin offs with application in the environmental protection and food production fields have also been devised as direct consequence of the availability of such rich and diverse data.</p>2024-02-17T00:00:00+00:00Copyright (c) 2023 Journal of Information and Data Managementhttps://journals-sol.sbc.org.br/index.php/jidm/article/view/2661Bioinformatics and Computational Biology Research at the Computer Science Department at UFMG2022-07-21T12:06:55+00:00Diego Marianodiego@dcc.ufmg.brFrederico Chaves Carvalhofredericochaves@dcc.ufmg.brLuana Luiza Bastosluizabastos.luana9@gmail.comLucas Moraes dos Santoslucas.santos@dcc.ufmg.brVivian Morais Paixãovivianmp95@ufmg.brRaquel C. de Melo-Minardiraquelcm@dcc.ufmg.br<p>Bioinformatics is an emerging research field that encompasses the use of computational methods, algorithms, and tools to solve life science problems. At the Laboratory of Bioinformatics and Systems (LBS), our research lines include the use of graph-based algorithms to improve the prediction of the structure and function of macromolecules, the detection of molecular recognition patterns, the application of mathematical models and artificial intelligence techniques to assist enzyme engineering, and development of models, algorithms, and tools. Additionally, the group has played a role in scientific outreach and spreading bioinformatics in Brazil. In this article, we summarize the 20 years of Bioinformatics and Computational Biology research conducted by our group at LBS in the Department of Computer Science at the Universidade Federal de Minas Gerais (DCC-UFMG).</p>2024-02-16T00:00:00+00:00Copyright (c) 2023 Journal of Information and Data Managementhttps://journals-sol.sbc.org.br/index.php/jidm/article/view/2673Computational Biology Laboratory - Combi-Lab2022-07-21T12:06:07+00:00Karina dos Santos Machadokarinaecomp@gmail.comAdriano Velasque Werhliwerhli@gmail.com<p>This article presents the Computational Biology - Combi-Lab research group at the Universidade Federal do Rio Grande (FURG) which started its activities in 2011. The main objective of the group is to bring together researchers and students who are interested in all aspects of Computational Biology. Specifically, the group aims to develop, improve and use sophisticated statistical, computational, and mathematical methods to contribute to the advancement of this research area. This article provides an overview of the Combi-Lab timeline from its founding to the actual days, highlighting various articles and discussing about the future of the group. More importantly, joint projects and collaborators are presented, and their contribution to the development of the Bioinformatics is explained. In conclusion, as we look to the past and face the challenges of the future, we hold fast to our goal of becoming a solid and leading reference in Computational Biology at our university and community, and giving back to the society the maximum that we can.</p>2024-02-16T00:00:00+00:00Copyright (c) 2023 Journal of Information and Data Managementhttps://journals-sol.sbc.org.br/index.php/jidm/article/view/2684NBioinfo: Establishing a Bioinformatics Core in a University-based General Hospital in South Brazil2022-07-21T12:15:22+00:00Mariana Recamonde-Mendozammendoza@hcpa.edu.brGerda Cristal Villalba Silvacristal.villalba@hotmail.comThayne Woycinck Kowalskitkowalski@hcpa.edu.brOtávio von Ameln Lovisonolovison@hcpa.edu.brRafaela Ramalho Guerrarrguerra@hcpa.edu.brAndreza Francisco Martinsandfmartins@hcpa.edu.brUrsula Matteumatte@hcpa.edu.br<p>Bioinformatics is an indispensable discipline for current research in life and medical sciences. The increasing volume and complexity of biological data and the growing tendency for open data and data reuse projects have made computer-based analytical tools central to these research fields. However, it is an intrinsic interdisciplinary field with a multitude of skill sets required for using bioinformatics tools or undertaking research toward developing new methods. There is still a lack of skilled human resources to meet the numerous and growing application possibilities, which represents a bottleneck in many research projects. This paper reports our efforts to create the Núcleo de Bioinformática (NBioinfo, or Bioinformatics Core) at the Hospital de Clínicas de Porto Alegre (HCPA), a major public university hospital in Brazil. NBioinfo aims to serve as a hub for research and interaction in Bioinformatics and Computational Biology at HCPA, institutionally developing these areas of knowledge and promoting scientific advances triggered by bioinformatics. We briefly present our research group's history and goals, and describe our activities toward providing HCPA with competencies in these fields. We also describe the scientific and methodological challenges recently faced by our group and the advances promoted by scientific collaborations and research projects developed at NBioinfo.</p>2024-02-16T00:00:00+00:00Copyright (c) 2023 Journal of Information and Data Managementhttps://journals-sol.sbc.org.br/index.php/jidm/article/view/3060Usage of the Bag Distance Filtering with In-Memory Metric Trees2023-09-28T00:03:12+00:00Sergio Luis Sardi Mergenmergen@inf.ufsm.br<p>Metric trees are efficient indexing structures for multidimensional objects defined in terms of a metric space. One possible application is for string similarity search, using the edit distance as the metric function. A previous work proposes clustering objects under leaf nodes and using the bag distance as a filtering step before the edit distance is computed. Cost predictions estimate that the filtering compensates in practical scenarios. The work has important implications when data resides on secondary storage, where nodes have a fixed size that aligns with page disks. In this paper, we expand the discussion by using the bag distance filtering step for in-memory metric trees, where the clusters have no size constraints. We adjust existing metric trees to support leaf nodes with arbitrary cluster sizes and incorporate parameters based on size and density to decide when a leaf node should be subdivided. Experiments show that cluster size can have a substantial impact during both index construction and search. We report the gains achieved in terms of processing cost and the number of distance computations when using the most suited values for the cluster size and density parameters.</p>2024-04-17T00:00:00+00:00Copyright (c) 2023 Journal of Information and Data Managementhttps://journals-sol.sbc.org.br/index.php/jidm/article/view/3079Built-up Integration: A New Terminology and Taxonomy for Managing Information On-the-fly2023-05-02T18:17:29+00:00Maria Helena Franciscattomhfranciscatto@inf.ufpr.brLuis Carlos Erpen de Bonabona@inf.ufpr.brCelio Troistrois@inf.ufsm.brMarcos Didonet Del Fabromarcos.didonetdelfabro@cea.fr<p>Obtaining useful data to meet specific query requirements usually demands to integrate data sources at query time, which is known as on-the-fly integration. Currently, many studies address this concept by discovering useful data sources in an ad-hoc manner, and merging them for providing actionable information to the end user. This set of steps, however, lack a standardization in their identification, since they are described in the literature under many different names. Hence, without an unified nomenclature and knowledge organization, the development in the area may be considerably impaired. This paper proposes a novel term called Built-up Integration aiming at knowledge regulation, and a taxonomy for embracing a set of common tasks observed in studies that select and integrate sources on-the-fly. As result from the taxonomy, we demonstrate how Built-up Integration features can be found in the literature, through an exemplification with related studies. We also highlight research opportunities regarding Built-up Integration, as a way to guide future development in a subdomain of Data Integration.</p>2024-02-19T00:00:00+00:00Copyright (c) 2023 Journal of Information and Data Managementhttps://journals-sol.sbc.org.br/index.php/jidm/article/view/3337Machine Learning Model Explainability supported by Data Explainability: a Provenance-Based Approach2023-10-24T15:46:40+00:00Rosana Leandro de Oliveirarosanasleandro@ime.eb.brJulio Cesar Duarteduarte@ime.eb.brKelli de Faria Cordeirokelli@ime.eb.br<p>The task of explaining the result of Machine Learning (ML) predictive models has become critically important nowadays, given the necessity to improve the results' reliability. Several techniques have been used to explain the prediction of ML models, and some research works explore the use of data provenance in ML cycle phases. However, there is a gap in relating the provenance data with model explainability provided by Explainable Artificial Intelligence (XAI) techniques. To address this issue, this work presents an approach to capture provenance data, mainly in the pre-processing phase, and relate it to the results of explainability techniques. To support that, a relational data model was also proposed and is the basis for our concept of data explainability. Furthermore, a graphic visualization was developed to better present the improved technique. The experiments' results showed that the improvement of the ML explainability techniques was reached mainly by the understanding of the attributes' derivation, which built the model, enabled by data explainability.</p>2024-02-21T00:00:00+00:00Copyright (c) 2023 Journal of Information and Data Managementhttps://journals-sol.sbc.org.br/index.php/jidm/article/view/3209Sentence-ITDL: Generating POI type Embeddings based on Variable Length Sentences2023-08-25T17:53:28+00:00Salatiel Dantas Silvasalatiel@copin.ufcg.edu.brClaudio E. C. Campelocampelo@dsc.ufcg.edu.brMaxwell Guimarães de Oliveiramaxwell@computacao.ufcg.edu.br<p>Point of Interest (POI) types are one of the most researched aspects of urban data. Developing new methods capable of capturing the semantics and similarity of POI types enables the creation of computational mechanisms that may assist in many tasks, such as POI recommendation and Urban Planning. Several works have successfully modeled POI types considering POI co-occurrences in different spatial regions along with statistical models based on the Word2Vec technique from Natural Language Processing (NLP). In the state-of-the-art, binary relations between each POI in a region indicate the co-occurrences. The relations are used to generate a set of two-word sentences using the POI types. Such sentences feed a Word2Vec model that produces POI type embeddings. Although these works have presented good results, they do not consider the spatial distance among related POIs as a feature to represent POI types. In this context, we present the Sentence-ITDL, an approach based on Word2Vec variable length sentences that include such a distance to generate POI type embeddings, providing an improved POI type representation. Our approach uses the distance to generate Word2Vec variable-length sentences. We define ranges of distances mapped to word positions in a sentence. From the mapping, nearby will have their types mapped to close positions in the sentences.Word2Vec's architecture uses the word position in a sentence to adjust the training weights of each POI type. In this manner, POI type embeddings can incorporate the distance. Experiments based on similarity assessments between POI types revealed that our representation provides values close to human judgment.</p>2024-07-17T00:00:00+00:00Copyright (c) 2024 Journal of Information and Data Managementhttps://journals-sol.sbc.org.br/index.php/jidm/article/view/3327Modeling the Spatiotemporal Evolution of Brazil2024-04-12T13:22:09+00:00Fernanda de Oliveira Ramalhofernanda.ramalho@dcc.ufmg.brClodoveu A. Davis Jr.clodoveu@dcc.ufmg.br<p>This article presents the modeling of a geographic database that includes temporal data on the evolution of territorial boundaries. Data on the changes that occurred in the Brazilian territory between the years of 1872 and 2015 are organized in order to implement the necessary structures for manipulation of temporal data. The purpose of this work is to achieve a single structure for analysis, comparison and visualization of spatiotemporal data. The result of the modeling is presented, along with an interaction in which it is possible to retrieve geographic data from a range of years.</p>2024-06-25T00:00:00+00:00Copyright (c) 2024 Journal of Information and Data Managementhttps://journals-sol.sbc.org.br/index.php/jidm/article/view/3317The Impact of Representation Learning on Unsupervised Graph Neural Networks for One-Class Recommendation2023-08-22T12:51:41+00:00Marcos Paulo Silva Gôlomarcosgolo@usp.brLeonardo Gonçalves de Moraesleonardo.g.moraes@usp.brRudinei Goularterudinei@icmc.usp.brRicardo Marcondes Marcaciniricardo.marcacini@icmc.usp.br<p>We present a Graph Neural Network (GNN) using link prediction for One-class Recommendation. Traditional recommender systems require positive and negative interactions to recommend items to users, but negative interactions are scarce, making it challenging to cover the scope of non-recommendations. Our proposed approach explores One-Class Learning (OCL) to overcome this limitation by using only one class (positive interactions) to train and predict whether or not a new example belongs to the training class in enriched heterogeneous graphs. The paper also proposes an explainability model and performs a qualitative evaluation through the TSNE algorithm in the learned embeddings. The methods' analysis in a two-dimensional projection showed our enriched graph neural network proposal was the only one that could separate the representations of users and items. Moreover, the proposed explainability method showed the user nodes connected with the predicted item are the most important to recommend this item to another user. Another conclusion from the experiments is that the added nodes to enrich the graph also impact the recommendation.</p>2024-02-22T00:00:00+00:00Copyright (c) 2024 Journal of Information and Data Managementhttps://journals-sol.sbc.org.br/index.php/jidm/article/view/3328Using Non-Local Connections to Augment Knowledge and Efficiency in Multiagent Reinforcement Learning: an Application to Route Choice2023-05-01T10:30:09+00:00Ana L. C. Bazzanbazzan@inf.ufrgs.brH. U. Gobbihugobbi@inf.ufrgs.brG. D. dos Santosgdsantos@inf.ufrgs.br<p>Providing timely information to drivers is proving valuable in urban mobility applications. There has been several attempts to tackle this question, from transportation engineering, as well as from computer science points of view. In this paper we use reinforcement learning to let driver agents learn how to select a route. In previous works, vehicles and the road infrastructure exchange information to allow drivers to make better informed decisions. In the present paper, we provide extensions in two directions. First, we use non-local information to augment the knowledge that some elements of the infrastructure have. By non-local we mean information that are not in the immediate neighborhood. This is done by constructing a graph in which the elements of the infrastructure are connected according to a similarity measure regarding patterns. Patterns here relate to a set of different attributes: we consider not only travel time, but also include emission of gases. The second extension refers to the environment: the road network now contains signalized intersections. Our results show that using augmented information leads to more efficiency. In particular, we measure travel time and emission of CO along time, and show that the agents learn to use routes that reduce both these measures and, when non-local information is used, the learning task is accelerated.</p>2024-02-29T00:00:00+00:00Copyright (c) 2024 Journal of Information and Data Managementhttps://journals-sol.sbc.org.br/index.php/jidm/article/view/3365Two Meta-learning approaches for noise filter algorithm recommendation2023-08-22T12:52:20+00:00Pedro B. Piopedrobpio@gmail.comAdriano Rivollirivolli@utfpr.edu.brAndré C. P. L. F. de Carvalhoandre@icmc.usp.brLuís P. F. Garcialuis.garcia@unb.br<p>Preprocessing techniques can increase the predictive performance, or even allow the use, of Machine Learning (ML) algorithms. This occurs because many of these techniques can improve the quality of a dataset, such as noise removal or filtering. However, it is not simple to identify which preprocessing techniques to apply to a given dataset. This work presents two approaches to recommend a noise filtering technique using meta-learning. Meta-learning is an automated machine learning (AutoML) method that can, based on a set of features extracted from a dataset, induce a meta-model able to predict the most suitable technique to be applied to a new dataset. The first approach returns a ranking of the noise filter techniques using regression models. The second sequentially applies multiple meta-models, to decide the most suitable noise filter technique for a particular dataset. For both approaches we extract the meta-features from use synthetics datasets and use as meta-label the f1-score value obtained by different ML algorithms when applied to these datasets. For the experiments, eight noise filtering techniques were used. The experimental results indicated that the rank approach acquired higher performance gain than the baseline, while the second obtained higher predictive performance. The ranking based approach also ranked the best algorithm in the top-3 positions with high predictive accuracy.</p>2024-02-23T00:00:00+00:00Copyright (c) 2024 Journal of Information and Data Managementhttps://journals-sol.sbc.org.br/index.php/jidm/article/view/3368Legal Document Segmentation and Labeling Through Named Entity Recognition Approaches2023-08-14T18:01:55+00:00Gabriel M. C. Guimarãesgabriel.ciriatico@aluno.unb.brFelipe X. B. da Silvafelipe.barbosa@aluno.unb.brLucas A. B. Macedoalmeida.bandeira@aluno.unb.brVictor H. F. Lisboavictor.lisboa@aluno.unb.brRicardo M. Marcaciniricardo.marcacini@gmail.comAndrei L. Queirozandreiqueiroz@unb.brVinicius R. P. Borgesviniciusrpb@unb.brThiago P. Faleirosthiagodepaulo@unb.brLuis P. F. Garcialuis.garcia@unb.br<p>The document segmentation task allows us to divide documents into smaller parts, known as segments, which can then be labelled within different categories. This problem can be divided in two steps: the extraction and the labeling of these segments. We tackle the problem of document segmentation and segment labeling focusing on official gazettes or legal documents. They have a structure that can benefit from token classification approaches, especially Named Entity Recognition (NER), since they are divided into labelled segments. In this study, we use word-based and sentence-based CRF, CNN-CNN-LSTM and CNN-biLSTM-CRF models to bring together text segmentation and token classification. To validate our experiments, we propose a new annotated data set named PersoSEG composed of 127 documents in Portuguese from the Official Gazette of the Federal District, published between 2001 and 2015, with a Krippendorff's alpha agreement coefficient of 0.984. As a result, we observed a better performance for word-based models, especially with the CRF architecture, that achieved an average F1-Score of 75.65% for 12 different categories of segments.</p>2024-02-23T00:00:00+00:00Copyright (c) 2024 Journal of Information and Data Managementhttps://journals-sol.sbc.org.br/index.php/jidm/article/view/3369Empirical Comparison of EEG Signal Classification Techniques through Genetic Programming-based AutoML: An Extended Study2023-04-30T12:37:33+00:00Icaro M. Mirandaicaro.miranda@aluno.unb.brClaus de C. Aranhacaranha@cs.tsukuba.ac.jpAndré C. P. L. F. de Carvalhoandre@icmc.usp.brLuís P. F. Garcialuis.garcia@unb.br<p>Machine Learning (ML) applications using complex data often need multiple preprocessing techniques and predictive models to find a solution that meets their needs. In this context, Automated Machine Learning (AutoML) techniques help to provide automated data preparation and modeling and improve ML pipelines. AutoML can follow different strategies, among them Genetic Programming (GP). GP stands out for its ability to create pipelines of arbitrary format, with high interpretability and the ability to customize information from the data domain context. This paper presents a comparative study of two AutoML approaches optimized with GP for the time series classification problem and its characterization through four domain-based feature sets. We selected the Electroencephalogram (EEG) signals as a case of study due to their high complexity, spatial and temporal co-variance, and non-stationarity. Our data characterization shows that using only spectral or time-domain features is unsuitable for achieving high-performance pipelines. Our results reveal how AutoML can generate more accurate and interpretable solutions than the literature's complex or <em>ad hoc</em> models. The proposed approach facilitates the analysis of dimensional reduction through fitness convergence, tree depth, and generated features.</p>2024-02-27T00:00:00+00:00Copyright (c) 2024 Journal of Information and Data Managementhttps://journals-sol.sbc.org.br/index.php/jidm/article/view/3371How to balance financial returns with metalearning for trend prediction2023-10-02T11:19:36+00:00Alvaro Valentim Pereira de Menezes Bandeiraavalentim98@usp.brGabriel Monteiro Ferracioliferracioligabriel@usp.brMoisés Rocha dos Santosmmrsantos@usp.brAndré Carlos Ponce de Leon Ferreira de Carvalhoandre@icmc.usp.br<p>The prediction of market price movement is an essential tool for decision-making in trading scenarios. However, there are several candidate methods for this task. Metalearning can be an important ally for the automatic selection of methods, which can be machine learning algorithms for classification tasks, named here classification algorithms. In this work, we present the use of metalearning for classification in market movement prediction and elaborate new analyses of its statistical implications. Different setups and metrics were evaluated for the meta-target selection. Cumulative return was the metric that achieved the best meta and base-level results. According to the experimental results, metalearning was a competitive selection strategy for predicting market price movement. This work is an extension of Bandeira et. al[2022].</p>2024-02-27T00:00:00+00:00Copyright (c) 2024 Journal of Information and Data Managementhttps://journals-sol.sbc.org.br/index.php/jidm/article/view/3463Instance hardness measures for classification and regression problems2023-09-01T13:28:12+00:00Gustavo P. Torquettegustavo.torquette@unifesp.brVictor S. Nunesvictor.nunes@ga.ita.brPedro Y. A. Paivapaiva@ita.brAna C. Lorenaaclorena@ita.br<p>While the most common approach in Machine Learning (ML) studies is to analyze the performance achieved on a dataset through summary statistics, a fine-grained analysis at the level of its individual instances can provide valuable information for the ML practitioner. For instance, one can inspect whether the instances which are hardest to have their labels predicted might have any quality issues that should be addressed beforehand; or one may identify the need for more powerful learning methods for addressing the challenge imposed by one or a set of instances. This paper formalizes and presents a set of meta-features for characterizing which instances of a dataset are the hardest to have their label predicted accurately and why they are so, aka instance hardness measures. While there are already measures able to characterize instance hardness in classification problems, there is a lack of work devoted to regression problems. Here we present and analyze instance hardness measures for both classification and regression problems according to different perspectives, taking into account the particularities of each of these problems. For validating our results, synthetic datasets with different sources and levels of complexity are built and analyzed, indicating what kind of difficulty each measure is able to better quantify. A Python package containing all implementations is also provided.</p>2024-02-27T00:00:00+00:00Copyright (c) 2024 Journal of Information and Data Managementhttps://journals-sol.sbc.org.br/index.php/jidm/article/view/3460LiPSet: A Comprehensive Dataset of Labeled Portuguese Public Bidding Documents2023-10-09T13:09:25+00:00Mariana O. Silvamariana.santos@dcc.ufmg.brGabriel P. Oliveiragabrielpoliveira@dcc.ufmg.brHenrique Hotthenriquehott@dcc.ufmg.brLarissa D. Gomidelarissa.gomide@dcc.ufmg.brBárbara M. A. Mendesbarbaramit@ufmg.brClara A. Bachaclarabacha@ufmg.brLucas L. Costalucas-lage@ufmg.brMichele A. Brandãomichele.brandao@ifmg.edu.brAnisio Lacerdaanisio@dcc.ufmg.brGisele L. Pappaglpappa@dcc.ufmg.br<p>Collecting, processing, and organizing governmental public documents pose significant challenges due to their diverse sources and formats, complicating data analysis. In this context, this work introduces LiPSet, a comprehensive dataset of labeled documents from Brazilian public bidding processes in Minas Gerais state. We provide an overview of the data collection process and present a methodology for data labeling that includes a meta-classifier to assist in the manual labeling process. Next, we perform an exploratory data analysis to summarize the key features and contributions of the LiPSet dataset. We also showcase a practical application of LiPSet by employing it as input data for classifying bidding documents. The results of the classification task exhibit promising performance, demonstrating the potential of LiPSet for training neural network models. Finally, we discuss various applications of LiPSet and highlight the primary challenges associated with its utilization.</p>2024-04-05T00:00:00+00:00Copyright (c) 2024 Journal of Information and Data Managementhttps://journals-sol.sbc.org.br/index.php/jidm/article/view/3564Datasets for Portuguese Legal Semantic Textual Similarity2023-10-09T13:09:13+00:00Daniel da Silva Juniordanieljunior@id.uff.brPaulo Roberto dos Santos Corvalpaulocorval@id.uff.brDaniel de Oliveiradanielcmo@ic.uff.brAline Paesalinepaes@ic.uff.br<p>The Brazilian judiciary faces a significant workload, leading to prolonged durations for legal proceedings. In response, the Brazilian National Council of Justice introduced the Resolution 469/2022, which provides formal guidelines for document and process digitalization, thereby creating the opportunity to implement automatic techniques in the legal field. These techniques aim to assist with various tasks, especially managing the large volume of texts involved in law procedures. Notably, Artificial Intelligence (AI) techniques open room to process and extract valuable information from textual data, which could significantly expedite the process. However, one of the challenges lies in the scarcity of datasets specific to the legal domain required for various AI techniques. Obtaining such datasets is difficult as they require some expertise for labeling. To address this challenge, this article presents four datasets from the legal domain: two include unlabelled documents and metadata, while the other two are labeled using a heuristic approach designed for use in textual semantic similarity tasks. Additionally, the article presents a small ground truth dataset generated from domain expert annotations to evaluate the effectiveness of the proposed heuristic labeling process. The analysis of the ground truth labels highlights that conducting semantic analysis of domain-specific texts can be challenging, even for domain experts. Nonetheless, the comparison between the ground truth and heuristic labels demonstrates the utility and effectiveness of the heuristic labeling approach.</p>2024-04-05T00:00:00+00:00Copyright (c) 2024 Journal of Information and Data Managementhttps://journals-sol.sbc.org.br/index.php/jidm/article/view/3568Wiki Evolution dataset applicability: English Wikipedia revision articles represented by quality attributes2023-11-23T16:40:19+00:00Ana Luiza Sanchesanaluizatrz@gmail.comSinval de Deus Vieira Júniorsinvalvieirajunior@gmail.comDaniel Hasan Daliphasan@cefetmg.brBárbara Gabrielle C. O. Lopesbarbaragcol@dcc.ufmg.br<p>This paper presents the creation of the Wikipedia article's evolution dataset. This dataset is a set of revisions of articles, represented by quality attributes and quality classification. This dataset can be used for studies regarding automatic quality classification that consider the article revision history as well as understanding how the content and quality of articles evolve over time in this collaborative platform. To illustrate a potential application, this study provides a practical example of utilizing a Machine Learning model trained on the constructed dataset.</p>2024-04-05T00:00:00+00:00Copyright (c) 2024 Journal of Information and Data Managementhttps://journals-sol.sbc.org.br/index.php/jidm/article/view/3570Workflow for the acquisition, processing, and dissemination of Brazilian public data focused on education2023-08-10T20:29:55+00:00Abílio Nogueira Barrosabilio.nogueira@ufrpe.brAldéryck Félix de Albuquerquederycck@gmail.comAndrêza Leite de Alencarandreza.leite@ufrpe.brAndré Nascimentoandre.nascimento@ufrpe.brIbsen Mateus Bittencourtibsen@feac.ufal.brRafael Ferreira Mellorafael.mello@ufrpe.br<p>This article aims to demonstrate the process of creating public databases focused on the educational and population areas. It describes the process of obtaining data from official government sources such as INEP (National Institute for Educational Studies and Research) and IBGE (Brazilian Institute of Geography and Statistics), the procedures for data adaptation and optimization to create their historical series, as well as the best practices followed for their development and the generated metadata. Highlighting the specificities between the themes of education and population, reporting their challenges and peculiarities of each dataset. It also reports the results that can already be directly obtained from each dataset and how, when combined, they can track indicators of the National Education Plan, one of the largest Brazilian public policies focused on education.</p>2024-04-05T00:00:00+00:00Copyright (c) 2024 Journal of Information and Data Managementhttps://journals-sol.sbc.org.br/index.php/jidm/article/view/3571Indicators and Municipal Data: A Database for Evaluating the Efficiency of Public Expenditures2023-10-09T13:08:37+00:00Paula Guelman Davispaula.davis@seguranca.mg.gov.br<div class="page" title="Page 1"> <div class="layoutArea"> <div class="column"> <p>This article describes the construction of a database with financial and operational data from Brazilian municipalities. Public data were collected regarding expenses by function (education, health, public security, among others), indicators and other data that reflect the municipal situation in the areas of education, health, public security, development, sanitation and finance. Data from various sources were integrated and transformed to allow studies on the correlation between performance indicators of the effectiveness of public governance, and the corresponding spenditures, to follow up and assess the effects of public policies.</p> </div> </div> </div>2024-04-06T00:00:00+00:00Copyright (c) 2024 Journal of Information and Data Management