\documentclass[jidm,a4paper]{jidm} % NOTE: JIDM is published on A4 paper
\usepackage{graphicx,url}  % for using figures and url format
\usepackage[T1]{fontenc}
\usepackage[utf8]{inputenc}
\usepackage{enumerate}
\usepackage{multirow}
%\usepackage{enumitem} %permite usar enumerate com letras do alfabeto 
%\usepackage{cite} % NOTE: do **not** include this package because it conflicts with jidm.bst

% Standard definitions
\newtheorem{theorem}{Theorem}[section]
\newtheorem{conjecture}[theorem]{Conjecture}
\newtheorem{corollary}[theorem]{Corollary}
\newtheorem{proposition}[theorem]{Proposition}
\newtheorem{lemma}[theorem]{Lemma}
\newdef{definition}[theorem]{Definition}
\newdef{remark}[theorem]{Remark}

% New environment definition
\newenvironment{latexcode}
{\ttfamily\vspace{0.1in}\setlength{\parindent}{18pt}}
{\vspace{0.1in}}

% ALL FIELDS UNTIL BEGIN{document} ARE MANDATORY

% The following data (volume, number and page) are given by the editors prior to publishing your article
\jidmVolume{5}
\jidmNumber{1}
\jidmYear{14}
\jidmMonth{June}
\setcounter{page}{1}


% Includes headers with simplified name of the authors and article title
\markboth{C. Ribeiro and L. Zárate}
{JIDM - Journal of Information and Data Management}
%  -> \markboth{}{}
%         takes 2 arguments
%         ex: \markboth{M. M. Moro}{Any article title}


% Title of the article
\title{Data preparation for longitudinal data mining: \\a position paper \\A case study on human ageing}


% List of authors
%IF THERE ARE TWO or more institutions, please use:
%\author{Author1\inst{1}, Author2\inst{1}, Author3\inst{1}}
\author{Caio Eduardo Ribeiro\inst{1}, Luis Enrique Zárate\inst{2}}


%Affiliation and email
\institute{Pontifícia Universidade Católica de Minas Gerais, Brazil \\
\email{caioedurib@gmail.com}\\ \email{zarate@pucminas.br}
}


% Article abstract - it should be from 100 to 300 words
\begin{abstract}
An adequate preparation of a database is essential to the extraction of useful knowledge contained in it. On longitudinal studies, that follow a fixed set of records through a time period, the data preparation process should adapt to the features added to the database by the temporal aspect of data. This article presents the data preparation process of a real longitudinal database, from a human ageing study. The process addresses the conceptual feature selection of the attributes in the database, and its preprocessing, generalizing the procedures performed.
\end{abstract}


% ACM Computing Classification System categories
\category{H.2.8}{Database Applications}{Data Mining} 
\category{H.2.m}{Miscellaneous}{}

% Categories and Descriptors are available at the 1998 ACM Computing Classification System
% http://www.acm.org/about/class/1998/
%  -> \category{}{}{}
%         takes 3 arguments for the Computing Reviews Classification Scheme.
%         ex: \category{D.3.3}{Programming Languages}{Language Constructs and Features}
%                   [data types and structures]
%                   the last argument, in square brackets, is optional.

% Article keywords
\keywords{data mining, knowledge discovery, preprocessing}
%  -> \keywords{} (in alphabetical order \keywords{document processing, sequences,
%                      string searching, subsequences, substrings})


% THE ARTICLE BEGINS
\begin{document}

% This is optional:
\begin{bottomstuff}
% similar to \thanks
% for authors' addresses; research/grant statements
\end{bottomstuff}

\maketitle


% ARTICLE NEW SECTION
\section{Introduction}
The process of knowledge discovery in databases (KDD) aims to discover useful non-trivial patterns through data mining algorithms. The data mining phase is preceded by a series of preparation steps, critical to make the found knowledge useful and correct \cite{fayyad}.

There are several suggested database preparation steps in the literature, such as feature selection, the detection and elimination of inconsistencies and redundant data, missing data analysis and outlier detection, among other tasks. These usually are executed considering the characteristics of the database and the goals of the project, which directly influence how the data preparation techniques are applied. When applied appropriately, the data preparation reduces distortions, inconsistencies and polarization in data, apart from contributing to the performance of the data mining algorithms. All these actions collaborate with more valuable and reliable results on the knowledge discovery process \cite{pyle}.

Meanwhile, on the current scenario it is possible to register data for extended periods of time, introducing a temporal aspect to them. Thus, we begin to handle historical data that must be explored through a temporal point of view, for its best understanding. Mining this temporal data allows the identification of cause-effect relations, and more accurate predictions, due to the capability to observe the history and progression of the individual states of the data \cite{roddick}. It is recommended that the traditional KDD methodologies be restructured to attend the characteristics of temporal data mining \cite{last}.

A class of temporal registers are the longitudinal studies, where the same sample of records is followed through time, to characterize certain aspects of its evolution \cite{diggle}. The main difference of an longitudinal study from a regular temporal study is the analysis of the evolutionary behaviour of the sample, which is not expressed only in terms of seasonality, tendencies and averages, but also, and mainly, through approaches that ease the identification and analysis of phenomenons related to the passage of time. This can be achieved by comparing the data from a determined period, or wave, with its versions in distinct instants or time periods, always referring to the same records.

As stated previously, the domain problem characteristics and the type of database being analysed impact on the best way to perform the database preparation. When working with data that presents the longitudinal aspect, the traditional data preparation techniques need to be assessed in a new light. Moreover, to ensure that the longitudinal information of the data is kept, we need to make adaptations in the preparation strategies.

Within the context of longitudinal studies, an area that is gaining attention from the scientific community and governmental agencies is the human ageing study. This, because of the populational ageing phenomenon, expected for the coming decades. Studies indicate that the elderly population will surpass 21.5\% of the worldwide total by 2050, a substantial increase in comparison to the current 12.3\%. Such substantial growth will have great social and economic impacts \cite{undesa}. Aiming to minimize the impact of populational ageing on the different social spheres, through the observation of the evolution of distinct environmental factors in the lives of sets of individuals, health researchers seek to formulate and test hypotheses of how the biological ageing is affected by individual choices and the environment on which a person is inserted.

Despite the growing importance of the study of the human ageing, our review to date found no work in which the KDD and data mining processes are applied to this global phenomenon. In order to encourage new longitudinal data mining (LDM) studies, on this work we discuss the different tasks of data preparation when applied to longitudinal databases (LDBs). As a case study, and foundation to our generalizations, we consider a real database used in human ageing studies.

This article is organized as follows: the second section presents concepts about human ageing and on longitudinal studies through a data mining perspective. All the data preparation method of the case study is presented on the third section, whilst detailing and generalizing the procedures adopted, to similar projects. Finally, the fourth section has our conclusions about this work.

\section{Background}
\subsection{Human ageing studies}
Because of the gradual ageing of the world population, currently there is a greater interest in understanding how genetics and environmental factors determine the way age affects people's cognitive functioning, health and psychological state. The populational ageing impacts the entire structure of society, specially regarding social security issues, because the ratio of working versus retired people will decline, which will have severe social and economic implications \cite{lutz}.

The studies on human ageing that focus on environment conditions (which are easier to handle than the genetic conditions) aim to determine the way personal choices, and the inherent characteristics to the environment in which the individual lives, affect one’s biological ageing. In order to understand this phenomenon, an interdisciplinary evaluation of the several aspects that compose the environment is necessary. Efforts at the national level are being carried out, mainly in nations where populational ageing is more aggravated, such as censuses to collect data on the ageing of their citizens. As mentioned, longitudinal studies follow a set sample of individuals through years, and generate databases with detailed descriptions of various aspects of their lives.

The most common form of longitudinal data analysis found in the literature is the usage of classical statistical techniques. For instance, it is usual to find regression analysis studies, which aim to infer the value of a dependent variable by studying a set of independent variables (\cite{cacioppo}). Hypotheses tests, and investigations of correlations between variables can also be found often. As an example, on the study of \cite{kim}, the authors tried to find correlations between self-reported health and self-reported life satisfaction on a set of individuals. 

Among the large scale longitudinal studies, the English Longitudinal Study of Ageing, ELSA, is one of the most prominent \cite{marmot}. The ELSA study has, in each of its 6 waves, thousands of respondents from United Kingdom households inhabitants, that are visited and interviewed every two years (duration of a wave of the study). In its current form, the ELSA began officially in 2002, and the discussed features include demographic, economic, social, of physical, mental, and psychological health, and cognitive functions.

Another longitudinal study of human ageing worth mentioning is the Survey of Health, Ageing and Retirement in Europe, SHARE, a continental level study that unites the efforts of twenty European countries, in addition to Israel. Being the most geographically distributed study of the world, SHARE has the largest sample of records, an important feature to some studies. However, the comprehensiveness of the ELSA was considered more suitable for our study, because it addresses several physical health aspects through test results (such as psycho-motor tests, cognitive tests, memory tests and blood tests), performed by health professionals who participate in the visits to the respondents of the study. For that reason, the ELSA database has been chosen for the case study in this work.

The ELSA is intended for persons 50 years of age or older, in order to follow the participants for years prior to their retirement and beyond, which allows a detailed analysis on the evolution of the observed aspects \cite{banks}. Most ELSA questions have predefined response options, making the generated database predominantly categorical. The ELSA database had some of its attributes and records altered throughout the study, due to updates in the samples and questionnaires, which made a preprocessing of the data necessary, before the longitudinal analysis itself. For this study, we used only the records and attributes common to all ELSA waves, creating a longitudinal version of the database


\subsection{Longitudinal Studies and Data Mining}
With the increasing interest on studying human ageing, large scale longitudinal studies have been initiated in several countries, producing databases made available for use in scientific researches. As stated previously, our reviews found no published works using data mining techniques on longitudinal studies databases. However, we believe that a more comprehensive analysis of the frequently complex problems addressed on social studies of human ageing, would be made easier and more efficient through data mining techniques. 

Formally, a longitudinal database is a temporal database with the same identity to all the time units. We define as longitudinal any database that can be described as a matrix $M$, composed by the Cartesian product of three vectors: $r$ (of records), $a$ (of attributes), and $t$ (of time units), as shown in Equation 1. The representativeness of a database is related to its dimensionality, which is affected by the size of each of the vectors. 

\begin{equation}
 M_{rat} = [r]  \times  [a]   \times   [t]
\end{equation}

As the same sample of records is followed through time in a longitudinal study (the records have unique identifications), the attributes of the database have a temporal aspect, and their values present previous and posterior states, considering the different waves of the database. 

It is necessary to consider the temporal information on data during the planning and execution of data preparation tasks, whilst the data mining algorithms must be adapted to handle this characteristic. LDBs inherently have greater data volume and complexity, and the focus of LDM studies is usually to identify the causal relation of an observed effect, and the evolution of the effects to a set of attributes on the database \cite{last}.

Data mining applied to longitudinal studies brings forth the possibility to discover and interpret patterns as: a) cohort effects, based on previous causes (effects that are specific to a sample of records); b) seasonal longitudinal patterns (behaviours that repeat in determined intervals of time, or happen in function of temporal events); or c) effects of the passage of time (evolution of the observed aspects). The traditional data mining algorithms of rule association, classification, and clustering must be adapted to achieve these goals.

\section{Data Preparation for Longitudinal Databases}

As stated in \cite{pyle}, many data miners neglect the data preparation tasks prior to the execution of mining algorithms, and that behaviour can cause the failure of a KDD project. The time spent on data preparation is a necessary investment to assure that the discovered knowledge is relevant. In practice, experience shows that the time need to fully prepare a database can reach 75\% of the total time spent in a KDD project.

The data preparation process aims to assure that the data is as relevant as possible to the KDD project, also that it represents reality and are presented in a sufficient quantity of records, or instances, to ensure that the found results are applicable \cite{kotsiantis}.

When applying KDD to LDBs, we should not consider temporal data as a simple collection of unordered events, ignoring the fact that these data reflect values referring to the same set of records, with a chronological order, which impacts on the posterior waves values of the data \cite{shahnawaz}. On this section, we describe the actions performed during the data preparation process for the ELSA database, on a LDM project. These actions are detailed to highlight the differences that the characteristics of this kind of database incur on the preparation process, moreover, the actions generalized to allow their replication on similar databases. In this article, we consider as parts of the data preparation process all the actions performed since the definition of the goals of the study, reaching the point of executing the data mining algorithms.

There are some intrinsic characteristics to LDBs, due to the way longitudinal studies are conducted:
\begin{enumerate}[(a)]
\item{As the same person has to be located each wave to respond to the ELSA questionnaires, the amount of missing data and discontinued records can become large. Furthermore, the studies themselves evolve over time, with changes on their approach and goals that change data structures, such as the emergence of new variables and the elimination of others, as a result of new guidelines and changes in the scope of the study;}
\item{A typical feature of the longitudinal studies is their comprehensiveness, so their databases usually have large dimensionality (large amounts of records and attributes). That hampers the manipulation and processing of these databases by algorithms, hence the need to a rigorous feature selection and preparation process.}
\end{enumerate}

With these LDB aspects in mind, the data preparation process discussed in this article is grounded in two phases: 1) definition of the problem and conceptual feature selection; and 2) data preprocessing. Initially, in the first phase, we propose a previous study on the domain problem, in order to stablish the most relevant attributes in the database, eliminating those of lesser relevance through a process we  referred to as a conceptual feature selection. The second phase of the process consists in guaranteeing that the database is in the correct format, identifying and eliminating inconsistent, imprecise or missing data from the database, which undermine the performance of mining algorithms. The goal of the preparation phase is to ensure that only relevant data remains represented, and that their representativeness is enough for the mining algorithms to discover the patterns that represent the knowledge existing in the database. The Figure \ref{dimensions} exhibits the original dimensions (records x attributes) of the 6 waves of the considered study database, in addition to the dimensions of the longitudinal and final versions of that database, after the preparation process. The resulting database has 4689 records and 172 attributes, in each of its 6 waves (referring to the 2002-2012 period) considered on this study.

\begin{figure}[!t]
\centering
\includegraphics[width=.75\textwidth]{Imagens/dimensoes.png}
\caption{Dimensions of the different versions of the ELSA database.}
\label{dimensions}
\end{figure}

\subsection{Previous Study and Conceptual Feature Selection}

\subsubsection{Problem Definition}
The first step of the first phase of the data preparation consists of defining the domain problem that will be explored on the study. Through this definition, it is possible to stablish the scope of the project, and which variables (attributes of the database) are relevant to the discovery and analysis of patterns. For example, on the case study, we research the influence of the environment on the process of human ageing. To prepare an adequate database, it was necessary to study the composition of the environment, and how the variables relate and were collected.

A previous study \cite{ribeiro}, aimed to conceptually model the problem of human ageing to highlight the environmental variables considered most relevant. This definition of the relevance of each attribute supports the feature selection process. Therefore, the selection of the environmental variables that were most frequently found in studies on human ageing, made in the aforementioned work, guided the step of conceptual feature selections, described hereafter. Figure \ref{model} contains a selection of the most addressed environmental aspects, organized in: economic, social, life-quality, physical health, and mental health categories.

\begin{figure}[!t]
\centering
\includegraphics[width=.8\textwidth]{Imagens/modelo.png}
\caption{Most addressed environmental aspects in published works.}
\label{model}
\end{figure}

The goal of a KDD project for which the database has been prepared to is discovering effects of the passage of time on the response of questions about the several environmental aspects considered in the study. Thus, as the scope and goals of the project have been defined, the data preparation process can be initiated. 

\subsubsection{Database composition}
A LDB corresponds to a Cartesian product of three vectors (as shown in Equation 1), which implies that the same records an attributes repeat over all waves of the database, represented by the time vector $t$. It is expect that, as the longitudinal studies evolve and previous results are found, the attributes of their databases modify to comply with new demands of the study, besides the addition of new records (respondents included in the study), and removal of some of the records (respondents that do not participate on posterior waves). Such changes prevent the longitudinal analysis of the database, because it is impossible to track the changes of values in the database. Therefore, for this study, a register or attribute can only be kept in the database if it is present in all waves of the study, as defined in Equations 2 and 3.

\begin{equation}
{r}_{i} \in  {M}_{rat} \mid  {r}_{i} \in  {M}_{rat_{j} } \forall  j
\end{equation}
\begin{equation}
{a}_{i} \in  {M}_{rat} \mid  {a}_{i} \in  {M}_{rat_{j} } \forall  j
\end{equation}

As mentioned, the ELSA is a national level study, addressing several environmental aspects of the life of its respondents. The resulting database differs for each wave. Thus, to allow a longitudinal analysis of the database, we removed from it the attributes and records that were not present in each of the 6 considered waves (2002-2012). This observation reduced the database to a total of 4796 records and 1693 attributes.

One of the main features of the LDB is the large volume of data produced by the temporal axis, which increases the importance and necessary care in the conceptual feature selection and data filtering tasks. In a longitudinal study, ensuring that the knowledge is represented by a minimal quantity of data that ensures representability and comprehensibility is not only a matter of avoiding useless or redundant knowledge found. These tasks are directly related to the viability of the algorithmic processing of data, to the interpretability of the knowledge and, thus, to the success of the KDD project \cite{paes}.

\subsubsection{Conceptual Feature Selection}
The conceptual feature selection aims to keep on the database only the attribute that are most relevant to the execution of the study. It would be worthwhile to note that an attribute might be considered relevant for different reasons. For instance, being related to the objectives of the research, having relevance because it improves the distribution of the records, or the precision of a classification algorithm \cite{blum}.

Selecting attributes implies in removing from the database the less relevant ones, reducing the complexity of executing mining algorithms with the data. Knowledge on the problem domain of the study is key to correctly defining relevance, because the attributes are judged according to the understanding of their relationship to the problem addressed.

Some attributes become more relevant when presented with a temporal aspect. To these attributes, their static version carries little significance, however a series of values adds important information. For example, a measure of heart frequency carries more information if there is a history of measures to that patient. Another example would be the psychological history of a patient, used to identify high levels of stress.

On the conceptual feature selection phase, to make the relevant attributes choice as adherent as possible, firstly it was necessary to define what composes the environmental influence on human ageing studies, object of study in this project (see Figure \ref{model} to check the aspects considered). Through the knowledge obtained on the previous study and the definitions of the conceptual modelling of the problem, an individual analysis of the questions present in the questionnaire elaborated to the longitudinal study. We discarded the attributes that were not related to the aspects identified on Figure \ref{model}. From the 1963 attributes in the longitudinal version of the ELSA database, only 275 have been kept after this task, considered the most relevant to the study. It is worthwhile to note that it is highly recommended to apply this conceptual feature selection process prior to more elaborated computational and statistic procedures, such as wrappers and filters algorithms. These techniques should be applied after the data preparation process, and before using mining algorithms, providing a formally based dimensionality reduction. In contrast, the conceptual feature selection is based on explicit and tacit knowledge of the domain problem, and can be easily applied even to bulky databases.

Databases are composed by facts and judgements, the last ones being sinéditouppositions that aim to approach from the explanation of the problem. Therefore, after the conceptual feature selection is finished, we suggest a second analysis on the selected attributes. It is important to have a control of the level of facts and judgements on the attributes in a database. If a database is composed only by facts, it is impossible to discover unprecedented knowledge, and as judgements are added the exploration capacity increases, but so does the distance to the real characterization of the problem. Finding this balance can be the difference between discovering obvious or relevant knowledge at the end of a KDD process.

Thus, if a database has low factual level, it is recommended that some attributes with high judgement level be eliminated, reducing the dimensionality from the database, while also enriching its capacity to generate useful unprecedented knowledge. The decision to eliminate an attribute depends on knowledge on the information it represents. For example, on the case study, we eliminated an attribute that represented the question "How many employees does the company you work for have?", and kept the question "How many hours a week do you stay at work, including extra and interval hours?". The eliminated question adds information on the size of the company where the respondents work, however this information is not directly related to their economic situation. The kept question, however, represents an important information on the routine of the respondents, thus being more relevant to the study. Using this elimination by judgement level filters the database, using the previous knowledge on the relevance of the attributes.

The selection by judgement level considers the explicit knowledge obtained during the previous study phase, to discard attributes that are poorly factual, thus reducing the volume of the database. It is important that, in such cases, a careful study of the domain problem is used to minimize the damage that the lack of tacit knowledge from an domain expert brings onto the KDD project. Ideally, all the feature selection process should be validated by a domain expert, who evaluates the relevance of each attribute available to the study.

The intervention of the domain expert has been identified as a key aspect to success in knowledge discovery processes. This paradigm is named D3M (Domain-Driven Data Mining), and recommends the generation of methodologies to KDD processes that are oriented to the problem \cite{cao}. According to D3M concepts, each task on the KDD process must be accompanied and validated by the domain expert, which makes the process more adherent.

\subsection{Data Preprocessing}
On this article, we consider as parts of the preprocessing phase all the procedures adopted to refine the database, before applying the data mining algorithms. These procedures are described hereafter.

\subsubsection{Noisy Data Elimination}
Noisy data arise from data collection or database generation, which are common in large data volumes. These data add false information that will be interpreted by the mining algorithms, and can lead to invalid results. Thus, the following actions should be performed to identify and eliminate noisy data:

\begin{enumerate}[(a)]
\item{\textbf{Duplicate records analysis}: On longitudinal databases, such as the one in this case study, similar records on the same wave or distinct waves could be considered valid. These records can represent individuals with the same responses on the questionnaire, or an individual that kept his response through several waves.}
\item{\textbf{Inconsistencies analysis}: A longitudinal database can, specially if the data is collected manually, present inconsistencies which are data registration errors that make the value of an attribute invalid. A conformity analysis on the values found in the database, to check if they are according to the formal definition of the expected values in an attribute should be done during the preprocessing phase. It is possible to infer the correct values of inconsistent data, but we must consider that assuming a value for an attribute can actually add more noisy data to the database, and harm the knowledge represented. A possible inconsistency example would be a negative value as a response to a question about the remuneration of a respondent.}
\item{\textbf{Attribute transformations}:
Some recoding and transformations on attributes can be necessary to make the database consistent. If an attribute has its possible values (options of answer) modified through the study waves, the values from the version with higher cardinality should be remapped to get consistent to the one with lower cardinality. This way, it is possible to keep coherence and consistency on the attributes, which is necessary for longitudinal analysis, with a minimal information loss.}
\end{enumerate}

Importantly, the ELSA database underwent a preliminary review by those responsible for study, eliminating possible insertion errors, duplicate records and inconsistencies. Still, some adaptations were necessary on the values of attributes referring to questions that had been somehow modified through the waves of the study. The attributes were recoded and their values remapped, so that the database had an unique and coherent version of the attribute, enabling its longitudinal analysis.

\subsubsection{Missing Data Analysis}
The missing data analysis seeks to adequately treat data with missing values, and dealing with them by eliminating the record, the attribute, or imputing a most likely value to the missing data. Different techniques to infer missing values on data can be applied, such as retrieving a likely value from other sources, in addition to the possibility of using calculations to determine a statistically more approximate value for the missing data (averages, medians, modes, among others) \cite{rubin} \cite{silva}.
	
In a longitudinal database, however, it is possible to resort to values from different waves to infer missing data with more reliability. If there is on the database a value for the attribute on the previous or posterior waves, we are more likely to being able to correctly infer the value to a wave where the value is missing. This characteristic of the longitudinal database makes imputing missing values more efficient, for using information directly related to the missing data to infer its value. This procedure requires tacit knowledge on the domain problem to evaluate its applicability.

In some ELSA questions, if the response is unchanged on subsequent waves to the first wave the question was asked, a code is registered on the database to indicate that the initial response was kept. In these cases, the real value of the attribute was recovered by retrieving the initial response to that question. Besides that, to each ELSA record a weight value is calculated, according to the amount of information that record adds to the database, meaning that individuals that responded a very low amount of questions got a smaller weight value. The records that received 0 weight value on this evaluation were eliminated from the database.

\subsubsection{Feature Selection through the Amount of Information}
If an attribute has excessively low information, including it on the study might confound the interpretation of results, making the knowledge less understandable. On these cases, it might be more adequate to reduce the dimensionality of the database by removing these attributes with low information. By calculating the entropy (S) of an attribute, it is possible to ascertain the amount of information they add to the database, according to their variability \cite{li}. The entropy calculation utilized considers the probabilities of the $j$ possible values of an attribute, as described in Equation 4.

\begin{equation}
S = \sum _{ i = 1 }^{ j }{ P(i) \times  \log _{ 2 }{ \frac { 1 }{ P(i) }  }  } 
\end{equation}

The ELSA attributes whose entropy calculation resulted in a value too close to 0 (the threshold defined was S<0.1) were individually analysed, instead of readily eliminated. In some questions, such as those that registered the existence of health conditions on the respondent, we expected a low entropy value, because the majority of the respondents did not suffer from those conditions. On the cases, the low variability was expected, and the information contained in attributes was considered important enough for them to be kept on the database. However on attributes where the low entropy found was considered an atypical behaviour for the information represented, they were eliminated from the database.

\subsubsection{Outlier Detection}
Some records have values that are not consistent with the sample of the study, for being too discrepant. The elimination of these outliers can be done automatically trough algorithms that identify the discrepant records examining the database. Approaches using neural networks \cite{williams}, filter algorithms \cite{liu}, clustering analysis, among others, aim to increase the efficiency of outlier detection, an important task to increase the precision of the KDD process.

On the literature, it is recommended to readily eliminate outliers to keep the data representativeness from being distorted. However, on longitudinal studies, the outlier analysis must consider the temporal information represented by these records, which might be relevant to predict future states of the database. Thus, the decision of eliminating outliers or not gets harder, once valuable information for prediction might be found on these records.

Figure \ref{outliers} shows the possible evolutions of a clustering case with outliers, over time. The initial situation can develop on the three ways shown, according to the strength of influence on the clusters and on the outliers. The first hypothesis is the adaptation of outliers (1), where the records modify their characteristics over time to fit the established patterns of existing clusters. On the second, adaptation of clusters (2), observed mainly on social behaviour changes, the outlier influence makes the cluster behaviour adapt gradually, modifying the characteristic values of that set of records (which can explain a change of patterns over time). On the third possibility presented, the outliers gather to form new clusters (3), modifying the study scenario. Therefore, on the temporal context, the outlier analysis can help explain and predict the changes on clusterings over time. These records, often discarded as noisy data, thus become valuable to longitudinal studies.

\begin{figure}[!t]
\centering
\includegraphics[width=.60\textwidth]{Imagens/outliers.png}
\caption{Estudo temporal de outliers.}
\label{outliers}
\end{figure}

Regarding the outlier analysis on ELSA, we considered as outliers of the study only records which did not characterize the target audience of the study, of individuals that were at least 50 years old. Younger respondents had been included on the study because they met certain prerequisites, such as the prospect of reaching the minimum age for the study. The records with values that were discrepant from averages and modes on the database were kept, for future clustering evolution evaluations.

\subsubsection{Discretization}
The next task of the preprocessing phase consists in making the necessary transformations on attributes to enable the use of the chosen data mining tools. Some mining algorithms require that input data are categorical, creating the need to discretize the numerical attributes present in the database.

The discretization problem is not trivial, because of the number of different combinations that can be done by modifying the number and size of value bands created to represent categories on a continuous interval. Therefore, we can use heuristics that suit specific situations and seek an approximation of the optimal discretization. Typically, the choice of the most suitable technique is made in accordance with the distribution of values, desired number of intervals and the information represented by the attribute \cite{garcia}.

On the case study, the numerical attributes were individually assessed, and discretized according to the guidelines of EqualFrequency and EqualWidth discretization techniques \cite{li}, both widely used. The decision between the discretization techniques was based on the data distribution, and also on the information represented by the attribute. On the continuous attributes (economical variables, such as income) and quantification attributes (for example, number of children, and how many rooms on the residence), the EqualFrequency discretization was adopted, because the values of these attributes were not equally distributed. However, on questions referring to age or dates (for example, age when diagnosed with some health condition) and percentage questions (such as the well-being questionnaire questions, that should be answered with a number from 0 to 100), the EqualWidth discretization was used, because the data distribution was more uniform on those questions.

Understanding the problem and its attributes to the point of being able to use a discretization based on tacit knowledge can make the database more robust. The numerical attributes on ELSA were discretized using the described guidelines, following the precepts of both algorithms. On the Figure \ref{discretization} four discretizations made on the case study are presented. The attributes discretized with the EqualWidth technique present values that represent upper and lower limits, and bands of the same magnitude between these. On the other hand, the attributes discretized with the EqualFrequency attributes do not have a visible pattern on their bands, because they were created considering the distribution of values in the database.

\begin{figure}[!t]
\centering
\includegraphics[width=.95\textwidth]{Imagens/discretization.png}
\caption{Discretização de atributos do ELSA.}
\label{discretization}
\end{figure}

\subsubsection{Variable Merging}
On a categorical  database, it is possible to take advantage from this characteristic, by making a merge of highly dependent attributes in order to reduce the volume of data. On the case study, some attributes could be represented by a single variable, without great loss of information. Therefore, concluding the preprocessing phase and de data preparation of the ELSA database, 22 highly dependent attributes were merged on this procedure, creating 9 new attributes to replace them.

Consider an attribute $A$ with cardinality $|A|$ and an attribute $B$ with cardinality $|B|$. The Cartesian product of both attributes will have its cardinality as defined in the corollary shown in Equation 5. To merge attributes, firstly the Cartesian product of both categorical attributes is done, generating a new question with options of answer referring to all the possible combinations of answers to the merged attributes. Then, this new attribute is adapted, through a discretization that united answers with a similar meaning. This, in order to keep the merges with a reduced number of options of answer, a desirable feature for clustering algorithms.

\begin{equation}
\left| A\times B \right| =\left| A \right| \times \left| B \right| 
\end{equation}

Figure \ref{juncao} has and example of variable merging with two ELSA questions. The Cartesian products of the attributes has six options of answer, however three of them could be represented by a single response.

\begin{figure}[!t]
\centering
\includegraphics[width=.95\textwidth]{Imagens/juncao.png}
\caption{Exemplo de junção de atributos.}
\label{juncao}
\end{figure}

Table \ref{tabela} exhibits a synthesis of the recommended guidelines on each task of the data preparation process, according to the procedures adopted on the case study. These actions can be replicated on data preparation processes for longitudinal databases with characteristics similar to the ELSA database.

\begin{table}[]
\centering
\caption{Guidelines for the preparation process.}
\label{tabela}
\begin{tabular}{|p{3cm}|p{4cm}|p{7cm}|}
\hline
\multicolumn{1}{|c|}{\textbf{Task}} & \multicolumn{1}{c|}{\textbf{Objectives}} & \multicolumn{1}{c|}{\textbf{Guidelines}} \\ \hline
Problem Definition & Stablish the scope of the project and its goals & Conceptually model the problem through an exploratory study and/or the help of a problem domain specialist \\ \hline
Database Composition & Ensure that the database is longitudinal & Eliminate records and attributes that do not exist in every wave of the database \\ \hline
Conceptual Feature Selection & Select the attributes in the database that are relevant to the study & Evaluate the relevance of the variable, according to the previous study. Eliminate those which are irrelevant, and those which present very high judgement levels (are less factual) \\ \hline
Noisy Data Elimination & Identify and eliminate noisy data from the database &
a) Only records with the same ID are considered duplicate. In a longitudinal database, resemblant records can correspond to similar responses of different respondents or waves.
b) If the set of possible options of answer of an attribute changes, set a version with fewer options of answer as default, and adequate the others to it, remapping the answers on the database.
  \\ \hline
Missing Data Analysis & Recover the missing values of attributes & The previous and later values of an attribute are the most reliable way to infer its correct missing value. \\ \hline
Feature Selection through the Amount of Information & Eliminate attributes that add little information to the database &
Attributes with low entropy should not be readily eliminated from the database. Stablish a threshold value to the entropy (e.g. S\textless0.1) and examine further the attributes that have small entropy values. The attribute might be of value to the database even if it has low variability, according to the information it represents. \\ \hline
Outlier Detection & Identify records that differ from common behaviours found in the database &
Outliers should not be readily eliminated from a longitudinal database. The outliers can be used in the longitudinal analysis, to understand the changes in the behaviour of clusters of records that happen over time. \\ \hline
Discretization & Categorize numerical attributes, for use in mining algorithms that demand categorical variables & Examine the data distribution of the attribute. If it is poorly distributed, use EqualFrequency discretization, otherwise use EqualWidth discretization. \\ \hline
Variable Merging & Reduce dimensionality, representing highly dependent variables through a single variable & Create a new variable through the Cartesian product of the dependent variables. Discretize this new variable, if possible, reducing its number of possible values. \\ \hline
\end{tabular}
\end{table}

\section{Conclusions}
On this article, we presented a case study on the preparation of a longitudinal database originated from the English Longitudinal Study of Ageing. The adopted procedures were generalized, being able to be replicated on similar projects. Firstly, we defined the scope and objectives of the study, and these definitions guided the decisions made during the data preparation process. Next, through explicit knowledge obtained on an extensive domain problem study, we performed a conceptual feature selection on the database, aiming to choose the most relevant attributes. Finally, the database has gone through a preprocessing that eliminated noisy data and attributes with low information, recovered missing values, analysed outliers, discretized numerical attributes, and merged dependent variables.

It is worthwhile to note that the conceptual feature selected, guided by a previous study on the problem domain and the analysis of an domain expert, is essential to the success of the LDM process. The relevance to the explored problem of the attributes that compose the database enriches the discovered knowledge, besides this selection resulting on a reduction of dimensionality that facilitates the computational processing of data, and the understandability of the extracted knowledge.

At the end of the data preparation process, the LDB suffered a reduction of about 90\% of its data volume, ending with the dimension of $r=4689$ records, $a=172$ attributes, and $t=6$ waves. All the records kept on the database represent respondents that characterize the target audience of the study (age equal or superior to 50), and had a considerable weight value. From the remaining attributes, after the eliminations by conceptual feature selection, filtering by amount of information, and the variable merging procedure, the five categories of environmental aspects have the following representation: 30 economic variables, 47 social, 47 life quality, 33 physical health, 9 mental health and 6 identification variables that do not fit any category (ID, gender, weight value, age, ethnicity and country of birth).

The techniques adopted on the data preparation process follow traditional precepts on the literature, however we discuss the special features of these techniques when used on longitudinal databases, that have a temporal aspect and serve a distinct purpose than traditional and common temporal databases. The difference on the database preparation is mainly due to the longitudinal information of the data, in other words, the fact that the values on the database have previous and posterior versions. On the preparation of longitudinal databases, the recovering of missing values becomes more precise, the outlier study can facilitate prediction of future tendencies on the database, and the feature selection has a larger impact on the success of the project. Regard the KDD process as a whole, the evolutionary characteristic of the longitudinal database, of following a sample of records over time, brings in new possibilities and challenges that need to be explored and studied more profoundly.

\section*{Acknowledgement}
The data were made available through the UK Data Archive. ELSA was developed by a team of researchers based at the NatCen Social Research, University College London and the Institute for Fiscal Studies. The data were collected by NatCen Social Research. The funding is provided by the National Institute of Aging in the United States, and a consortium of UK government departments co-ordinated by the Office for National Statistics. The developers and funders of ELSA and the Archive do not bear any responsibility for the analyses or interpretations presented here

This work was conducted during a scholarship supported by the International Cooperation Program CAPES/COFECUB at the PUC-Minas University. Financed by CAPES – Brazilian Federal Agency for Support and Evaluation of Graduate Education within the Ministry of Education of Brazil.

\bibliographystyle{jidm}
\bibliography{jidmb}

\begin{received}
\end{received}

\end{document}
