%Sim , trabalhet na seção 4. Novamente, não omita nada. Escreva a seção seguindo a estrutura proposta na seção 1. Comece pelos 3 baselines , suga com as novas 4 heurísticas e depois os ranking method. Só aí já tem 3 subseções, pelo menos. Deixe claro quais das 3 dimensões defn na intro cada método explora e quais métricas de qualidade, dentre as apresentadas na seção 3. ou seja, tem que REESCREVER a seção com base no novo discurso, e não necessariamente manter a mesma estrutura do Wsdm. Isto e mto importante

% - 3 baselines
% - 4 novas heurísticas
% - ranking methods


\section{Tag Recommendation Strategies} \label{sec:strategies}

In this section, we present the tag recommendation strategies analyzed in this work. We start by introducing  in Section \ref{sec:metrics} several metrics used to assess the relevance of candidate tags. Next, in Section \ref{sec:unsupervised}, we present a state-of-the-art unsupervised method, considered here a  baseline for comparison, which uses a simple linear combination of two relevance metrics as tag ranking function. In Section \ref{sec:ranking}, we present the   L2R techniques considered here,  and describe how we apply them to automatically ``learn'' good ranking functions by  combining a larger number of relevance metrics.   


\subsection{Relevance Metrics for Tag Recommendation} \label{sec:metrics}

We here briefly present several metrics that can be used to estimate the relevance of a candidate term $c$ for tag recommendation. Most of them have been  previously applied for recommending tags \cite{belem_sigir2011,sigur2008ftr,menezes2010,heymann_sigir08}, and are used here as attributes by the L2R based methods.
Each metric fall into one of these categories:

\vspace{-0.25cm}

\begin{itemize}

\item \textit{Tag Co-occurrence Metrics}:  estimates how relevant  a candidate $c$ is  given a set of input tags that often co-occur with $t$ in the dataset.

\item \textit{Descriptive Power Metrics}: estimate how accurately candidate $c$ describes the object's content, which is important for information services that exploit object's semantics.

\item \textit{Discriminative Power Metrics}: estimate the capability of $c$ to distinguish the target object from others, which supports tasks such as separating the objects into semantic classes or into levels of relevance regarding a query. 

\item \textit{Term Predictability}: indicates the likelihood that  $c$ occurs as a tag. 

\end{itemize}

\vspace{-0.25cm}

Table \ref{tab:attrib} summarizes all considered metrics. More details can be found in \cite{belem_sigir2011,lipczak11}.

\input{table_attrib}


%\subsection{State-of-the-Art Baselines} \label{sec:baselines}

\subsection{Unsupervised Strategy} \label{sec:unsupervised}

We   consider as a baseline $LATRE$+$wTS$, a state-of-the-art unsupervised tag recommender that is the best heuristic approach proposed in \cite{belem_sigir2011}.  $LATRE$+$wTS$ generates candidate tags for a target object $o$ exploiting: (1) association rules (tags which co-occur with the tags already available in $o$),  and (2) terms extracted from other textual features. The scores assigned to these candidates are a linear combination of the values of metrics $Sum$ and $wTS$ (Section \ref{sec:metrics}):

\vspace{-0.4cm}

\begin{equation} \label{eq:latre_dp}
\begin{array}{l}
 LATRE+wTS(c, o, \ell, \alpha) = \alpha Sum(c, I_o, \ell) + (1 - \alpha) wTS(c, o)
\end{array} 
\end{equation}

Parameter $\alpha$ ($0 \le \alpha \le 1$) is used as a weighting factor. Note that $Sum$  is computed only over candidates generated from the association rules, whereas $wTS$ is computed for terms extracted from other textual features of target object $o$.


\subsection{Learning-to-Rank Based Strategies} \label{sec:ranking}


Our goal is to investigate the   benefits of applying L2R techniques to the tag recommendation problem. The basic idea
is to use such algorithms to ``learn'' a  {\it good ranking function} based on  a list $L_{m}$ of  relevance metrics. We here evaluate eight different techniques, namely, \textit{Genetic Programming} ($GP$), $RankSVM$, $RankBoost$, \textit{Random Forest} ($RF$), $MART$, $\lambda$-$MART$,  $ListNet$, $AdaRank$. Out of these, only the first three have been previously applied to the tag recommendation problem  \cite{cao_2009,wu_www09,belem_sigir2011}, whereas only the first two have been compared against each other \cite{belem_sigir2011}. Thus, the application  of the other five techniques to  tag recommendation as well as a thorough evaluation of all solutions are key contributions of this present work.
For all strategies, $L_{m}$ includes all metrics defined in Table \ref{tab:attrib} (Section \ref{sec:metrics}), which is a richer set of metrics compared to  those exploited in previous work \cite{cao_2009,wu_www09}. Particularly, metric $Sum$ is used with values $\ell$=$1$ and $\ell$=$3$, generating thus two attributes: $Sum(c, o, 1)$ and $Sum(c, o, 3)$.

%The following sections introduce our proposed strategies, which exploit the traditional RankSVM method (Section \ref{sec:svm}) and the Genetic %Programming (GP) framework (Section \ref{sec:gp}).

%All metrics are shown in Table \ref{tab_metrics}. In addition to the metrics defined in Equations \ref{eq:soma}-\ref{eq:h}, we also include other ranking functions proposed in \cite{sigur2008ftr}, called $Vote$, $Vote^+$, $Sum$ and $Sum^+$. $Vote(c, o)$ is the number of association rules whose antecedent is a term in $\mathcal{I}_o$ and whose consequent is the candidate term $c$. In other words, $Vote$ is the number of ``votes'' received by $c$ as relevant according to the considered association rules. $Vote^+$, similarly to $Sum^+$ (Equation \ref{eq:sum_plus}) is built from $Vote$ by weighting each vote by the Stability of both antecedent and consequent.


For all analyzed L2R strategies, the set of candidate tags  $C_o$ for each object $o$ is the same produced by our unsupervised strategy $LATRE$+$wTS$, including all terms generated by association rules 
and all terms extracted from other textual features. For each candidate $c \in C_o$  for   object $o$,  we compute all metrics in $L_{m}$ using the training set $\mathcal{D}$  and the  other textual features associated with $o$.
Each  candidate $c$ is then represented by a vector of metric values $M_c \in \mathbb{R}^a$, where $a$ is the number of considered attributes (metrics). 
We also assign a binary label $r_c$ to each candidate $c$ for each object $v$ in validation set $\mathcal{V}$, indicating
whether $c$ is a relevant recommendation for $v$ ($r_c$=$1$) or not ($r_c$=$0$), based on the contents of  $Y_v$, the gold standard for tag recommendation. 
Note that we use training set $\mathcal{D}$ only to extract the association rules and to compute the metrics, relying on validation
set $\mathcal{V}$ to learn the solutions (see discussion in Section \ref{sec:metodologia}).  Thus, we only
assign labels $r_c$ for objects in $\mathcal{V}$.  The learned model, i.e., the ranking function $f(M_c)$, is then applied 
to each candidate $c$ for each object $o$ in the test set. All 
reported results are based on the performance on the test sets. 

Next, we describe the analyzed L2R techniques, and how they were applied  to tag recommendation. %They can be categorized into three approaches: pointwise, pair-wise and list-wise. While the former two have the relevance measure built-in, the latter allows this function to be arbitrarily defined. In this paper, we evaluate the most competitive algorithms from each of the three classes above.


\subsubsection{RankSVM Based Strategy} \label{sec:svm}

 RankSVM is based on the state-of-the-art  Support Vector Machine (SVM) classification method \cite{svm}.  
%To avoid overfitting which could occur if we learned the model in the same set from which association rules and metrics were extracted, mainly because the metrics derived from the rules could be over-inflated, we use an additional validaC_otion set $V$ to learn the model \footnote{\small{We did observe this overffiting problem in some initial tests, with the learned models having lower generalization capabilities.}}. 
 We use the SVM-rank tool\footnote{\url{http://www.cs.cornell.edu/People/tj/svm_light/svm_rank.html}} to learn a function $f(M_c)$=$f(W,M_c)$, where
$W$ = $<w_1, w_2, ... w_m>$  is a vector of weights associated with the considered metrics (i.e., $W \in \mathbb{R}^m$). $W$ is learned by a maximum-margin optimization method that tries to find a hyperplane, defined by $W$, that best separates the ``closest" candidate tags (represented by their metric vectors in $\mathbb{R}^m$) belonging to two different \textit{levels of relevance} (i.e., relevant and irrelevant) assigned to each object-candidate pair in the training. They are employed to produce ranking statements (i.e., relevant tags must precede irrelevant ones), which in turn are used as input to the RankSVM learning process. 
 At recommendation time, $f(M_c)$ is used to rank all  candidates for target object $o$ according to their relative distances to the separating hyperplane. 
%when predicting tags for a test object $o$, $f(M_c)$ is used to estimate the relevance of a given candidate $c$, represented by metric vector $M_c$, generated using, among other things, the contents of $\mathcal{I}_o$. These estimates are used to  rank all candidates for $o$. Top ranked tags are selected as recommended tags. } 
RankSVM has 2 key  parameters:  the type of kernel function, which indicates the structure of the solution function, and 
cost $j$,  which controls the penalty  to classification errors in the training process. %RankSVM is still a very competitive performer when considering average results across all collections in the LETOR 3.0 L2R benchmark\footnote{\scriptsize{http://research.microsoft.com/en-us/um/beijing/projects/letor}}. 

\subsubsection{Random Forest Based Strategy} \label{sec:rf}

The Random Forest (RF)  algorithm \cite{Breiman01} is an ensemble method that combines a collection of decision trees. The learning of each decision tree in the ensemble happens in a recursive way: first, the most discriminative attribute (according to some measure, such as Information Gain) is selected as a decision node. The selected  candidate tags are split according to a split value (e.g., average attribute value), and the process repeats in a top-down fashion 
% $l-1$ times to form a tree with $l$ terminal nodes.  {\bf FABIANO: Isto está certo? Eu não vi como a profundidade da arvore eh igual ao numero de folhas - 1}
to form a tree with $l$ terminal nodes, where $l$ is a tuning parameter.
Once the decision tree is built, it can 
assign a real-valued score as output  for an unseen (test) candidate tag.

The $RF$ method exploits the \textit{bagging} ensemble technique, i.e., each tree within the forest is built with a different bootstrap sample drawn from the original set of pairs $(M_c, r_c)$ that represents each candidate tag $c$ for the considered objects in our dataset, where, as discussed before, $M_c\in \mathbb{R}^m$ and $r_c \in \{0,1\}$.
% vectors $M_c$ corresponding to all candidates $c$ for the considered objects. 
The attribute selection for each split in a tree is conducted on a randomly selected subset of attributes, instead of on the full attribute set, as usually done in traditional decision tree algorithms. Each leaf in each tree corresponds to an output score to be assigned to a candidate tag. Once the forest is built, for each tag candidate $c$ in a target object $o$, the scores given by each tree to $c$ are averaged and used to produce the final ranking of candidates.

Besides $l$, the number $T$ of trees to grow and the number of attributes $m$ to consider when splitting each node are tuning parameters in RF. We note that, 
although each decision tree  may suffer from overfitting, the aggregation of a larger number of low-correlated trees can mitigate this problem. The generalization error of a RF depends on both the correlation between trees in the forest and the strength of each individual tree. The more correlated each tree is, the higher the error rate becomes. The stronger each individual tree is (high accuracy), the lower the error rate becomes. By increasing $m$ or lowering $l$, both the correlation and the strength of each tree increases. By lowering $m$ or increasing $l$, each tree becomes more independent (less correlated), but also becomes weaker at the same time. Thus, there exists some optimal values of $m$ and $l$ that provide the optimal balance between the correlation and the strength to get the minimum generalization error. We set those parameters using the validation set, as described in Section \ref{sec:metodologia}. The implementation of RF and the next four approaches were provided by the RankLib learning to rank tool\footnote{\url{http://people.cs.umass.edu/~vdang/ranklib.html}}.

RF has been shown consistently effective and competitive in several real world benchmarks \cite{Mohan10}. Some of its strengths are  its insensitivity to parameter choices, resilience to overfitting, and  high degree of parallelization due the fact that single decision trees  are built independently from others, thus making RFs inherently parallel.


\subsubsection{MART Based Strategy} \label{sec:mart}

Similarly to $RF$, $MART$ (Multiple Additive Regression Trees) \cite{Friedman00} combines multiple decision trees. However, unlike $RF$, $MART$ exploits the \textit{boosting} ensemble technique, which is based on the idea that it is easier to find and average many weak learners, than to find a single, highly accurate predictor. 
%Related techniques -- including bagging, stacking and model averaging -- also build, then merge results from multiple models.  %but boosting is unique because it is sequential: it is a forward, stage-wise procedure. 
 Boosting is an iterative method that produces and combines different \textit{weak learners}. In each of $i$ iterations, it increases emphasis on training instances which were not well modeled in the previous iteration (i.e., candidate tags in the training set which were not correctly ranked by the model built in the previous iteration, in our case). Increasing the weight given to these ``harder'' training examples, it potentially improves the final model, which is a linear combination of the weak learners, with weights defined by the learning rate $lr$, a tuning parameter.
Each weak learner is a decision tree with  $l$ leaves. However, unlike $RF$, $MART$ does not build randomized trees, i.e., it considers the whole training set and the complete set of attributes.


\subsubsection{ $\lambda$-MART Based Strategy} \label{sec:lambdaMART}
$\lambda$-MART \cite{Wu10}, the winning approach at the Yahoo! Learning to Rank Challenge \cite{Chapelle11},  is  a combination of  MART and the  $\lambda$-Rank ranking model, which tries to directly optimize the value of an evaluation metric. 
The main difference between $\lambda$-MART and  MART  is that $\lambda$-MART learns which  candidate   in a pair of candidate tags must appear first in the ranking. In order to learn these pairwise preferences, $\lambda$-MART uses the gain in Normalized Cumulative Discounted Gain (NDCG) \cite{MIR} from swapping the rank positions of candidates in any given pair of candidate tags for the same object $o$.  Thus, unlike MART, the training is made by considering that each candidate tag is in the set of candidate tags $C_o$  for an object $o$.  Similarly to MART, the learning rate $lr$, the number of terminal nodes $l$ and the number of iterations $i$ must be specified.


\subsubsection{ListNet Based Strategy} \label{sec:listnet}
The goal of the learning in ListNet \cite{Cao07} is  minimizing ranking errors, rather than minimizing errors in classification of pairs of tags or building regression models.
ListNet is based on comparing the probability distribution of permutations of lists of candidate tags. Specifically, for a set of candidate tags to an object $o$, ListNet first  defines a permutation probability distribution based on the scores produced by a ranking function for each  candidate tag. It then defines another distribution based on the ground truth labels, and measures the  inconsistency between these two distributions.
 %the cross entropy between the two distributions is used to define the loss function that 
%a function is adjusted to measure the inconsistency between the output of the ranking function and the ground truth permutation. 
The ranking function is defined as a neural network model. Thus,  it is possible to iteratively adjust this ranking function according to the measured inconsistency. This adjustment is done  at a pre-defined learning rate $lr$ during $i$ iterations.  %Notice that ListNet is similar to another ranking strategy, called RankNet \cite{Burges05}. The only major difference lies in that the former uses tag candidate lists as instances while the latter uses pairs of candidate tags as instances. 

%{\bf JUSSARA: Nao entenddi. Ele vai ajustando ate convergir ou ate um limite de iteracoes i. Como esta acima parece que ele TEM que convergir depois de i iteracoes. Nao faltou adicionar OR UNTIL a number of iterations $i$???  Removi a referencia a RankNet pois nao ajuda e ficou confuso. %Documento???? o que e isto no nosso contexto.  }


\subsubsection{AdaRank Based Strategy} \label{sec:adarank}

%AdaRank  manages to directly optimize target evaluation metric. 
The basic idea of AdaRank \cite{Xu07} is to plug a selected evaluation metric into the {\it boosting}  framework and directly optimize this metric. Specifically, it repeatedly builds weak rankers on the basis of re-weighted training of sets of candidate tags, i.e.,  each set of candidate tags $C_o$ for an object $o$ receives a weight. The weak rankers are linearly combined to make ranking predictions. 
 %A  weak ranker is just one of the metrics described in \ref{tab_metrics}, which can rank the candidate tags  obtaining the optimal performance among all of the metrics in the weighted objects. 
Each attribute in Table \ref{tab:attrib}, in isolation, is used as a weak ranker. The selected weak ranker in each iteration of the algorithm corresponds to the attribute that leads to the best performance over the weighted objects, measured by a given evaluation metric (NDCG, in our case). 
AdaRank runs for $i$ iterations. At each iteration, AdaRank maintains a distribution of weights over the objects in the training data. Initially, AdaRank sets equal weights to the objects. At each iteration, it increases the weights of those objects whose tags are not ranked well by the model created so far. As a result, the learning at the next iteration will be focused on the creation of a weak ranker that can work on the tag ranking of those ``hard'' objects. 


\subsubsection{RankBoost Based Strategy} \label{sec:rankboost}
Similarly to AdaBoost, RankBoost \cite{Freund03} also adopts the boosting framework, i.e., it trains one weak learner at each  iteration, and
learns a linear combination of weak rankers as the final ranking function. Each weak ranker consists of a single attribute (one of the metrics in  Table \ref{tab:attrib}) and a threshold that best distinguishes between relevant and non-relevant candidate tags. 
However, similarly to RankSVM, RankBoost operates on pair of tags. After each iteration, the tag pairs are re-weighted: it decreases the weight of correctly ranked pairs and increases the weight of wrongly ranked pairs.  As a result, the learning at the next iteration will be focused on dealing with pairs which are more difficult to rank. The total number of iterations $i$ is a tuning parameter.

%{\bf Jussara: note que mudei nome do parametro \# de iteracoes. Era t passou pra i para consistencia com metodo anterior.}

\subsubsection{GP Based Strategy} \label{sec:gp}


Genetic Programming (GP) is a framework inspired by the biological mechanisms of  genetic inheritance  and evolution of individuals in a population \cite{banzhaf97}.   GP implements a global search mechanism by evolving a population of individuals over multiple generations. 
Each individual, representing a possible solution for the target problem (a tag ranking function), is modeled as  a tree composed of terminals (leaves) and operators (inner nodes), related to the target problem.
In our case, terminals are constants (uniformly distributed between 0 and 1) and attributes (metrics in  Table \ref{tab:attrib}), while the inner nodes are operators sum, subtraction, multiplication as well as  protected division and logarithm (so that they return the default value 0 if their inputs are out of their domains). 
In each generation, each individual is evaluated by a {\it fitness} function, defined based on quality  metrics related to the problem at hand.  We here use the NDCG metric as fitness function. 
Only individuals with the highest fitness values are selected, according to some selection method (we adopt the tournament selection, i.e., selecting the best individual among $k$ randomly chosen individuals), to evolve the population. 

% Only individuals with the highest fitness values are selected to transmit their characteristics to future generations through  {\it crossover} and {\it mutation} operations.  

%This selection is used to evolve the population. % as described next. 


An initial randomly generated  population is evolved in a number of generations, through {\it crossover} and {\it mutation} operations. 
The crossover operation, performed on two selected individuals with probability $p_c$, is implemented by randomly choosing one node of each tree representing a selected individual  and exchanging the subtrees below them.  It aims at combining good solutions towards a more promising one. The mutation operation, on the other hand, adds new individuals (solutions)  to the population, thus increasing the diversity in it. This is useful, for instance, to avoid being trapped in local optima. With probability $p_m$,  the mutation of a selected individual is done by first randomly choosing one node of its tree, and then replacing the subtree rooted at it by a new randomly generated subtree, without exceeding a maximum tree depth $d$.   Note that population size $n_p$ is kept fixed through all generations.
%In the crossover, the ``genetic" contents of two individuals, chosen according to the adopted selection method,  are exchanged to build a new individual. This exchange occurs with probability $p_c$. It is done by randomly choosing one node of each tree representing  a selected individual and exchanging the subtrees below them. The crossover operation combines good solutions towards  a more promising solution in the search space.  The mutation operation, on the other hand, adds new individuals (solutions), thus increasing the diversity in the population. This is useful, for instance, to avoid being trapped in local optima. With probability $p_m$,  the mutation of a selected individual is done by first randomly choosing one node of its tree, and then replacing the subtree rooted at it by a new randomly generated subtree, without exceeding a maximum tree depth $d$.   Note that the population size $n_p$ is kept fixed through all generations.
This process continues until a target fitness value $f^{t}$ or a maximum number of generations $n_g$ is reached. The individual with the best fitness value, usually part of the last generation, is chosen as the final solution for the problem.


GP is a non-linear method that has been applied to various IR tasks. We were the only  to use it for recommending tags \cite{belem_sigir2011,belem_ecir2013}, having obtained competitive (or superior) results over RankSVM. %GP directly optimizes a target (fitness) function (e.g., precision).% and allows for easy extensions to include  more problem-related attributes (terminals) and to address other aspects of the target problem (by, for instance, adapting the fitness function).


%The Fitness of an individual in this context represents the quality of the recommendations produced by the corresponding ranking function, which is here assessed in terms of the precision of the top-$k$ terms in the ranking of recommended terms\footnote{{\small We also experimented with other Fitness functions (e.g., Mean Average Precision - MAP) obtaining similar results.}}. Given a ranking function $f(M_c)$, let $\mathcal{Y}_o$ be the set of relevant tags for an object $o$ (identified by labels $Y_c$), $\mathcal{R}^f_o$ be the ranked recommendations produced by $f(M_c)$ for $o$, and $\mathcal{R}^f_{k, o}$ the top $k$ elements of $\mathcal{R}^f_o$. The quality of the recommendations produced by $f$  for  $o$ is measured as:

%\vspace{-0.2cm}
%\begin{equation} \label{eq:precision}
%\small
% P@k(\mathcal{R}^f_o, \mathcal{Y}_o, f) = \frac{|\mathcal{R}^f_{k, o} \cap \mathcal{Y}_o|}{min(k, |\mathcal{Y}_o|)}
%\end{equation}


%The $min$ operator guarantees that the denominator does not exceed $|\mathcal{Y}_o|$. 

%The Fitness (i.e., quality) of  ranking function $f(M_c)$ is then computed as  the average $P@k$ over all recommendations produced by $f(M_c)$ in a sample of objects of size $s$, extracted from validation set $\mathcal{V}$. % where $s$ is a parameter to be set.


%\subsubsection{Combining GP and LATRE}
%
%Our last recommendation strategy, called  $GP+LATRE$, exploits the best ranking method according to our experiments, GP, together with the %best co-occurrence metric, computed by $LATRE$. Thus, the set of candidate tags is composed by all candidate tags generated by $LATRE$, %$Sum^+$ and extracted from textual features. All the evolutionary process is the same as presented in the Section \ref{sec:gp}, while the exploited %metrics are all metrics in Table \ref{tab_metrics} plus the score produced by $LATRE$.


The  main  features of the above L2R techniques are summarized in Table \ref{tab:l2rmethods}.


\input{tab_methods}

%\begin{table}[ht]
%\label{tab:l2rmethods}
%%\resizebox{20cm}{!} {
%    \begin{tabular}{|l|l|p{4cm}|p{3cm}|p{3cm}|}
%    \hline
%    Approach    & Ensemble & Input data                                                 &  Directly optimize the evaluation metric & Generated %model                         \\\hline
%    RankSVM     & -        & Pairs or candidate tags for the same object $o$.           & no                                      & %Hiperplane                             \\\hline
%    RF          & bagging  & Set of candidate tags                                      & no                                      & Set of ``randomized'' %regression trees \\\hline
%    MART        & boosting & Set of candidate tags                                      & no                                      & Set of boosted %regression trees        \\\hline
%    $\lambda$-MART & boosting & Sets of candidate tags $C_o$ for  for the same object $o$. & yes                                     & Set of boosted %regression trees        \\\hline
%    Listnet     & -        & Sets of candidate tags $C_o$ for  for the same object $o$. & no                                      & Neural %network                         \\\hline
%    Adarank     & boosting & Sets of candidate tags $C_o$ for  for the same object $o$. & yes                                     & Sets of weighted weak %rankers          \\\hline
%    Rankboost   & boosting & Pairs or candidate tags for the same object $o$.           & no                                      & Sets of weighted weak %rankers          \\\hline
%    GP          & -        & Pairs or candidate tags for the same object $o$.           & yes                                     & %~                                      \\\hline
%    \end{tabular}
%%}
%\end{table}
%