\section{Representative Results} \label{sec:results}

Now we report the results of  experiments with objects in the {\it test sets}, comparing the results of all analyzed methods, i.e., the state-of-the-art unsupervised $LATRE$+$wTS$ heuristic and the eight L2R-based strategies $RF$, $MART$, $\lambda$-$MART$, $AdaRank$, $ListNet$, $RankSVM$, $GP$ and  $RankBoost$. Our comparison comprises both effectiveness (Section \ref{sec:effect}) and efficiency (Section \ref{sec:efficiency}) of the analyzed methods.  Results are averages of all 5 folds, along with 95\% confidence intervals.


\subsection{Recommendation Effectiveness} \label{sec:effect}

Tables \ref{tab:ndcg}-\ref{tab:recall} show results for NDCG, precision and recall, respectively, considering the top $k$=$5$ ranked tags. %Results for precision lead to the same conclusions and thus were omitted.
We start by noting that the best L2R-based strategies (i.e., $RF$ and $\lambda$-$MART$) outperform the state-of-the-art unsupervised $LATRE$+$wTS$ by up to 29\%, 23\% and 22\% in NDCG, precision and recall, respectively. These gains are even more expressive than the gains achieved by previously evaluated L2R methods \cite{belem_sigir2011}. They confirm  the benefits of exploiting supervised L2R methods for tag recommendation, %particularly $RF$, $MART$, and $\lambda$-$MART$,
allowing an automatic search for a solution that combines a larger number of attributes when compared to unsupervised heuristics such as  $LATRE$+$wTS$. %, which combines two of the main attributes for this problem, i.e., $Sum$ and $wTS$ \cite{belem_sigir2011}.


We now turn our attention to the comparison of the eight L2R-based strategies. Unlike  existing comparisons of different L2R techniques in other domains (e.g., document ranking) \cite{gomes_2013}, there is a clear winning group of methods ($RF$, $MART$ and $\lambda$-$MART$) in all 5 datasets, with a slight advantage of two of them ($RF$ and $\lambda$-$MART$). The gains  in NDCG  of the winner methods over the best of the remaining L2R techniques considered (i.e., either $GP$, $RankSVM$ or $RankBoost$) range from 4\% to 12\%. The corresponding gains in precision and recall reach 10\% and 11\%, respectively. %Thus, according to our results, recommendation functions based on an ensemble of decision trees are the most effective in the tag recommendation domain, among all strategies we analyzed.
  These results confirm the effectiveness of methods based on an ensemble of decision trees, which are non-linear L2R strategies that have been shown to be effective and competitive in other studies \cite{Mohan10,Friedman00}.     %This happens because (1) decision trees are a relatively effective learning approach, even when used in isolation, and (2) exploiting an ensemble of multiple and/or randomized trees avoids overfitting, increasing the generalization power of the generated recommendation functions.

The second group of methods is formed by the L2R strategies previously exploited in tag recommendation: $GP$, $RankSVM$ and $RankBoost$. $GP$ is the most flexible strategy, allowing a wider range of types of recommendation functions (any function formed by the considered operators and attributes). However, this can be also a disadvantage because the search space is larger when compared to the search space of other methods, making it more difficult to find the best function. On the other hand, the shape of functions produced by $RankSVM$ is pre-defined by the kernel function, which was set linear here (as this led to the best results), and thus all $RankSVM$-produced functions consist of linear combinations of the attributes. Although $RankBoost$ is composed by simpler weak rankers (defined by single attribute), it achieved  results similar to $RankSVM$, since its ensemble strategy also produces a linear combination of the attributes. %The difference is that $RankBoost$ focus on the ``hardest'' training samples when building each weak learner.
%However, we note that $RankBoost$ is the most parameter-sensitive strategy in this group  of methods, {\bf as discussed in Section \ref {subsec:param}}.


%Tentar explicar porque ListNet e AdaRank foram ruins?     


Comparing the general results across  different datasets, we note that the best results are for  YahooVideo, while MovieLens (which is considerably smaller than the other datasets) and Bibsonomy present the worst results. The observed differences are possibly due to the number of tags per object, which tends to be smaller in MovieLens and Bibsonomy than in YahooVideo objects. For example,  48\% and 73\% of our YahooVideo and YouTube objects have fewer than 10 tags, against 94\%, 88\% and 76\% of the objects in  Bibsonomy, LastFM,  and   MovieLens, respectively.   Moreover, these results may also be due to differences in tagging behavior:  on YahooVideo and YouTube, tags tend to appear in other textual features of the same object more often than on LastFM \cite{belem_sigir2011,IPM_flavio} and the other datasets, which facilitate the recommendation of relevant tags exploiting some of the considered tag relevance metrics (e.g., $wTS$).

\input{table_results}


%Im sum, we found that (1) %that, unlike  existing comparisons of different L2R techniques in other domains (e.g., document ranking) \cite{XX}, for tag recommendation, there is a clear winning group of methods ($RF$, $MART$ and $\lambda$-$MART$), with a slight advantage of two ($RF$ and $\lambda$-$MART$) over the other, with gains in precision ranging from 5\% to 11\% over the best of the remaining  L2R techniques considered

In sum, we found that (1) L2R based strategies can significantly outperform state-of-the-art unsupervised heuristics, and (2) $RF$, $\lambda$-$MART$ and $MART$ are the best L2R strategies out of the eight analyzed techniques, providing further gains over previously evaluated L2R-based strategies.


\subsection{Recommendation Efficiency} \label{sec:efficiency}

Now we show results for the efficiency of all analyzed methods in terms of their online recommendation time. 
Table \ref{tab:time} shows the average recommendation  time per object,  in \textit{milliseconds}, broken down into each of the 4 recommendation stages  (see Section \ref{sec:metodologia}).  In particular, we distinguish the metric computation (MC) stage for the L2R techniques, which involve the computation of all metrics used\footnote{The exception is $Sum(\ell=3)$, whose computation time is included in the lazy rule generation (LRG) step, for all (supervised and unsupervised) methods analyzed.}, and the MC stage for $LATRE$+$wTS$, which consists of the computation of only $wTS$. Table \ref{tab:total_time} shows the total recommendation time  (sum of all stages) per object, on average, in \textit{seconds}.

%, which is an important performance measure since the user must wait for recommendations to be provided, and may abandon the system if the waiting time is unacceptable. Training (learning the recommendation functions and best parameters) can be performed completely offline, and thus we will focus on online recommendation time.
%All experiments were performed in a 16-core 2.40GHz Intel(R) Xeon processor, with 50GB RAM.


%For all analyzed methods, we can divide the recommendation time in four stages: (1) on-demand association rules generation, (2) attribute values computation, (3) application of the (learned) model to score candidate tags, which depends on the complexity (e.g., number of nodes of decision trees) of the generated model, and (4) sorting of the candidate tags according to the generated scores. All stages can be somehow benefited by data storage in cache (e.g., pre-computed attribute values and association rules \cite{menezes2010}, or even the final ranking of tags for a given object). However, we here focus on the worst case (a superior limit of execution time) assuming that the cache will be always empty, and thus all metrics and necessary association rules will be computed in recommendation time. As such, the ranking (application of the recommendation function and sorting of candidate tags) will be also always performed. The evaluation of caching strategies is complex because it requires a real system running (or some model of it), and thus we leave as future work.

%We measured CPU time for the whole batch of executions for all objects in each fold. Thus, reported results are averages over the 5 folds. >> VAI PARA EVALUATION METHODOLOGY.

\input{table_time}


As expected, the simplest unsupervised method ($LATRE$+$wTS$) takes less time than the others, since it requires only the computation and sum of two metrics -- $Sum$ and $wTS$ -- besides the on-demand association rule generation and candidate sorting, which are common steps to all methods. Note however that, in terms of total recommendation time, the use of several of the analyzed L2R strategies only incur in a small  additional cost (under 3\%) in comparison to $LATRE$+$wTS$. %Tem alguns casos que chega a 300% a mais (mas a diferenca absuluta eh pequena), nao sei se devo reportar. 
 

%Despite  some variation in the ranking time among the different L2R methods, we found that feasibility of the L2R approach for the tag recommendation task

Comparing the recommendation time of the L2R strategies, we found that, despite some variation, the most efficient models are those generated by $RankBoost$-based strategies, followed by $RankSVM$, $ListNet$ and $GP$. The three winner methods in terms of effectiveness come next. In any case, the average  recommendation time of all methods, in a worst case scenario (without any pre-computed metric or association rule),  is under 1.3 seconds per object. %, while the model application stage requires, in average, less than 0.1 seconds per object, on average. 


In conclusion,  the most cost-effective solution is the $MART$-based method, as it produces results very close to $RF$ and $\lambda$-$MART$ (e.g., the difference is under 1.3\% in NDCG), with a much shorter recommendation time (differences  
 range  from 185\% to 473\%). Thus, despite some variation across different methods, the L2R approach is feasible for the tag recommendation task. Moreover, L2R methods are flexible, allowing the inclusion of new attributes and addressing other aspects of the problem (e.g., personalization).


 %Moreover, since association rule computation is one of the most time consuming stages, the total recommendation time could be significantly reduced by a different parameter choice, i.e., decreasing the complexity (length) of the generated rules or increasing their minimum support/confidence thresholds \cite{belem_sigir2011}.


%Among the three winner methods, the most simple ($MART$) requires less time.
%Thus, it is worthy to use the best methods in terms of effectiveness (i.e. $RF$ and $\lambda$-$MART$) to recommend tags.
%Thus, considering also the effectiveness analysis we made in Section \ref{sec:effect}, the best cost-effective strategy is the XX-based method. 
%We note that the  average recommendation time of the best cost-effective L2R-based method, in a worst case scenario (without any pre-computed metric or association rule) is under XX seconds.