\section{Experimental Setup} \label{sec:exp}

In this section, we describe the datasets used to evaluate the tag recommendation strategies (Section \ref{sec:dados}), 
our evaluation methodology (Section \ref{sec:metodologia}), and how we parameterized each strategy (Section \ref{subsec:param}).
%and discuss a set of representative results (Section \ref{subsec:res}).


\subsection{Data Collections} \label{sec:dados}


%YouTube:
%400.000 usuários foram coletados no período de 27/08/2009 a 02/09/2009. Os metadados dos vídeos postados por estes usuários foram coletados em seguida, de 04 a 13/09/2009, totalizando mais de 9 milhões de vídeos

%Estratégia de coleta: snowball sampling no grafo de usuários. Quais usuários foram usados como semente: temos que verificar ainda, aparentemente foram os usuários que mais receberam subscriptions

%LastFM:
%87960 usuários foram coletados no período de 05 a 12/09/2009. Em seguida, de 14 a 18/09, foram coletados metadados relacionados a 2.758.992 artistas que foram ouvidos pelos usuários coletados.

%Estratégia de coleta: mesma do Youtube, seguindo o grafo de amizade.


We evaluate the tag recommendation methods on five datasets, each containing the \textit{title}, \textit{tags} and \textit{description} associated with real objects
from  Bibsonomy, LastFM, MovieLens, YouTube and YahooVideo. %The Bibsonomy, LastFM and YouTube datasets also include the set of tag assignments ($\mathcal{P}$) \footnote{On YouTube, only the video owner can assign tags to it.} thus allowing the evaluation of object-centered and personalized tag recommendation methods. The YahooVideo dataset, in contrast, does not identify the user who assigned each tag, and thus is here used only  in the evaluation of object-centered methods.
The Bibsonomy dataset is a recent snapshot of the system comprising 543,872 objects (bibtex records of publications), obtained on  January $1^{st}$ 2012. It is publicly available\footnote{\url{http://www.kde.cs.uni-kassel.de/bibsonomy/dumps}.}, and has been used in several previous efforts \cite{lipczak11,pairwise2010,yin_wsdm2013}.  The MovieLens dataset, also available online\footnote{\url{http://www.grouplens.org/taxonomy/term/14}}, contains 10,000 objects.
The LastFM and YouTube datasets were collected\footnote{Visit \url{http://vod.dcc.ufmg.br/recc/} for information on data availability.} in August 2009, following a {\it snowball sampling}. That is, starting
 from a set of users (the most popular users) selected as seeds, the crawler recursively collected the objects posted by them and followed their social links to other users, collecting the objects posted by them as well. Our datasets include the textual features %and tag assignments
 associated with 2,758,992 LastFM artists and  more than 9 million YouTube videos. 
The YahooVideo dataset was also gathered by snowball sampling, but using the most popular objects as seeds and following links of related videos.
 %\footnote{We adopted a slightly different sampling strategy for LastFM and YouTube,  exploiting users and social links as opposed to videos and related video links, as we use the collected datasets to evaluate personalized recommendation strategies. As  YahooVideo does not publish per-user information on tag assignment, we chose not to  crawl that application again, thus relying on our previously gathered dataset and evaluating it only for object-centered recommendation.}.
It was gathered in October 2008, and contains the features of 160,228 objects.  
%The YahooVideo dataset was the same used in \cite{belem_cikm2010}, whereas the other two  were collected more recently and are significantly larger than the collections analyzed in that work\footnote{Visit \url{http://vod.dcc.ufmg.br/recc/} for information on data availability.}. 


We considered only objects with textual features in English, removed stopwords, and  used  the Porter Stemming algorithm\footnote{\url{http://tartarus.org/~martin/PorterStemmer/}}  to remove affixes of the words in the collected features. Stemming was performed to avoid trivial recommendations (e.g., plurals and simple variations of the same word). %We also removed stopwords, as well as terms that are either too frequent (with more than 100,000 occurrences in the dataset) or too rare (with fewer than 30 occurrences), as such terms are hardly good recommendations \cite{sigur2008ftr}.


\subsection{Evaluation Methodology} \label{sec:metodologia}

Recall that our goal is to compare the use of alternative L2R techniques to recommend tags  in terms of effectiveness, which relates to the quality of the recommended tags,  and efficiency, measured by the time required to perform a recommendation. 
We focus on relevance as main criterion of  quality, leaving other aspects such as novelty and diversity to future work \cite{belem_ecir2013}.


As in \cite{pairwise2010,menezes2010,yin_wsdm2013},  we adopt a fully-automated  methodology to evaluate the effectiveness of these techniques: we use a subset of the tags previously assigned to the target object $o$  as a \textit{gold standard}, i.e., as the relevant tags for $o$.  As such, these tags are disregarded for the computation of metrics of relevance. This methodology was adopted because the manual evaluation of tag  relevance is a very expensive process that may be affected by the subjectivity of human judgments. We note that the results obtained according to the proposed methodology represent lower bounds, as some of the recommended tags, although not in the gold standard, might still be considered relevant for the given object or user.
Following this methodology, 
for each tuple $\langle I_o, F_o, Y_o\rangle$ in the test and validation sets, we randomly select half of its tags to be included in $I_o$. The other half are included in $Y_o$, the gold standard for $o$. We also use title and description as textual features in $F_o$, for each object $o$. %Similarly, for personalized recommendation, for each tuple $\langle I_o, F_o, Y_{o, u}\rangle$ in the test and validation sets, half of the tags assigned by the target user $u$ to the object $o$ are included in $I_o$ and the other half in $Y_{o, u}$. Tags assigned by other users to object $o$ (i.e., $I_{o,u'}$ for $u' \neq u$) are also used as input, being included in $I_o$. %Evaluation metrics are described in Section \ref{sec:evaluation_metrics}. 


We sampled $N$ tuples from each dataset, with $N$ being 150,000 for  YouTube, LastFM and Bibsonomy,  120,000 for YahooVideo, and 6,500 for MovieLens. Each sample was divided into five equal-sized portions, which were used in a five-fold cross validation. That is, three portions were treated as training set ($\mathcal{D}$) and used for extracting association rules and computing all metrics. A fourth portion was used as validation set $\mathcal{V}$, which, in turn, was considered a part of the training set, being used to ``learn" the solutions and to  tune parameters of all recommendation methods. The last portion was used for testing.
Note that our use of the 5-fold cross-validation process for the L2R based strategies is slightly different from the traditional use: we learn the ranking function and select parameter values in the validation set $\mathcal{V}$. We do so to avoid overfitting, which could occur if the solutions were learned in the same set from which association rules and metrics were extracted (i.e., training set $\mathcal{D}$), as metrics derived from the rules could be over-inflated\footnote{The overfitting problem was observed in initial tests, with the learned solutions having lower generalization capabilities.}.


We argue that our experimental design is fair because:
1) we do not use any privileged information  from the test set in which results of all methods  are reported; 2) all parameters are discovered in the same validation set; 3) our collections are large enough for effective learning; %, even when we conduct cross-validation in $\mathcal{V}$ for parameterization (e.g., results in these preliminaries experiments are basically identical to the ones reported in the test with the discovered parameters);
and 4) the very tight confidence intervals reported in our results 
 are evidence of low variation and thus learning convergence. Moreover, having a large amount of training data to generate the 
tag co-occurrence rules can help increasing the coverage of the rules (i.e., more co-occurrences can be potentially found), thus generating
more candidates and more precise metric values. This benefits all methods.


%\subsubsection*{Evaluation Metrics} \label{sec:evaluation_metrics}

%We now present the metrics used to evaluate the quality of the recommendations produced by all considered tag recommendation methods. They are also used by the GP framework, whose search process tries to directly maximize the considered evaluation metric, as described in Section \ref{sec:gp}.

%We now present the metrics used to evaluate the quality of the recommendations produced by all considered tag recommendation methods.


 Our primary metric to evaluate the effectiveness of each method is the Normalized Cumulative Discounted Gain (NDCG) \cite{MIR}, computed over the ranked list of candidate tags produced  for each object in the test set. We also measure the \textit{precision}, i.e., the fraction of recommended tags that are relevant,  and the \textit{recall}, i.e., the fraction of the  relevant tags for an object that are indeed recommended.  All three metrics are computed over the top $k$ positions of the ranking, with $k$=$5$.

%which is directly maximized by some methods presented in Section \ref{sec:strategies}. We also measured the \textit{precision} and the \textit{recall}. Precision is the fraction of recommended tags which is relevant, while recall is the fraction of the set of relevant tags for an object that were indeed recommended. %, whereas $nDCG$ considers the order in which tags are recommended, emphasizing ranking relevant tags higher \cite{MIR}. Specifically, let $Y$ be the set of relevant tags for object $o$ ($Y$ = $Y_o$), or, in the case of personalized recommendations, for the pair object-user $\langle o, u \rangle$ ($Y$ = $Y_{o, u}$). Let $C$ be the sorted set of recommendations generated by the method being evaluated, and $C_{k}$ the top $k$ elements in $C$. Recall in the first $k$ positions of the ranking is defined as: All evaluation metrics are computed over the top $k$ positions of the ranking, with $k$=$5$.

%\begin{equation} \label{eq:recall}
% Recall@k(C, Y) = \frac{|C_k \cap Y|}{|Y|}
%\end{equation}


%Let $DCG@k$ be the discounted cumulative gain in the first $k$ recommendations, defined as:

%\begin{equation} \label{eq:dcg}
% DCG@k(C, Y) = \sum_{i=1}^k \frac{rel(i)}{\log_2(i+1)}
%\end{equation}

%\noindent where $rel(i) = 1$ if the  $i^{th}$ candidate returned in $C$ is relevant (i.e, it is in $Y$), and $rel(i) = 0$ otherwise.
%The normalized discounted cumulative gain in the first $k$ recommendations, $NDCG@k$, is defined as:

%\begin{equation} \label{eq:ndcg}
% NDCG@k(C, Y) = \frac{DCG@k(C, Y)}{IdealDCG@k},
%\end{equation}

%\noindent where $IdealDCG$ is the value obtained for $DCG@k$ when all top-k candidates are relevant.


In order to evaluate the efficiency of the methods, we measured their online recommendation time, which is an important performance measure since tag recommendation is an interactive task. Thus, the user may abandon the system if the waiting time is unacceptable. Model training, i.e.,  learning the recommendation functions and best parameters,  can be performed completely offline, being thus less concerning.
All experiments were performed on a 16-core 2.40GHz Intel(R) Xeon processor, with 50GB RAM. We measured CPU time (user and system time) for the batch of executions consisting of all objects in the test set.    Reported results are averages per object, computed over all 5 test sets.


For all  methods,  recommendation time is divided into four stages: (1) lazy association rule  generation, (2) metric computation, (3) application of the (learned) model to score candidate tags, which depends on the complexity of the generated model (e.g., number of nodes of decision trees), and (4) sorting of the candidate tags according to the  scores. All stages can be somehow benefited by data storage in cache (e.g., pre-computed attribute values and association rules \cite{menezes2010}, or even the final ranking of tags for a given object). However, we here focus on the worst case (an upper-limit of execution time), assuming that the cache will be always empty. Thus, all metrics and necessary association rules are computed at recommendation time. As such, the ranking (application of the recommendation function and sorting of candidate tags) will  also be always performed. 
We leave the evaluation of the benefits from alternative caching strategies  for  future work.


\input{tab_param}


\subsection{Parameterization}\label{subsec:param}

We ran a series of experiments with the validation set $\mathcal{V}$ to find the best parameters of each method. 
%In each experiment, the method's parameters were  adjusted using cross-validation \cite{10fold} inside the set $V$.
We found our  $RF$-based tag recommender  to be  very insensitive to parameterization.  The results obtained in our cross-validation experiments using different number of trees ($T$=300, 500, 1000) are statistically tied (with 95\% confidence) for all datasets. We thus set $T$=$300$, due to the lower cost. We also fixed the number $m$ of all attributes selected in each split of the tree according to the default value originally suggested in \cite{Breiman01}, i.e $m=\lfloor log_2 (M + 1) +0,5 \rfloor$, where $M$ is the total number of attributes. Despite the  fact that this default value has been reported to work well in practice \cite{liaw02}, we verified that other values ranging from $0.25M$ to $0.75M$  do not significantly impact our results. The only parameter that (slightly) impacts results is the number of terminal nodes $l$. We used cross validation to determine the best $l$ among values from $10^2$ to $10^3$, finding that the best choice is $l$=1000 for Bibsonomy and LastFM, and $l=300$ for MovieLens, YahooVideo and YouTube.

The number of leaves $l$ also impacts  $MART$ and $\lambda$-$MART$, the other tree-based approaches. We experimented  with $l$ between 2 and 20, finding $l$=$5$ as the best choice  for all datasets. Since the results obtained with different number of iterations ($i=1500, 3000, 6000$) are statistically tied in all $MART$ and $\lambda$-$MART$ experiments, we set $i$=1500 in all  experiments due the lower cost. We also varied learning rate $lr$ between  0.0001 and 0.2, finding that the best choice was 0.1 for all datasets. Greater values for both $l$ and $lr$  do not improve effectiveness, and make these methods very inefficient.

For $ListNet$, our results were not very sensitive to the  $lr$ parameter. We tested values  ranging from $10^{-7}$ to $10^{-1}$, finding that $lr=10^{-5}$ always led to the best results. We also varied the number of iterations $i$ between 10 and $10^3$, finding, as best choices,  $i=160$ for MovieLens, $i=10$ for YahooVideo, and $i=40$ for the other datasets. Similarly, we  tested ten values of $i$ between $10^2$ and $10^3$ for $RankBoost$ and $AdaRank$. For $AdaRank$,  $i=300$ was the best choice in most datasets, except for YouTube, where $i=100$ was the  best value. For $RankBoost$, the best values varied according to the dataset: it was set to 500 in Bibsonomy, 700 in LastFM and 300 in the other datasets.


Regarding $LATRE$+$wTS$, $RankSVM$ and $GP$,  we adopted the same best parameter values reported in \cite{belem_sigir2011} for LastFM, YouTube and YahooVideo, since the datasets are the same.  For MovieLens and Bibsonomy, which were not included in that work, we follow the same methodology. Using cross-validation in $\mathcal{V}$, we found $j$=$100$  as the best cost for $RankSVM$, and we used the linear kernel.  For   $GP$, we set $n$=$200$ and $g=200$ (as in \cite{belem_sigir2011}), $k=2$,  $d=7$, $p_c=0.6$ and $p_m=0.1$, as usually done in the literature \cite{banzhaf97}. Finally,  for $LATRE$+$wTS$, the best parameter values are $\alpha$=$0.9$ for Bibsonomy and $\alpha=0.95$ for MovieLens. We set $\ell$=$3$ for the metric $Sum$ (for all methods), as in \cite{belem_sigir2011}.


We summarize our parameterization of all methods in Table \ref{tbl:param}. Unless otherwise noted, the same parameter value was used for all datasets.


%  \begin{table}[htttt]
%  \vspace{-0.5cm}
%  \centering
%  \caption{Best parameterization for each approach.}
%    \label{tbl:param}
%    \begin{tabular}{|l|p{12cm}|}
%    \hline
%    \textbf{Approach}    & \textbf{Parameter}    \\\hline
%    $LATRE$+$wTS$	& $\alpha=0.95$ for LastFM and MovieLens, $\alpha=0.9$ for Bibsonomy, YahooVideo and YouTube\\ \hline
%    $RankSVM$         & linear kernel, $j=100$\\ \hline
%    $GP$          & $n=200$, $k=2$, $d=7$, $pc=0.6$ and $pm=0.1$\\\hline
%    $RankBoost$   & $i=500$ for Bibsonomy, $i=700$ for LastFM, i=$300$ for MovieLens, YahooVideo and YouTube\\\hline
%    $RF$          & $T=300$, $m=4$, $l=1000$ for Bibsonomy and LastFM, $l=300$ for the remaining datasets \\\hline
%    $MART$        & $l=5$, $lr=0.1$ and $i=1500$\\\hline
%    $\lambda$-$MART$ & $l=5$, $lr=0.1$ and  $i=1500$ \\\hline
%    $AdaRank$     & $i=300$ for Bibsonomy, LastFM, MovieLens and YahooVideo,  $i=100$ for YouTube\\\hline
%    $ListNet$     & $lr=0.00001$, $i=160$ for MovieLens, $i=10$ for YahooVideo, $i=40$ for Bibsonomy, LastFM and YouTube. \\\hline
%    \end{tabular}
% \end{table}


%The number of interations $i$ are also parameters in RankBoost and  Adarank, which, in this case, were determined between values from 100 to 2000. A discussion about the parametrization of the remaining methods RankSVM and GP according to our datasets can be found in \cite{belem_sigir2011}. 

%Talvez valha a pena colocar aqui:
%For the RankSVM based strategy, we tested two kernel functions, namely Linear and Radial Basis Function (RBF), choosing the former, as it led to better results more efficiently. Using cross-validation in $\mathcal{V}$, we also varied cost $j$ between $10^{-3}$ and $10^{3}$, finding that the best choice was {\bf $j$=$100$}, in most cases. We also tried different strategies to normalize our attribute vectors, including L2-norm, z-score and the LETOR normalization procedure \cite{letor}, with no improvements. Thus, the results reported here refer to non-normalized data.
 
%For the GP based strategy, we experimented, using $\mathcal{V}$, with population sizes $n$ equal to $50$, $100$, and $200$, selecting $n$=$200$, as the  larger population allows a greater coverage of the solution space, leading to better results. For this population size, the algorithm converges (i.e., Fitness values stop improving) before 200 generations, value assigned to $g$. We fixed $k$=$2$, and  set  $d$=$7$,  $p_c$=$0.6$ and $p_m$=$0.1$, as usually done in the literature \cite{banzhaf97}.   Computing the fitness during the evolutionary process over all validation objects can be infeasible. Thus, we used a sample of $s$=$500$ of those objects, as this was enough to learn functions that are more effective than $LATRE$+$wTS$, with no improvements when using larger samples.


%\subsection{Representative Results} -------> transferi para results.tex


%  \begin{table}[h!]
%    \small
%    \begin{tabular}{l|lllll}
%    ~           & Bibsonomy          &  Lastfm             & Movielens         & Yahoo               & Youtube          \\\hline
%    LATRE+wTS   &$ 0.438 \pm 0.002  $&$ 0.401 \pm 0.002  $&$	0.314 \pm 0.008 $&$ 0.733 \pm 0.003	 $&$ 0.489 \pm 0.002$ \\
%    SVM         &$ 0.456 \pm 0.003  $&$ 0.419 \pm 0.002  $&$ 0.346 \pm 0.006 $&$ 0.754 \pm 0.002   $&$ 0.517 \pm 0.003$ \\ 
%    GP          &$ 0.441 \pm 0.009  $&$ 0.450 \pm 0.006  $&$ 0.363 \pm 0.004 $&$ 0.755 \pm 0.005   $&$ 0.520 \pm 0.002$ \\
%    RF          &$ 0.502 \pm 0.003  $&$ 0.494 \pm 0.001  $&$ 0.386 \pm 0.006 $&$ 0.797 \pm 0.002   $&$ 0.543 \pm 0.002$ \\
%    MART        &$ 0.496 \pm 0.002  $&$ 0.489 \pm 0.001  $&$ 0.385 \pm 0.002 $&$ 0.792 \pm 0.002   $&$ 0.541 \pm 0.001$ \\
%    $\lambda$-MART  &$ 0.501 \pm 0.002  $&$ 0.493 \pm 0.002  $&$ 0.385 \pm 0.003 $&$ 0.797 \pm 0.002   $&$ 0.546 \pm 0.001$ \\
%    Adarank     &$ 0.454 \pm 0.003  $&$ 0.134 \pm 0.063  $&$ 0.180 \pm 0.149 $&$ 0.712 \pm 0.010   $&$ 0.440 \pm 0.038$ \\
%    Listnet     &$ 0.437 \pm 0.006  $&$ 0.398 \pm 0.008  $&$ 0.316 \pm 0.010 $&$ 0.661 \pm 0.003   $&$ 0.499 \pm 0.003$ \\
%    Rankboost   &$ 0.451 \pm 0.003  $&$ 0.424 \pm 0.002  $&$ 0.366 \pm 0.002 $&$ 0.763 \pm 0.003   $&$ 0.517 \pm 0.002$ \\
%    \end{tabular}%RF & $0.495 \pm 0.003$ &$0.469 \pm 0.002$ &$0.322 \pm 0.010$ &$0.729 \pm 0.002$ &$0.513 \pm 0.002$ \\
%%MART& $0.489 \pm 0.002$ & $0.463 \pm 0.001$ & $0.321 \pm 0.006$ & $0.725 \pm 0.002$ & $0.511 \pm 0.001$\\
%%lambdaMART& $0.493 \pm 0.002$ & $0.468 \pm 0.002$ & $0.319 \pm 0.010$ & $0.729 \pm 0.003$ & $0.516 \pm 0.002$\\
%
%    \caption{precision@5}
%
%\end{table}
%
%
%  \begin{table}[h!]
%%  \small
%    \begin{tabular}{l|lllll}
%    ~           & Bibsonomy          & Lastfm             & Movielens         & Yahoo               & Youtube          \\\hline
%    
%LATRE+wTS&$	0.397\pm 0.001$&$0.388 \pm 0.003$&$0.321 \pm 0.009$&$0.744 \pm 0.002$&$0.489 \pm 0.001$\\
%SVM &$0.412\pm 0.002$&$ 0.407 \pm 0.002$ & $0.354 \pm 0.007$& $ 0.765 \pm 0.001$&$ 0.515 \pm 0.003$\\
%GP &$0.406 \pm 0.006$ &$0.440 \pm 0.008$ &$0.388 \pm 0.002$& $0.770 \pm 0.0044 $&$0.530 \pm 0.002$\\
%RF    & $0.495 \pm 0.003$ &$0.469 \pm 0.002$ &$0.413 \pm 0.004$ &$0.803 \pm 0.002$ &$0.549 \pm 0.002$ \\
%MART  &$0.489 \pm 0.002$ & $0.463 \pm 0.001$ &$0.411 \pm 0.006$ &$0.794 \pm 0.002$ & $0.547 \pm 0.001$\\
%$\lambda$-MART& $0.493 \pm 0.002$ & $0.468 \pm 0.002$ & $0.409 \pm 0.010$ & $0.802 \pm 0.003$ & $0.551 \pm 0.002$\\
%Adarank & $0.446 \pm 0.004$ & $0.160 \pm 0.128$ & $0.206 \pm 0.110$ & $0.653 \pm 0.010$ & $0.419 \pm 0.032$\\
%Listnet & $0.431 \pm 0.006$ & $0.380 \pm 0.007$ & $0.273 \pm 0.012$ & $0.607 \pm 0.003$ & $0.472 \pm 0.003$ \\
%Rankboost & $0.444 \pm 0.002$ & $0.402 \pm 0.002$ & $0.307 \pm 0.005$ & $0.696 \pm 0.003$ & $0.488 \pm 0.002$\\
%
%    \end{tabular}
%    \caption{ndcg@5}
%\end{table}
%

%  \begin{table}[h!]
%%  \small
%    \begin{tabular}{l|lllll}
%%    ~           & Bibsonomy          & Lastfm             & Movielens         & Yahoo               & Youtube          \\\hline
%    
%
%
%LATRE+wTS & $0.430 \pm	0.002$&$	0.377\pm	0.002$&$	0.250 \pm	0.012$&$	0.637\pm	0.002$&$	0.451 \pm	0.001$\\
%RankSVM &$0.446 \pm 0.003$ &$0.390 \pm 0.003$ &$ 0.275 \pm 0.008$ &$ 0.651 \pm 0.001$ &$ 0.474 \pm 0.003$\\
%GP &$0.431 \pm 0.009$ &$0.415 \pm 0.010$ & $0.280 \pm 0.005$ & $0.654 \pm 0.004$ & $0.478 \pm 0.002$\\
%RF &  $0.491 \pm 0.003$ & $0.460 \pm 0.002$ & $0.301 \pm 0.011$ & $0.689 \pm 0.001$ & $0.498 \pm 0.002$ \\
%MART & $0.485 \pm 0.002$ & $0.455 \pm 0.002$ & $0.300 \pm 0.007$ & $0.685 \pm 0.002$ & $0.496 \pm 0.001$ \\
%$\lambda$-MART & $0.490 \pm 0.002$ & $0.459 \pm 0.002$ & $0.298 \pm 0.011$ & $0.690 \pm 0.002$ & $0.502 \pm 0.001$\\
%Adarank & $0.443 \pm 0.004$ & $0.152 \pm 0.120$ &  $0.194 \pm 0.105$ & $0.618 \pm 0.009$ & $0.408 \pm 0.030$\\
%Listnet & $0.428 \pm 0.006$ & $0.374 \pm 0.006$ & $0.259 \pm 0.012$ & $0.575 \pm 0.003$ & $0.459 \pm 0.003$\\
%Rankboost & $0.441 \pm 0.002$ & $0.394 \pm 0.002$ & $0.287 \pm 0.007$ & $0.657 \pm 0.003$ & $0.474 \pm 0.002$ \\
%    \end{tabular}
%    \caption{recall@5}
%\end{table}
%  
%


%\subsection{Execution Time}
%%\label{•} 


%  \begin{table}[h!]
%  \small
%    \begin{tabular}{l|lllll}

%    ~           & Bibsonomy          &  Lastfm             & Movielens         & Yahoo               & Youtube          \\\hline
%LRG & $3481.8 \pm 73.8$ &$ 35943.4 \pm 2189.0$&$ 309.4 \pm 15.7$&$ 23236.4 \pm 2034.5$&$ 17763.8 \pm 3021.3$\\
%MC &$15.9 \pm 0.3$&$ 28.9 \pm 1.2$ &$0.6 \pm 0.01$&$ 14.0 \pm 0.2$&$ 29.0 \pm 0.5$\\
%    \end{tabular}
%    \caption{Online recommendation stages (time in seconds) of Lazy Rule Generation (LRG) and Metric Computation (MC)}

%\end{table}


%  \begin{table}[h!]
%  \scriptsize
%    \begin{tabular}{l|lllll}
%    ~           & Bibsonomy        & Lastfm           & Movielens       & Yahoo             & Youtube          \\\hline
%    SVM         &$21.5 \pm 1.1$ &$ 45.6 \pm 2.0$ & $0.6 \pm 0.0$ & $15.7 \pm 1.9$ & $40.5 \pm 1.9$\\
%    GP          &$63.6 \pm 3.8$ & $160.4 \pm 15.8$ & $2.9 \pm 0.3$ & $46.9 \pm 2.8$ & $149.7 \pm 16.9$ \\
%    RF          &$1416.1 \pm 49.8$&$1632.9 \pm 9.2$&$1039.0 \pm 58.5$&$1335.2 \pm 41.6$&$1715.5 \pm 42.4$\\
%    MART        &$307.2 \pm 3.7$ & $573.4 \pm 5.8$ & $107.1 \pm 0.8$ & $245.5 \pm 4.4$ &$480.3 \pm 20.6$\\
%    LambdaMART  &$1522.8 \pm 294.9$&$1820.8 \pm 963.9$&$613.6 \pm 155.9$&$1254.9 \pm 231.8$&$2700.6 \pm 1069.3$\\
%    Adarank     &$44.3 \pm 6.8$&$136.0 \pm 20.6$&$5.1 \pm 1.1$&$32.0 \pm 2.1$&$123.4 \pm 38.1$\\
%    Listnet     &$42.9 \pm 7.3$&$83.8 \pm 35.0$&$3.1 \pm 1.4$&$25.2 \pm 4.6$&$87.4 \pm 10.3$\\
%    Rankboost   &$17.7 \pm 0.5$&$39.1 \pm 0.6$&$1.8 \pm 0.1$&$12.7 \pm 0.4$&$35.1 \pm 2.6$\\
%    \end{tabular}
%\end{table}

%\subsection{Results with Personalization}
%  \begin{table}[h!]
%  \small
%    \begin{tabular}{l|lllll}
%    ~           & bibsonomy          & lastfm             & movielens         & youtube          \\\hline
%    SVM         &$ 	0.553\pm 0.001 $&$ 0.541	\pm 0.004 $&$ 0.821 \pm	0.007$&$ 0.544 \pm	0.001 $ \\ 
%    GP          &$  0.537 \pm	0.007$&$ 0.809 \pm	0.005 $&$0.853 \pm	0.003 $&$ 0.534 \pm	0.007 $ \\
%    RF          &$ 0.604 \pm 0.002 $&$ 0.841 \pm 0.002 $&$ 0.878 \pm 0.003 $&$  $ \\
%    MART        &$ 0.593 \pm 0.001 $&$  $&$ 0.871 \pm 0.003 $&$  0.570 \pm 0.001$ \\
%    LambdaMART  &$ 0.604 \pm 0.001 $&$ 0.844 \pm 0.004 $&$ 0.878 \pm 0.002 $&$ 0.576 \pm 0.002 $ \\
%    adarank     &$ 0.532 \pm 0.017 $&$ 0.439 \pm 0.011 $&$ 0.193 \pm 0.361 $&$  $ \\
%    listnet     &$ 0.488 \pm 0.003 $&$ 0.455 \pm nan $&$ 0.582 \pm 0.011 $&$  $ \\
%    rankboost   &$ 0.573 \pm 0.009 $&$  $&$ 0.859 \pm 0.004 $&$ 0.542 \pm nan $ \\
%    \end{tabular}
%\end{table}

\input{results}