Unformatted text preview: nce on
the novel test data, including:
Classi cation accuracy Acc: The percentage of examples correctly classi ed as positive or negative.
Recall Rec: The percentage of positive examples classi ed as positive.
Precision Pr: The percentage of examples classi ed
as positive which are positive.
Precision at Top 3 Pr3: The percentage of the 3 top
ranked examples which are positive. Precision at Top 10 Pr10: The percentage of the 10
top ranked examples which are positive.
F-Measure F: A weighted average of precision and
recall frequently used in information retrieval:
F = 2 Pr Rec=Pr + Rec
Rating of Top 3 Rt3: The average user rating assigned to the 3 top ranked examples.
Rating of Top 10 Rt10: The average user rating assigned to the 10 top ranked examples.
Rank Correlation rs : Spearman's rank correlation
coe cient between the system's ranking and that imposed by the users ratings ,1 rs 1; ties are
handled using the method recommended by 1 .
The top 3 and top 10 metrics are given since many users
will be primarily interested in getting a few top-ranked recommendations. Rank correlation gives a good overall picture of how the system's continuous ranking of books agrees
with the user's, without requiring that the system actually
predict the numerical rating score assigned by the user. A
correlation coe cient of 0.3 to 0.6 is generally considered
moderate" and above 0.6 is considered strong." 3.1.3 Methodological Discussion
A number of other recent experimental evaluations of recommender systems have employed user-selected examples
that were not randomly sampled from the overall distribution. In particular, data from the EachMovie system, has
been used by a number of researchers to evaluate recommenders 6, 4, 12 . Examples in such data sets were selected
by users with unknown strategies and biases and do not
constitute a representative sample of items. When a recommender is used in practice, it needs to rank or categorize
all of the items in the database, and therefore the test data
in an experimental evaluation should be a random sample
from the complete dataset in order to faithfully characterize
ultimate performance. Consequently, our experiments utilize randomly sampled examples. Unfortunately, naturallyavailable data from users of existing systems does not provide random test examples. An ideal evaluation requires
the researcher to control the selection of test examples and
prevents the easy use of existing commercial data.
However, in practical use, the user will normally select
the training examples. Therefore, randomly selected training examples, such as employed in our experiments, is not
particularly realistic. Unfortunately, employing user-selected
training examples with randomly-sampled test examples prevents repeatedly partitioning a given set of user data and
running multiple training test trials, such as n-fold crossvalidation, in order to obtain more statistically reliable results. Since presumab...
View Full Document
- Fall '13
- Machine Learning, Spearman's rank correlation coefficient, Recommender system, training examples