On the novel test data including classi cation

Info iconThis preview shows page 1. Sign up to view the full content.

View Full Document Right Arrow Icon
This is the end of the preview. Sign up to access the rest of the document.

Unformatted text preview: nce on the novel test data, including: Classi cation accuracy Acc: The percentage of examples correctly classi ed as positive or negative. Recall Rec: The percentage of positive examples classi ed as positive. Precision Pr: The percentage of examples classi ed as positive which are positive. Precision at Top 3 Pr3: The percentage of the 3 top ranked examples which are positive. Precision at Top 10 Pr10: The percentage of the 10 top ranked examples which are positive. F-Measure F: A weighted average of precision and recall frequently used in information retrieval: F = 2  Pr  Rec=Pr + Rec Rating of Top 3 Rt3: The average user rating assigned to the 3 top ranked examples. Rating of Top 10 Rt10: The average user rating assigned to the 10 top ranked examples. Rank Correlation rs : Spearman's rank correlation coe cient between the system's ranking and that imposed by the users ratings ,1  rs  1; ties are handled using the method recommended by 1 . The top 3 and top 10 metrics are given since many users will be primarily interested in getting a few top-ranked recommendations. Rank correlation gives a good overall picture of how the system's continuous ranking of books agrees with the user's, without requiring that the system actually predict the numerical rating score assigned by the user. A correlation coe cient of 0.3 to 0.6 is generally considered moderate" and above 0.6 is considered strong." 3.1.3 Methodological Discussion A number of other recent experimental evaluations of recommender systems have employed user-selected examples that were not randomly sampled from the overall distribution. In particular, data from the EachMovie system, has been used by a number of researchers to evaluate recommenders 6, 4, 12 . Examples in such data sets were selected by users with unknown strategies and biases and do not constitute a representative sample of items. When a recommender is used in practice, it needs to rank or categorize all of the items in the database, and therefore the test data in an experimental evaluation should be a random sample from the complete dataset in order to faithfully characterize ultimate performance. Consequently, our experiments utilize randomly sampled examples. Unfortunately, naturallyavailable data from users of existing systems does not provide random test examples. An ideal evaluation requires the researcher to control the selection of test examples and prevents the easy use of existing commercial data. However, in practical use, the user will normally select the training examples. Therefore, randomly selected training examples, such as employed in our experiments, is not particularly realistic. Unfortunately, employing user-selected training examples with randomly-sampled test examples prevents repeatedly partitioning a given set of user data and running multiple training test trials, such as n-fold crossvalidation, in order to obtain more statistically reliable results. Since presumab...
View Full Document

This document was uploaded on 09/12/2013.

Ask a homework question - tutors are online