This preview shows page 1. Sign up to view the full content.
Unformatted text preview: o di cult and, in any case,
does not provide feedback for representative test examples.
Therefore, it is important to continue to explore a range of
alternative datasets and evaluation methods and to avoid
prematurely committing to a speci c methodology or overinterpreting the results of individual studies. 3.2 Basic Results
The results are summarized in Table 6, where N represents
the number of training examples utilized and results are
shown for a number of representative points along the learning curve. Overall, the results are quite encouraging even
when the system is given relatively small training sets, and
performance generally improves quite rapidly as the number of training examples are increased. The SF data set
is clearly the most di cult since there are very few highlyrated books. Although accuracy for SF is less than choos ing the most common class negative, the other metrics are
more informative.
The top n" metrics are perhaps the most relevant to
many users. Consider precision at top 3, which is fairly
consistently in the 90 range after only 20 training examples
the exceptions are Lit1 until 70 examples1 and SF until
450 examples. Therefore, Libra's top recommendations
are highly likely to be viewed positively by the user. Note
that the Positive" column in Table 4 gives the probability
that a randomly chosen example from a given data set will
be positively rated. Therefore, for every data set, the top 3
and top 10 recommendations are always substantially more
likely than random to be rated positively, even after only 5
training examples.
Considering the average rating of the top 3 recommendations, it is fairly consistently above an 8 after only 20
training examples the exceptions again are Lit1 until 100
examples and SF. For every data set, the top 3 and top
10 recommendations are always rated substantially higher
than a randomly selected example cf. the average rating
from Table 4.
Looking at the rank correlation, except for SF, there
is at least a moderate correlation rs 0:3 after only 10
examples, and SF exhibits a moderate correlation after 40
examples. This becomes a strong correlation rs 0:6 for
Lit1 after only 20 examples, for Lit2 after 40 examples, for
Sci after 70 examples, for Myst after 300 examples, and for
1 References to performance at 70 and 300 examples are based on
learning curve data not included in the summary in Table 6. 7 0.6 6 0.5 5 Rating Top 3 8 0.7 Correlation Coefficient 0.8 0.4
0.3
0.2 LIBRA
LIBRANR 1 0 0
0 200 300 Figure 1: 100 Lit1 400
500
600
Training Examples 700 800 900 Rank Correlation 80
70
60
50
40
30
LIBRA
LIBRANR 10
0
0 50 Figure 2: 100 150 Myst 200
250
300
Training Examples 50 100 SF 150 200
250
300
Training Examples 350 400 450 Average Rating of Top 3 results shown in Figure 1, there is a consistent, statisticallysigni cant di erence in performance from 20 examples onward. For the Myst results on precision at top 10 shown in
Figure 2, there is a consistent, stati...
View
Full
Document
This document was uploaded on 09/12/2013.
 Fall '13

Click to edit the document details