This preview shows pages 1–2. Sign up to view the full content.
This preview has intentionally blurred sections. Sign up to view the full version.View Full Document
Unformatted text preview: 1 The BellKor Solution to the Netflix Grand Prize Yehuda Koren August 2009 I. INTRODUCTION This article describes part of our contribution to the Bell- Kors Pragmatic Chaos final solution, which won the Netflix Grand Prize. The other portion of the contribution was created while working at AT&T with Robert Bell and Chris Volinsky, as reported in our 2008 Progress Prize report . The final solution includes all the predictors described there. In this article we describe only the newer predictors. So what is new over last years solution? First we further im- proved the baseline predictors (Sec. III). This in turn improves our other models, which incorporate those predictors, like the matrix factorization model (Sec. IV). In addition, an extension of the neighborhood model that addresses temporal dynamics was introduced (Sec. V). On the Restricted Boltzmann Ma- chines (RBM) front, we use a new RBM model with superior accuracy by conditioning the visible units (Sec. VI). The final addition is the introduction of a new blending algorithm, which is based on gradient boosted decision trees (GBDT) (Sec. VII). II. PRELIMINARIES The Netflix dataset contains more than 100 million date- stamped movie ratings performed by anonymous Netflix cus- tomers between Dec 31, 1999 and Dec 31, 2005 . This dataset gives ratings about m = 480 , 189 users and n = 17 , 770 movies (aka, items). The contest was designed in a training-test set format. A Hold-out set of about 4.2 million ratings was created consisting of the last nine movies rated by each user (or fewer if a user had not rated at least 18 movies over the entire period). The remaining data made up the training set. The Hold-out set was randomly split three ways, into subsets called Probe, Quiz, and Test. The Probe set was attached to the training set, and labels (the rating that the user gave the movie) were attached. The Quiz and Test sets made up an evaluation set, which is known as the Qualifying set, that competitors were required to predict ratings for. Once a competitor submits pre- dictions, the prizemaster returns the root mean squared error (RMSE) achieved on the Quiz set, which is posted on a public leaderboard ( www.netflixprize.com/leaderboard ). RMSE values mentioned in this article correspond to the Quiz set. Ultimately, the winner of the prize is the one that scores best on the Test set, and those scores were never disclosed by Netflix. This precludes clever systems which might game the competition by learning about the Quiz set through repeated submissions. Compared with the training data, the Hold-out set contains many more ratings by users that do not rate much and are Y. Koren is with Yahoo! Research, Haifa, ISRAEL. Email: firstname.lastname@example.org therefore harder to predict. In a way, this represents real requirements for a collaborative filtering (CF) system, which needs to predict new ratings from older ones, and to equally address all users, not just the heavy raters....
View Full Document
This note was uploaded on 01/15/2012 for the course ECON 101 taught by Professor N/a during the Fall '11 term at Middlesex CC.
- Fall '11