This preview has intentionally blurred sections. Sign up to view the full version.
View Full Document
Unformatted text preview: 23. Bayesian Ying Yang Learning (II): A New Mechanism for Model Selection and Regularization Lei Xu Chinese University of Hong Kong, Hong Kong Abstract Efforts toward a key challenge of statistical learning, namely making learning on a finite size of samples with model selection ability, have been discussed in two typical streams. Bayesian Ying Yang (BYY) harmony learning provides a promising tool for solving this key challenge, with new mechanisms for model selection and regularization. Moreover, not only the BYY harmony learning is further justified from both an information theoretic perspective and a generalized projection geometry, but also com parative discussions are made on its relations and differences from the studies of minimum description length (MDL), the bitback based MDL, Bayesian approach, maximum likelihood, information geometry, Helmholtz machines, and variational approximation. In addition, bibliographic re marks are made on the advances of BYY harmony learning studies. 23.1 Introduction: A Key Challenge and Existing Solutions A key challenge to all the learning tasks is that learning is made on a finite size set X of samples from the world X , while our ambition is to get the underlying distribution such that we can apply it to as many as possible new samples coming from X . Helped by certain preknowledge about X a learner, M is usually designed via a parametric family p ( x  θ ), with its density function form covering or being as close as possible to the function form of the true density p * ( x · ). Then, we obtain an estimator ˆ θ ( X ) with a specific value for θ such that p ( x  ˆ θ ( X )) is as close as possible to the true density p * ( x  θ o ), with the true value θ o . This is usually obtained by determining a specific value of ˆ θ ( X ) that minimizes a cost functional F ( p ( x  θ ) , X ) or F ( p ( x  θ ) ,q X ( x )) , (23.1) where q X is an estimated density of x from X , e.g., given by the empirical density: p ( x ) = 1 N ∑ N t =1 δ ( x x t ) , δ ( x ) = ‰ lim δ → 1 δ d , x = 0, , x 6 = 0, (23.2) 662 L. Xu where d is the dimension of x and δ > 0 is a small number. With a given smoothing parameter h > 0, q X can also be the following nonparametric Parzen window density estimate [23.21]: p h ( x ) = 1 N ∑ N t =1 G ( x  x t ,h 2 I ) , (23.3) When p ( x  θ ) = p ( x ), given by Eq. (23.2), a typical example of Eq. (23.1) is min θF ( p ( x  θ ) , X ) = R p ( x )ln p ( x  θ ) μ ( dx ) , (23.4) where μ ( . ) is a given measure. It leads to the maximum likelihood (ML) estimator ˆ θ ( X ). For a fixed N , we usually have ˆ θ ( X ) 6 = θ o and p ( x  ˆ θ ( X )) 6 = p * ( x  θ o ). Thus, though p ( x  ˆ θ ( X )) best matches the sample set X in the sense of Eq. (23.1) or Eq. (23.4), p ( x  ˆ θ ( X )) may not well apply to new samples from the same world X ....
View
Full Document
 Spring '10
 LeiXu
 Machine Learning, Computational complexity theory, Eq., Lei Xu

Click to edit the document details