This preview has intentionally blurred sections. Sign up to view the full version.View Full Document
Unformatted text preview: 23. Bayesian Ying Yang Learning (II): A New Mechanism for Model Selection and Regularization Lei Xu Chinese University of Hong Kong, Hong Kong Abstract Efforts toward a key challenge of statistical learning, namely making learning on a finite size of samples with model selection ability, have been discussed in two typical streams. Bayesian Ying Yang (BYY) harmony learning provides a promising tool for solving this key challenge, with new mechanisms for model selection and regularization. Moreover, not only the BYY harmony learning is further justified from both an information theoretic perspective and a generalized projection geometry, but also com- parative discussions are made on its relations and differences from the studies of minimum description length (MDL), the bit-back based MDL, Bayesian approach, maximum likelihood, information geometry, Helmholtz machines, and variational approximation. In addition, bibliographic re- marks are made on the advances of BYY harmony learning studies. 23.1 Introduction: A Key Challenge and Existing Solutions A key challenge to all the learning tasks is that learning is made on a finite size set X of samples from the world X , while our ambition is to get the underlying distribution such that we can apply it to as many as possible new samples coming from X . Helped by certain pre-knowledge about X a learner, M is usually designed via a parametric family p ( x | θ ), with its density function form covering or being as close as possible to the function form of the true density p * ( x |· ). Then, we obtain an estimator ˆ θ ( X ) with a specific value for θ such that p ( x | ˆ θ ( X )) is as close as possible to the true density p * ( x | θ o ), with the true value θ o . This is usually obtained by determining a specific value of ˆ θ ( X ) that minimizes a cost functional F ( p ( x | θ ) , X ) or F ( p ( x | θ ) ,q X ( x )) , (23.1) where q X is an estimated density of x from X , e.g., given by the empirical density: p ( x ) = 1 N ∑ N t =1 δ ( x- x t ) , δ ( x ) = ‰ lim δ → 1 δ d , x = 0, , x 6 = 0, (23.2) 662 L. Xu where d is the dimension of x and δ > 0 is a small number. With a given smoothing parameter h > 0, q X can also be the following non-parametric Parzen window density estimate [23.21]: p h ( x ) = 1 N ∑ N t =1 G ( x | x t ,h 2 I ) , (23.3) When p ( x | θ ) = p ( x ), given by Eq. (23.2), a typical example of Eq. (23.1) is min θ-F ( p ( x | θ ) , X ) =- R p ( x )ln p ( x | θ ) μ ( dx ) , (23.4) where μ ( . ) is a given measure. It leads to the maximum likelihood (ML) estimator ˆ θ ( X ). For a fixed N , we usually have ˆ θ ( X ) 6 = θ o and p ( x | ˆ θ ( X )) 6 = p * ( x | θ o ). Thus, though p ( x | ˆ θ ( X )) best matches the sample set X in the sense of Eq. (23.1) or Eq. (23.4), p ( x | ˆ θ ( X )) may not well apply to new samples from the same world X ....
View Full Document
- Spring '10
- Machine Learning, Computational complexity theory, Eq., Lei Xu