This preview shows pages 1–3. Sign up to view the full content.
This preview has intentionally blurred sections. Sign up to view the full version.View Full Document
Unformatted text preview: A Brief Introduction to Deterministic Annealing Justin Muncaster Department of Computer Science University of California, Santa Barbara [email protected] Abstract This paper provides a short description of Deterministic Annealing  and its information theoretic derivation. The technique is presented in the context of clustering and vector quantization, where its application is most obvious, although the technique’s applications range much further. This paper is based on  and is meant to provide an approachable introduction to deterministic annealing. 1. Introduction Deterministic annealing is an optimization technique that attempts to find a global minimum of a cost function. The technique is designed to be able to explore a large portion of the cost surface using randomness, while still performing optimization using local information. The procedure starts with changing the cost function to introduce a notion of randomness, allowing a large area to be explored. Each iteration the amount of randomness (measured by Shannon Entropy ) is constrained, and a local optimization of performed. Gradually, the amount of imposed randomness is lowered so that upon termination the algorithm optimizes over the original cost function, yielding a solution to the original problem. 2. Clustering and compression In clustering we wish to represent a space of data points by a smaller set of codevectors. This is to say we with to partition the space into subsets where elements in each subset are as similar as possible. This problem has applications to many fields, ranging from pattern recognition to compression. In the following, we will define the problem and present the classic k-means solution, which will reappear in a different form when we discuss the deterministic annealing approach. 2.1 Problem definition Mathematically, the problem is defined as follows. Suppose we are given a source vector X x ∈ that we wish to transmit across a noiseless channel. We will encode x by a codevector y from a codebook Y . We will always encode x using an index to the “best” reproduction codevector, denoted y ( x ). The best reproduction codevector is defined with respect to a distortion function ( ) ⋅ ⋅ , d , which we wish to minimize. The distortion function quantifies the difference between a source vector x and a reproduction vector y ( x ). In most cases distortion is measured by squared distance; however this need not be the case in general. Lossy compression is achieved by a many-to-one mapping of source vectors to codevectors. This effectively partitioning the space X into “clusters” surrounding the codevectors, where the number of clusters (and hence, the compression rate) is determined by the size of the codebook that we wish to use....
View Full Document
This note was uploaded on 12/27/2011 for the course CMPSC 225 taught by Professor Vandam during the Fall '09 term at UCSB.
- Fall '09