Information-Theory

Information-Theory - Machine Learning Srihari Information...

Info iconThis preview shows pages 1–8. Sign up to view the full content.

View Full Document Right Arrow Icon
Machine Learning Srihari 1 Information Theory Sargur N. Srihari
Background image of page 1

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon
Machine Learning Srihari 2 Topics 1. Entropy as an Information Measure 1. Discrete variable definition Relationship to Code Length 2. Continuous Variable Differential Entropy 2. Maximum Entropy 3. Conditional Entropy 4. Kullback-Leibler Divergence (Relative Entropy) 5. Mutual Information
Background image of page 2
Machine Learning Srihari 3 Information Measure How much information is received when we observe a specific value for a discrete random variable x ? Amount of information is degree of surprise – Certain means no information – More information when event is unlikely Depends on probability distribution p(x), a quantity h(x) If there are two unrelated events x and y we want h(x,y)= h(x) + h(y) Thus we choose h(x)= - log 2 p(x) – Negative assures that information measure is positive Average amount of information transmitted is the expectation wrt p(x) refered to as entropy H(x)=- Σ x p(x) log 2 p(x)
Background image of page 3

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon
Machine Learning Srihari Uniform Distribution – Random variable x has 8 possible states, each equally likely We would need 3 bits to transmit Also, H(x) = - 8 x (1/8)log 2 (1/8)=3 bits Non-uniform Distribution – If x has 8 states with probabilities (1/2,1/4,1/8,1/16,1/64,1/64,1/64,1/64) H(x)=2 bits Non-uniform distribution has smaller entropy than uniform Has an interpretation of in terms of disorder 4 Usefulness of Entropy
Background image of page 4
Machine Learning Srihari 5 Relationship of Entropy to Code Length Take advantage of non-uniform distribution to use shorter codes for more probable events If x has 8 states ( a,b,c,d,e,f,g,h ) with probabilities (1/2,1/4,1/8,1/16,1/64,1/64,1/64,1/64) Can use codes 0,10,110,1110,111100,111110,111111 average code length = (1/2)1+(1/4)2+(1/8)3+(1/16)4+4(1/64)6 =2 bits Same as entropy of the random variable Shorter code string is not possible due to need to disambiguate string into component parts 11001110 is uniquely decoded as sequence cad 1/2 1/4 1/8 0 1
Background image of page 5

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon
Machine Learning Srihari Relationship between Entropy and Shortest Coding Length • Noiseless coding theorem of Shannon – Entropy is a lower bound on number of bits needed to transmit a random variable • Natural logarithms are used in relationship to other topics – Nats instead of bits 6
Background image of page 6
Machine Learning Srihari History of Entropy: thermodynamics to information theory • Entropy is average amount of information needed to specify state of a random variable • Concept had much earlier origin in physics – Context of equilibrium thermodynamics – Later given deeper interpretation as measure of disorder (developments in statistical mechanics) 7
Background image of page 7

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon
Image of page 8
This is the end of the preview. Sign up to access the rest of the document.

This document was uploaded on 02/25/2012.

Page1 / 24

Information-Theory - Machine Learning Srihari Information...

This preview shows document pages 1 - 8. Sign up to view the full document.

View Full Document Right Arrow Icon
Ask a homework question - tutors are online