Unformatted text preview: Suppose we 0 given an initial probability distribution 0 where we denote are PrX0 = i by i . Then, j1 = PrX1 = j X = PrX1 = j j X0 = iPrX0 = i i n X = pij i0;
i=1 i.e. 1 = 0 P: Repeating this argument, we obtain s = 0 P s : Here, s represents the distribution after s steps. Therefore, the matrix P s is the so-called s-step transition matrix". Its entries are ps = Pr Xt+s = j j Xt = i . ij De nition 1 A Markov Chain is said to be ergodic" if slim psij = j 0 for all j !1
and is independent of i. In this case, P 1 = slim P s 2!1 6 ..1 : : : = 6 . 4 1 : : : j : : : . . . j : : : 3 n 7 . 7 . 5 . n Hence, is independent of the starting distribution 0: = 0 P 1: Any vector which satis es P = and Pi i = 1 is called a stationary distribution. Proposition 7 For an ergodic MC, is a stationary distribution, and moreover it
is the unique stationary distribution. Proof: We have already shown that 0P = 1 which implies P 1 = slim P s+1 = slim P s P = P 1 P !1 !1
Random-11 Since 0P 1 = for any probability distribution 0, we have 0P 1 = 0P 1 P which implies P = . Since P 1 = 1 where 1 is the vector of 1's, we derive that P 1 1 = 1, which says that Pi i = 1. The reason why there is a unique stationary distribution is simply because by starting from another stationary distribution, say , we always remain in this distri~ bution implying that = . ~ Proposition 7 gives an easy" way to calculate the limiting distribution of an ergodic Markov chain P the transition matrix P . We just solve the linear system from of equations P = , = 1.
i i 4 Ergodicity and time reversibility 1. it is irreducible. That is, the underlying graph consisting of states and transitions with positive probabilities on them is strongly connected. Formally, for all i and j there is s such that ps 0. ij 2. the chain is aperiodic. That is, you cannot divide states into subgroups so that you must go from one to another in succession. Formally, gcdfs : ps 0g = 1 ij for all i and j . De nition 2 An ergodic MC is called time reversible TR if the chain remains a Markov chain when you go backwards". More formally, if is the stationary distribution, then ipij = j pji for all pairs of states i and j or, in words, the expected or ergodic ow from i to j equals the expected ow from j to i. Proposition 9 Consider an ergodic MC. Suppose there exists such that the balance conditions are satis ed: ipij = j pji; 8i; j and also, Pi i = 1. Then is the stationary distribution, and clearly, the MC is TR. Clearly the MC in Figure 5 is ergodic strong connectivity i.e., irreducibility and aperiodicity are obvious. It is clear that there exists a stationary distribution, and we can easily guess one. Consider 1 = 1 and 2 = 2 . Since one can easily verify that 3 3 satis es the balance conditions, must be the stationary distribution and the MC is time-reversible. Consider an ergodic MC which is also symmetric pij = pji as in Figure 6. Then 1 the stationary distribution is i = N , where N is the number of states. In these notes, we shall consider MC's that are ergodic and symmetric, and therefore, have a uniform stationary distribution over states. Theorem 8 An MC is ergodic if and only if both of the following are true: Random-12 1/2 1/2 1 1/4 3/4 2 Figure 5: An example of an MC with a stationary distribution. 1/4 1/2 1/6 1
1/4 1/4 1/2 1/3 2 1/3 3 5/12 Figure 6: A symmetric ergodic MC. Random-13 5 Counting problems
Suppose you would like to count the number of blue-eyed people on a planet. One approach is to take some sort of census, checking everyone on the planet for blue eyes. Usually this is not feasible. If you know the total number of people on the planet, then you could take a sample in such a way that every person has the same probability of being selected. This would be a uniformly selected sample of size n out of a universe of size N . Let Y be the random variable representing the number of individuals in this sample that have the property blue eyes, in our example. Then we can infer that the total number of individuals with this property is approximately YN. n If the actual number of individuals with the property is pN , then we have a Bernoulli process with n trials and success probability p. The random variable Y has a Binomial distribution with parameters n; p. We are interested in nding out how close Y is to p, and to do this we can use Cherno bounds, which are exponentially n decreasing on the tail of the distribution. Since Cherno bounds are quite useful, we will digress for a while and derive them in some generality. Lemma 10 Let Xi be independent Bernoulli random variables with probability of success pi. Then, for all 0 and all t 0, we have "X n n n Y h i Y Pr Xi t e, t E e Xi = e, t pie + 1 , pi : Proof: i=1 Pr "X n
i=1 i=1 h Pn i Xi t = Pr e i=1 Xi e t i=1 for any 0. Moreover, this can be written as Pr Y a with Y 0: From Markov's inequality we have Pr Y a E aY for any nonnegative random variable. Thus, h P i Pr Pn=1 Xi t e, tE e ihXi i i = e, t Qn=1 E e Xi because of independence. i The equality then follows from the de nition of expectation. P X for some 0 and = ln1 + , we obtain: Setting t = 1 + E i i Corollary 11 Let Xi be independent Pn random probability of Pn X = BernoulliThen, for variables ,withhave success pi , and let np = E i=1 i all 0 we i=1 pi . "X " np n n e ,1+ np Y E h1 + Xi i Pr Xi 1 + np 1 + 1 + 1+ : i=1 i=1 Random-14 The second inequality of the corollary follows from the fact that h i E 1 + Xi = pi1 + + 1 , pi = 1 + pi e pi : For in the range 0; 1, we can simplify the above expression and write the following more classical form of the Cherno bound: Theorem 12 Cherno bound Let Xi be independent Bernoulli random variables P P with probability of success pi, let Y = n=1 Xi , and let np = n=1 pi . Then, for i i 1 0, we have Pr Y , np np e, 2 np=3: For other ranges of , we simply have to change the constant 1=3 appropriately in the exponent. Similarly, we can write a Cherno bound for the probability that Y is below the mean. Theorem 13 Cherno bound Let Xi n independent Bernoullinrandom variables be with probability of success pi, let Y = Pi=1 Xi , and let np = Pi=1 pi . Then, for 1 0, we have np " e Pr Y , np np 1 , 1, e, 2 np=2: The last upper bound of e, 2 =2 can be derived by a series expansion. Let us go back to our counting problem. We can use Theorems 12 and 13 to see what sample size n we need to ensure that the relative error in our estimate of p n o is arbitrarily small. Suppose we wish to impose the bound Pr j Y , pj p . n , 2 np=3 , we derive that we can let the number of samples to be Imposing e 2 n = 3p log 2 : 2 Notice that n is polynomial in 1 , log 1 , and 1 . If p is exponentially small, then this p may be a bad approach. For example, if we were trying to count the number of American citizens who have dodged the draft, have become President of the country and who are being sued for sexual harassment, we would need an exponential number of trials. These notions can be formalized as follows. Suppose we would like to compute an integral number f x x represents the input. De nition 3 An fpras fully polynomial randomized approximation scheme for f x is a randomized algorithm which given x and outputs an integer g x such that " gx , f x 3 Pr f x 4 and runs in time polynomial in the size of the input x and in 1 . Random-15 Thus repeated sampling can be used to obtain an fpras if we can view f x as the number of elements with some property in a universe which is only polynomially bigger, and if we can sample uniformly from this universe. Notice that the running time is assumed to be polynomial in 1 . If we were to impose the stronger condition that the running time be polynomial in ln 1 then we would be able to compute f x exactly in randomized polynomial time whenever the size of the universe is exponential in the input size simply run the fpras with equal to the inverse of the size of the universe. Going back to our original counting problem and assuming that p is not too small, the question now is how to draw a uniformly selected individual on the planet or more generally a uniformly generated element in the universe under consideration. One possible approach is to use a Markov chain where there is one state for each individual. Assuming that each individual has at most 1000 friends, and that friendship" is symmetric, we set the transition probability from an individual to each of his friends 1 to be 2000 . Then if an individual has k friends, the transition probability to himself k will be 1 , 2000 1 , implying that the chain is aperiodic. 2 If the graph of friendship is strongly connected everyone knows everyone else through some sequence of friends of friends then this MC is ergodic, and the stationary distribution is the uniform distribution on states. Recall that 0 1 B 1 jj n C B C lim P s = P 1 = B .1 . . . . . .n C B .. . .. . .. C s!1 @ A 1 j n If this limit converges quickly, we can simulate the MC for a nite number of steps and get close" to the stationary distribution. Therefore, it would be useful for us to know the rate of convergence to the stationary distribution if we want to use Markov chains to approximately sample for a distribution. It turns out that the rate of convergence is related to the eigenvalues of the transition matrix, P . Given a stochastic matrix P recall that P is stochastic if it is nonnegative and all row sums are 1 with eigenvalues 1; : : : ; N , we have the following. 1. All jij 1. Indeed, if Pei = ei then P sei = s ei, and the fact that the LHS is bounded implies that the RHS must also be. 2. Since P is stochastic, 1 = 1 P 1 = 1. 3. The MC is ergodic i jij 1 for all i 6= 1. 4. If P is symmetric then all eigenvalues are real. Random-16 1 5. if P is symmetric and stochastic then if pii 2 for all i then i 0 for all i. Indeed, if we let Q = 2P , I , then Q is stochastic. Hence, the ith eigenvalue for Q, Q = 2i , 1, but j2i , 1j 1 implies that 0 i 1. i 6. Speed of convergence The speed of convergence is dictated by the second largest eigenvalue. For simplicity, suppose P has N linearly independent eigenvectors. Then P can be expressed as A,1DA where D is the diagonal matrix of eigenvalues Dii = i. And P 2 = A,1DAA,1DA = A,1D2A; or, in general, P s = A,1Ds A N X s = i Mi
i=1 = M1 + N X i=2 s Mi i where Mi is the matrix formed by regrouping corresponding terms from the matrix multiplication. If the MC is ergodic, then for i 6= 1, i 1, so lims!1 s = 0 i implying M1 = P 1 . Then P s , P 1 = PN s Mi is dominated by the term i=2 i corresponding to max = maxi6=1 jij. More generally, Theorem 14 Consider an ergodic time-reversible MC with stationary distribution .
Then the relative error after t steps is = max i;j jpijt , j j tmax : min j j j In particular, for an ergodic symmetric chain with pii 1 , we have 1 2 N 0, so max = 2 . 2 : : : Corollary 15 For an ergodic symmetric MC with pii 1 , the relative error 2 log if t log1N= 2 . =
Returning to our example: In order to calculate how many iterations are needed until we are arbitrarily close to the uniform distribution , we need to evaluate the second eigenvalue. For this purpose, Jerrum and Sinclair 11 have derived a relationship between the so-called conductance of the MC and max. Their result can be viewed as an isoperimetric inequality, and its derivation is analogous to an isoperimetric inequality of Cheeger 4 for Riemannian manifolds, or results on expanders by Alon 1 and Alon and Milman 2 . Random-17 6 Conductance of Markov chains Jerrum-Sinclair
Given a set S of states, let CS denote the capacity of S , which is the probability of being in some state in S when steady-state is reached. Speci cally, X CS = i:
i2S De ne FS , the ergodic ow out of S expected ow so that X FS = ipij ;
i2S;j 62S summing over transitions from S to the complement of S , S . Clearly FS CS . De ne S = FS =CS , which is the probability of leaving S given that we are already in S . De ne the conductance of the MC := min 1 S :
S :CS 2 Intuitively, if we have an MC with small conductance, then once we are in a set S , we are stuck" in S for a long time. This implies that it will take a long time to reach the stationary distribution, so the rate of convergence will be small. We might therefore expect that 2 will be close to 1 if the conductance is small. Theorem 16 Jerrum-Sinclair 11 For an ergodic MC that is TR, we have 2 1 , 2=2: Remark 1 There exist corresponding lower bounds expressing that t2 and 2 1 , 2. This therefore shows that the conductance is an appropriate measure of the
speed of convergence. 7 Evaluation of Conductance of Markov Chains
Given a markov chain, the task will be to evaluate the conductance . In order to generate Markov chains with a uniform steady state distribution and rapidly mixing property, we restrict our attention to the following MC: symmetric and equal transition probability p between states having a transition i.e. for all i 6= j , either pij = pji = 0 or pij = pji = p. Therefore, it will have a uniform steady state distribution i = 1=N , where N denotes the number of states. Instead of looking at the MC, we can look at the underlying graph G = V; E , E = fi; j : pij = pji = pg. For a set S of states, let S = fi; j 2 E : i 2 S; j 62 S g, then, X 1 CS = i = N jS j; i2S Random-18 FS = X
i2S;j 62S 1 ipij = p N j S j;
S F S S = CS = p j jS jj : De nition 4 The magni cation factor of G is
G = minjV j j S j : 0 jS j 2 jS j Therefore, = min S = p G: S In the rest of this section, we study the conductance of a simple MC, which will be useful in the problem of counting matchings. Take a MC on states that are all binary numbers with d bits so that the underlying graph is a d-cube with 2d states. Two nodes are adjacent only if they di er by exactly one bit. Refer to Figure 7 for a 3-cube.
000 100 001 101 010 011 110 111 Figure 7: A 3-cube Markov chain. If we put probabilities of 21d on all out-transitions then the self loops get probabilities 1 . The MC is symmetric and ergodic, so i = 21d . If we simulate a random walk 2 on this d-cube, we obtain a random d-bit number. Claim 17 The magni cation factor of the d-cube G = 1. Proof:
Random-19 1. G 1. Let S1 be a vertex set of half cube" e.g. all vertices with state number starting with 0. Then clearly, every vertex in S1 will have exactly one edge incident to it leaving S1. Therefore, j S1j = jS1j = jV j , and 2 S S G = minjV j j jS jj j jS 1jj = 1:
0 jS j
2 1 2. G 1. For all x1; x2, de ne a random path between x1 and x2 selected uniformly among all shortest paths between x1 and x2. For example, for x1 = 1 0 1 1 0 1 0 x2 = 1 0 0 1 1 1 1 we rst look at the bits in which they di er 3rd, 5th and 7th. The order in which you change these bits determines a or all shortest path in the d-cube. Given e 2 E , by symmetry, 1 E of paths through e = jE j T; where T is the total length of all shortest paths. We can compute this by rst choosing x1 2d choices, sum up all the paths from x1from length 1 to d. For any path p, we count it twice: x1 ! x2 and x2 ! x1. Thus, d ! 1 2d X d k: T=2 k=0 k d This can also be written as T = 1 2d Pd=0 k d , k. Taking the average of k 2 these two expressions, we deduce that d ! 1 d2d X d = d22d,2 : T=2 k=0 k On the other hand, jE j = 1 2d d. Hence, 2 T E of paths through e = jE j = d2 d,1 = 2d,1: d2
Consider any set S . By linearity of expectations, X E of paths intersecting S E of paths through e = 2d,1 j S j:
e2 S 2d,2 Random-20 ...
View Full Document