# 132 let f def x 7 z α ω φ ω x μ dω ω α ω c 416 be the

• Notes
• 217

Course Hero uses AI to attempt to automatically extract content from documents to surface to you and others so you can study better, e.g., in search results, to enrich docs, and more. This preview shows page 132 - 136 out of 217 pages.

132
*LetFdef=x7→Zα(ω)φω(x)μ() :ω,|α(ω)| ≤C(416)be the subset of functions in the RKHSHwith bounded Fourier coefficientsα(ω).*LetˆFdef=(x7→1mmXi=1α(ωi)φωi(x) :ω,|α(ω)| ≤C)(417)be the subset that is spanned by the random feature functions, whereω1:kbedrawn i.i.d. fromμ.*Letp*be any distribution overX=Rb.*Define the inner product with respect to the data-generating distribution (thisis not the RKHS inner product):hf, gidef=Exp*[f(x)g(x)].(418)*Letf*∈ Fbe any true function.*Then with probability at least 1-δ, there existsˆfˆFthatkˆf-f*k ≤Cm1 +p2 log(1).(419)Proof of Theorem25:*This proof uses fairly standard tools: McDiarmid’s inequality and Jensen’sinequality. The function we’re applying involves taking a norm of a function,but we just need the bounded differences condition to hold.*Fixf*∈ Fwith coefficientsα(ω).*Constructˆfwith the same coefficients, and note thatˆfˆFandE[ˆf] =f*.*DefineD(ω1:m) =kˆf-f*k.(420)Note thatDsatisfies the bounded differences inequality: lettingωi1:m=ω1:mexcept on thei-th component, where it isω0i:D(ω1:m)-D(ωi1:m)≤ kˆf-f*k - kˆfi-f*k(421)≤ kˆf-ˆfik[triangle inequality](422)1mkα(ωi)φωi-α(ω0i)φω0ik(423)2Cm.(424)Note that the last line follows because|α(ωi)| ≤Candφωi(x) =e-ihωi,xiand|e-ia|= 1 for alla.133
*We can bound the mean by passing to the variance:E[D(ω1:m)]pE[D(ω1:m)2][Jensen’s inequality](425)=vuuutE1mmXi=1(α(ωi)φωi-f*)2[expand](426)=vuut1m2mXi=1Ekα(ωi)φωi-f*k2[variance of i.i.d. sum](427)Cm[use|α(ωi)| ≤C].(428)*Applying McDiarmid’s inequality (Theorem8), we get thatPD(ω1:m)Cm+exp-22mi=1(2C/m)2.(429)Rearranging yields the theorem.Remark: the definition ofαhere differs from the Rahimi/Recht paper.Corollary:*Suppose we had a loss function(y, v) which is 1-Lipschitz in the secondargument. (e.g., the hinge loss). Define the expected risk in the usual way:L(f)def=E(x,y)p*[(y, f(x))].(430)Then the approximation ratio is bounded:L(ˆf)-L(f*)E[|(y,ˆf(x))-(y, f*(x))|][definition, add| · |](431)E[|ˆf(x)-f*(x)|][fixy,is Lipschitz](432)≤ kˆf-f*k[concavity of·].(433)So far, we have analyzed approximation error due to having a finitem, but as-suming an infinite amount of data. Separately, there is the estimation error dueto havingndata points:L(ˆfERM)-L(ˆf)OpCn,(434)whereˆfERMminimzes the empirical risk over the random hypothesis classˆF. So,the total error, which includes approximation error and estimation error isL(ˆfERM)-L(f*) =OpCn+Cm.(435)134
This bound suggests that the approximation and estimation errors are balancedwhenmandnare on the same order. One takeaway is that we shouldn’t over-optimize one without the other. But one might also strongly object and say that ifmun, then we aren’t really getting any savings! This is indeed a valid complaint,and in order to get stronger results, we would need to impose more structure onthe problem.

Upload your study docs or become a

Course Hero member to access this document

Upload your study docs or become a

Course Hero member to access this document

End of preview. Want to read all 217 pages?

Upload your study docs or become a

Course Hero member to access this document

Term
Spring
Professor
PercyShuoLiang
Tags
Maximum likelihood, Estimation theory