IE 598: Big Data OptimizationFall 2016Lecture 16: Smoothing Techniques I – October 18Lecturer: Niao HeScribers: Harsh GuptaOverview.We discussed Subgradient Descent and Mirror Descent algorithms for non-smooth convex op-timization in the past week. We observed that Subgradient Descent is a special case of the Mirror Descentalgorithm. But, both these algorithms have general formulations and don’t exploit the structure of the prob-lem at hand. In practice, we always know some thing about the structure of the optimization problem weintend to solve. One can then utilize this structure to come up with more efficient algorithms as comparedto Subgradient Descent and Mirror Descent algorithms.16.1IntroductionWe intend to solve the following optimization problem:minx∈Xf(x)(16.1)wherefis a convex but non-smooth, i.e., non-differentiable function, andXis a convex compact set. Oneintuitive way to approach the above problem is to approximate the non-smooth functionf(x) by a smoothand convex functionfμ(x), so that we can use the standard techniques learnt so far in the course to solvethe problem. Hence, we want to reduce the problem in (16.1) to the following:minx∈Xfμ(x)(16.2)wherefμ(x) is aLμ-Lipschitz continuous, smooth and convex approximation of the functionf(x).Nowwe can use the techniques learnt earlier in this course like gradient descent, accelerated gradient descent,Frank Wolfe algorithm, coordinate descent etc., to solve the above problem. Clearly, the objective now is tocome up with a reasonably good approximationfμoffso that solving (16.2) is as close to solving (16.1) aspossible.A motivation example:Consider the simplest non-smooth and convex function,f(x) =|x|. The followingfunction, known as theHuber functionfμ(x) =(x22μ,|x| ≤μ|x| -μ2,|x|> μ(16.3)is a smooth approximation of the absolute value function. We plot the two functions (forμ= 1) in Figure1. We make the following observations:1.fμ(x) is clearly continuous and differentiable everywhere. This can be seen straightforwardly from itsformulation given in (16.3).2. We observe thatfμ(x)≤f(x). Also,fμ(x)≥ |x| -μ2, therefore:f(x)-μ2≤fμ(x)≤f(x)Hence, ifμ→0, thenfμ(x)→f(x). Therefore,μcharacterizes the approximation accuracy.3. We also observe that|f00μ(x)| ≤1μ. This implies thatfμ(x) is1μ-Lipschitz continuous.The Hubert function approximation has been widely used in machine learning to approximate non-smoothloss functions, e.g. absolute loss (robust regression), hinge loss (SVM), etc.