# 《非线性最优化基础》学习笔记

August 2, 2015

## 主要内容

1. 凸函数、闭函数
2. 共轭函数
3. 鞍点问题
4. Lagrange 对偶问题
5. Lagrange 对偶性的推广
6. Fenchel 对偶性

4. Fast dual proximal gradient methods

Constrains relax

## 算法

#### Proximal mapping

The proximal mapping (or proximal operator) of a convex function ​$h$ is

examples

1.$h(x) = 0: {\bf prox}_h(x) = x$

2.$h(x) = I_C(x)$ (indicator function of ​$C$): ​${\bf prox}_h$ is projection on ​$C$

3.$h(x) = t \|x\|_1$: ​${\bf prox}_h$ is shinkage (soft threshold) operation

unconstrained problem with cost function split in two components

$g$ convex, differentiable, with dom$g=\Re^n$

$h$ closed, convex, possibly nondifferentiable; ​${\bf prox}_h$ is inexpensive

constant or determined by line search

#### Interpretation

from definition of proximal operator:

$x^+$ minimizes ​$h(u)$ plus a simple quadratic local of ​$g(u)$ around ​$x$

#### Examples

gradient method: ​$h(x) = 0$, i.e., minimize g(x)

gradient projection method: ​$h(x) = I_C(x)$, i.e., minimize ​$g(x)$ over ​$C$

iterative soft-thresholding: ​$h(x) = \|x\|_1$, i.e., ​$minimize \; \; g(x)+ \| x \|_1$

and

### 2. Dual Proximal Gradient Methods

#### Composite structure in the Dual

dual has the right structure for the proximal gradient method if

prox-operator of ​$g$ (or ​$g^\ast$) is cheap (closed form or simple algorithm)

$f$ is strongly convex (​$f(x)-(\frac{\mu}{2})x^T$ is convex) implies ​$f^\ast\left(-A^Tz\right)$ has Lipschitz continuous gradient (​$L=\frac{\|A\|^2_2}{\mu}$):

because ​$\nabla f^2$ is Lipschitz continuous with constant ​$\frac{1}{\mu}$

equivalent expression in term of ​$f$:

1. if ​$f$ is separable, calculation of ​$\hat{x}$ decomposes into independent problems

2. step size ​$t$ constant or from backtracking line search

#### Alternating minimization interpretation

Moreau decomposition gives alternate expression for ​$z$-update

where

in each iteration, an alternating minimization of:

1. Lagrangian$f(x) + g(y) + z^T(Ax - y)$ over ​$x$

2. augmented Lagrangian$f(x) + g(y) + z^T(Ax - y) + \frac{t}{2} \|Ax - y\|^2_2$ over ​$y$

#### Regularized norm approximation

a special case with ​$g(y) = \|y - b\|$

C is unit norm ball for dual norm ​$\|\cdot\|_\ast$

#### Example

$C_i$ is unit Euclidean norm ball in ​$\Re^{m_i}$, if ​$B_i \in \Re^{m_i \times n}$

#### Minimization over intersection of convex sets

$f$ strongly convex; e.g., ​$f(x) = \|x - a\|^2_2$ for projecting ​$a$ on intersection

sets ​$C_i$ are closed, convex, and easy to project onto

#### Decomposition of separable problems

each ​$f_i$ is strongly convex; ​$g_i$ has inexpensive prox-operator

### 3. Fast proximal gradient methods

#### FISTA (basic version)

$g$ convex, differentiable with ​$\mathop{dom} g=\Re^n$

$h$ closed, convex, with inexpensive ​$prox_{th}$ operator

algorithm: choose any ​$x^{(0)} = x^{(-1)}$; for ​$k \geq 1$, repeat the steps

step size ​$t_k$ fixed or determined by line search

acronym stands for ‘Fast Iterative Shrinkage-Thresholding Algorithm’

#### Interpretation

first iteration (​$k = 1$) is a proximal gradient step at ​$y = x^{(0)}$

next iterations are proximal gradient steps at extrapolated points ​$y$

note: ​$x^{(k)}$ is feasible (in ​$\mathop{dom} h$); ​$y$ may be outside ​$\mathop{dom} h$

#### Reformulation of FISTA

define ​$\theta_k = \frac{2}{k+1}$ and introduce an intermediate variable ​$v^{(k)}$

algorithm: choose ​$x^{(0)} = v^{(0)}$; for ​$k \geq 1$, repeat the steps

#### Nesterov’s second method

algorithm: choose ​$x^{(0)} = v^{(0)}$; for ​$k \geq 1$, repeat the steps

User​$\theta_k = \frac{2}{k+1}$ and ​$t_k = \frac{1}{L}$, or one of the line search methods

identical to FISTA if ​$h(x) = 0$

unlike in FISTA, ​$y$ is feasible (in ​$\mathop{dom} h$) if we take ​$x^{(0)} \in \mathop{dom} h$

### 4. Fast dual proximal gradient methods

Initialization: ​$L \geq \frac{\|A\|^2}{\sigma}$, ​$w_1 = y_0 \in \mathbb{V}$, ​$t_1 = 1$.

General Step ​$(k \geq 1)$:

#### The Fast Dual-Based Proximal Gradient Method (FDPG)

Input: ​$L \geq \frac{\|A\|^2}{\sigma} - \text{ an upper bound on the Lipschitz constant of } \nabla F$

Step ​$0$. Take ​$w_1 = y_0 \in \mathbb{V}$, ​$t_1 = 1$.

Step ​$k$. (​$k \geq 0$) Compute

《非线性最优化基础》学习笔记 - August 2, 2015 - QU Xiaofeng