OPTIMIZATIONS for Machine Learning

The Handbook of Cluster Analysis

Sun, 14 Feb 2016 20:21:46 +0800

最近在看 The Handbook of Cluster Analysis（聚类分析手册）这本书。这本书不愧为手册，各种聚类方法都很全，作者也都是业内人士。

其中两章有作者公开的版本：

$G = \sum^\infty_{i=1} w_i \delta_{\theta_i}$

K-means, K-SVD, LC-KSVD and DPL

Tue, 04 Aug 2015 06:46:53 +0800

从 K-means 到 K-SVD¹¹ M. Aharon, M. Elad, and A.M. Bruckstein, “The K-SVD: An Algorithm for Designing of Overcomplete Dictionaries for Sparse Representation”, the IEEE Trans. On Signal Processing, Vol. 54, no. 11, pp. 4311-4322, November 2006. http://www.cs.technion.ac.il/~elad/publications/journals/2004/32_KSVD_IEEE_TSP.pdf²² O. Bryt and M. Elad, Compression of Facial Images Using the K-SVD Algorithm, Journal of Visual Communication and Image Representation, Vol. 19, No. 4, Pages 270-283, May 2008. http://www.cs.technion.ac.il/~elad/publications/journals/2007/FaceCompress_KSVD_JVCIR.pdf³³ R. Rubinstein, T. Peleg and M. Elad, Analysis K-SVD: A Dictionary-Learning Algorithm for the Analysis Sparse Model, IEEE Trans. on Signal Processing, Vol. 61, No. 3, Pages 661-677, March 2013. http://www.cs.technion.ac.il/~elad/publications/journals/2011/Analysis-KSVD-IEEE-TSP.pdf，到 LC-KSVD⁴⁴ Zhuolin Jiang, Zhe Lin, Larry S. Davis. “Label Consistent K-SVD: Learning A Discriminative Dictionary for Recognition”. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2013, 35(11): 2651-2664. http://www.umiacs.umd.edu/~zhuolin/projectlcksvd.html，到 DPL⁵⁵ Gu, S., Zhang, L., Zuo, W., & Feng, X. (2014). Projective dictionary pair learning for pattern classification. In Z. Ghahramani, M. Welling, C. Cortes, N. D. Lawrence, & K. Q. Weinberger (Eds.), Advances in Neural Information Processing Systems 27 (pp. 793–801). Curran Associates, Inc. http://papers.nips.cc/paper/5600-spectral-clustering-of-graphs-with-the-bethe-hessian。

K-means 算法

任务：通过最近邻寻找能够表达数据样本 $\{\mathbf{y}_i\}^N_{i=1}$ 的最优编码本（codebook，既字典参数），既求解如下问题

$\min_{\mathbf{C, X}} \left\{ \|\mathbf{Y} - \mathbf{CX}\|^2_F \right\} \text{ subject to } \forall i \text{, } \mathbf{x}_i = \mathbf{e}_k \text{ for some } k$

初始化：设置编码本矩阵 $\mathbf{C}^{(0)} \in \Re^{n \times K}$ . 设置 $J=1$ 。

循环至收敛（使用停止规则）

1. 稀疏编码阶段：将训练样本 ${\bf Y}$ 分为如下 $K$ 个集合。

$\left( \mathbf{R}^{(J-1)}_1, \mathbf{R}^{(J-1)}_2, \cdots, \mathbf{R}^{(J-1)}_K \right)$

每个集合中存放与 ${\bf c}^{J-1}_k$ 列最相似的样本的索引。

$\mathbf{R}^{(J-1)}_k = \left\{ i \mid \forall_{l \neq k}, \|\mathbf{y}_i - \mathbf{c}^{(J-1)}_k\|_2 < \|\mathbf{y}_i - \mathbf{c}^{(J-1)}_l\|_2 \right\}$

2. 编码本更新阶段： ${\bf C}^{(J-1)}$ 中的任一列 $k$ 都根据如下公式更新。

${\bf c}^{(J)}_k = \frac{1}{|{\bf R}_k|}\sum_{i \in {\bf R}^{(J-1)}_k}{\bf y}_i$

3. 令 $J = J + 1$

K-means 相当于是只使用编码本矩阵中的一列的稀疏表达。又因为只有一列，所以系数为1。其中该列由如下公式确定。

$\forall_{k \neq j} \; \left\|{\bf y}_i - {Ce}_j \right\|^2_2 \leqslant \left\|{\bf y}_i - {Ce}_k \right\|^2_2$

Julia Code Highlight Test

include("diagadd!.jl")

function updateA!(A::Array{Any,1}, 
                  D::Array{Any,1}, 
                  DataMat::Array{Any,1}, 
                  P::Array{Any,1}, 
                  τ::Float64, 
                  DictSize::Int64)
    # Update tempDictCoef by Eq. (8)

    for i=1:length(A)
        @inbounds TempDict::Matrix{Float64} = D[i]
        @inbounds TempData::Matrix{Float64} = DataMat[i]
        tempDictCoef::Matrix{Float64}       = TempDict' * TempDict
        tempDictDataCoef::Matrix{Float64}   = TempDict' * TempData
        @inbounds C::Matrix{Float64}        = P[i] * TempData
        diagadd!(tempDictCoef, τ)
        fma!(tempDictDataCoef, C, τ)
        @inbounds A[i]                      = tempDictCoef \ tempDictDataCoef
    end
end

《非线性最优化基础》学习笔记

Sun, 02 Aug 2015 22:33:54 +0800

《非线性最优化基础》作者福嶋雅夫 ¹¹ 《非线性最优化基础》（豆瓣链接：http://book.douban.com/subject/6510671/）。福嶋雅夫（Masao Fukushima），教授，日本南山大学理工学院系统与数学科学系，日本京都大学名誉教授，加拿大滑铁卢大学/比利时那慕尔大学/澳大利亚新南威尔士大学客座教授。主页：http://www.seto.nanzan-u.ac.jp/~fuku/index.html。

该文为冯象初教授²² 冯象初，教授，西安电子科技大学数学系。主页：http://web.xidian.edu.cn/xcfeng/有关非线性最优化的讲座的笔记。

主要内容

理论基础

凸函数、闭函数
共轭函数
鞍点问题
Lagrange 对偶问题
Lagrange 对偶性的推广
Fenchel 对偶性

算法

Proximal gradient methods
Dual proximal gradient methods
Fast proximal gradient methods
Fast dual proximal gradient methods

理论基础

凸函数、闭函数

给定函数 $f : \Re^n \to [-\infty, +\infty]$ ，称 $\Re^{n+1}$ 的子集

$graph \; f = \left\{ ({\bf x}, \beta)^T \in \Re^{n+1} \mid \beta = f({\bf x}) \right\} ,$

为 $f$ 的图像（graph），而称位于 $f$ 的图像上方的点的全体构成的集合

$epi \; f =\left\{ ({\bf x}, \beta)^T \in \Re^{n+1} \mid \beta \geqslant f({\bf x}) \right\}$

为 $f$ 的上图（epigraph）。若上图 $epi \; f$ 为凸集，则称 $f$ 为凸函数(convex function)。

定理 2.27 设 $\mathcal{I}$ 为任意非空指标集，而 $f_i : \Re^n \to [-\infty, +\infty] \; (i \in \mathcal{I})$ 均为凸函数，则由

$f({\bf x}) = \sup \left\{ f_i({\bf x}) \mid i \in \mathcal{I} \right\}$

定义的函数 $f : \Re^n \to [-\infty, +\infty]$ 为凸函数。进一步，若 $\mathcal{I}$ 为有限指标集，每个 $f_i$ 均为正常的凸函数，并且 $\cap_{i \in \mathcal{I}} \; dom \; f_i \neq \varnothing$ ，则 $f$ 为正常凸函数。

若对任意收敛于 ${\bf x}$ 的点列 $\{ {\bf x}^k\} \subseteq \Re^n$ 均有

$f({\bf x}) \geqslant \limsup_{k \to \infty}f({\bf x}^k)$

成立，则称函数 $f:\Re^n\to[-\infty,+\infty]$ 在 ${\bf x}$ 处上半连续（upper semicontinuous）；反之，当

$f({\bf x}) \leqslant \liminf_{k \to \infty}f({\bf x}^k)$

成立时，称 $f$ 在 ${\bf x}$ 处下半连续（lower semicontinuous）。若 $f$ 在 ${\bf x}$ 处既为上半连续又为下半连续，则称 $f$ 在 ${\bf x}$ 处连续（continuous）。

共轭函数

给定正常凸函数 $f:\Re^n \to (-\infty,+\infty]$ ，由

$f^\ast({\bf\xi}) = \sup \left\{ \lt{}{\bf x},{\bf\xi}\gt{}-f({\bf x}) \mid {\bf x}\in \Re^n \right\}$

定义的函数 $f^\ast:\Re^n \to [-\infty,+\infty]$ 称为 $f$ 的共轭函数（conjuagate function）。

定理 2.36 正常凸函数 $f:\Re^n \to (-\infty,+\infty]$ 的共轭函数 $f^\ast$ 必为闭正常凸函数。

鞍点问题

设 $Y$ 与 $Z$ 分别为 $\Re^n$ 与 $\Re^m$ 的非空子集，给定以 $Y\times Z$ 为定义域的函数 $K:Y\times Z\to[-\infty,+\infty]$ ，定义两个函数 $\eta:Y\to[-\infty,+\infty]$ 与 $\zeta:Z\to[-\infty,+\infty]$ 如下：

$\eta({\bf y})=\sup\left\{ K({\bf y},{\bf z}) \mid {\bf z} \in Z\right\}$

$\zeta({\bf z})=\inf\left\{ K({\bf y},{\bf z}) \mid {\bf y} \in Y\right\}$

$\min \; \; \eta({\bf y})\\s.t. \; \; \; {\bf y} \in Y$

$\max \; \; \zeta({\bf z})\\s.t. \; \; \; {\bf z} \in Z$

引理 4.1 对任意 ${\bf y}\in Y$ 与 ${\bf z}\in Z$ 均有 $\zeta({\bf z}) \leqslant \eta({\bf y})$ 成立。进一步，还有 $\sup\left\{ \zeta({\bf z})\mid {\bf z}\in Z\right\} \leqslant \inf\left\{ \eta({\bf y})\mid {\bf y} \in Y\right\}$

定理 4.1 点 $(\bar{\bf y},\bar{\bf z})\in Y\times Z$ 为函数 $K:Y\times Z\to[-\infty,+\infty]$ 的鞍点的充要条件是 $\bar{\bf y}\in Y$ 与 $\bar{\bf z}\in Z$ 满足

$\eta(\bar{\bf y})=\inf\left\{ \eta({\bf y})\mid {\bf y}\in Y\right\} =\sup\left\{ \zeta({\bf z})\mid {\bf z}\in Z\right\} =\zeta(\bar{\bf z})$

Lagrange 对偶问题

考虑如下非线性规划问题：

$\min \; \; f({\bf x}) \\ s.t. \; \; g_i({\bf x}) \leqslant 0, \; \; i=1, \cdots, m$

其中 $f: \Re^n \to \Re$ , $g_i: \Re^n \to \Re (i=1, \cdots, m)$ 。

$S = \left\{ x \in \Re^n \mid g_i({\bf x}) \leqslant 0 \text{, } \; \; i=1, \cdots, m\right \}$

$L_0({\bf x}, {\bf \lambda}) = \begin{cases} f({\bf x}) + \sum^m_{i=1}\lambda_ig_i({\bf x})\;, & {\bf \lambda} \geqslant {\bf 0}\\ -\infty \; , & {\bf \lambda} \ngeqslant {\bf 0} \end{cases}$

$\theta({\bf x}) = f({\bf x}) + \delta_S({\bf x})$

$\theta({\bf x}) = \sup \left\{ L_0({\bf x}, {\bf \lambda}) \mid {\bf \lambda} \in \Re^m \right\}$

$\omega_0({\bf \lambda}) = \inf \left\{ L_0({\bf x}, {\bf \lambda}) \mid {\bf x} \in \Re^n \right\}$

Constrains relax

$F_0({\bf x}, {\bf u}) = \begin{cases} f({\bf x}), & {\bf x} \in S({\bf u}) & \min & f({\bf x}) & & \\ +\infty, & {\bf x} \notin S({\bf u}) & s.t. & g_i({\bf x}) & \leqslant u_i, & i = 1, \cdots, m \end{cases}$

$S({\bf u}) = \left\{ {\bf x} \in \Re^n \mid g_i({\bf x}) \leqslant u_i, \; i=1, \cdots, m \right\}$

引理 4.5 Lagrange 函数 $L_0: \Re^{n+m} \to [-\infty, +\infty)$ 与函数 $F_0: \Re^{n+m} \to (-\infty,+\infty]$ 之间有如下关系成立：

$L_0({\bf x}, {\bf \lambda}) = \inf \left\{ F_0({\bf x}, {\bf u}) + \lt{}{\bf \lambda}, {\bf u}\gt{} \mid {\bf u} \in \Re^m \right\}$

$F_0({\bf x}, {\bf u}) = \sup \left\{ L_0({\bf x}, {\bf \lambda}) - \lt{}{\bf \lambda}, {\bf u}\gt{} \mid {\bf \lambda} \in \Re^m \right\}$

Lagrange 对偶性的推广

对于原始问题 $(P)$ ，考虑函数 $F: \Re^{n+M} \to (-\infty, +\infty]$ ，使得对任意固定的 ${\bf x} \in \Re^n$ ， $F({\bf x}, \cdot): \Re^M \to (-\infty, +\infty]$ 均为闭正常凸函数，并且满足

$F({\bf x}, {\bf 0}) = \theta({\bf x}) \text{, } {\bf x} \in \Re^n$

例 4.7 设 $M = m$ ，考虑函数 $F_0: \Re^{n+m} \to (-\infty, +\infty]$ ，利用满足 $q({\bf 0}) = 0$ 的闭正常凸函数 $q: \Re^m \to (-\infty, +\infty]$ 定义函数 $F: \Re^{n+m} \to (-\infty, +\infty]$ 如下：

$F({\bf x}, {\bf u}) = F_0({\bf x}, {\bf u}) + q({\bf u})$

$\theta({\bf x}) = f({\bf x}) + \delta_S({\bf x})$ $\implies F({\bf x}, {\bf u}) \mid F({\bf x}, {\bf 0}) = \theta({\bf x})$ $\implies L({\bf x}, {\bf \lambda}) = \inf \left\{ F({\bf x}, {\bf u}) + \lt{}{\bf \lambda}, {\bf u}\gt{} \mid {\bf u} \in \Re^M \right\}$ $\implies \omega({\bf \lambda}) = \inf \left\{ L({\bf x}, {\bf \lambda}) \mid {\bf x} \in \Re^n \right\}$

Fenchel 对偶性

$\min_{\bf x} f({\bf x}) + g({\bf Ax})$

$\begin{cases} & F({\bf x}, {\bf 0}) = \theta({\bf x}), & x \in \Re^n \\ & \theta({\bf x}) = f({\bf x}) + g({\bf Ax}) & \end{cases}$

$\implies F({\bf x}, {\bf u}) = f({\bf x}) + g({\bf Ax} + {\bf u})$ $\begin{eqnarray*} \implies L({\bf x}, {\bf \lambda}) & = & \inf \left\{ f({\bf x}) + g({\bf Ax} + {\bf u}) + \lt{}{\bf \lambda}, {\bf u}\gt{} \mid {\bf u} \in \Re^m \right\} \\ & = & f({\bf x}) - g^\ast(-{\bf \lambda}) - \lt{}{\bf \lambda}, {\bf Ax}\gt{} \end{eqnarray*}$ $\begin{eqnarray*} \implies \omega({\bf \lambda}) & = & \inf \left\{ f({\bf x} - g^\ast(-{\bf \lambda}) - \lt{}{\bf \lambda}, {\bf Ax}\gt{} \mid {\bf x} \in \Re^n \right\} \\ & = & -f^\ast({\bf A}^T{\bf \lambda}) - g^\ast(-{\bf \lambda}) \end{eqnarray*}$

$\min_{\bf \lambda} f^\ast\left( {\bf A}^T{\bf \lambda} \right) + g^\ast(-{\bf \lambda})$ $\max_{\bf \lambda} -f^\ast\left({\bf A}^T{\bf \lambda} \right) - g^\ast\left(-{\bf\lambda}\right)$

算法

1. Proximal Gradient Method

参考 Algorithms for large-scale convex optimization - DTU 2010³³ A Lecture note from “02930 Algorithms for Large-Scale Convex Optimization” taught by Per Christian Hansen (pch@imm.dtu.dk) and Professor Lieven Vandenberghe (http://www.seas.ucla.edu/~vandenbe/) at Danmarks Tekniske Universitet (http://www.kurser.dtu.dk/2010-2011/02930.aspx?menulanguage=en-GB). The Download Link is found at the page of “EE227BT: Convex Optimization - Fall 2013” taught by Laurent El Ghaoui at Berkeley (http://www.eecs.berkeley.edu/~elghaoui/Teaching/EE227A/lecture18.pdf). And both of the lectures mentioned the book “Convex Optimization” by Stephen Boyd and Lieven Vandenberghe (http://stanford.edu/~boyd/cvxbook/) and the software “CVX” - a MATLAB software for desciplined Convex Programming (http://cvxr.com/cvx/). A similar lecture note on Proximal Gradient Method from “EE236C - Optimization Methods for Large-Scale Systems (Spring 2013-14)” (http://www.seas.ucla.edu/~vandenbe/ee236c.html) at UCLA

Proximal mapping

The proximal mapping (or proximal operator) of a convex function $h$ is

${\bf prox}_h(x) = \mathop{argmin}_u \left( h(u) + \frac{1}{2} \|u - x\|^2_2 \right)$

examples

1. $h(x) = 0: {\bf prox}_h(x) = x$

2. $h(x) = I_C(x)$ (indicator function of $C$ ): ${\bf prox}_h$ is projection on $C$

${\bf prox}_h(x) = P_C(x) = \mathop{argmin}_{u \in C} \|u - x\|^2_2$

3. $h(x) = t \|x\|_1$ : ${\bf prox}_h$ is shinkage (soft threshold) operation

${\bf prox}_h = \begin{cases} x_i - t & x_i \geqslant t \\ 0 & |x_i| \leqslant t \\ x_i + t & x_i \leqslant -t \end{cases}$

Proximal gradient method

unconstrained problem with cost function split in two components

$\mathop{argmin} f(x) = g(x) + h(x)$

$g$ convex, differentiable, with dom $g=\Re^n$

$h$ closed, convex, possibly nondifferentiable; ${\bf prox}_h$ is inexpensive

proximal gradient algorithm

$x^{(k)} = {\bf prox}_{t_kh} \left( x^{(k-1)} - t_k \nabla g \left( x^{(k-1)} \right) \right)$

$t_k \gt{} 0 \text{ is the step size,}$

constant or determined by line search

Interpretation

$x^+ = {\bf prox}_{th} \left( x - t\nabla g(x) \right)$

from definition of proximal operator:

$\begin{eqnarray*} x^+ & = & \mathop{argmin}_u \left( h(u) + \frac{1}{2t} \left\| u - x + t\nabla g(x) \right\|^2_2 \right) \\ & = & \mathop{argmin}_u \left( h(u) + g(x) + \nabla g(x)^T(u-x) + \frac{1}{2t} \left\| u - x \right\|^2_2 \right) \end{eqnarray*}$

$x^+$ minimizes $h(u)$ plus a simple quadratic local of $g(u)$ around $x$

Examples

$minimize \; \; g(x) + h(x)$

gradient method: $h(x) = 0$ , i.e., minimize g(x)

$x^{(k)} = x^{(k-1)} - t_k\nabla g\left( x^{(k-1)} \right)$

gradient projection method: $h(x) = I_C(x)$ , i.e., minimize $g(x)$ over $C$

$x^{(k)} = P_C \left( x^{(k-1)} - t_k\nabla g \left(x^{(k-1)} \right) \right)$

iterative soft-thresholding: $h(x) = \|x\|_1$ , i.e., $minimize \; \; g(x)+ \| x \|_1$

$x^{(k)} = {\bf prox}_{t_kh} \left( x^{(k-1)} - t_k\nabla g\left( x^{(k-1)} \right) \right)$

and

${\bf prox}_{th}(u)_i = \begin{cases} u_i - t & & u_i \geq t \\ 0 & & -t \leq u_i \leq t \\ u_i + t & & u_i \geq t \end{cases}$

2. Dual Proximal Gradient Methods

参考 L. Vandenberghe EE236C (Spring 2013-14)

Composite structure in the Dual

$\begin{eqnarray*} minimize & & f(x)+g(Ax) \\ maximize & & -f^\ast \left( -A^Tz \right) - g^\ast(z) \end{eqnarray*}$

dual has the right structure for the proximal gradient method if

prox-operator of $g$ (or $g^\ast$ ) is cheap (closed form or simple algorithm)

$f$ is strongly convex ( $f(x)-(\frac{\mu}{2})x^T$ is convex) implies $f^\ast\left(-A^Tz\right)$ has Lipschitz continuous gradient ( $L=\frac{\|A\|^2_2}{\mu}$ ):

$\left\| A\nabla f^\ast(-A^Tu)-A\nabla f^\ast(-A^Tv) \right\|_2 \leq \frac{\|A\|^2_2}{\mu}\|u-v\|_2$

because $\nabla f^2$ is Lipschitz continuous with constant $\frac{1}{\mu}$

Dual proximal gradient update

$z^+ = prox_{tg\ast}\left( z+tA\nabla f^\ast\left( -A^Tz \right) \right)$

equivalent expression in term of $f$ :

$z^+ = prox_{tg\ast}(z+tA\hat{x}) \text{ where } \hat{x} = \mathop{argmin}_x \left( f(x) + z^TAx \right)$

1. if $f$ is separable, calculation of $\hat{x}$ decomposes into independent problems

2. step size $t$ constant or from backtracking line search

Alternating minimization interpretation

Moreau decomposition gives alternate expression for $z$ -update

$z^+ = z + t(A\hat{x} - \hat{y})$

where

$\begin{eqnarray*} \hat{x} & = & \mathop{argmin}_x \left( f(x) + z^TAx \right) \\ \hat{y} & = & prox_{t^{-1}g} \left( \frac{z}{t} + A\hat{x} \right) \\ & = & \mathop{argmin}_y \left(g(y) + z^T(A\hat{x} - y) + \frac{t}{2} \|A\hat{x} - y\|^2_2 \right) \end{eqnarray*}$

in each iteration, an alternating minimization of:

1. Lagrangian $f(x) + g(y) + z^T(Ax - y)$ over $x$

2. augmented Lagrangian $f(x) + g(y) + z^T(Ax - y) + \frac{t}{2} \|Ax - y\|^2_2$ over $y$

Regularized norm approximation

$minimize f(x) + \|Ax - b\| \text{ (with } f \text{ strongly convex) }$

a special case with $g(y) = \|y - b\|$

$g^\ast = \begin{cases} b^Tz & & \|z\|_\ast \leq 1 \\ +\infty & & otherwise \end{cases}$

$prox_{tg\ast}(z) = P_C(z - tb)$

C is unit norm ball for dual norm $\|\cdot\|_\ast$

dual gradient projection update

$\begin{eqnarray*} \hat{x} & = & \mathop{argmin}_x \left( f(x) + z^TAx \right) \\ z^+ & = & P_C(z + t(A\hat{x} - b)) \end{eqnarray*}$

Example

$minimize \; \; f(x) + \sum^p_{i=1}\|B_ix\|_2 \text{ (with } f \text{ strongly convex) }$

dual gradient projection update

$\begin{eqnarray*} \hat{x} & = & \mathop{argmin}_x \left( f(x) + \left(\sum^p_{i=1}B^T_iz_i\right)^Tx \right) \\ z^+_i & = & P_{C_i}(z_i + tB_i\hat{x}) \text{, } \; \; i=1, \cdots, p \end{eqnarray*}$

$C_i$ is unit Euclidean norm ball in $\Re^{m_i}$ , if $B_i \in \Re^{m_i \times n}$

Minimization over intersection of convex sets

$\begin{eqnarray*} minimize & & f(x) \\ subject to & & x \in C_i \cap \cdots \cap C_m \end{eqnarray*}$

$f$ strongly convex; e.g., $f(x) = \|x - a\|^2_2$ for projecting $a$ on intersection

sets $C_i$ are closed, convex, and easy to project onto

dual proximal gradient update

$\begin{eqnarray*} \hat{x} & = & \mathop{argmin}_x \left( f(x) + (z_i + \cdots + z_m)^Tx \right) \\ z^+_i & = & z_i + t\hat{x} - tP_{C_i}\left(\frac{z_i}{t} + \hat{x}\right) \text{, }\; \; i=1, \cdots, m \end{eqnarray*}$

Decomposition of separable problems

$minimize \; \; \sum^n_{j=1}f_j(x_j) + \sum^m_{i=1}g_i(A_{i1}x_1 + \cdots + A_{in}x_n )$

each $f_i$ is strongly convex; $g_i$ has inexpensive prox-operator

dual proximal gradient update

$\begin{eqnarray*} \hat{x}_j & = & \mathop{argmin}_{x_j} \left( f_j(x_j) + \sum^m_{i=1}z^T_iA_{ij}x_j \right) \text{, } \; \; j=1, \cdots, n \\ z^+_i & = & prox_{tg^\ast_i}\left(z_i + t\sum^n_{j=1}A_{ij}\hat{x}_j \right) \text{, } \; \; i=1, \cdots, m \end{eqnarray*}$

3. Fast proximal gradient methods

参考 L. Vandenberghe EE236C (Spring 2013-14)

FISTA (basic version)

$minimize \; \; f(x) = g(x) + h(x)$

$g$ convex, differentiable with $\mathop{dom} g=\Re^n$

$h$ closed, convex, with inexpensive $prox_{th}$ operator

algorithm: choose any $x^{(0)} = x^{(-1)}$ ; for $k \geq 1$ , repeat the steps

$\begin{eqnarray*} y & = & x^{(k-1)} + \frac{k-2}{k+1} \left( x^{(k-1)} - x^{(k-2)} \right) \\ x^{(k)} & = & prox_{t_kh} \left( y - t_k\nabla g(y) \right) \end{eqnarray*}$

step size $t_k$ fixed or determined by line search

acronym stands for ‘Fast Iterative Shrinkage-Thresholding Algorithm’

Interpretation

first iteration ( $k = 1$ ) is a proximal gradient step at $y = x^{(0)}$

next iterations are proximal gradient steps at extrapolated points $y$

note: $x^{(k)}$ is feasible (in $\mathop{dom} h$ ); $y$ may be outside $\mathop{dom} h$

Reformulation of FISTA

define $\theta_k = \frac{2}{k+1}$ and introduce an intermediate variable $v^{(k)}$

algorithm: choose $x^{(0)} = v^{(0)}$ ; for $k \geq 1$ , repeat the steps

$\begin{eqnarray*} y & = & (1 - \theta_k)x^{(k-1)} + \theta_kv^{(k-1)} \\ x^{(k)} & = & prox_{t_kh}(y-t_k\nabla g(y))\\ v^{(k)} & = & x^{(k - 1)} + \frac{1}{\theta_k}\left( x^{(k)} - x^{(k-1)} \right) \end{eqnarray*}$

Nesterov’s second method

algorithm: choose $x^{(0)} = v^{(0)}$ ; for $k \geq 1$ , repeat the steps

$\begin{eqnarray*} y & = & (1 - \theta_k)x^{(k-1)} + \theta_kv^{(k-1)} \\ v^{(k)} & = & prox_{\left(\frac{t_k}{\theta_k}\right)h} \left( v^{(k-1)} - \frac{t_k}{\theta_k}\nabla g(y) \right)\\ x^{(k)} & = & (1 - \theta_k)x^{(k-1)} + \theta_kv^{(k)} \end{eqnarray*}$

User $\theta_k = \frac{2}{k+1}$ and $t_k = \frac{1}{L}$ , or one of the line search methods

identical to FISTA if $h(x) = 0$

unlike in FISTA, $y$ is feasible (in $\mathop{dom} h$ ) if we take $x^{(0)} \in \mathop{dom} h$

4. Fast dual proximal gradient methods

参考 A Fast Dual Proximal Gradient Algorithm for Convex Minimization and Applications by Amir Beck and Marc Teboulle at October 10, 2013

$\begin{eqnarray*} (D) & = & \max_y\left\lbrace q(y) \equiv -f^\ast\left(A^Ty\right)-g^\ast(-y)\right\rbrace,\\ (D') & = & \min F(y) + G(y),\\ (P') & = & \min \left\lbrace f(x) + g(z): Ax - z = 0 \right\rbrace. \end{eqnarray*}$

$F(y) := f^\ast\left( A^Ty \right), \; \; G(y) :=g^\ast(-y)$

Initialization: $L \geq \frac{\|A\|^2}{\sigma}$ , $w_1 = y_0 \in \mathbb{V}$ , $t_1 = 1$ .

General Step $(k \geq 1)$ :

$\begin{eqnarray*} y_k & = & prox_{\frac{1}{L}G}\left( w_k - \frac{1}{L} \nabla F(w_k) \right)\\ t_{k+1} & = & \frac{1 + \sqrt{1 + 4t^2_k}}{2} \\ w_{k+1} & = & y_k + \left( \frac{t_k - 1}{t_{k+1}} \right) (y_k - y_{k-1}). \end{eqnarray*}$

The Fast Dual-Based Proximal Gradient Method (FDPG)

Input: $L \geq \frac{\|A\|^2}{\sigma} - \text{ an upper bound on the Lipschitz constant of } \nabla F$

Step $0$ . Take $w_1 = y_0 \in \mathbb{V}$ , $t_1 = 1$ .

Step $k$ . ( $k \geq 0$ ) Compute

$\begin{eqnarray*} u_k & = & \mathop{argmax}_x \left\lbrace \lt{}x, A^Tw_k\gt{} - f(x) \right\rbrace\\ v_k & = & prox_{Lg}(Au_k - Lw_k)\\ y_k & = & w_k - \frac{1}{L}(au_k - v_k)\\ t_{k+1} & = & \frac{1 + \sqrt{1 + 4t^2_k}}{2}\\ w_{k+1} & = & y_k + \left( \frac{t_k - 1}{t_{k+1}} \right) (y_k - y_{k-1}). \tag*{$\blacksquare$} \end{eqnarray*}$

The Expectation Maximization Algorithm and Finite Mixture Models

Fri, 24 Jul 2015 03:52:47 +0800

期望最大化算法和有限混合模型

概念主要来自于一次跟师弟的讨论。师弟提到 Expectation Maximization Algorithm (EM 算法) 方面的专家 Prof. Geoffrey John McLachlan 的一篇早期的论文就完全涵盖了多个最新顶级期刊论文的思想。就跟着追到该教授的主页，找到这两本书：Finite Mixture Models (有限混合模型) 和 The EM Algorithm and Extensions (期望最大化算法及其拓展)。

Prof. Geoffrey John McLachlan 近期相关论文

McLachlan, G.J. and Ng, S.K. (2009). The EM Algorithm. In The Top-Ten Algorithms in Data Mining, X. Wu and V. Kumar (Eds.). Boca Raton, Florida: Chapman & Hall/CRC, pp. 93-115.
McLachlan, G.J., Ng, S.K., and Wang, K. (2010). Clustering of high-dimensional and correlated data. In Studies in Classification, Data Analysis, and Knowledge Organization: Data Analysis and Classification, C. Lauro, F. Palumbo, and M. Greenacre (Eds.). Berlin: Springer-Verlag, pp. 3-11.
McLachlan, G.J. and Baek, J. (2010). Clustering of high-dimensional data via finite mixture models. In Advances in Data Analysis, Data Handling and Business Intelligence, A. Fink, B. Lausen, W. Seidel, and A. Ultsch (Eds.). Berlin: Springer-Verlag, pp. 33-44.
Baek, J., McLachlan, G.J., and Flack, L. (2010). Mixtures of factor analyzers with common factor loadings: applications to the clustering and visualisation of high-dimensional data. IEEE Transactions on Pattern Analysis and Machine Intelligence 32, 1298-1309.
Nikulin, V., Huang, T.-H., Ng, S.K., Rathnayake, S.I., and McLachlan, G.J. (2011). A very fast algorithm for matrix factorization. Statistics & Probability Letters 81, 773-782.
Lee, S.X. and McLachlan, G.J. (2011). On the fitting of mixtures of multivariate skew t-distributions via the EM algorithm. Preprint arXiv:1109.4706v2.
Lee, S. and McLachlan, G.J. (2014). Finite mixtures of multivariate skew t-distributions: some recent and new results. Statistics and Computing 24, 181-202. See also amended version with corrections.
Lin, T.-I., McLachlan, G.J., and Lee, S.X. (2016). Extending mixtures of factor models using the restricted multivariate skew-normal distribution. Journal of Multivariate Analysis. To appear. Preprint arXiv.1307.1748.
McLachlan, G.J. and Lee, S.X. (2014). Comment on “Comparing two formulations of skew distributions with special reference to model-based clustering” by A. Azzalini, R. Browne, M. Genton, and P. McNicholas. Preprint arXiv:1404.1733.
Lee, S.X., McLachlan, G.J., and Pyne, S. (2014). Supervised classification of flow cytometric samples via the joint clustering and matching (JCM) procedure. Preprint arXiv:1411.0685.

高斯混合模型

在应用领域，更多提及的是高斯混合模型 (GMM, Gaussian Mixture Model)。最简单的是单高斯模型。在单高斯模型中，假设多维变量符合高斯分布。高斯分布从空间上观察，在二维情况，近似于椭圆；在三维情况，近似于椭球。当单高斯分布不足以表达高维随机变量的分布时，用多个高斯分布的合成来近似该高维随机变量的分布。可以类比小波分解，使用足够多的高斯分布，可以拟合任意高维数据分布。特别考虑到，独立同分布 (i.i.d., independent identically distributed) 随机变量，当数据足够多时，近似服从高斯分布。

这里 GMM 跟 K-means 算法在很多方面近似。K-means 算法是从一组样本中选择其中一个最近邻样本作为聚类中心。GMM 中是根据一组数据按照高斯分布计算中心与方差，然后为每一个数据计算属于该高斯分布的概率。GMM 可以看成是一种柔性、概率的 K-means。K-means 可以看成是鉴别/分类方法 (identification/classification)；对应的 GMM 可以看成是验证方法 (veriication)。K-means 把每一个样本分配到 K 个聚类中心中最近的一个；GMM 为每一个样本计算其属于 K 个不同的高斯分布的概率。

给定样本及类别信息，可以通过极大似然估计 (ML, Maximum Likelihood) 算法求解 GMM 参数。当样本类别未知时，使用各样本符合分布的概率，期望最大化算法 (EM)。当微小概率连乘时，可以通过取对数的方式，提高数值计算的稳定性。

Kernel and Kernel: Reproducing Kernel Hilbert Space and Kernel Method

Sun, 19 Jul 2015 17:22:53 +0800

Kernel

再生核的定义

Definition. $k: \mathcal{X} \times \mathcal{X} \rightarrow \Re$ is a kernel if

核函数 $k$ 对称 isymmetric: $k(x,y)=k(y,x)$ .
核函数 $k$ 半正定 positive semi-definite

i.e., $\forall x_1, x_2, \ldots , x_n \in \mathcal{X}$ , the “Gram Matrix” $K$ defined by $K_{ij} = k(x_i,x_j)$ is positive semi-definite. (A matrix $M \in \Re^{n \times n}$ is positive semi-definite if $\forall a \in \Re^n$ , $a' M a \ge 0$ .)

再生核希尔伯特空间是从低维数据到函数泛函映射。Hilbert 空间“Frechet-Riesz”表现定理。

希尔伯特空间是定义了内积的空间；相对应的 Banach 是定义了范数的空间。

Reproducing Kernel Feature Map

再生核特征映射 $\Phi_x(\cdot) \triangleq k(\cdot,x)$

即对任意线性泛函 $\Phi(\cdot)$ , $\exists x_\Phi \in H$ , 使得 $\Phi(\cdot) = (\cdot, x_\Phi)$

To be continued…

Alternating Direction Method of Multipliers (ADMM)

Sat, 18 Jul 2015 21:36:14 +0800

Consider minimizing $f({\bf x}) + g({\bf y})$ subject to affine constraints ${\bf Ax} + {\bf By} = {\bf c}$

The augmented Lagrangian

$\mathcal{L}_\rho({\bf x}, {\bf y}, {\bf \lambda}) = f({\bf x}) + g({\bf y}) + \langle {\bf \lambda}, {\bf Ax} + {\bf By} - {\bf c} \rangle + \frac{\rho}{2} \| {\bf Ax} + {\bf By} - {\bf c} \|^2_2$

Idea: perform block descent on ${\bf x}$ and ${\bf y}$ and then update multiplier vector ${\bf \lambda}$

$\begin{align} {\bf x}^{(t+1)} & \leftarrow \min_{\bf x} f({\bf x}) + \langle {\bf \lambda}, {\bf Ax} + {\bf By}^{(t)} - {\bf c} \rangle + \frac{\rho}{2} \| {\bf Ax} + {\bf By}^{(t)} - {\bf c} \|^2_2 \\ {\bf y}^{(t+1)} & \leftarrow \min_{\bf y} g({\bf y}) + \langle {\bf \lambda}, {\bf Ax}^{(t+1)} + {\bf By} - {\bf c} \rangle + \frac{\rho}{2} \| {\bf Ax}^{(t+1)} + {\bf By} - {\bf c} \|^2_2 \\ {\bf \lambda}^{(t+1)} & \leftarrow {\bf \lambda}^{(t)} + \rho ({\bf Ax}^{(t+1)} + {\bf By}^{(t+1)} - {\bf c}) \end{align}$

Example: fused lasso

Fused lasso problem minimizes $\frac{1}{2} \| {\bf y - X\beta} \|^2_2 + \mu \sum^{p-1}_{j=1} |\beta_{j+1} - \beta_j |$

Define ${\bf \gamma = D\beta}$ , where

$D = \left(\begin{matrix} 1 & -1 & & & & \\ & & \cdots & & \\ & & & 1 & -1 \end{matrix} \right)$

Then we minimize $\frac{1}{2} \| {\bf y} - {\bf X\beta} \|^2_2 + \mu \| \gamma \|_1$ subject to ${\bf D\beta} = \gamma$

Augmented Lagrangian is

$\mathcal{L}_\rho({\bf \beta}, {\bf \gamma}, {\bf \lambda}) = \frac{1}{2} \| {\bf y} - {\bf X\beta} \|^2_2 + \mu \| {\bf \gamma} \|_1 + {\bf \lambda}^T({\bf D\beta} - {\bf \gamma}) + \frac{\rho}{2} \| {\bf D\beta} - {\bf \gamma} \|^2_2$

ADMM

Update ${\bf \beta}$ is a smooth quadratic problem
Update ${\bf \gamma}$ is a separated lasso problem (elementwise thresholding)
Update multipliers

${\bf \lambda}^{(t+1)} \leftarrow {\bf \lambda}^{(t)} + \rho({\bf D\beta}^{(t)} - {\bf \gamma}^{(t)})$

Same algorithm applies to a general regularization matrix ${\bf D}$ (generalized lasso)

Remarks on ADMM

Related algorithms

Split Bregman iteration ¹¹ Goldstein, T. and Osher, S. (2009). The split Bregman method for l1-regularized problems. SIAM J. Img. Sci., 2:323-343.
Dykstra’s alternating projection algorithm ²² Dykstra, R. L. (1983). An algorithm for restricted least squares regression. J. Amer. Statist. Assoc., 78(384):837-842.
Proximal point algorithm applied to the dual
Numerous applications in statistics and machine learning: lasso, gen. lasso, graphical lasso, (overlapping) group lasso, …
Embraces distributed computing for big data ³³ Boyd, S., Parikh, N., Chu, E., Peleato, B., and Eckstein, J. (2011). Distributed optimization and statistical learning via the alternating direction method of multipliers. Found. Trends Mach. learn., 3(1):1-122.

Augmented Lagrangian Method

Fri, 17 Jul 2015 19:04:01 +0800

Consider minimizing:

$f({\bf x})$ subject to equality constraints $g_i({\bf x}) = 0$ for $i=1, \ldots ,q$

Inequality constraints are ignored for simplicity

Assume $f$ and $g_i$ are smooth for simplicity

At a constrained minimum, the Lagrange multiplier condition

${\bf 0}=\nabla f({\bf x})+\sum^q_{i=1}\lambda_i\nabla g_i({\bf x})$

holds provided $\nabla g_i({\bf x})$ are linearly independent

Augmented lagrangian ¹¹ 参考http://www.stat.ncsu.edu/people/zhou/courses/st810/notes/lect24final.pdf

$\mathcal{L}_\rho ({\bf x},{\bf \lambda}) = f({\bf x}) + \sum^q_{i=1}\lambda_i g_i({\bf x}) + \frac{\rho}{2}\sum^q_{i=1}g_i({\bf x})^2$

The penalty term $\frac{\rho}{2}\sum^q_{i=1}g_i({\bf x})^2$ punishes violations of the equality constraints $g_i({\bf \theta})$

Optimize the Augmented Lagrangian and adjust ${\bf \lambda}$ in the hope of matching the true Lagrange multipliers

For $\rho$ large enough (finite), the unconstrained minimizer of the augmented lagrangian coincides with the constrained solution of the original problem

At convergence, the gradient $\rho g_i({\bf x})\nabla g_i({\bf x})$ vanishes and we recover the standard multiplier rule

Algorithm

Take $\rho$ initially large or gradually increase it; iterate

Find the unconstrained minimum ${\bf x}^{(t+1)}\leftarrow \min_x\mathcal{L}_\rho ({\bf x},{\bf \lambda}^{(t)})$

Update the multiplier vector ${\bf \lambda}$ $\lambda^{(t+1)}_i \leftarrow \lambda^{(t)}_i + \rho g_i({\bf x}^{(t)}), \; i = 1, \ldots , q$

Intuition for updating ${\bf \lambda}$

If ${\bf x}^{(t)}$ is the unconstrained minimum of $\mathcal{L}({\bf x},{\bf \lambda})$ , then the stationary condition says

$\begin{eqnarray*} {\bf 0} & = \nabla f({\bf x}^{(t)}) + \sum^q_{i=1} \lambda^{(t)}_i \nabla g_i({\bf x}^{(t)}) + \rho \sum^q_{i=1} g_i({\bf x}^{(t)}) \nabla g_i({\bf x}^{(t)}) \\ & = \nabla f({\bf x}^{(t)}) + \sum^q_{i=1} \left[ \lambda^{(t)}_i + \rho g_i({\bf x}^{(t)}) \right] \nabla g_i({\bf x}^{(t)}) \end{eqnarray*}$

For non-smooth $f$ , replace gradient $\nabla f$ by sub-differential $\partial f$

Example: basis pursuit

Basis pursuit problem seeks the sparsest solution subject to linear constraints $\begin{align} \text{minimize } & \|{\bf x}\|_1 \\ \text{subject to } & {\bf Ax} = {\bf b} \end{align}$

Take $\rho$ initially large or gradually increase it; iterate according to

$\begin{eqnarray*} {\bf x}^{(t+1)} & \leftarrow \min \|{\bf x}\| + \langle {\bf \lambda}^{(t)}, {\bf Ax} - {\bf b} \rangle + \frac{\rho}{2} \|{\bf Ax} - {\bf b}\|^2-2 \text{(lasso)} \\ {\bf \lambda}^{(t+1)} & \leftarrow {\bf \lambda}^{(t)} + \rho \left( {\bf Ax}^{(t+1)} - {\bf b} \right) \end{eqnarray*}$

Converges in a finite (small) number of steps ²² Yin, W., Osher, S., Goldfarb, D., and Darbon, J. (2008). Bregman iterative algorithms for l₁-minimization with applications to compressed sensing. SIAM J. Imaging Sci., 1(1):143-168. Online: http://www.caam.rice.edu/~wy1/paperfiles/Rice_CAAM_TR07-13.PDF

Remarks

The augmented Lagrangian method dates back to 50s ³³ Hestenes, M.R. (1969). Multiplier and gradient methods. J. Optimization Theory Appl., 4:303-320.
Powell, M. J. D. (1969). A method for nonlinear constraints in minimization problems. In Optimization (Sympos., Univ. Keele, Keele, 1968), pages 283-298. Academic Press, London. Monograph by Bertsekas⁴⁴ Bertsekas, D. P. (1982). Constrained Optimization and Lagrange Multiplier Methods. Computer Science and Applied Mathematics. Academic Press Inc. [Harcourt Brace Jovanovich Publishers], New York. provides a general treatment Same as the Bregman iteration (Yin etal., 2008) proposed for basis pursuit (compressive sensing) Equivalent to proximal print algorithm applied to the dual; can be accelerated (Nesterov)

Introduction to Nonlinear Optimization

Thu, 16 Jul 2015 05:22:17 +0800

昨天晚上到今天，看完了一本之前一直看不完的书《Introduction to Nonlinear Optimization》¹¹ 《Introduction to Nonlinear Optimization》 at 豆瓣: http://book.douban.com/subject/26551626/ and at Amazon: http://www.amazon.com/Introduction-Nonlinear-Optimization-Algorithms-Applications/dp/1611973643/ by Amir Beck ²² Amir Beck is an associate professor at The Technion—Israel Institute of Technology： http://iew3.technion.ac.il/Home/Users/becka.html。澄清了一些过去曾经误解的概念。MOS-SIAM Series on Optimization³³ MOS-SIAM Series on Optimization: http://bookstore.siam.org/mos-siam-series-on-optimization/ 一系列优化的书都不错。连着借了三本，希望以后都能好好读完。现在这本非线性优化暂时只是翻完了，习题都没做，感觉习题也都挺有用的。

Table of Content 目录

1. Mathematical Preliminaries 数学基础 2. Optimality Conditions for Unconstrained Optimization 无约束优化问题的优化条件 3. Least Squares 最小二乘问题 4. The Gradient Method 梯度下降法 5. Newton’s Method 牛顿法 6. Convex Sets 凸集 7. Convex Functions 凸函数 8. Convex Optimization 凸优化 9. Optimization over a Convex Set 凸集上的优化问题 10. Optimality Conditions for Linearly Constrained Problems 线性约束问题的优化条件 11. The KKT Conditions 库恩塔克条件 12. Duality 对偶问题

第一章数学基础

Definition 1.1 内积 inner product $\langle\cdot,\cdot\rangle:=\Re^n \times \Re^n \to \Re$

1. 对称性 symmetry $\langle{\bf x},{\bf y}\rangle=\langle{\bf y},{\bf x}\rangle$ 对于任何 ${\bf x},{\bf y} \in \Re^n$

2. 可加性 additivity $\langle{\bf x},{\bf y}+{\bf z}\rangle=\langle{\bf x},{\bf y}\rangle+\langle{\bf x},{\bf z}\rangle$ 对于任何 ${\bf x},{\bf y},{\bf z} \in \Re^n$

3. 线性 homogeneity $\langle\lambda{\bf x},{\bf y}\rangle=\lambda\langle{\bf x},{\bf y}\rangle$ 对于任何 $\lambda\in \Re$ 及 ${\bf x},{\bf y} \in \Re^n$

4. 正定性 positive definiteness 对于任意 ${\bf x}\in \Re^n$ ， $\langle{\bf x},{\bf x}\rangle\ge0$ ；当且仅当 ${\bf x}={\bf 0}$ 时， $\langle{\bf x},{\bf x}\rangle=0$ 。 $\tag*{$\blacksquare$}$

Example 1.2 最常见的内积就是点积 dot product $\langle{\bf x},{\bf y}\rangle={\bf x}^T{\bf y}=\sum^n_{i=1}x_iy_i \text{ for any } {\bf x},{\bf y} \in \Re^n$ 点积是标准内积，当不明确说明时，默认内积就是点积。 $\tag*{$\blacksquare$}$

Example 1.3 加权点积 weighted dot norm 是 $\Re^n$ 空间中另一个内积的例子，其中权重 ${\bf w}\in \Re^n_{++}$ 。 $\langle {\bf x},{\bf y} \rangle_{\bf w} = \sum^n_{i=1}w_ix_iy_i \; \tag*{$\blacksquare$}$

Definition 1.4 范数 Norm 一个定义在实数向量集 $\Re^n$ 上的范数 $\|\cdot\|$ 是一个满足如下条件，形如 $\| \cdot \| : \Re^n \to \Re$ 的函数

1. 非负性 nonnegativity $\|{\bf x}\| \ge 0 \text{ for any } {\bf x} \in \Re^n \text{ and } \|{\bf x}\| = 0 \text{ if and only if } {\bf x} = {\bf 0}$

2. 正线性 positive homogeneity $|\lambda {\bf x}\| = |\lambda| \|{\bf x}\| \text{ for any } {\bf x} \in \Re^n \text{ and } \lambda \in \Re$

3. 三角不等 triangle inequality $\|{\bf x} + {\bf y}\| \le \|{\bf x}\| + \|{\bf y}\| \text{ for any } {\bf x},{\bf y} \in \Re^n$

相应的， $p$ 范数 p-norm 定位为 $\ell_p \equiv \|{\bf x}\|_p \equiv \sqrt[p]{\sum^n_{i=1}|x_i|^p} \text{ , for } p \ge 1$

类似的，无穷范数 $\infty$ 定义为 $\|{\bf x}\|_{\infty} \equiv \max_{i=1,2,\cdots,n}|x_i|=\lim_{p \to \infty} \|{\bf x}\|_p \;$ 即最大绝对值。

1-范数 1-norm $\|{\bf x}\|_1$ 即为绝对值之和 $\sum^n_{i=1}|x_i|$ 。 $\tag*{$\blacksquare$}$

Lemma 1.5 柯西 - 施瓦茨不等式 Cauchy-Schwarz inequality $\text{ For any } {\bf x}, {\bf y} \in \Re^n,$ $|{\bf x}^T{\bf y}| \le \|{\bf x}\|_2 \cdot \|{\bf y}\|_2$

等号在且仅在 ${\bf x}$ 和 ${\bf y}$ 线性相关时成立。 $\tag*{$\blacksquare$}$

Definition 1.6 矩阵的范数 matrix norms 一个定义在 $\Re^{m \times n}$ 空间上的范数 $\|\cdot\|$ 是一个形如 $\|\cdot\|:\Re^{m \times n} \to \Re$ ，且满足如下性质的函数。

1. 非负性 nonnegativity 对于任意 ${\bf A} \in \Re^{m \times n}$ , $\|{\bf A}\| \ge 0$ ；同时，当且仅当 ${\bf A} = {\bf 0}$ 时， $\|{\bf A}\| = 0$ 。

2. 正线性 positive homogeneity 对于任意 ${\bf A} \in \Re^{m \times n}$ 和 $\lambda \in \Re$ ， $\|\lambda{\bf A}\| = |\lambda| \cdot \|{\bf A}\|$ 。

3. 三角不等式 triangle inequality 对于任意 ${\bf A},{\bf B} \in \Re^{m \times n}$ ， $\|{\bf A} + {\bf B}\| \le \|{\bf A}\| + \|{\bf B}\|$ 。

常见的 induced matrix norm $\| \cdot \|_{a,b}$ 定义如下。给定一个矩阵 ${\bf A} \in \Re^{m \times n}$ 和分别定义在 $\Re^n$ 及 $\Re^m$ 上的两个范数 $\| \cdot \|_a$ 和 $\| \cdot \|_b$ ， $\| {\bf A} \|_{a,b}$ 即为

$\| {\bf A} \|_{a,b} = \max_{\bf x}\{ \| {\bf Ax} \|_b : \| {\bf x} \|_a \le 1 \}$

从定义可以看出，对于任意 ${\bf x} \in \Re^n$ 以下不等式均成立

$\| {\bf Ax} \|_b = \| {\bf A} \|_{a,b} \| {\bf x} \|_a \tag*{$\blacksquare$}$

Example 1.7 谱范数 spectral norm $\tag*{$\blacksquare$}$

Example 1.8 1-范数 1-norm 又称为最大列绝对值和范数 $\tag*{$\blacksquare$}$

Example 1.9 无穷范数 $\inf$ -norm 又称为最大行绝对值和范数 $\tag*{$\blacksquare$}$

Frobenius 范数是一个非 induced 范数的矩阵范数的例子，定义如下。

$\| {\bf A} \|_F = \sqrt{\sum^m_{i=1} \sum^n_{j=1}A^2_{ij}}, \; {\bf A} \in \Re^{m \times n} \tag*{$\blacksquare$}$

稀疏编码的优化问题

Thu, 21 May 2015 08:33:45 +0800

稀疏编码问题： $\arg\min_x f(x)\left\|y-Ax\right\|^2+ \lambda\|x\|_1$

用 alternating minimization (ADM)
Primal-dual 算法
Soft threshold

字典学习 K-SVD
看完 K-SVD 之后，图像分类: Discriminative K-SVD for Dictionary Learning in Face Recognition (CVPR)，Label Consistent K-SVD Learning a Discriminative Dictionary for Recognition （TPAMI）；结构化稀疏: Robust Classiﬁcation using Structured Sparse Representation （CVPR）；低秩: Learning Structured Low-rank Representations for Image Classiﬁcation （CVPR）。
关于稀疏表示的已有算法的分析: A survey of sparse representation: algorithms and applications
参考林倞老师¹¹ 林倞教授中山大学计算机科学 NIPS, ACML 和 ML 一系列工作的报告。基本上 ADM 及其扩展都介绍到了。
迭代优化
$\sum_{i=1}^p\left\|y^{(i)}-A\cdot x^{(i)}\right\|_2^2 + \sum_{i=1}^p S\left(x^{(i)}\right)$
凸的稀疏模型求基于 proximal operator 的 alternating minimization。参考 boyd 的文章。
An ADMM Solution to the Sparse Coding Problem: ²² http://stanford.edu/class/ee364b/projects/2011projects/reports/bhaskar_zou.pdf
http://www.eecs.berkeley.edu/~yang/software/l1benchmark/
这篇文章 ³³ http://web.stanford.edu/~boyd/papers/prox_algs.html 从 proximal operator 角度讲解，更加深入一点。
南京大学何炳生老师的课程 ⁴⁴ http://math.nju.edu.cn/~hebma/，讲得比较深入。
biconvex 问题，基本是个保证收敛的算法效果就还行。

几个基本优化问题

Thu, 21 May 2015 06:53:20 +0800

可以用 ALM¹¹ Augmented Lagrange Multiplier 增广拉格朗日乘子法, LP²² Linear Programming 线性规划和 IRLS ³³ Iteratively Reweighted Least Squares 求解的四种基本优化问题

Question 1

least entropy & error correction

$\arg \min \|x\|_1 + \|e\|_1$ subj. to $y=Ax+e$
标准 linear programming
鲁棒的 SRC⁴⁴ 参见 http://research.microsoft.com/pubs/132810/PAMI-Face.pdf and Face Recognition via Sparse Representation。使用单位矩阵作为遮挡字典，用标准形式求解。

Question 2

least energy & error correction

$\arg\min \|x\|_2 + \|e\|_1$ subj. to $y=Ax+e$
鲁棒的 CRC

Question 3

sparse regression with noise - lasso

$\arg\min \|x\|_1 + \|e\|_2$ subj. to $y=Ax+e$
标准 lasso 问题
标准的 SRC

Question 4

least energy with noise

$\arg\min \|x\|_2 + \|e\|_2$ subj. to $y=Ax+e$
极小最小二乘解，可求广义逆。
CRC，用二范数约束表示系数，有解析解。

弗罗贝尼乌斯范数 Frobenius norm

$\| \mathbf{A} \|_F = \sqrt{\sum^m_{i=1} \sum^n_{j=1} \mid a_{ij} \mid^2} = \sqrt{trace(\mathbf{A}^* \mathbf{A})} = \sqrt{\sum^{\min\{m,n\}}_{i=1}\sigma^2_i}$

OPTIMIZATIONS for Machine Learning

The Handbook of Cluster Analysis

K-means, K-SVD, LC-KSVD and DPL

《非线性最优化基础》学习笔记

主要内容

理论基础

凸函数、闭函数

共轭函数

鞍点问题

Lagrange 对偶问题

Lagrange 对偶性的推广

Fenchel 对偶性

算法

1. Proximal Gradient Method

Proximal mapping

Proximal gradient method

Interpretation

Examples

2. Dual Proximal Gradient Methods

Composite structure in the Dual

Dual proximal gradient update

Alternating minimization interpretation

Regularized norm approximation

Example

Minimization over intersection of convex sets

Decomposition of separable problems

3. Fast proximal gradient methods

FISTA (basic version)

Interpretation

Reformulation of FISTA

Nesterov’s second method

4. Fast dual proximal gradient methods

The Fast Dual-Based Proximal Gradient Method (FDPG)

The Expectation Maximization Algorithm and Finite Mixture Models

Prof. Geoffrey John McLachlan 近期相关论文

高斯混合模型

Kernel and Kernel: Reproducing Kernel Hilbert Space and Kernel Method

Kernel

Reproducing Kernel Feature Map

Alternating Direction Method of Multipliers (ADMM)

Example: fused lasso

ADMM

Remarks on ADMM

Augmented Lagrangian Method

Algorithm

Intuition for updating ​{\bf \lambda}

For non-smooth ​f, replace gradient ​\nabla f by sub-differential ​\partial f

Example: basis pursuit

Remarks

Introduction to Nonlinear Optimization

Table of Content 目录

第一章 数学基础

稀疏编码的优化问题

几个基本优化问题

Question 1

Question 2

Question 3

Question 4

弗罗贝尼乌斯范数 Frobenius norm

Intuition for updating ${\bf \lambda}$

For non-smooth $f$ , replace gradient $\nabla f$ by sub-differential $\partial f$

第一章数学基础