Table of Contents¶

1 Singular Value Decomposition (SVD)

1.1 SVD tells us everything about a matrix

1.2 Proof of existence of SVD using eigen-decomposition of Gram matrix

1.3 Another relation of SVD to eigen-decomposition

1.4 Some exercises

1.5 Geometry of SVD

1.6 SVD and generalized inverse

1.7 Eckart-Young theorem: best approximation to a matrix

1.8 Singular vectors and Rayleigh quotient

1.9 Principal component analysis (PCA)

1.10 PCA example: Fisher's Iris data

1.11 PCA example: genomics

1.12 Matrix completion

1.13 Compressed sensing

Singular Value Decomposition (SVD)¶

using Pkg
Pkg.activate(pwd())
Pkg.instantiate()
Pkg.status()

  Activating project at `~/Documents/github.com/ucla-biostat-216/2022fall/slides/12-svd`

Status `~/Documents/github.com/ucla-biostat-216/2022fall/slides/12-svd/Project.toml`
  [916415d5] Images v0.25.2
⌃ [91a5bcdd] Plots v1.35.5
  [ce6b1742] RDatasets v0.7.7
  [2913bbd2] StatsBase v0.33.21
  [3eaba693] StatsModels v0.6.33
  [f3b207a7] StatsPlots v0.15.4
  [37e2e46d] LinearAlgebra
Info Packages marked with ⌃ have new versions available

using Images, LinearAlgebra, Random, RDatasets, StatsBase, StatsModels, StatsPlots
gr() # GR backend

Plots.GRBackend()

We saw in earlier lectures that a square matrix can have complex eigenvalues and eigenvectors. Symmetric matrices have the nice property that all eigenvalues and eigenvectors are real. The singular value decomposition (SVD) generalizes the spectral decomposition to general rectangular matrices.
Applied mathematicians and statisticians call SVD the singularly (most) valuable decomposition.
Let $\mathbf{A} \in \mathbb{R}^{m \times n}$ with $\text{rank}(\mathbf{A})=r$. We assume $m \ge n$. Instead of the eigen-equation, now we have \begin{eqnarray*} \mathbf{A} \mathbf{v}_i &=& \sigma_i \mathbf{u}_i, \quad i = 1,\ldots,r \\ \mathbf{A} \mathbf{v}_i &=& 0 \, \mathbf{u}_i, \quad i = r+1,\ldots,n, \end{eqnarray*} where the left singular vectors $\mathbf{u}_i \in \mathbb{R}^m$ are orthonormal, the right singular vectors $\mathbf{v}_i \in \mathbb{R}^n$ are orthonormal, and the singular values $$ \sigma_1 \ge \cdots \ge \sigma_r > \sigma_{r+1} = \cdots = \sigma_{n} = 0. $$
Collecting above equations into matrix multiplication format: $$ \mathbf{A} \begin{pmatrix} \mid & & \mid & & \mid \\ \mathbf{v}_1 & \cdots & \mathbf{v}_r & \cdots & \mathbf{v}_n \\ \mid & & \mid & & \mid \end{pmatrix} = \begin{pmatrix} \mid & & \mid & & \mid \\ \mathbf{u}_1 & \cdots & \mathbf{u}_r & \cdots & \mathbf{u}_n \\ \mid & & \mid & & \mid \end{pmatrix} \begin{pmatrix} \sigma_1 & & & \\ & \ddots & & \\ & & \sigma_r & \\ & & & \mathbf{O}_{n-r} \end{pmatrix}, $$ or $$ \mathbf{A} \mathbf{V} = \mathbf{U} \boldsymbol{\Sigma}. $$ Multiplying both sides by $\mathbf{V}'$, we have the famous singular value decomposition (SVD) $$ \mathbf{A} = \mathbf{U} \boldsymbol{\Sigma} \mathbf{V}' = \sigma_1 \mathbf{u}_1 \mathbf{v}_1' + \cdots + \sigma_r \mathbf{u}_r \mathbf{v}_r', $$ where $\mathbf{U} \in \mathbb{R}^{m \times n}$ has orthogonal columns, $\boldsymbol{\Sigma} = \text{diag}(\sigma_1, \ldots, \sigma_r, \mathbf{0}_{n-r})$, and $\mathbf{V} \in \mathbb{R}^{n \times n}$ is orthogonal. $$ \mathbf{A} = \begin{pmatrix} \mid & & \mid \\ \mathbf{u}_1 & \cdots & \mathbf{u}_n \\ \mid & & \mid \end{pmatrix} \begin{pmatrix} \sigma_1 & & & \\ & \ddots & & \\ & & \sigma_r & \\ & & & \mathbf{O}_{n-r} \end{pmatrix} \begin{pmatrix} - & \mathbf{v}_1' & -\\ & \vdots & \\ - & \mathbf{v}_n' & - \end{pmatrix}. $$

# a square matrix
A = [3.0 0.0; 4.0 5.0]

2×2 Matrix{Float64}:
 3.0  0.0
 4.0  5.0

# eigenvalues and eigenvectors
eigen(A)

Eigen{Float64, Float64, Matrix{Float64}, Vector{Float64}}
values:
2-element Vector{Float64}:
 3.0
 5.0
vectors:
2×2 Matrix{Float64}:
  0.447214  0.0
 -0.894427  1.0

# singular values and singular vectors
# different from eigenvalues and eigenvectors
Asvd = svd(A)

SVD{Float64, Float64, Matrix{Float64}, Vector{Float64}}
U factor:
2×2 Matrix{Float64}:
 -0.316228  -0.948683
 -0.948683   0.316228
singular values:
2-element Vector{Float64}:
 6.70820393249937
 2.2360679774997894
Vt factor:
2×2 Matrix{Float64}:
 -0.707107  -0.707107
 -0.707107   0.707107

A * Asvd.Vt[1, :]

2-element Vector{Float64}:
 -2.1213203435596424
 -6.363961030678928

Asvd.S[1] * Asvd.U[1, :]

2-element Vector{Float64}:
 -2.1213203435596433
 -6.363961030678928

# orthogonality of U
Asvd.U'Asvd.U

2×2 Matrix{Float64}:
 1.0          1.11022e-16
 1.11022e-16  1.0

# orthogonality of V
Asvd.V'Asvd.V

2×2 Matrix{Float64}:
 1.0  0.0
 0.0  1.0

A ≈ Asvd.U * Diagonal(Asvd.S) * Asvd.V'

true

Reduced form of the SVD. If we just keep the first $r$ singular triplets ($\sigma_i, \mathbf{u}_i, \mathbf{v}_i$), then $$ \mathbf{A} = \mathbf{U}_r \boldsymbol{\Sigma}_r \mathbf{V}_r', $$ where $\mathbf{U}_r \in \mathbb{R}^{m \times r}$, $\boldsymbol{\Sigma}_r = \text{diag}(\sigma_1, \ldots, \sigma_r)$, and $\mathbf{V}_r \in \mathbb{R}^{n \times r}$ $$ \mathbf{A} = \begin{pmatrix} \mid & & \mid \\ \mathbf{u}_1 & \cdots & \mathbf{u}_r \\ \mid & & \mid \end{pmatrix} \begin{pmatrix} \sigma_1 & & \\ & \ddots & \\ & & \sigma_r \end{pmatrix} \begin{pmatrix} - & \mathbf{v}_1' & -\\ & \vdots & \\ - & \mathbf{v}_r' & - \end{pmatrix}. $$
Full SVD. Opposite to the reduced form of SVD, we can also augment the $\mathbf{U}$ matrix to be a square orthogonal matrix to obtain the full SVD $$ \mathbf{A} = \begin{pmatrix} \mid & & \mid \\ \mathbf{u}_1 & \cdots & \mathbf{u}_m \\ \mid & & \mid \end{pmatrix} \begin{pmatrix} \sigma_1 & & & \\ & \ddots & & \\ & & \sigma_r & \\ & & & \mathbf{O}_{n-r} \\ \\ \\ & & \mathbf{O}_{m-n, n} \end{pmatrix} \begin{pmatrix} - & \mathbf{v}_1' & - \\ & \vdots & \\ - & \mathbf{v}_n' & - \end{pmatrix}. $$

Random.seed!(216)
m, n, r = 5, 3, 2
# a rank r matrix by rank factorization
A = randn(m, r) * randn(r, n)

5×3 Matrix{Float64}:
  1.19194   -1.06587    -0.282852
 -0.782863   0.582771    0.49484
  1.35372   -0.920401   -1.08576
  0.629091  -0.529823   -0.235526
 -0.129245   0.0789172   0.127265

# svd: U is mxn, Σ is nxn, V is nxn
# SVD
svd(A)

SVD{Float64, Float64, Matrix{Float64}, Vector{Float64}}
U factor:
5×3 Matrix{Float64}:
 -0.550507    0.73403   -0.129657
  0.382519    0.107063  -0.916621
 -0.677041   -0.623757  -0.355273
 -0.296395    0.223011  -0.129409
  0.0662395   0.104547   0.00539119
singular values:
3-element Vector{Float64}:
 2.8564495864018364
 0.5518082957047505
 7.29153183536249e-17
Vt factor:
3×3 Matrix{Float64}:
 -0.723686   0.558421  0.405518
  0.133192  -0.463537  0.87601
  0.677155   0.687968  0.261078

# full SVD
svd(A, full=true)

SVD{Float64, Float64, Matrix{Float64}, Vector{Float64}}
U factor:
5×5 Matrix{Float64}:
 -0.550507    0.73403   -0.129657    -0.373998    -0.0381611
  0.382519    0.107063  -0.916621    -0.0319149   -0.0316878
 -0.677041   -0.623757  -0.355273    -0.116055     0.113379
 -0.296395    0.223011  -0.129409     0.919574    -0.00724408
  0.0662395   0.104547   0.00539119   0.00457152   0.992286
singular values:
3-element Vector{Float64}:
 2.8564495864018364
 0.5518082957047505
 7.29153183536249e-17
Vt factor:
3×3 Matrix{Float64}:
 -0.723686   0.558421  0.405518
  0.133192  -0.463537  0.87601
  0.677155   0.687968  0.261078

SVD tells us everything about a matrix¶

SVD and four fundamental subspaces. Given the full SVD \begin{eqnarray*} \mathbf{A} &=& \begin{pmatrix} \mid & & \mid & \mid & & \mid \\ \mathbf{u}_1 & \cdots & \mathbf{u}_r & \mathbf{u}_{r+1} & \cdots & \mathbf{u}_m \\ \mid & & \mid & \mid & & \mid \end{pmatrix} \begin{pmatrix} \sigma_1 & & & \\ & \ddots & & \\ & & \sigma_r & \\ & & & \mathbf{O}_{n-r} \\ \\ & & \mathbf{O}_{m-n, n} & \end{pmatrix} \begin{pmatrix} - & \mathbf{v}_1' & - \\ & \vdots & \\ - & \mathbf{v}_r' & - \\ - & \mathbf{v}_{r+1}' & - \\ & \vdots & \\ - & \mathbf{v}_n' & - \end{pmatrix} \\ &=& \begin{pmatrix} \mathbf{U}_r & \mathbf{U}_{m-r} \end{pmatrix} \begin{pmatrix} \boldsymbol{\Sigma}_r & & \\ & & \mathbf{O}_{n-r} \\ \\ & \mathbf{O}_{m-n,n} & \end{pmatrix} \begin{pmatrix} \mathbf{V}_r' \\ \mathbf{V}_{n-r}' \end{pmatrix}, \end{eqnarray*} Then
1. $\mathcal{C}(\mathbf{A}) = \mathcal{C}(\mathbf{U}_r)$; $\quad \mathbf{U}_r \mathbf{U}_r'$ is the orthogonal projector into $\mathcal{C}(\mathbf{A})$.
2. $\mathcal{N}(\mathbf{A}') = \mathcal{C}(\mathbf{U}_{m-r})$; $\quad \mathbf{U}_{m-r} \mathbf{U}_{m-r}'$ is the orthogonal projector into $\mathcal{N}(\mathbf{A}')$.
3. $\mathcal{C}(\mathbf{A}') = \mathcal{C}(\mathbf{V}_{r})$; $\quad \mathbf{V}_{r} \mathbf{V}_{r}'$ is the orthogonal projector into $\mathcal{C}(\mathbf{A}')$.
4. $\mathcal{N}(\mathbf{A}) = \mathcal{C}(\mathbf{V}_{n-r})$; $\quad \mathbf{V}_{n-r} \mathbf{V}_{n-r}'$ is the orthogonal projector into $\mathcal{N}(\mathbf{A})$.
$\text{rank}(\mathbf{A}) = r = \text{# positive singular values}$.
Frobenius norm $\|\mathbf{A}\|_{\text{F}}^2 = \sum_{i,j} a_{ij}^2 = \text{tr}(\mathbf{A}' \mathbf{A}) = \sum_i \sigma_i^2$.
Spectral norm or $\ell_2$ matrix norm : $\|\mathbf{A}\|_2 = \max \frac{\|\mathbf{A} \mathbf{x}\|}{\|\mathbf{x}\|} = \sigma_1$. HW7 Q3.

Random.seed!(216)
# generate a rank deficient matrix
m, n, r = 5, 3, 2
A = randn(m, r) * randn(r, n)
# full svd
Asvd = svd(A, full=true)

SVD{Float64, Float64, Matrix{Float64}, Vector{Float64}}
U factor:
5×5 Matrix{Float64}:
 -0.550507    0.73403   -0.129657    -0.373998    -0.0381611
  0.382519    0.107063  -0.916621    -0.0319149   -0.0316878
 -0.677041   -0.623757  -0.355273    -0.116055     0.113379
 -0.296395    0.223011  -0.129409     0.919574    -0.00724408
  0.0662395   0.104547   0.00539119   0.00457152   0.992286
singular values:
3-element Vector{Float64}:
 2.8564495864018364
 0.5518082957047505
 7.29153183536249e-17
Vt factor:
3×3 Matrix{Float64}:
 -0.723686   0.558421  0.405518
  0.133192  -0.463537  0.87601
  0.677155   0.687968  0.261078

# Frobenius norm
norm(A) ≈ norm(Asvd.S)

true

# orthogonal projector to C(A)
P1 = A * pinv(A'A) * A' # WARNING: this is very inefficient code; take Biostat 257

5×5 Matrix{Float64}:
  0.841858   -0.131992   -0.0851415   0.326864     0.0402755
 -0.131992    0.157783   -0.325762   -0.0895006    0.0365309
 -0.0851415  -0.325762    0.847457    0.0615673   -0.110059
  0.326864   -0.0895006   0.0615673   0.137584     0.00368202
  0.0402755   0.0365309  -0.110059    0.00368202   0.0153178

# orthognal projector to C(A)
P2 = Asvd.U[:, 1:2] * Asvd.U[:, 1:2]'

5×5 Matrix{Float64}:
  0.841858   -0.131992   -0.0851415   0.326864     0.0402755
 -0.131992    0.157783   -0.325762   -0.0895006    0.0365309
 -0.0851415  -0.325762    0.847457    0.0615673   -0.110059
  0.326864   -0.0895006   0.0615673   0.137584     0.00368202
  0.0402755   0.0365309  -0.110059    0.00368202   0.0153178

# they should be equal by the uniqueness of orthogonal projector
P1 ≈ P2

true

Proof of existence of SVD using eigen-decomposition of Gram matrix¶

Relating SVD to eigen-decomposition. From SVD $\mathbf{A} = \mathbf{U} \boldsymbol{\Sigma} \mathbf{V}'$, we have \begin{eqnarray*} \mathbf{A}' \mathbf{A} &=& (\mathbf{V} \boldsymbol{\Sigma} \mathbf{U}') (\mathbf{U} \boldsymbol{\Sigma} \mathbf{V}') = \mathbf{V} \boldsymbol{\Sigma}^2 \mathbf{V}' \\ \mathbf{A} \mathbf{A}' &=& (\mathbf{U} \boldsymbol{\Sigma} \mathbf{V}') (\mathbf{V} \boldsymbol{\Sigma} \mathbf{U}') = \mathbf{U} \boldsymbol{\Sigma}^2 \mathbf{U}'. \end{eqnarray*} This says
- $\mathbf{u}_i$ are simply the eigenvectors of the symmetric matrix $\mathbf{A} \mathbf{A}'$
- $\mathbf{v}_i$ are simply the eigenvectors of the symmetric matrix $\mathbf{A}' \mathbf{A}$
- $\sigma_i$ are simply $\sqrt{\lambda_i}$.
  
  Proof of existence of SVD: Start from positive eigenvalues $\lambda_i > 0$ and corresponding (orthonormal) eigenvectors $\mathbf{v}_i$ of $\mathbf{A}' \mathbf{A}$. Set $\sigma_i = \sqrt \lambda_i$ and $$ \mathbf{u}_i = \frac{\mathbf{A} \mathbf{v}_i}{\sigma_i}, \quad i = 1,\ldots,r. $$ Verify that $\mathbf{u}_i$ are orthonormal. Lastly, augment $\mathbf{u}_i$s by an orthonormal basis in $\mathcal{N}(\mathbf{A}')$ and augment $\mathbf{v}_i$s by an orthonormal basis in $\mathcal{N}(\mathbf{A})$.

# revisit our earlier 5-by-3, rank-2 matrix
A

5×3 Matrix{Float64}:
  1.19194   -1.06587    -0.282852
 -0.782863   0.582771    0.49484
  1.35372   -0.920401   -1.08576
  0.629091  -0.529823   -0.235526
 -0.129245   0.0789172   0.127265

# full SVD
Asvd

SVD{Float64, Float64, Matrix{Float64}, Vector{Float64}}
U factor:
5×5 Matrix{Float64}:
 -0.550507    0.73403   -0.129657    -0.373998    -0.0381611
  0.382519    0.107063  -0.916621    -0.0319149   -0.0316878
 -0.677041   -0.623757  -0.355273    -0.116055     0.113379
 -0.296395    0.223011  -0.129409     0.919574    -0.00724408
  0.0662395   0.104547   0.00539119   0.00457152   0.992286
singular values:
3-element Vector{Float64}:
 2.8564495864018364
 0.5518082957047505
 7.29153183536249e-17
Vt factor:
3×3 Matrix{Float64}:
 -0.723686   0.558421  0.405518
  0.133192  -0.463537  0.87601
  0.677155   0.687968  0.261078

eigen(A'A)

Eigen{Float64, Float64, Matrix{Float64}, Vector{Float64}}
values:
3-element Vector{Float64}:
 -7.8540073965614e-17
  0.304492395208581
  8.159304239655212
vectors:
3×3 Matrix{Float64}:
 0.677155   0.133192   0.723686
 0.687968  -0.463537  -0.558421
 0.261078   0.87601   -0.405518

eigen(A * A')

Eigen{Float64, Float64, Matrix{Float64}, Vector{Float64}}
values:
5-element Vector{Float64}:
 -3.5037484483407655e-16
 -3.3864375037012587e-308
  1.8384139114030316e-16
  0.3044923952085807
  8.159304239655214
vectors:
5×5 Matrix{Float64}:
  0.384351  -0.0405875   -0.0936423   0.73403   -0.550507
  0.117971  -0.036814    -0.909365    0.107063   0.382519
  0.149588   0.110912    -0.343314   -0.623757  -0.677041
 -0.903316  -0.00371055  -0.215458    0.223011  -0.296395
  0.0        0.992312     0.0         0.104547   0.0662395

Another relation of SVD to eigen-decomposition¶

In the full SVD, partition the $\mathbf{U} \in \mathbb{R}^{m \times m}$ as $$ \mathbf{U} = \begin{pmatrix} \mathbf{U}_n & \mathbf{U}_{m-n} \end{pmatrix}, $$ where $\mathbf{U}_n \in \mathbb{R}^{m \times n}$ and $\mathbf{U}_{m-n} \in \mathbb{R}^{m \times (m - n)}$. Let $\boldsymbol{\Sigma} = \text{diag}(\sigma_1, \ldots, \sigma_n)$. Then $$ \begin{pmatrix} \mathbf{O}_n & \mathbf{A}' \\ \mathbf{A} & \mathbf{O}_m \end{pmatrix} = \frac{1}{\sqrt 2} \begin{pmatrix} \mathbf{V} & \mathbf{V} & \mathbf{O} \\ \mathbf{U}_n & - \mathbf{U}_n & \sqrt{2} \mathbf{U}_{m-n} \end{pmatrix} \begin{pmatrix} \boldsymbol{\Sigma} & & \\ & - \boldsymbol{\Sigma} & \\ & & \mathbf{O}_{m-n} \end{pmatrix} \frac{1}{\sqrt 2} \begin{pmatrix} \mathbf{V}' & \mathbf{U}_n' \\ \mathbf{V}' & - \mathbf{U}_n' \\ \mathbf{O} & \sqrt{2} \mathbf{U}_{m-n}' \end{pmatrix}. $$

A

5×3 Matrix{Float64}:
  1.19194   -1.06587    -0.282852
 -0.782863   0.582771    0.49484
  1.35372   -0.920401   -1.08576
  0.629091  -0.529823   -0.235526
 -0.129245   0.0789172   0.127265

svd(A)

SVD{Float64, Float64, Matrix{Float64}, Vector{Float64}}
U factor:
5×3 Matrix{Float64}:
 -0.550507    0.73403   -0.129657
  0.382519    0.107063  -0.916621
 -0.677041   -0.623757  -0.355273
 -0.296395    0.223011  -0.129409
  0.0662395   0.104547   0.00539119
singular values:
3-element Vector{Float64}:
 2.8564495864018364
 0.5518082957047505
 7.29153183536249e-17
Vt factor:
3×3 Matrix{Float64}:
 -0.723686   0.558421  0.405518
  0.133192  -0.463537  0.87601
  0.677155   0.687968  0.261078

B = [zeros(n, n) A';
    A zeros(m, m)]

8×8 Matrix{Float64}:
  0.0        0.0         0.0       …   1.35372    0.629091  -0.129245
  0.0        0.0         0.0          -0.920401  -0.529823   0.0789172
  0.0        0.0         0.0          -1.08576   -0.235526   0.127265
  1.19194   -1.06587    -0.282852      0.0        0.0        0.0
 -0.782863   0.582771    0.49484       0.0        0.0        0.0
  1.35372   -0.920401   -1.08576   …   0.0        0.0        0.0
  0.629091  -0.529823   -0.235526      0.0        0.0        0.0
 -0.129245   0.0789172   0.127265      0.0        0.0        0.0

Beig = eigen(B)

Eigen{Float64, Float64, Matrix{Float64}, Vector{Float64}}
values:
8-element Vector{Float64}:
 -2.8564495864018284
 -0.5518082957047472
 -1.2588628283183443e-16
 -6.81285049987535e-17
  1.5223696025047923e-17
  2.220446049250313e-15
  0.551808295704753
  2.856449586401836
vectors:
8×8 Matrix{Float64}:
  0.511723    0.0941807   0.170328   -0.58763   …   0.0941807   0.511723
 -0.394863   -0.32777     0.173048   -0.597013     -0.32777    -0.394863
 -0.286745    0.619432    0.0656704  -0.226562      0.619432   -0.286745
 -0.389267   -0.519038   -0.164695   -0.196513      0.519038    0.389267
  0.270481   -0.0757048   0.610223   -0.152636      0.0757048  -0.270481
 -0.47874     0.441063    0.16616    -0.108653  …  -0.441063    0.47874
 -0.209583   -0.157692    0.71388     0.416193      0.157692    0.209583
  0.0468384  -0.073926    0.0         0.0           0.073926   -0.0468384

# reorder eigenvalues
p = [8, 7, 6, 1, 2, 3, 5, 4]
Beigvec = Beig.vectors[:, p]

8×8 Matrix{Float64}:
  0.511723    0.0941807  -2.25514e-17  …   0.170328   -0.290204  -0.58763
 -0.394863   -0.32777     1.35092e-16      0.173048   -0.294838  -0.597013
 -0.286745    0.619432   -2.88831e-16      0.0656704  -0.111889  -0.226562
  0.389267    0.519038   -0.0405875       -0.164695    0.301252  -0.196513
 -0.270481    0.0757048  -0.036814         0.610223    0.667227  -0.152636
  0.47874    -0.441063    0.110912     …   0.16616     0.317532  -0.108653
  0.209583    0.157692   -0.00371055       0.71388    -0.423747   0.416193
 -0.0468384   0.073926    0.992312         0.0         0.0        0.0

# this should be Vᵣ
Beigvec[1:n, 1:r] * sqrt(2)

3×2 Matrix{Float64}:
  0.723686   0.133192
 -0.558421  -0.463537
 -0.405518   0.87601

# this should be Uᵣ
Beigvec[n+1:end, 1:r] * sqrt(2)

5×2 Matrix{Float64}:
  0.550507    0.73403
 -0.382519    0.107063
  0.677041   -0.623757
  0.296395    0.223011
 -0.0662395   0.104547

Some exercises¶

Question: If $\mathbf{A} = \mathbf{Q} \boldsymbol{\Lambda} \mathbf{Q}'$ is symmetric positive definite, what is its SVD?

Answer: The SVD is exactly same as eigen-decomposition $\mathbf{U} \boldsymbol{\Sigma} \mathbf{V}' = \mathbf{Q} \boldsymbol{\Lambda} \mathbf{Q}'$.

# a pd matrix
A = [1 1; 1 3]

2×2 Matrix{Int64}:
 1  1
 1  3

eigen(A)

Eigen{Float64, Float64, Matrix{Float64}, Vector{Float64}}
values:
2-element Vector{Float64}:
 0.5857864376269051
 3.414213562373095
vectors:
2×2 Matrix{Float64}:
 -0.92388   0.382683
  0.382683  0.92388

svd(A)

SVD{Float64, Float64, Matrix{Float64}, Vector{Float64}}
U factor:
2×2 Matrix{Float64}:
 -0.382683  -0.92388
 -0.92388    0.382683
singular values:
2-element Vector{Float64}:
 3.4142135623730945
 0.5857864376269051
Vt factor:
2×2 Matrix{Float64}:
 -0.382683  -0.92388
 -0.92388    0.382683

Question: If $\mathbf{A} = \mathbf{Q} \boldsymbol{\Lambda} \mathbf{Q}'$ is symmetric and indefinite (has negative eigenvalues), then what is its SVD?

Answer: Its SVD is $$ \mathbf{A} = \mathbf{Q} \boldsymbol{\Lambda} \mathbf{Q}' = \mathbf{Q} \begin{pmatrix} |\lambda_1| & & \\ & \ddots & \\ & & |\lambda_n| \end{pmatrix} \begin{pmatrix} \text{sgn}(\lambda_1) & & \\ & \ddots & \\ & & \text{sgn}(\lambda_n) \end{pmatrix} \mathbf{Q}'. $$

# an indefinite matrix
A = [1 2; 2 3]

2×2 Matrix{Int64}:
 1  2
 2  3

eigen(A)

Eigen{Float64, Float64, Matrix{Float64}, Vector{Float64}}
values:
2-element Vector{Float64}:
 -0.2360679774997897
  4.23606797749979
vectors:
2×2 Matrix{Float64}:
 -0.850651  0.525731
  0.525731  0.850651

svd(A)

SVD{Float64, Float64, Matrix{Float64}, Vector{Float64}}
U factor:
2×2 Matrix{Float64}:
 -0.525731  -0.850651
 -0.850651   0.525731
singular values:
2-element Vector{Float64}:
 4.236067977499789
 0.23606797749978947
Vt factor:
2×2 Matrix{Float64}:
 -0.525731  -0.850651
  0.850651  -0.525731

Question: Why the singular values of an orthogonal matrix $\mathbf{Q}$ are all 1?

Answer: $\mathbf{Q} = \mathbf{Q} \cdot \mathbf{I}_n \cdot \mathbf{I}_n$ is the SVD.

# Hadamard matrix of order 2
H = (1 / sqrt(2)) * [1 1; 1 -1]

2×2 Matrix{Float64}:
 0.707107   0.707107
 0.707107  -0.707107

# check orthogonality
H'H

2×2 Matrix{Float64}:
 1.0  0.0
 0.0  1.0

# singular values are all 1
svd(H)

SVD{Float64, Float64, Matrix{Float64}, Vector{Float64}}
U factor:
2×2 Matrix{Float64}:
 -0.707107  -0.707107
 -0.707107   0.707107
singular values:
2-element Vector{Float64}:
 1.0
 0.9999999999999999
Vt factor:
2×2 Matrix{Float64}:
 -1.0  -0.0
 -0.0  -1.0

Why are all eigenvalues of a square matrix less than or equal to $\sigma_1$?

Answer: Recall that trthogonal matrices preserve vector length. Using the full SVD $\mathbf{A} = \mathbf{U} \boldsymbol{\Sigma} \mathbf{V}'$, $$ \|\mathbf{A} \mathbf{x}\| = \|\mathbf{U} \boldsymbol{\Sigma} \mathbf{V}' \mathbf{x}\| = \|\boldsymbol{\Sigma} \mathbf{V}' \mathbf{x}\| \le \sigma_1 \|\mathbf{V}' \mathbf{x}\| = \sigma_1 \|\mathbf{x}\|. $$ Now an eigenvector $\mathbf{x}$ satisifies $$ \|\mathbf{A} \mathbf{x}\| = |\lambda| \|\mathbf{x}\|. $$ Thus we have $|\lambda| \le \sigma_1$.

Random.seed!(216)
A = randn(5, 5)

5×5 Matrix{Float64}:
  1.27293    -1.24595   -0.00147378   0.359248    2.06304
 -0.0716618   0.881297  -0.270751    -1.1542      1.36629
 -0.445158   -1.5708    -0.37621      0.0790726   1.11764
  0.45854     0.485928   0.206471    -0.49577     1.41871
  0.100877   -0.949817  -1.11681     -0.282272   -0.675163

eigvals(A) .|> abs |> sort

5-element Vector{Float64}:
 0.12739109954008143
 0.7194223279350407
 1.8026102333712462
 2.1147164087310544
 2.1147164087310544

svdvals(A) |> sort

5-element Vector{Float64}:
 0.05567198633641073
 1.02673507675862
 1.5170242347123837
 2.525722122771602
 3.3733072355323444

Question: If $\mathbf{A} = \mathbf{x} \mathbf{y}'$, find the singular values and singular vectors. What does $|\lambda| \le \sigma_1$ imply?

Answer: TODO.

Geometry of SVD¶

Visualize how a $2 \times 2$ matrix acts on a vector via SVD. Picture from this paper: $$ \mathbf{A} \mathbf{x} = \mathbf{U} \boldsymbol{\Sigma} \mathbf{V}' \mathbf{x}. $$

SVD and generalized inverse¶

Using the SVD, the Moore-Penrose inverse) is given by $$ \mathbf{A}^+ = \mathbf{V} \boldsymbol{\Sigma}^+ \mathbf{U}^T = \mathbf{V}_r \boldsymbol{\Sigma}_r^{-1} \mathbf{U}_r' = \sum_{i=1}^r \sigma_i^{-1} \mathbf{v}_i \mathbf{u}_i', $$ where $\boldsymbol{\Sigma}^+ = \text{diag}(\sigma_1^{-1}, \ldots, \sigma_r^{-1}, 0, \ldots, 0)$, $r= \text{rank}(\mathbf{A})$. In HW7, you verify this definition satisifes all properties of the Moore-Pensorse inverse. This is how the pinv function is implemented in Julia, Matlab, Python, or R.

Random.seed!(216)
m, n, r = 5, 3, 2
A = randn(m, r) * randn(r, n)

5×3 Matrix{Float64}:
  1.19194   -1.06587    -0.282852
 -0.782863   0.582771    0.49484
  1.35372   -0.920401   -1.08576
  0.629091  -0.529823   -0.235526
 -0.129245   0.0789172   0.127265

Asvd = svd(A)
Asvd.Vt[1:2, :]' * Diagonal(inv.(Asvd.S[1:2])) * Asvd.U[:, 1:2]'

3×5 Matrix{Float64}:
  0.316647  -0.0710696   0.020971   0.128921   0.00845298
 -0.724231  -0.015156    0.391619  -0.24528   -0.0748736
  1.08714    0.224269   -1.08635    0.311957   0.175375

pinv(A)

3×5 Matrix{Float64}:
  0.316647  -0.0710696   0.020971   0.128921   0.00845298
 -0.724231  -0.015156    0.391619  -0.24528   -0.0748736
  1.08714    0.224269   -1.08635    0.311957   0.175375

Eckart-Young theorem: best approximation to a matrix¶

Let's first look at an image compression example from https://www.mathworks.com/content/mathworks/www/en/company/newsletters/articles/professor-svd.html.

img = load("golub-561-838-rank-120.jpg")

channels = float(channelview(img))
size(channels)

(3, 838, 561)

channels[3, :, :]

838×561 Matrix{Float32}:
 0.482353  0.482353  0.482353  0.439216  …  0.478431  0.458824  0.478431
 0.482353  0.470588  0.447059  0.431373     0.466667  0.482353  0.478431
 0.482353  0.447059  0.423529  0.54902      0.435294  0.435294  0.470588
 0.443137  0.431373  0.54902   0.984314     0.537255  0.470588  0.470588
 0.490196  0.435294  0.482353  0.968627     0.486275  0.427451  0.470588
 0.454902  0.427451  0.47451   0.968627  …  0.490196  0.466667  0.470588
 0.498039  0.458824  0.47451   0.952941     0.490196  0.458824  0.470588
 0.501961  0.447059  0.47451   0.968627     0.517647  0.462745  0.470588
 0.478431  0.427451  0.490196  0.968627     0.498039  0.458824  0.470588
 0.466667  0.423529  0.498039  0.980392     0.501961  0.466667  0.470588
 0.470588  0.423529  0.498039  0.976471  …  0.509804  0.470588  0.470588
 0.470588  0.427451  0.490196  0.968627     0.501961  0.470588  0.470588
 0.470588  0.427451  0.490196  0.964706     0.501961  0.454902  0.470588
 ⋮                                       ⋱                      ⋮
 0.478431  0.427451  0.396078  0.427451     0.4       0.435294  0.466667
 0.486275  0.427451  0.392157  0.423529     0.403922  0.443137  0.466667
 0.490196  0.431373  0.388235  0.415686     0.4       0.431373  0.466667
 0.494118  0.431373  0.388235  0.415686     0.407843  0.431373  0.466667
 0.486275  0.431373  0.392157  0.423529  …  0.411765  0.439216  0.466667
 0.47451   0.423529  0.396078  0.423529     0.4       0.431373  0.466667
 0.486275  0.411765  0.396078  0.407843     0.407843  0.419608  0.470588
 0.47451   0.427451  0.423529  0.45098      0.423529  0.435294  0.462745
 0.478431  0.45098   0.454902  0.458824     0.462745  0.466667  0.458824
 0.423529  0.423529  0.423529  0.419608  …  0.427451  0.431373  0.431373
 0.388235  0.380392  0.372549  0.364706     0.384314  0.388235  0.372549
 1.0       1.0       1.0       1.0          1.0       1.0       1.0

function rank_approx(F::SVD, k)
    U, S, Vt = F.U, F.S, F.Vt
    M = U[:, 1:k] * Diagonal(S[1:k]) * Vt[1:k, :]
    M .= clamp.(M, 0.0, 1.0)
end
svdfactors = (svd(channels[1, :, :]), svd(channels[2, :, :]), svd(channels[3, :, :]))

(SVD{Float32, Float32, Matrix{Float32}, Vector{Float32}}(Float32[-0.0031989813 -0.002967855 … 0.0183309 -0.004316881; -0.004179895 -0.0039784014 … -0.076949075 0.014094186; … ; -0.0063422113 -0.0060919584 … 0.0005547097 -0.017454399; -0.046187274 -0.038557358 … 0.00029002503 0.001047425], Float32[468.8883, 54.572285, 39.715496, 28.825445, 22.580215, 19.784306, 18.398533, 16.746706, 15.208014, 13.955471  …  0.019828707, 0.019351872, 0.019001774, 0.018515224, 0.018378811, 0.018156813, 0.017283395, 0.017243234, 0.016644003, 0.01637554], Float32[-0.0038052236 -0.0049679875 … -0.005103427 -0.0043158513; -0.0008207955 -0.0009993941 … -0.0010606974 -0.0008770722; … ; 0.01440609 -0.018368334 … -0.019079357 0.0073162774; -0.018272433 -0.064345896 … 0.018100321 0.027789444]), SVD{Float32, Float32, Matrix{Float32}, Vector{Float32}}(Float32[-0.012613237 -0.01633393 … -0.06465561 -0.0003529974; -0.012318611 -0.016402498 … 0.07236642 0.05997443; … ; -0.011948539 -0.016418213 … -0.0007248239 0.041448317; -0.055155963 -0.072461374 … 0.00592323 -0.02678206], Float32[419.88687, 60.74678, 34.202217, 29.938581, 23.479506, 21.34288, 16.514172, 16.360628, 15.088315, 12.113871  …  0.018009061, 0.017747186, 0.017407168, 0.016813967, 0.016718494, 0.016438827, 0.016056975, 0.015792124, 0.01548744, 0.014807718], Float32[-0.015269644 -0.015003741 … -0.01495191 -0.015130209; -0.0031515153 -0.002999927 … -0.00310336 -0.0029591415; … ; 0.039632805 -0.06704547 … 0.00932245 -0.023942372; -0.008262174 0.034130894 … 0.0016027894 -0.057798564]), SVD{Float32, Float32, Matrix{Float32}, Vector{Float32}}(Float32[-0.028394103 -0.044606693 … 0.0013892207 0.034061868; -0.025803536 -0.040724188 … 0.0029198546 -0.032955687; … ; -0.022000728 -0.03501293 … -0.033706937 0.040161964; -0.059711784 -0.09410887 … 0.006191494 0.016586903], Float32[383.69214, 59.918865, 33.99363, 27.136225, 24.223873, 21.355232, 17.50871, 16.523273, 14.506897, 12.658147  …  0.021025755, 0.02060199, 0.020447029, 0.02032693, 0.0200175, 0.019096836, 0.018692506, 0.018081604, 0.017987508, 0.016505146], Float32[-0.034779165 -0.031716347 … -0.03301989 -0.034202103; -0.018703843 -0.015579854 … -0.01464163 -0.017756158; … ; -0.082657784 0.028032716 … 0.035916027 0.11056614; 0.045359723 0.0042624963 … 0.009738512 -0.065284625]))

k1, k2, k3 = 25, 25, 25
colorview(RGB,
    rank_approx(svdfactors[1], k1),
    rank_approx(svdfactors[2], k2),
    rank_approx(svdfactors[3], k3)
)

SVD has some inherent optimality properties. It prescribes how to approximate a general matrix $\mathbf{A}$ by a low rank matrix.
Before talking about approximation, we need a metric that measures the quality of approximation. We discuss 3 matrix norms.
- Spectral norm or $\ell_2$ norm: $$ \|\mathbf{A}\|_2 = \max \frac{\|\mathbf{A} \mathbf{x}\|}{\|\mathbf{x}\|} = \sigma_1. $$
- Frobenius norm: $$ \|\mathbf{A}\|_{\text{F}} = \sqrt{\sum_{i,j} a_{ij}^2} = \sqrt{\text{tr}(\mathbf{A}' \mathbf{A})} = \sqrt{\sigma_1^2 + \cdots + \sigma_r^2}. $$
- Nuclear norm: $$ \|\mathbf{A}\|_{\text{nuc}} = \sigma_1 + \cdots + \sigma_r. $$
These 3 norms already have different values for the identity matrix: \begin{eqnarray*} \|\mathbf{I}_n\|_2 &=& 1 \\ \|\mathbf{I}_n\|_{\text{F}} &=& \sqrt{n} \\ \|\mathbf{I}_n\|_{\text{nuc}} &=& n. \end{eqnarray*} Indeed for any orthogonal matrix $\mathbf{Q} \in \mathbb{R}^{n \times n}$, \begin{eqnarray*} \|\mathbf{Q}\|_2 &=& 1 \\ \|\mathbf{Q}\|_{\text{F}} &=& \sqrt{n} \\ \|\mathbf{Q}\|_{\text{nuc}} &=& n. \end{eqnarray*}
Invariance under orthogonal transform. For all three norms, $$ \|\mathbf{Q}_1 \mathbf{A} \mathbf{Q}_2'\| = \|\mathbf{A}\| \text{ for orthogonal } \mathbf{Q}_1 \text{ and } \mathbf{Q}_2. $$

Proof: HW7.
Eckart-Young theorem. Assuming SVD of $\mathbf{A} \in \mathbb{R}^{m \times n}$ with rank $r$: $$ \mathbf{A} = \mathbf{U} \boldsymbol{\Sigma} \mathbf{V}' = \sum_{i=1}^r \sigma_i \mathbf{u}_i \mathbf{v}_i'. $$ Then the matrix $$ \mathbf{A}_k = \sum_{i=1}^k \sigma_i \mathbf{u}_i \mathbf{v}_i' $$ is the best rank-$k$ approximation to the original matrix $\mathbf{A}$ in the 3 matrix norms ($\ell_2$, Frobenius, and nuclear). More precisely, $$ \|\mathbf{A} - \mathbf{B}\| $$ is minimized by $\mathbf{A}_k$ among all matrices $\mathbf{B}$ with rank $\le k$.
Proof of Eckart-Young theorem for the $\ell_2$ norm. (optional)

First we note $$ \|\mathbf{A} - \mathbf{A}_k\|_2 = \left\|\sum_{i=1}^r \sigma_i \mathbf{u}_i \mathbf{v}_i' - \sum_{i=1}^k \sigma_i \mathbf{u}_i \mathbf{v}_i' \right\|_2 = \left\|\sum_{i=k+1}^r \sigma_i \mathbf{u}_i \mathbf{v}_i' \right\|_2 = \sigma_{k+1}. $$ We want to show that $$ \|\mathbf{A} - \mathbf{B}\|_2 = \max \frac{\|(\mathbf{A} - \mathbf{B}) \mathbf{x}\|}{\|\mathbf{x}\|} \ge \sigma_{k+1} $$ for any $\mathbf{B}$ with $\text{rank}(\mathbf{B}) \le k$. We show this by choosing a particular $\mathbf{x}$ such that $$ \frac{\|(\mathbf{A} - \mathbf{B}) \mathbf{x}\|}{\|\mathbf{x}\|} \ge \sigma_{k+1}. $$ Choose $\mathbf{x} \ne \mathbf{0}$ such that $$ \mathbf{B} \mathbf{x} = \mathbf{0} \text{ and } \mathbf{x} = \sum_{i=1}^{k+1} c_i \mathbf{v}_i. $$ There exists such $\mathbf{x}$ because (1) $\mathcal{N}(\mathbf{B})$ has dimension $\ge n-k$ because $\text{rank}(\mathbf{B}) \le k$ (rank-nullity theorem) and (2) $\mathbf{v}_1, \ldots, \mathbf{v}_{k+1}$ span a subspace of dimension $k+1$. Thus $\mathcal{N}(\mathbf{B})$ and $\text{span}\{\mathbf{v}_1, \ldots, \mathbf{v}_{k+1}\}$ has a non-trivial intersection. For this $\mathbf{x}$, we have \begin{eqnarray*} & & \|(\mathbf{A} - \mathbf{B}) \mathbf{x}\|^2 \\ &=& \|\mathbf{A} \mathbf{x}\|^2 \\ &=& \left\|\left(\sum_{i=1}^r \sigma_i \mathbf{u}_i \mathbf{v}_i'\right)\left(\sum_{i=1}^{k+1} c_i \mathbf{v}_i\right)\right\|^2 \\ &=& \left\| \sum_{i=1}^{k+1} c_i \sigma_i \mathbf{u}_i \right\|^2 \\ &=& \sum_{i=1}^{k+1} c_i^2 \sigma_i^2 \\ &\ge& \left(\sum_{i=1}^{k+1} c_i^2\right) \sigma_{k+1}^2 \\ &=& \|\mathbf{x}\|^2 \sigma_{k+1}^2. \end{eqnarray*} The proof is finished.
(Nathan Srebro's) Proof of Eckart-Young theorem for the Frobenius norm. (optional)

Let $\mathbf{B}$ be a matrix of rank $k$. By the rank factorization, $\mathbf{B} = \mathbf{C} \mathbf{R}$, where $\mathbf{C} \in \mathbb{R}^{m \times k}$ and $\mathbf{R} \in \mathbb{R}^{k \times n}$. Using SVD of $\mathbf{B}$, we will assume that $\mathbf{C}$ has orthogonal columns (so $\mathbf{C}' \mathbf{C} = \mathbf{D}$) and $\mathbf{R}$ has orthonormal rows (so $\mathbf{R} \mathbf{R}' = \mathbf{I}_k$). We want to show $\mathbf{C} = \mathbf{U}_k \boldsymbol{\Sigma}_k$ and $\mathbf{R} = \mathbf{V}_k'$, where $(\boldsymbol{\Sigma}_k, \mathbf{U}_k, \mathbf{V}_k)$ are the top $k$ singular values/vectors of $\mathbf{A}$. In order for $\mathbf{B} = \mathbf{C} \mathbf{R}$ to minimize $$ f(\mathbf{C}, \mathbf{R}) = \|\mathbf{A} - \mathbf{C} \mathbf{R}\|_{\text{F}}^2, $$ the gradient (partial derivatives) must vanish \begin{eqnarray*} \frac{\partial f(\mathbf{C}, \mathbf{R})}{\partial \mathbf{C}} &=& - 2(\mathbf{A} - \mathbf{C} \mathbf{R}) \mathbf{R}' = \mathbf{O}_{m \times k} \\ \frac{\partial f(\mathbf{C}, \mathbf{R})}{\partial \mathbf{R}} &=& - 2 \mathbf{C}' (\mathbf{A} - \mathbf{C} \mathbf{R}) = \mathbf{O}_{k \times n}. \end{eqnarray*} The first equation shows $$ \mathbf{A} \mathbf{R}' = \mathbf{C} \mathbf{R} \mathbf{R}' = \mathbf{C}. $$ Then by the second equation $$ \mathbf{A}' \mathbf{A} \mathbf{R}' = \mathbf{A}' \mathbf{C} = \mathbf{R}' \mathbf{C}' \mathbf{C} = \mathbf{R}' \mathbf{D}. $$ This says the columns of $\mathbf{R}'$ (rows of $\mathbf{R}$) must be eigenvectors of $\mathbf{A}' \mathbf{A}$, or equivalently, right singular vectors $\mathbf{v}_i$ of $\mathbf{A}$. Similarly the columns of $\mathbf{C}$ must be eigenvectors of $\mathbf{A} \mathbf{A}'$, or equivalently, left singular vectors $\mathbf{u}_i$ of $\mathbf{A}$: $$ \mathbf{A} \mathbf{A}' \mathbf{C} = \mathbf{A} \mathbf{R}' \mathbf{D} = \mathbf{C} \mathbf{D}. $$ Which $\mathbf{u}_i$ and $\mathbf{v}_i$ shall we take to minimize $f$? Apparently we should choose those with largest singular values to achieve the minimum value $\sigma_{k+1}^2 + \cdots + \sigma_r^2$.

Singular vectors and Rayleigh quotient¶

Goal: Maximize the Rayleigh quotient $$ \text{maximize} \quad f(\mathbf{x}) = \frac{\|\mathbf{A} \mathbf{x}\|^2}{\|\mathbf{x}\|^2} = \frac{\mathbf{x}' \mathbf{A}' \mathbf{A} \mathbf{x}}{\mathbf{x}' \mathbf{x}} = \frac{\mathbf{x}' \mathbf{S} \mathbf{x}}{\mathbf{x}' \mathbf{x}}. $$
Let's calculate the partial derivatives of the objective function $f(\mathbf{x})$ $$ \frac{\partial f(\mathbf{x})}{\partial x_i} = \frac{\left(2\sum_j s_{ij} x_j\right) (\mathbf{x}' \mathbf{x}) - (\mathbf{x}' \mathbf{S} \mathbf{x}) (2x_i)}{(\mathbf{x}' \mathbf{x})^2} = 2 (\mathbf{x}' \mathbf{x})^{-1} \left(\sum_j s_{ij} x_j - f(\mathbf{x}) x_i \right). $$ Collecting partial derivatives into the gradient vector and setting it to zero $$ \nabla f(\mathbf{x}) = 2 (\mathbf{x}' \mathbf{x})^{-1} \left[ \mathbf{S} \mathbf{x} - f(\mathbf{x}) \cdot \mathbf{x} \right] = \mathbf{0} $$ yields $$ \mathbf{S} \mathbf{x} = f(\mathbf{x}) \cdot \mathbf{x}. $$ Thus the optimal $\mathbf{x}$ must be an eigenvector of $\mathbf{S} = \mathbf{A}' \mathbf{A}$ with corresponding eigenvalue $f(\mathbf{x})$. Which one shall we choose? Apparently the top right singular vector $\mathbf{x}^\star = \mathbf{v}_1$ gives us the maximal value which is equal to $\lambda_1 = \sigma_1^2$.

In HW6, you show this result using an alternative proof (elementary linear algebra arguments).
Above we have shown $$ \max_{\mathbf{x} \ne \mathbf{0}} \frac{\|\mathbf{A} \mathbf{x}\|}{\|\mathbf{x}\|} = \sigma_1, $$ which is the spectral norm (or $\ell_2$ norm) of a matrix $\mathbf{A}$.

Random.seed!(216)
m, n, r = 5, 3, 2
A = randn(m, r) * randn(r, n)

5×3 Matrix{Float64}:
  1.19194   -1.06587    -0.282852
 -0.782863   0.582771    0.49484
  1.35372   -0.920401   -1.08576
  0.629091  -0.529823   -0.235526
 -0.129245   0.0789172   0.127265

x = randn(n)
norm(A * x) / norm(x)

1.3019523541430873

Asvd = svd(A)

SVD{Float64, Float64, Matrix{Float64}, Vector{Float64}}
U factor:
5×3 Matrix{Float64}:
 -0.550507    0.73403   -0.129657
  0.382519    0.107063  -0.916621
 -0.677041   -0.623757  -0.355273
 -0.296395    0.223011  -0.129409
  0.0662395   0.104547   0.00539119
singular values:
3-element Vector{Float64}:
 2.8564495864018364
 0.5518082957047505
 7.29153183536249e-17
Vt factor:
3×3 Matrix{Float64}:
 -0.723686   0.558421  0.405518
  0.133192  -0.463537  0.87601
  0.677155   0.687968  0.261078

x = Asvd.Vt[1, :]
norm(A * x) / norm(x)

2.8564495864018364

Similarly, the second right singular vector maximizes the Rayleigh quotient subject to an orthogonality constraint \begin{eqnarray*} \text{maximize} &\quad& f(\mathbf{x}) = \frac{\|\mathbf{A} \mathbf{x}\|^2}{\|\mathbf{x}\|^2} = \frac{\mathbf{x}' \mathbf{A}' \mathbf{A} \mathbf{x}}{\mathbf{x}' \mathbf{x}} = \frac{\mathbf{x}' \mathbf{S} \mathbf{x}}{\mathbf{x}' \mathbf{x}} \\ \text{subject to} &\quad& \mathbf{x} \perp \mathbf{v}_1. \end{eqnarray*}
Submatrices have smaller singular values. Let $\mathbf{B}$ be a submatrix of $\mathbf{A} \in \mathbb{R}^{m \times n}$. Then $$ \|\mathbf{B}\| \le \|\mathbf{A}\| $$ or equivalently $$ \sigma_1 (\mathbf{B}) \le \sigma_1 (\mathbf{A}). $$

Proof: Let $\tilde{\mathbf{y}} \in \mathbb{R}^n$ hold corresponding entries in $\mathbf{y}$ and be zero elsewhere. Then $$ \sigma_1 (\mathbf{B}) = \max_{\mathbf{y}} \frac{\|\mathbf{B} \mathbf{y}\|}{\|\mathbf{y}\|} \le \max_{\tilde{\mathbf{y}}} \frac{\|\mathbf{A} \tilde{\mathbf{y}}\|}{\|\tilde{\mathbf{y}}\|} \le \max_{\mathbf{x}} \frac{\|\mathbf{A} \mathbf{x}\|}{\|\mathbf{x}\|} = \sigma_1 (\mathbf{A}). $$

svd(A[1:3, 1:2])

SVD{Float64, Float64, Matrix{Float64}, Vector{Float64}}
U factor:
3×2 Matrix{Float64}:
 -0.642327  -0.744931
  0.392978  -0.118181
 -0.658015   0.656591
singular values:
2-element Vector{Float64}:
 2.483072964049883
 0.15272120046204632
Vt factor:
2×2 Matrix{Float64}:
 -0.790967  0.611858
  0.611858  0.790967

Principal component analysis (PCA)¶

PCA is a dimension reduction technique that finds the most informative linear combinations of the $p$ random variables $\mathbf{X} \in \mathbb{R}^p$. Mathematically it finds the linear combinations of the $p$ random variables that have the largest variances. If we know the population covariance of the $p$ variables is $\text{Cov}(\mathbf{X}) = \boldsymbol{\Sigma} \in \mathbb{R}^{p \times p}$, then $$ \text{Var}(\mathbf{a}' \mathbf{X}) = \mathbf{a}' \text{Cov}(\mathbf{X}) \mathbf{a} = \mathbf{a}' \boldsymbol{\Sigma} \mathbf{a}. $$ Apparently the larger magnitude of $\mathbf{a}$, the large variance $\text{Var}(\mathbf{a}' \mathbf{X})$. It makes sense to maximize the normalized version $$ \text{maximize} \quad \frac{\mathbf{a}' \boldsymbol{\Sigma} \mathbf{a}}{\mathbf{a}' \mathbf{a}}. $$
Given a data matrix $\mathbf{X} \in \mathbb{R}^{n \times p}$, where there are $n$ observations of $p$ variables. We would substitute in the sample covariance matrix $$ \widehat{\boldsymbol{\Sigma}} = \frac{\tilde{\mathbf{X}}'\tilde{\mathbf{X}}}{n-1}, $$ where $\tilde{\mathbf{X}}$ is the column-centered data, and maximize the Rayleigh quotient $$ \text{maximize} \quad \frac{\mathbf{a}' \widehat{\boldsymbol{\Sigma}} \mathbf{a}}{\mathbf{a}' \mathbf{a}}. $$ From earlier discussion we know the optimal $\mathbf{a}^\star$ maximizing this Rayleigh quotient is the top eigenvector of $\widehat{\boldsymbol{\Sigma}}$, or equivalently, the top right singular vector $\mathbf{v}_1$ of $\tilde{\mathbf{X}}$, achieving the optimal value $\lambda_1$.

Similarly, right singular vectors $\mathbf{v}_k$ maximizes $$ \frac{\mathbf{a}' \widehat{\boldsymbol{\Sigma}} \mathbf{a}}{\mathbf{a}' \mathbf{a}} $$ subject to the constraint $\mathbf{a} \perp \mathbf{v}_i$ for $i =1,\ldots,k-1$.
Coumns of $$ \mathbf{V}_r = \begin{pmatrix} \mid & & \mid \\ \mathbf{v}_1 & \cdots & \mathbf{v}_r \\ \mid & & \mid \end{pmatrix} \in \mathbb{R}^{p \times r} $$ are called the principal components.
Columns of $\tilde{\mathbf{X}} \mathbf{V}_r \in \mathbb{R}^{n \times r}$ are called the principal scores.

PCA example: Fisher's Iris data¶

# load iris dataset
iris = dataset("datasets", "iris")

# retrieve features and labels
X_labels = convert(Vector, iris[!, 5])

150-element Vector{CategoricalArrays.CategoricalValue{String, UInt8}}:
 "setosa"
 "setosa"
 "setosa"
 "setosa"
 "setosa"
 "setosa"
 "setosa"
 "setosa"
 "setosa"
 "setosa"
 "setosa"
 "setosa"
 "setosa"
 ⋮
 "virginica"
 "virginica"
 "virginica"
 "virginica"
 "virginica"
 "virginica"
 "virginica"
 "virginica"
 "virginica"
 "virginica"
 "virginica"
 "virginica"

# retrieve the 4 continuous features
X = (ModelMatrix(ModelFrame(
    @formula(1 ~ 1 + SepalLength + SepalWidth + PetalLength + PetalWidth + Species), 
    iris,
    contrasts = Dict(:Species => StatsModels.FullDummyCoding()))).m)[:, 2:5]

150×4 Matrix{Float64}:
 5.1  3.5  1.4  0.2
 4.9  3.0  1.4  0.2
 4.7  3.2  1.3  0.2
 4.6  3.1  1.5  0.2
 5.0  3.6  1.4  0.2
 5.4  3.9  1.7  0.4
 4.6  3.4  1.4  0.3
 5.0  3.4  1.5  0.2
 4.4  2.9  1.4  0.2
 4.9  3.1  1.5  0.1
 5.4  3.7  1.5  0.2
 4.8  3.4  1.6  0.2
 4.8  3.0  1.4  0.1
 ⋮              
 6.0  3.0  4.8  1.8
 6.9  3.1  5.4  2.1
 6.7  3.1  5.6  2.4
 6.9  3.1  5.1  2.3
 5.8  2.7  5.1  1.9
 6.8  3.2  5.9  2.3
 6.7  3.3  5.7  2.5
 6.7  3.0  5.2  2.3
 6.3  2.5  5.0  1.9
 6.5  3.0  5.2  2.0
 6.2  3.4  5.4  2.3
 5.9  3.0  5.1  1.8

# center but not scale
X̃ = zscore(X, mean(X, dims=1), ones(1, size(X, 2)))

150×4 Matrix{Float64}:
 -0.743333    0.442667   -2.358  -0.999333
 -0.943333   -0.0573333  -2.358  -0.999333
 -1.14333     0.142667   -2.458  -0.999333
 -1.24333     0.0426667  -2.258  -0.999333
 -0.843333    0.542667   -2.358  -0.999333
 -0.443333    0.842667   -2.058  -0.799333
 -1.24333     0.342667   -2.358  -0.899333
 -0.843333    0.342667   -2.258  -0.999333
 -1.44333    -0.157333   -2.358  -0.999333
 -0.943333    0.0426667  -2.258  -1.09933
 -0.443333    0.642667   -2.258  -0.999333
 -1.04333     0.342667   -2.158  -0.999333
 -1.04333    -0.0573333  -2.358  -1.09933
  ⋮                              
  0.156667   -0.0573333   1.042   0.600667
  1.05667     0.0426667   1.642   0.900667
  0.856667    0.0426667   1.842   1.20067
  1.05667     0.0426667   1.342   1.10067
 -0.0433333  -0.357333    1.342   0.700667
  0.956667    0.142667    2.142   1.10067
  0.856667    0.242667    1.942   1.30067
  0.856667   -0.0573333   1.442   1.10067
  0.456667   -0.557333    1.242   0.700667
  0.656667   -0.0573333   1.442   0.800667
  0.356667    0.342667    1.642   1.10067
  0.0566667  -0.0573333   1.342   0.600667

# Xsvd.V contains the principal components
Xsvd = svd(X̃)
Xsvd.V

4×4 adjoint(::Matrix{Float64}) with eltype Float64:
  0.361387   -0.656589   0.58203     0.315487
 -0.0845225  -0.730161  -0.597911   -0.319723
  0.856671    0.173373  -0.0762361  -0.479839
  0.358289    0.075481  -0.545831    0.753657

# compute the top 3 principal scores
Y = X̃ * Xsvd.V[:, 1:3]

150×3 Matrix{Float64}:
 -2.68413  -0.319397    0.0279148
 -2.71414   0.177001    0.210464
 -2.88899   0.144949   -0.0179003
 -2.74534   0.318299   -0.0315594
 -2.72872  -0.326755   -0.0900792
 -2.28086  -0.74133    -0.168678
 -2.82054   0.0894614  -0.257892
 -2.62614  -0.163385    0.0218793
 -2.88638   0.578312   -0.0207596
 -2.67276   0.113774    0.197633
 -2.50695  -0.645069    0.075318
 -2.61276  -0.0147299  -0.10215
 -2.78611   0.235112    0.206844
  ⋮                    
  1.16933   0.16499    -0.281836
  2.10761  -0.372288   -0.0272911
  2.31415  -0.183651   -0.322694
  1.92227  -0.409203   -0.113587
  1.41524   0.574916   -0.296323
  2.56301  -0.277863   -0.29257
  2.41875  -0.304798   -0.504483
  1.94411  -0.187532   -0.177825
  1.52717   0.375317    0.121898
  1.76435  -0.0788589  -0.130482
  1.90094  -0.116628   -0.723252
  1.39019   0.282661   -0.36291

# group results by testing set labels for color coding
setosa     = Y[X_labels .== "setosa"    , :]
versicolor = Y[X_labels .== "versicolor", :]
virginica  = Y[X_labels .== "virginica" , :];

# visualize first 2 principal components in 2D interacive plot
p = scatter(setosa[:, 1], setosa[:, 2], marker=:circle, linewidth=0)
scatter!(versicolor[:, 1], versicolor[:, 2], marker=:circle, linewidth=0)
scatter!(virginica[:, 1], virginica[:, 2], marker=:circle, linewidth=0)
plot!(p, xlabel="PC1", ylabel="PC2")

# visualize first 3 principal components in 3D interacive plot
p = scatter(setosa[:, 1], setosa[:, 2], setosa[:, 3], marker=:circle, linewidth=0)
scatter!(versicolor[:, 1], versicolor[:, 2], versicolor[:, 3], marker=:circle, linewidth=0)
scatter!(virginica[:, 1], virginica[:, 2], virginica[:, 3], marker=:circle, linewidth=0)
plot!(p, xlabel="PC1", ylabel="PC2", zlabel="PC3")

PCA example: genomics¶

Above picture is from the article Genes mirror geography within Europe by Novembre et al (2008) published in Nature.

Matrix completion¶

Missing data is ubiquitous. Matrix completion (a hot topic in machine learning) aims to recover the missing values in a huge matrix.

Candes and Tao proposes a technique to complete a matrix $\mathbf{Y}$ with a large number of missing entries by an optimization problem \begin{eqnarray*} &\text{minimize}& \|\mathbf{X}\|_{\text{nuc}} \\ &\text{subject to }& x_{ij}=y_{ij} \text{ for all observed entries} y_{ij}. \end{eqnarray*} That is we seek the matrix with minimal nuclear norm that agrees with the observed entries.

See an example here.

Compressed sensing¶

See an example here.

Row	SepalLength	SepalWidth	PetalLength	PetalWidth	Species
	Float64	Float64	Float64	Float64	Cat…
1	5.1	3.5	1.4	0.2	setosa
2	4.9	3.0	1.4	0.2	setosa
3	4.7	3.2	1.3	0.2	setosa
4	4.6	3.1	1.5	0.2	setosa
5	5.0	3.6	1.4	0.2	setosa
6	5.4	3.9	1.7	0.4	setosa
7	4.6	3.4	1.4	0.3	setosa
8	5.0	3.4	1.5	0.2	setosa
9	4.4	2.9	1.4	0.2	setosa
10	4.9	3.1	1.5	0.1	setosa
11	5.4	3.7	1.5	0.2	setosa
12	4.8	3.4	1.6	0.2	setosa
13	4.8	3.0	1.4	0.1	setosa
⋮	⋮	⋮	⋮	⋮	⋮
139	6.0	3.0	4.8	1.8	virginica
140	6.9	3.1	5.4	2.1	virginica
141	6.7	3.1	5.6	2.4	virginica
142	6.9	3.1	5.1	2.3	virginica
143	5.8	2.7	5.1	1.9	virginica
144	6.8	3.2	5.9	2.3	virginica
145	6.7	3.3	5.7	2.5	virginica
146	6.7	3.0	5.2	2.3	virginica
147	6.3	2.5	5.0	1.9	virginica
148	6.5	3.0	5.2	2.0	virginica
149	6.2	3.4	5.4	2.3	virginica
150	5.9	3.0	5.1	1.8	virginica