usingImages, LinearAlgebra, Random, RDatasets, StatsBase, StatsModels, StatsPlotsgr() # GR backend
Plots.GRBackend()
1 SVD introduction
We saw in earlier lectures that a square matrix can have complex eigenvalues and eigenvectors. Symmetric matrices have the nice property that all eigenvalues and eigenvectors are real. The singular value decomposition (SVD) generalizes the spectral decomposition to general rectangular matrices.
Let \(\mathbf{A} \in \mathbb{R}^{m \times n}\) with \(\text{rank}(\mathbf{A})=r\). We assume \(m \ge n\). Instead of the eigen-equation, now we have \[
\begin{align}
\mathbf{A} \mathbf{v}_i &= \sigma_i \mathbf{u}_i, \quad i = 1,\ldots,r \\
\mathbf{A} \mathbf{v}_i &= 0 \, \mathbf{u}_i, \quad i = r+1,\ldots,n,
\end{align}
\] where the left singular vectors\(\mathbf{u}_i \in \mathbb{R}^m\) are orthonormal, the right singular vectors\(\mathbf{v}_i \in \mathbb{R}^n\) are orthonormal, and the singular values\[
\sigma_1 \ge \cdots \ge \sigma_r > \sigma_{r+1} = \cdots = \sigma_{n} = 0.
\]
\(\mathcal{C}(\mathbf{A}) = \mathcal{C}(\mathbf{U}_r)\); \(\quad \mathbf{U}_r \mathbf{U}_r'\) is the orthogonal projector into \(\mathcal{C}(\mathbf{A})\).
\(\mathcal{N}(\mathbf{A}') = \mathcal{C}(\mathbf{U}_{m-r})\); \(\quad \mathbf{U}_{m-r} \mathbf{U}_{m-r}'\) is the orthogonal projector into \(\mathcal{N}(\mathbf{A}')\).
\(\mathcal{C}(\mathbf{A}') = \mathcal{C}(\mathbf{V}_{r})\); \(\quad \mathbf{V}_{r} \mathbf{V}_{r}'\) is the orthogonal projector into \(\mathcal{C}(\mathbf{A}')\).
\(\mathcal{N}(\mathbf{A}) = \mathcal{C}(\mathbf{V}_{n-r})\); \(\quad \mathbf{V}_{n-r} \mathbf{V}_{n-r}'\) is the orthogonal projector into \(\mathcal{N}(\mathbf{A})\).
\(\text{rank}(\mathbf{A}) = r = \text{\# positive singular values}\).
# they should be equal by the uniqueness of orthogonal projectorP1 ≈ P2
true
3 Proof of existence of SVD using eigen-decomposition of Gram matrix
Relating SVD to eigen-decomposition. From SVD \(\mathbf{A} = \mathbf{U} \boldsymbol{\Sigma} \mathbf{V}'\), we have \[
\begin{eqnarray*}
\mathbf{A}' \mathbf{A} &=& (\mathbf{V} \boldsymbol{\Sigma} \mathbf{U}') (\mathbf{U} \boldsymbol{\Sigma} \mathbf{V}') = \mathbf{V} \boldsymbol{\Sigma}^2 \mathbf{V}' \\
\mathbf{A} \mathbf{A}' &=& (\mathbf{U} \boldsymbol{\Sigma} \mathbf{V}') (\mathbf{V} \boldsymbol{\Sigma} \mathbf{U}') = \mathbf{U} \boldsymbol{\Sigma}^2 \mathbf{U}'.
\end{eqnarray*}
\] This says
\(\mathbf{u}_i\) are simply the eigenvectors of the symmetric matrix \(\mathbf{A} \mathbf{A}'\)
\(\mathbf{v}_i\) are simply the eigenvectors of the symmetric matrix \(\mathbf{A}' \mathbf{A}\)
\(\sigma_i\) are simply \(\sqrt{\lambda_i}\).
Proof of existence of SVD: Start from positive eigenvalues \(\lambda_i > 0\) and corresponding (orthonormal) eigenvectors \(\mathbf{v}_i\) of \(\mathbf{A}' \mathbf{A}\). Set \(\sigma_i = \sqrt \lambda_i\) and \[
\mathbf{u}_i = \frac{\mathbf{A} \mathbf{v}_i}{\sigma_i}, \quad i = 1,\ldots,r.
\] Verify that \(\mathbf{u}_i\) are orthonormal. Lastly, augment \(\mathbf{u}_i\)s by an orthonormal basis in \(\mathcal{N}(\mathbf{A}') = \mathcal{N}(\mathbf{A} \mathbf{A}')\) and augment \(\mathbf{v}_i\)s by an orthonormal basis in \(\mathcal{N}(\mathbf{A}) = \mathcal{N}(\mathbf{A}' \mathbf{A})\). Finally we verify that \[
\sum_{i=1}^r \sigma_i \mathbf{u}_i \mathbf{v}_i' = \sum_{i=1}^r \sigma_i \frac{\mathbf{A} \mathbf{v}_i}{\sigma_i} \mathbf{v}_i' = \mathbf{A} \sum_{i=1}^r \mathbf{v}_i \mathbf{v_i}' = \mathbf{A} \sum_{i=1}^n \mathbf{v}_i \mathbf{v}_i' = \mathbf{A} \mathbf{I} = \mathbf{A}.
\] The third equality is because \(\mathbf{v}_i \in \mathcal{N}(\mathbf{A})\) for \(i=r+1,\ldots,n\).
Question: If \(\mathbf{A} = \mathbf{Q} \boldsymbol{\Lambda} \mathbf{Q}'\) is symmetric and indefinite (has negative eigenvalues), then what is its SVD?
Question: If \(\mathbf{A} = \mathbf{x} \mathbf{y}'\), find the singular values and singular vectors. What does \(|\lambda| \le \sigma_1\) imply?
Answer: TODO.
6 Geometry of SVD
Visualize how a \(2 \times 2\) matrix acts on a vector via SVD. Picture from this paper: \[
\mathbf{A} \mathbf{x} = \mathbf{U} \boldsymbol{\Sigma} \mathbf{V}' \mathbf{x}.
\]
7 SVD and generalized inverse
Using the SVD, the Moore-Penrose inverse is given by \[
\mathbf{A}^+ = \mathbf{V} \boldsymbol{\Sigma}^+ \mathbf{U}^T = \mathbf{V}_r \boldsymbol{\Sigma}_r^{-1} \mathbf{U}_r' = \sum_{i=1}^r \sigma_i^{-1} \mathbf{v}_i \mathbf{u}_i',
\] where \(\boldsymbol{\Sigma}^+ = \text{diag}(\sigma_1^{-1}, \ldots, \sigma_r^{-1}, 0, \ldots, 0)\), \(r= \text{rank}(\mathbf{A})\). In HW6, you verify this definition satisifes all properties of the Moore-Pensorse inverse. This is how the pinv function is implemented in Julia, Matlab, Python, or R.
Random.seed!(216)m, n, r =5, 3, 2A =randn(m, r) *randn(r, n)
These 3 norms already have different values for the identity matrix: \[\begin{eqnarray*}
\|\mathbf{I}_n\|_2 &=& 1 \\
\|\mathbf{I}_n\|_{\text{F}} &=& \sqrt{n} \\
\|\mathbf{I}_n\|_{\text{nuc}} &=& n.
\end{eqnarray*}\] Indeed for any orthogonal matrix \(\mathbf{Q} \in \mathbb{R}^{n \times n}\), \[\begin{eqnarray*}
\|\mathbf{Q}\|_2 &=& 1 \\
\|\mathbf{Q}\|_{\text{F}} &=& \sqrt{n} \\
\|\mathbf{Q}\|_{\text{nuc}} &=& n.
\end{eqnarray*}\]
Invariance under orthogonal transform. For all three norms, \[
\|\mathbf{Q}_1 \mathbf{A} \mathbf{Q}_2'\| = \|\mathbf{A}\| \text{ for orthogonal } \mathbf{Q}_1 \text{ and } \mathbf{Q}_2.
\]
Proof: HW6.
Eckart-Young theorem. Assuming SVD of \(\mathbf{A} \in \mathbb{R}^{m \times n}\) with rank \(r\): \[
\mathbf{A} = \mathbf{U} \boldsymbol{\Sigma} \mathbf{V}' = \sum_{i=1}^r \sigma_i \mathbf{u}_i \mathbf{v}_i'.
\] Then the matrix \[
\mathbf{A}_k = \sum_{i=1}^k \sigma_i \mathbf{u}_i \mathbf{v}_i'
\] is the best rank-\(k\) approximation to the original matrix \(\mathbf{A}\) in the 3 matrix norms (\(\ell_2\), Frobenius, and nuclear). More precisely, \[
\|\mathbf{A} - \mathbf{B}\|
\] is minimized by \(\mathbf{A}_k\) among all matrices \(\mathbf{B}\) with rank \(\le k\).
Proof of Eckart-Young theorem for the \(\ell_2\) norm. (optional)
First we note \[
\|\mathbf{A} - \mathbf{A}_k\|_2 = \left\|\sum_{i=1}^r \sigma_i \mathbf{u}_i \mathbf{v}_i' - \sum_{i=1}^k \sigma_i \mathbf{u}_i \mathbf{v}_i' \right\|_2 = \left\|\sum_{i=k+1}^r \sigma_i \mathbf{u}_i \mathbf{v}_i' \right\|_2 = \sigma_{k+1}.
\] We want to show that \[
\|\mathbf{A} - \mathbf{B}\|_2 = \max \frac{\|(\mathbf{A} - \mathbf{B}) \mathbf{x}\|}{\|\mathbf{x}\|} \ge \sigma_{k+1}
\] for any \(\mathbf{B}\) with \(\text{rank}(\mathbf{B}) \le k\). We show this by choosing a particular \(\mathbf{x}\) such that \[
\frac{\|(\mathbf{A} - \mathbf{B}) \mathbf{x}\|}{\|\mathbf{x}\|} \ge \sigma_{k+1}.
\] Choose \(\mathbf{x} \ne \mathbf{0}\) such that \[
\mathbf{B} \mathbf{x} = \mathbf{0} \text{ and } \mathbf{x} = \sum_{i=1}^{k+1} c_i \mathbf{v}_i.
\] There exists such \(\mathbf{x}\) because (1) \(\mathcal{N}(\mathbf{B})\) has dimension \(\ge n-k\) because \(\text{rank}(\mathbf{B}) \le k\) (rank-nullity theorem) and (2) \(\mathbf{v}_1, \ldots, \mathbf{v}_{k+1}\) span a subspace of dimension \(k+1\). Thus \(\mathcal{N}(\mathbf{B})\) and \(\text{span}\{\mathbf{v}_1, \ldots, \mathbf{v}_{k+1}\}\) has a non-trivial intersection. For this \(\mathbf{x}\), we have \[\begin{eqnarray*}
& & \|(\mathbf{A} - \mathbf{B}) \mathbf{x}\|^2 \\
&=& \|\mathbf{A} \mathbf{x}\|^2 \\
&=& \left\|\left(\sum_{i=1}^r \sigma_i \mathbf{u}_i \mathbf{v}_i'\right)\left(\sum_{i=1}^{k+1} c_i \mathbf{v}_i\right)\right\|^2 \\
&=& \left\| \sum_{i=1}^{k+1} c_i \sigma_i \mathbf{u}_i \right\|^2 \\
&=& \sum_{i=1}^{k+1} c_i^2 \sigma_i^2 \\
&\ge& \left(\sum_{i=1}^{k+1} c_i^2\right) \sigma_{k+1}^2 \\
&=& \|\mathbf{x}\|^2 \sigma_{k+1}^2.
\end{eqnarray*}\] The proof is finished.
(Nathan Srebro’s) Proof of Eckart-Young theorem for the Frobenius norm. (optional)
Let \(\mathbf{B}\) be a matrix of rank \(k\). By the rank factorization, \(\mathbf{B} = \mathbf{C} \mathbf{R}\), where \(\mathbf{C} \in \mathbb{R}^{m \times k}\) and \(\mathbf{R} \in \mathbb{R}^{k \times n}\). Using SVD of \(\mathbf{B}\), we will assume that \(\mathbf{C}\) has orthogonal columns (so \(\mathbf{C}' \mathbf{C} = \mathbf{D}\)) and \(\mathbf{R}\) has orthonormal rows (so \(\mathbf{R} \mathbf{R}' = \mathbf{I}_k\)). We want to show \(\mathbf{C} = \mathbf{U}_k \boldsymbol{\Sigma}_k\) and \(\mathbf{R} = \mathbf{V}_k'\), where \((\boldsymbol{\Sigma}_k, \mathbf{U}_k, \mathbf{V}_k)\) are the top \(k\) singular values/vectors of \(\mathbf{A}\). In order for \(\mathbf{B} = \mathbf{C} \mathbf{R}\) to minimize \[
f(\mathbf{C}, \mathbf{R}) = \|\mathbf{A} - \mathbf{C} \mathbf{R}\|_{\text{F}}^2,
\] the gradient (partial derivatives) must vanish \[\begin{eqnarray*}
\frac{\partial f(\mathbf{C}, \mathbf{R})}{\partial \mathbf{C}} &=& - 2(\mathbf{A} - \mathbf{C} \mathbf{R}) \mathbf{R}' = \mathbf{O}_{m \times k} \\
\frac{\partial f(\mathbf{C}, \mathbf{R})}{\partial \mathbf{R}} &=& - 2 \mathbf{C}' (\mathbf{A} - \mathbf{C} \mathbf{R}) = \mathbf{O}_{k \times n}.
\end{eqnarray*}\] The first equation shows \[
\mathbf{A} \mathbf{R}' = \mathbf{C} \mathbf{R} \mathbf{R}' = \mathbf{C}.
\] Then by the second equation \[
\mathbf{A}' \mathbf{A} \mathbf{R}' = \mathbf{A}' \mathbf{C} = \mathbf{R}' \mathbf{C}' \mathbf{C} = \mathbf{R}' \mathbf{D}.
\] This says the columns of \(\mathbf{R}'\) (rows of \(\mathbf{R}\)) must be eigenvectors of \(\mathbf{A}' \mathbf{A}\), or equivalently, right singular vectors \(\mathbf{v}_i\) of \(\mathbf{A}\). Similarly the columns of \(\mathbf{C}\) must be eigenvectors of \(\mathbf{A} \mathbf{A}'\), or equivalently, left singular vectors \(\mathbf{u}_i\) of \(\mathbf{A}\): \[
\mathbf{A} \mathbf{A}' \mathbf{C} = \mathbf{A} \mathbf{R}' \mathbf{D} = \mathbf{C} \mathbf{D}.
\] Which \(\mathbf{u}_i\) and \(\mathbf{v}_i\) shall we take to minimize \(f\)? Apparently we should choose those with largest singular values to achieve the minimum value \(\sigma_{k+1}^2 + \cdots + \sigma_r^2\).
Let’s calculate the partial derivatives of the objective function \(f(\mathbf{x})\)\[
\frac{\partial f(\mathbf{x})}{\partial x_i} = \frac{\left(2\sum_j s_{ij} x_j\right) (\mathbf{x}' \mathbf{x}) - (\mathbf{x}' \mathbf{S} \mathbf{x}) (2x_i)}{(\mathbf{x}' \mathbf{x})^2} = 2 (\mathbf{x}' \mathbf{x})^{-1} \left(\sum_j s_{ij} x_j - f(\mathbf{x}) x_i \right).
\] Collecting partial derivatives into the gradient vector and setting it to zero \[
\nabla f(\mathbf{x}) = 2 (\mathbf{x}' \mathbf{x})^{-1} \left[ \mathbf{S} \mathbf{x} - f(\mathbf{x}) \cdot \mathbf{x} \right] = \mathbf{0}
\] yields \[
\mathbf{S} \mathbf{x} = f(\mathbf{x}) \cdot \mathbf{x}.
\] Thus the optimal \(\mathbf{x}\) must be an eigenvector of \(\mathbf{S} = \mathbf{A}' \mathbf{A}\) with corresponding eigenvalue \(f(\mathbf{x})\). Which one shall we choose? Apparently the top right singular vector\(\mathbf{x}^\star = \mathbf{v}_1\) gives us the maximal value which is equal to \(\lambda_1 = \sigma_1^2\).
In HW6, you show this result using an alternative proof (elementary linear algebra arguments).
Above we have shown \[
\max_{\mathbf{x} \ne \mathbf{0}} \frac{\|\mathbf{A} \mathbf{x}\|}{\|\mathbf{x}\|} = \sigma_1,
\] which is the spectral norm (or \(\ell_2\) norm) of a matrix \(\mathbf{A}\).
Random.seed!(216)m, n, r =5, 3, 2A =randn(m, r) *randn(r, n)
Similarly, the second right singular vector maximizes the Rayleigh quotient subject to an orthogonality constraint \[\begin{eqnarray*}
\text{maximize} &\quad& f(\mathbf{x}) = \frac{\|\mathbf{A} \mathbf{x}\|^2}{\|\mathbf{x}\|^2} = \frac{\mathbf{x}' \mathbf{A}' \mathbf{A} \mathbf{x}}{\mathbf{x}' \mathbf{x}} = \frac{\mathbf{x}' \mathbf{S} \mathbf{x}}{\mathbf{x}' \mathbf{x}} \\
\text{subject to} &\quad& \mathbf{x} \perp \mathbf{v}_1.
\end{eqnarray*}\]
Submatrices have smaller singular values. Let \(\mathbf{B}\) be a submatrix of \(\mathbf{A} \in \mathbb{R}^{m \times n}\). Then \[
\|\mathbf{B}\|_2 \le \|\mathbf{A}\|_2
\] or equivalently \[
\sigma_1 (\mathbf{B}) \le \sigma_1 (\mathbf{A}).
\]
PCA is a dimension reduction technique that finds the most informative linear combinations of the \(p\) random variables \(\mathbf{X} \in \mathbb{R}^p\). Mathematically it finds the linear combinations of the \(p\) random variables that have the largest variances. If we know the population covariance of the \(p\) variables is \(\text{Cov}(\mathbf{X}) = \boldsymbol{\Sigma} \in \mathbb{R}^{p \times p}\), then \[
\text{Var}(\mathbf{a}' \mathbf{X}) = \mathbf{a}' \text{Cov}(\mathbf{X}) \mathbf{a} = \mathbf{a}' \boldsymbol{\Sigma} \mathbf{a}.
\] Apparently the larger magnitude of \(\mathbf{a}\), the large variance \(\text{Var}(\mathbf{a}' \mathbf{X})\). It makes sense to maximize the normalized version \[
\text{maximize} \quad \frac{\mathbf{a}' \boldsymbol{\Sigma} \mathbf{a}}{\mathbf{a}' \mathbf{a}}.
\]
Given a data matrix \(\mathbf{X} \in \mathbb{R}^{n \times p}\), where there are \(n\) observations of \(p\) variables. We would substitute in the sample covariance matrix \[
\widehat{\boldsymbol{\Sigma}} = \frac{\tilde{\mathbf{X}}'\tilde{\mathbf{X}}}{n-1},
\] where \(\tilde{\mathbf{X}}\) is the column-centered data, and maximize the Rayleigh quotient \[
\text{maximize} \quad \frac{\mathbf{a}' \widehat{\boldsymbol{\Sigma}} \mathbf{a}}{\mathbf{a}' \mathbf{a}}.
\] From earlier discussion we know the optimal \(\mathbf{a}^\star\) maximizing this Rayleigh quotient is the top eigenvector of \(\widehat{\boldsymbol{\Sigma}}\), or equivalently, the top right singular vector \(\mathbf{v}_1\) of \(\tilde{\mathbf{X}}\), achieving the optimal value \(\lambda_1\).
Similarly, right singular vectors \(\mathbf{v}_k\) maximizes \[
\frac{\mathbf{a}' \widehat{\boldsymbol{\Sigma}} \mathbf{a}}{\mathbf{a}' \mathbf{a}}
\] subject to the constraint \(\mathbf{a} \perp \mathbf{v}_i\) for \(i =1,\ldots,k-1\).
Coumns of \[
\mathbf{V}_r = \begin{pmatrix} \mid & & \mid \\
\mathbf{v}_1 & \cdots & \mathbf{v}_r \\
\mid & & \mid
\end{pmatrix} \in \mathbb{R}^{p \times r}
\] are called the principal components.
Columns of \(\tilde{\mathbf{X}} \mathbf{V}_r \in \mathbb{R}^{n \times r}\) are called the principal scores.
# group results by testing set labels for color codingsetosa = Y[X_labels .=="setosa" , :]versicolor = Y[X_labels .=="versicolor", :]virginica = Y[X_labels .=="virginica" , :];
# visualize first 2 principal components in 2D interacive plotp =scatter(setosa[:, 1], setosa[:, 2], marker=:circle, linewidth=0)scatter!(versicolor[:, 1], versicolor[:, 2], marker=:circle, linewidth=0)scatter!(virginica[:, 1], virginica[:, 2], marker=:circle, linewidth=0)plot!(p, xlabel="PC1", ylabel="PC2")
# visualize first 3 principal components in 3D interacive plotp =scatter(setosa[:, 1], setosa[:, 2], setosa[:, 3], marker=:circle, linewidth=0)scatter!(versicolor[:, 1], versicolor[:, 2], versicolor[:, 3], marker=:circle, linewidth=0)scatter!(virginica[:, 1], virginica[:, 2], virginica[:, 3], marker=:circle, linewidth=0)plot!(p, xlabel="PC1", ylabel="PC2", zlabel="PC3")