Zero vector: \[
\mathbf{0}_n = \begin{pmatrix} 0 \\ \vdots \\ 0 \end{pmatrix}.
\] The subscript denotes the length of the vector; it sometimes is omitted if obvious from context.
Scalar-vector multiplication. For a scalar \(\alpha \in \mathbb{R}\) and a vector \(\mathbf{x} \in \mathbb{R}^n\), \[
\alpha \mathbf{x} = \begin{pmatrix}
\alpha x_1 \\ \alpha x_2 \\ \vdots \\ \alpha x_n
\end{pmatrix}.
\]
α =0.5x = [1, 2, 3, 4, 5]α * x
5-element Vector{Float64}:
0.5
1.0
1.5
2.0
2.5
Elementwise multiplication or Hadamard product. For two vectors \(\mathbf{x}, \mathbf{y} \in \mathbb{R}^n\) (of same length), \[
\mathbf{x} \circ \mathbf{y} = \begin{pmatrix}
x_1 y_1 \\ \vdots \\ x_n y_n
\end{pmatrix}.
\]
# in Julia, dot operation is elementwise operationx = [1, 2, 3, 4, 5]y = [6, 7, 8, 9, 10]x .* y
5-element Vector{Int64}:
6
14
24
36
50
For scalars \(\alpha_1, \ldots, \alpha_k \in \mathbb{R}\) and vectors \(\mathbf{x}_1, \ldots, \mathbf{x}_k \in \mathbb{R}^n\), the linear combination\[
\sum_{i=1}^k \alpha_i \mathbf{x}_i = \alpha_1 \mathbf{x}_1 + \cdots + \alpha_k \mathbf{x}_k
\] is a sum of scalar-vector products.
x = [1, 2, 3, 4, 5]y = [6, 7, 8, 9, 10]1* x +0.5* y
Examples of inner product. In class exercises: express the following quantities using vector inner products of a vector \(\mathbf{x} \in \mathbb{R}^n\) and another vector.
Norm of a vector: \(\rule[-0.1cm]{1cm}{0.15mm}\) flops.
These vector operations are all order \(n\) algorithms.
# info about my computer versioninfo(verbose =true)
Julia Version 1.9.3
Commit bed2cd540a1 (2023-08-24 14:43 UTC)
Build Info:
Official https://julialang.org/ release
Platform Info:
OS: macOS (arm64-apple-darwin22.4.0)
uname: Darwin 23.0.0 Darwin Kernel Version 23.0.0: Fri Sep 15 14:43:05 PDT 2023; root:xnu-10002.1.13~1/RELEASE_ARM64_T6020 arm64 arm
CPU: Apple M2 Max:
speed user nice sys idle irq
#1-12 2400 MHz 1465885 s 0 s 773574 s 26205121 s 0 s
Memory: 96.0 GB (50415.953125 MB free)
Uptime: 369108.0 sec
Load Avg: 2.779296875 2.48828125 2.7109375
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-14.0.6 (ORCJIT, apple-m1)
Threads: 2 on 8 virtual cores
Environment:
XPC_FLAGS = 0x0
PATH = /Applications/Julia-1.9.app/Contents/Resources/julia/bin:/Users/huazhou/.julia/conda/3/aarch64/bin:/usr/local/bin:/System/Cryptexes/App/usr/bin:/usr/bin:/bin:/usr/sbin:/sbin:/var/run/com.apple.security.cryptexd/codex.system/bootstrap/usr/local/bin:/var/run/com.apple.security.cryptexd/codex.system/bootstrap/usr/bin:/var/run/com.apple.security.cryptexd/codex.system/bootstrap/usr/appleinternal/bin:/Applications/quarto/bin
TERM = xterm-256color
HOME = /Users/huazhou
FONTCONFIG_PATH = /Users/huazhou/.julia/artifacts/e6b9fb44029423f5cd69e0cbbff25abcc4b32a8f/etc/fonts
Assume that one Apple M2 performance core can do 8 double-precision flops per CPU cylce (?) at 2.4GHz (cycles/second). Then the theoretical throughput of a single performance core on my laptop is \[
8 \times 10^9 \times 2.4 = 19.2 \times 10^9 \text{ flops/second} = 19.2 \text{ GFLOPS}
\] in double precision. I estimate my computer takes about \[
\frac{10^7 - 1}{19.2 \times 10^9} \approx 0.00052 \text{ seconds} = 520 \text{ micro seconds}
\] to sum a vector of length \(n = 10^7\) using a single performance core.
# the actual run timen =10^7x =randn(n)sum(x) # compile@timesum(x);
0.001599 seconds (1 allocation: 16 bytes)
@benchmarksum($x)
BenchmarkTools.Trial: 3271 samples with 1 evaluation.
Range (min … max): 1.461 ms … 1.937 ms┊ GC (min … max): 0.00% … 0.00%
Time (median): 1.502 ms ┊ GC (median): 0.00%
Time (mean ± σ): 1.520 ms ± 59.934 μs┊ GC (mean ± σ): 0.00% ± 0.00%
▅█
████▆▆▅▅▅▅▅▅▄▃▄▃▃▃▃▃▃▃▃▃▂▃▅▃▃▂▂▂▂▂▂▂▁▂▂▂▂▁▁▁▁▁▁▁▂▁▁▁▁▁▁▁▁▁ ▂
1.46 ms Histogram: frequency by time 1.71 ms <
Memory estimate: 0 bytes, allocs estimate: 0.
1.6 Norm, distance, angle
The Euclidean norm or L2 norm of a vector \(\mathbf{x} \in \mathbb{R}^n\) is \[
\|\mathbf{x}\| = \|\mathbf{x}\|_2 = (\mathbf{x}'\mathbf{x})^{1/2} = \sqrt{x_1^2 + \cdots + x_n^2}.
\] The Euclidean/L2 norm captures the Euclidean length of the vector.
The L1 norm of a vector \(\mathbf{x} \in \mathbb{R}^n\) is \[
\|\mathbf{x}\|_1 = |x_1| + \cdots + |x_n|.
\] Also known as Manhattan Distance or Taxicab norm. The L1 norm is the distance you have to travel between the origin \(\mathbf{0}_n\) to the destination \(\mathbf{x} = (x_1, \ldots, x_n)'\), in a way that resembles how a taxicab drives between city blocks to arrive at its destination.
Positive definiteness: \(\|\mathbf{x}\| \ge 0\) for any vector \(\mathbf{x}\). \(\|\mathbf{x}\| = 0\) if and only if \(\mathbf{x}=\mathbf{0}\).
Homogeneity: \(\|\alpha \mathbf{x}\| = |\alpha| \|\mathbf{x}\|\) for any scalar \(\alpha\) and vector \(\mathbf{x}\).
Triangle inequality: \(\|\mathbf{x} + \mathbf{y}\| \le \|\mathbf{x}\| + \|\mathbf{y}\|\) for any \(\mathbf{x}, \mathbf{y} \in \mathbb{R}^n\).
Proof: use Cauchy-Schwartz inequality. TODO in class.
Cauchy-Schwarz inequality: \(|\mathbf{x}' \mathbf{y}| \le \|\mathbf{x}\| \|\mathbf{y}\|\) for any \(\mathbf{x}, \mathbf{y} \in \mathbb{R}^n\). The equality holds when (1) \(\mathbf{x} = \mathbf{0}\) or \(\mathbf{y}=\mathbf{0}\) or (2) \(\mathbf{x} \ne \mathbf{0}\), \(\mathbf{y} \ne \mathbf{0}\), and \(\mathbf{x} = \alpha \mathbf{y}\) for some \(\alpha \ne 0\).
Proof: The function \(f(t) = \|\mathbf{x} - t \mathbf{y}\|_2^2 = \|\mathbf{x}\|_2^2 - 2t (\mathbf{x}' \mathbf{y}) + t^2\|\mathbf{y}\|_2^2\) is minimized at \(t^\star =(\mathbf{x}'\mathbf{y}) / \|\mathbf{y}\|^2\) with minimal value \(0 \le f(t^\star) = \|\mathbf{x}\|^2 - (\mathbf{x}'\mathbf{y})^2 / \|\mathbf{y}\|^2\).
There are at least 5 other proofs of CS inequality on Wikipedia.
The first there properties are the defining properties of any vector norm.
# check triangular inequality on random vectors@show x =randn(5)@show y =randn(5)@shownorm(x + y)@shownorm(x) +norm(y)@shownorm(x + y) ≤norm(x) +norm(y)
The (Euclidean) distance between vectors \(\mathbf{x}\) and \(\mathbf{y}\) is defined as \(\|\mathbf{x} - \mathbf{y}\|\).
Property of distances.
Nonnegativity. \(\|\mathbf{x} - \mathbf{y}\| \ge 0\) for all \(\mathbf{x}\) and \(\mathbf{y}\). And \(\|\mathbf{x} - \mathbf{y}\| = 0\) if and only if \(\mathbf{x} = \mathbf{y}\).
The average of a vector \(\mathbf{x}\) is \[
\operatorname{avg}(\mathbf{x}) = \bar{\mathbf{x}} = \frac{x_1 + \cdots + x_n}{n} = \frac{\mathbf{1}' \mathbf{x}}{n}.
\]
The rooted mean square (RMS) of a vector is \[
\operatorname{rms}(\mathbf{x}) = \sqrt{\frac{x_1^2 + \cdots + x_n^2}{n}} = \frac{\|\mathbf{x}\|}{\sqrt n}.
\]
The standard deviation of a vector \(\mathbf{x}\) is \[
\operatorname{std}(\mathbf{x}) = \sqrt{\frac{(x_1 - \bar{\mathbf{x}})^2 + \cdots + (x_n - \bar{\mathbf{x}})^2}{n}} = \operatorname{rms}(\mathbf{x} - \bar{\mathbf{x}} \mathbf{1}) = \frac{\|\mathbf{x} - (\mathbf{1}' \mathbf{x} / n) \mathbf{1}\|}{\sqrt n}.
\]
x =randn(5)@showmean(x)^2+std(x, corrected =false)^2@shownorm(x)^2/length(x)# floating point arithmetics is not exact@showmean(x)^2+std(x, corrected =false)^2≈norm(x)^2/length(x)
Angle between two nonzero vectors \(\mathbf{x}, \mathbf{y}\) is \[
\theta = \angle (\mathbf{x}, \mathbf{y}) = \operatorname{arccos} \left(\frac{\mathbf{x}'\mathbf{y}}{\|\mathbf{x}\|\|\mathbf{y}\|}\right).
\] This is the unique value of \(\theta \in [0, \pi]\) that satisifies \[
\cos \theta = \frac{\mathbf{x}'\mathbf{y}}{\|\mathbf{x}\|\|\mathbf{y}\|}.
\]
Example: Consider vectors in \(\mathbb{R}^3\)\[
\begin{pmatrix} 1 \\ 0 \\ 0 \end{pmatrix}, \begin{pmatrix} 0 \\ 1 \\ 0 \end{pmatrix}, \begin{pmatrix} 0 \\ 0 \\ 1 \end{pmatrix}, \begin{pmatrix} 1 \\ 2 \\ 3 \end{pmatrix}.
\] The fourth vector \(\begin{pmatrix} 1 \\ 2 \\ 3 \end{pmatrix}\) is in some sense redundant because it can be expressed as a linear combination of the other 3 vectors \[
\begin{pmatrix} 1 \\ 2 \\ 3 \end{pmatrix} = 1 \cdot \begin{pmatrix} 1 \\ 0 \\ 0 \end{pmatrix} + 2 \cdot \begin{pmatrix} 0 \\ 1 \\ 0 \end{pmatrix} + 3 \cdot \begin{pmatrix} 0 \\ 0 \\ 1 \end{pmatrix}.
\] Similarly, each one of these 4 vectors can be expressed as the linear combination of the other 3. We say these four vectors are linearly dependent.
A set of vectors \(\mathbf{a}_1, \ldots, \mathbf{a}_k \in \mathbb{R}^n\) are linearly dependent if there exist constants \(\alpha_1, \ldots, \alpha_k\), which are not all zeros, such that \[
\alpha_1 \mathbf{a}_1 + \cdots + \alpha_k \mathbf{a}_k = \mathbf{0}.
\] They are linearly independent if they are not linearly dependent. That is if \(\alpha_1 \mathbf{a}_1 + \cdots + \alpha_k \mathbf{a}_k = \mathbf{0}\) then \(\alpha_1 = \cdots = \alpha_k = 0\) (this is usually how we show that a set of vectors are linearly independent).
Theorem: Unit vectors \(\mathbf{e}_1, \ldots, \mathbf{e}_n \in \mathbb{R}^n\) are linearly independent.
Proof: TODO in class.
Theorem: If \(\mathbf{x}\) is a linear combination of linearly independent vectors \(\mathbf{a}_1, \ldots, \mathbf{a}_k\). That is \(\mathbf{x} = \alpha_1 \mathbf{a}_1 + \cdots + \alpha_k \mathbf{a}_k\). Then the coefficients \(\alpha_1, \ldots, \alpha_k\) are unique.
Proof: TODO in class. Hint: proof by contradition.
Independence-dimension inequality or order-dimension inequality. If the vectors \(\mathbf{a}_1, \ldots, \mathbf{a}_k \in \mathbb{R}^n\) are linearly independent, then \(k \le n\).
In words, there can be at most \(n\) linearly independent vectors in \(\mathbb{R}^n\). Or any set of \(n+1\) or more vectors in \(\mathbb{R}^n\) must linearly dependent.
Proof (optional): We show this by induction. Let \(a_1, \ldots, a_k \in \mathbb{R}^1\) be linearly independent. We must have \(a_1 \ne 0\). This means that every element \(a_i\) of the collection can be expressed as a multiple of \(a_i = (a_i / a_1) a_1\) of the first element \(a_1\). This contradicts the linear independence thus \(k\) must be 1.
Induction hypothesis: suppose \(n \ge 2\) and the independence-dimension inequality holds for \(k \le n\). We partition the vectors \(\mathbf{a}_i \in \mathbb{R}^n\) as \[
\mathbf{a}_i = \begin{pmatrix} \mathbf{b}_i \\ \alpha_i \end{pmatrix}, \quad i = 1,\ldots,k,
\] where \(\mathbf{b}_i \in \mathbb{R}^{n-1}\) and \(\alpha_i \in \mathbb{R}\).
First suppose \(\alpha_1 = \cdots = \alpha_k = 0\). Then the vectors \(\mathbf{b}_1, \ldots, \mathbf{b}_k\) are linearly independent: \(\sum_{i=1}^k \beta_i \mathbf{b}_i = \mathbf{0}\) if and only if \(\sum_{i=1}^k \beta_i \mathbf{a}_i = \mathbf{0}\), which is only possible for \(\beta_1 = \cdots = \beta_k = 0\) because the vectors \(\mathbf{a}_i\) are linearly independent. The vectors \(\mathbf{b}_i\) therefore form a linearly independent collection of \((n-1)\)-vectors. By the induction hypothesis we have \(k \le n-1\) so \(k \le n\).
Next we assume the scalars \(\alpha_i\) are not all zero. Assume \(\alpha_j \ne 0\). We define a collection of \(k-1\) vectors \(\mathbf{c}_i\) of length \(n-1\) as follows: \[
\mathbf{c}_i = \mathbf{b}_i - \frac{\alpha_i}{\alpha_j} \mathbf{b}_j, \quad i = 1, \ldots, j-1, \mathbf{c}_i = \mathbf{b}_{i+1} - \frac{\alpha_{i+1}}{\alpha_j} \mathbf{b}_j, \quad i = j, \ldots, k-1.
\] These \(k-1\) vectors are linealy independent: If \(\sum_{i=1}^{k-1} \beta_i c_i = 0\) then \[
\sum_{i=1}^{j-1} \beta_i \begin{pmatrix} \mathbf{b}_i \\ \alpha_i \end{pmatrix} + \gamma \begin{pmatrix} \mathbf{b}_j \\ \alpha_j \end{pmatrix} + \sum_{i=j+1}^k \beta_{i-1} \begin{pmatrix} \mathbf{b}_i \\ \alpha_i \end{pmatrix} = \mathbf{0}
\] with \(\gamma = - \alpha_j^{-1} \left( \sum_{i=1}^{j-1} \beta_i \alpha_i + \sum_{i=j+1}^k \beta_{i-1} \alpha_i \right)\). Since the vectors \(\mathbf{a}_i\) are linearly independent, all coefficients \(\beta_i\) and \(\gamma\) are all zero. This in turns implies that the vectors \(\mathbf{c}_1, \ldots, \mathbf{c}_{k-1}\) are linearly independent. By the induction hypothesis \(k-1 \le n-1\), we have established \(k \le n\).
1.8 Basis
A set of \(n\) linearly independent vectors \(\mathbf{a}_1, \ldots, \mathbf{a}_n \in \mathbb{R}^n\) is called a basis for \(\mathbb{R}^n\).
Fact: the zero vector \(\mathbf{0}_n\) cannot be a basis vector in \(\mathbb{R}^n\). Why?
Theorem: Any vector \(\mathbf{x} \in \mathbb{R}^n\) can be expressed as a linear combination of basis vectors \(\mathbf{x} = \alpha_1 \mathbf{a}_1 + \cdots + \alpha_n \mathbf{a}_n\) for some \(\alpha_1, \ldots, \alpha_n\), and these coefficients are unique. This is called expansion of \(\mathbf{x}\) in the basis \(\mathbf{a}_1, \ldots, \mathbf{a}_n\).
Proof of existence by contradition (optional). Suppose \(\mathbf{x}\) can NOT be expressed as a linear combination of basis vectors. Suppose an arbitrary linear combination \(\alpha_1 \mathbf{a}_1 + \cdots + \alpha_n \mathbf{a}_n + \beta \mathbf{x} = \mathbf{0}\). Then \(\beta = 0\) otherwise it contradictions with our assumption. Also \(\alpha_1 = \cdots = \alpha_n = 0\) by linear independence of \(\mathbf{a}_1, \ldots, \mathbf{a}_n\). Therefore we conclude \(\alpha_1 = \cdots = \alpha_n = \beta = 0\). Thus \(\mathbf{a}_1, \ldots, \mathbf{a}_n, \mathbf{x}\) are linearly independent, contradicting with the independence-dimension inequality.
Proof of uniqueness: TODO in class.
Example: Unit vectors \(\mathbf{e}_1, \ldots, \mathbf{e}_n\) form a basis for \(\mathbb{R}^n\). Expansion of a vector \(\mathbf{x} \in \mathbb{R}^n\) in this basis is \[
\mathbf{x} = x_1 \mathbf{e}_1 + \cdots + x_n \mathbf{e}_n.
\]
1.9 Orthonormal basis
A set of vectors \(\mathbf{a}_1, \ldots, \mathbf{a}_k\) are (mutually) orthogonal if \(\mathbf{a}_i \perp \mathbf{a}_j\) for any \(i \ne j\). They are normalized if \(\|\mathbf{a}_i\|=1\) for all \(i\). They are orthonormal if they are both orthogonal and normalized.
Orthonormality is often expressed compactly by \(\mathbf{a}_i'\mathbf{a}_j = \delta_{ij}\), where \[
\delta_{ij} = \begin{cases}
1 & \text{if } i = j \\
0 & \text{if } i \ne j
\end{cases}
\] is the Kronecker delta notation.
Theorem: An orthonormal set of vectors are linearly independent.
Orthonormal expansion. If \(\mathbf{a}_1, \ldots, \mathbf{a}_n \in \mathbb{R}^n\) is an orthonormal basis, then for any vector \(\mathbf{x} \in \mathbb{R}^n\), \[
\mathbf{x} = (\mathbf{a}_1'\mathbf{x}) \mathbf{a}_1 + \cdots + (\mathbf{a}_n'\mathbf{x}) \mathbf{a}_n.
\]
Proof: Take inner product with \(\mathbf{a}_i\) on both sides.
If G-S does not stop early (in step 2), \(\mathbf{a}_1, \ldots, \mathbf{a}_k\) are linearly independent.
If G-S stops early in iteration \(i=j\), then \(\mathbf{a}_j\) is a linear combination of \(\mathbf{a}_1, \ldots, \mathbf{a}_{j-1}\) and \(\mathbf{a}_1, \ldots, \mathbf{a}_{j-1}\) are linearly independent.
# for i = 3# orthogonalization@show q̃3= a3 - (q1'a3) * q1 - (q2'a3) * q2# test for linear independence@shownorm(q̃3) ≈0# Normalization@show q3 = q̃3/norm(q̃3);
Show by induction that \(\mathbf{q}_1, \ldots, \mathbf{q}_i\) are orthonormal (optional):
Assume it’s true for \(i-1\).
Orthogonalization step ensures that \(\tilde{\mathbf{q}}_i \perp \mathbf{q}_1, \ldots, \tilde{\mathbf{q}}_i \perp \mathbf{q}_{i-1}\). To show this, take inner product of both sides with \(\mathbf{q}_j\), \(j < i\)\[
\mathbf{q}_j' \tilde{\mathbf{q}}_i = \mathbf{q}_j' \mathbf{a}_i - (\mathbf{q}_1' \mathbf{a}_i) (\mathbf{q}_j' \mathbf{q}_1) - \cdots - (\mathbf{q}_{i-1}' \mathbf{a}_i) (\mathbf{q}_j' \mathbf{q}_{i-1}) = \mathbf{q}_j' \mathbf{a}_i - \mathbf{q}_j' \mathbf{a}_i = 0.
\]
So \(\mathbf{q}_1, \ldots, \mathbf{q}_i\) are orthogonal. The normalization step ensures \(\mathbf{q}_i\) is normal.
Suppose G-S has not terminated by iteration \(i\), then
\(\mathbf{a}_i\) is a combination of \(\mathbf{q}_1, \ldots, \mathbf{q}_i\), and
\(\mathbf{q}_i\) is a combination of \(\mathbf{a}_1, \ldots, \mathbf{a}_i\).
Computational complexity of G-S algorithm:
Step 1 of iteration \(i\) requires (1) \(i-1\) inner products, \(\mathbf{q}_1' \mathbf{a}_i, \ldots, \mathbf{q}_{i-1}' \mathbf{a}_i\), which costs \((i-1)(2n-1)\) flops, (2) \(2n(i-1)\) flops to compute \(\tilde{\mathbf{q}}_i\)
Step 2 of iteration \(i\) requires less than \(n\) flops.
Step 3 of iteration \(i\) requires \(3n\) flops to normalize \(\|\tilde{\mathbf{q}}_i\|\).
Assuming no early termination, total computational cost of the GS algorithm to orthonormalize a set of \(k\) vectors in \(\mathbb{R}^n\) is: \[
\sum_{i=1}^k [(4n-1) (i - 1) + 3n] = (4n - 1) \frac{k(k-1)}{2} + 3nk \approx 2nk^2 = O(nk^2),
\] using \(\sum_{i=1}^k (i - 1) = k(k-1)/2\).