using Pkg
Pkg.activate(pwd())
Pkg.instantiate()
using GraphPlot, Graphs, ImageCore, ImageIO, ImageMagick, ImageShow,
LinearAlgebra, MatrixDepot, MLDatasets, QuartzImageIO,
RDatasets, StatsModels, TextAnalysis
Vector $\mathbf{x} \in \mathbb{R}^{n}$: $$ \mathbf{x} = \begin{pmatrix} x_1 \\ x_2 \\ \vdots \\ x_n \end{pmatrix}. $$
Matrix $\mathbf{X} = (x_{ij}) \in \mathbb{R}^{m \times n}$: $$ \mathbf{X} = \begin{pmatrix} x_{11} & \cdots & x_{1n} \\ \vdots & \ddots & \vdots \\ x_{m1} & \cdots & x_{mn} \end{pmatrix}. $$
In statistics, tabular data is often summarized by a predictor matrix or covariate matrix or design matrix or feature matrix, which is denoted by $\mathbf{X}$ by convention. Each row of the feature matrix is an observation, and each column is a covariate/measurement/feature.
The famous Fisher's Iris data:
# the famous Fisher's Iris data
# <https://en.wikipedia.org/wiki/Iris_flower_data_set>
iris = dataset("datasets", "iris")
We can turn a tabular data set into a feature matrix according to a model formula:
# use full dummy coding (one-hot coding) for categorical variable Species
iris_X = ModelMatrix(ModelFrame(
@formula(1 ~ 1 + SepalLength + SepalWidth + PetalLength + PetalWidth + Species),
iris,
contrasts = Dict(:Species => StatsModels.FullDummyCoding()))).m
# first training sample: image, digit label
# MNIST.traindata(1)
MNIST(split=:train)[1]
# first training digit
X = MNIST(split=:train)[1][1]
# apparently it's digit 5
convert2image(MNIST, X)
# 2nd training image in CIFAR10
X = CIFAR10(split=:train)[2].features
# is this a truck?
convert2image(CIFAR10, X)
Text data (webpage, blog, twitter) can be transformed to numeric matrices for statistical analysis as well. For example, the 29 State of the Union Addresses by U.S. presidents, from George W Bush in 1989 to Donald Trump in 2017, can be represented by a $29 \times 9610$ document term matrix, where each row stands for one speech and each column is a word that ever appears in these speeches. An entry $x_{ij}$ of the matrix counts the number of occurrences of word $j$ in speech $i$.
sotupath = joinpath(dirname(pathof(TextAnalysis)), "..", "test/data/sotu")
Base.Filesystem.readdir(sotupath)
crps = DirectoryCorpus(sotupath)
# Donald Trump 2017 SOTU address
text(crps[29])
standardize!(crps, StringDocument)
remove_case!(crps)
prepare!(crps, strip_punctuation)
update_lexicon!(crps)
update_inverse_index!(crps)
m = DocumentTermMatrix(crps)
D = dtm(m, :dense)
m.terms
The world wide web (WWW) with $n$ webpages can be described by a connectivity matrix or adjacency matrix $\mathbf{A} \in \{0,1\}^{n \times n}$ with entry \begin{eqnarray*} a_{ij} = \begin{cases} 1 & \text{if page $i$ links to page $j$} \\ 0 & \text{otherwise} \end{cases}. \end{eqnarray*} According to Internet Live Stats, $n \approx 1.98$ billion now. The smaller SNP/web-Google data set contains a web of 916,428 pages.
mdinfo("SNAP/web-Google") |> show
md = mdopen("SNAP/web-Google")
md.A
Here is a visulization of the SNAP/web-Google network
Such a directed graph can also be represented by an indicence matrix $\mathbf{B} \in \{-1,0,1\}^{m \times n}$ where $m$ is the number of verticies and $n$ is the number of edges. The entries of an incidence matrix are \begin{eqnarray*} b_{ij} = \begin{cases} -1 & \text{if edge $j$ starts at vertex $i$} \\ 1 & \text{if edge $j$ ends at vertex $i$} \\ 0 & \text{otherwise} \end{cases}. \end{eqnarray*}
Here is a directed graph with 4 nodes and 5 edges.
# a simple directed graph on GS p16
g = SimpleDiGraph(4)
add_edge!(g, 1, 2)
add_edge!(g, 1, 3)
add_edge!(g, 2, 3)
add_edge!(g, 2, 4)
add_edge!(g, 4, 3)
gplot(g, nodelabel=["x1", "x2", "x3", "x4"], edgelabel=["b1", "b2", "b3", "b4", "b5"])
# adjacency matrix A
convert(Matrix{Int64}, adjacency_matrix(g))
# incidence matrix B
convert(Matrix{Int64}, incidence_matrix(g))