Course Introduction

Biostat 216

Author

Dr. Hua Zhou @ UCLA

Published

September 26, 2023

System information (for reproducibility):

versioninfo()

Julia Version 1.9.3
Commit bed2cd540a1 (2023-08-24 14:43 UTC)
Build Info:
  Official https://julialang.org/ release
Platform Info:
  OS: macOS (arm64-apple-darwin22.4.0)
  CPU: 12 × Apple M2 Max
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-14.0.6 (ORCJIT, apple-m1)
  Threads: 2 on 8 virtual cores

Load packages:

using Pkg
Pkg.activate(pwd())
Pkg.instantiate()

  Activating project at `~/Documents/github.com/ucla-biostat-216/2023fall/slides/01-intro`

using GraphPlot, Graphs, ImageCore, ImageIO, ImageShow, 
    LinearAlgebra, MatrixDepot, MLDatasets, QuartzImageIO, 
    RDatasets, StatsModels, TextAnalysis

[ Info: verify download of index files...
[ Info: reading database
[ Info: adding metadata...
[ Info: adding svd data...
[ Info: writing database
[ Info: used remote sites are sparse.tamu.edu with MAT index and math.nist.gov with HTML index

1 Introduction

1.1 Subject of linear algebra

Vector $\mathbf{x} \in \mathbb{R}^{n}$: \[ \mathbf{x} = \begin{pmatrix} x_1 \\ x_2 \\ \vdots \\ x_n \end{pmatrix}. \]
Matrix $\mathbf{X} = (x_{ij}) \in \mathbb{R}^{m \times n}$: \[ \mathbf{X} = \begin{pmatrix} x_{11} & \cdots & x_{1n} \\ \vdots & \ddots & \vdots \\ x_{m1} & \cdots & x_{mn} \end{pmatrix}. \]

1.2 Examples of vectors and matrices

1.2.1 Design matrix

In statistics, tabular data is often summarized by a predictor matrix or covariate matrix or design matrix or feature matrix, which is denoted by $\mathbf{X}$ by convention. Each row of the feature matrix is an observation, and each column is a covariate/measurement/feature.

The famous Fisher’s Iris data:

# the famous Fisher's Iris data
# <https://en.wikipedia.org/wiki/Iris_flower_data_set>
iris = dataset("datasets", "iris")

150×5 DataFrame

125 rows omitted

Row	SepalLength	SepalWidth	PetalLength	PetalWidth	Species
	Float64	Float64	Float64	Float64	Cat…
1	5.1	3.5	1.4	0.2	setosa
2	4.9	3.0	1.4	0.2	setosa
3	4.7	3.2	1.3	0.2	setosa
4	4.6	3.1	1.5	0.2	setosa
5	5.0	3.6	1.4	0.2	setosa
6	5.4	3.9	1.7	0.4	setosa
7	4.6	3.4	1.4	0.3	setosa
8	5.0	3.4	1.5	0.2	setosa
9	4.4	2.9	1.4	0.2	setosa
10	4.9	3.1	1.5	0.1	setosa
11	5.4	3.7	1.5	0.2	setosa
12	4.8	3.4	1.6	0.2	setosa
13	4.8	3.0	1.4	0.1	setosa
⋮	⋮	⋮	⋮	⋮	⋮
139	6.0	3.0	4.8	1.8	virginica
140	6.9	3.1	5.4	2.1	virginica
141	6.7	3.1	5.6	2.4	virginica
142	6.9	3.1	5.1	2.3	virginica
143	5.8	2.7	5.1	1.9	virginica
144	6.8	3.2	5.9	2.3	virginica
145	6.7	3.3	5.7	2.5	virginica
146	6.7	3.0	5.2	2.3	virginica
147	6.3	2.5	5.0	1.9	virginica
148	6.5	3.0	5.2	2.0	virginica
149	6.2	3.4	5.4	2.3	virginica
150	5.9	3.0	5.1	1.8	virginica

We can turn a tabular data set into a feature matrix according to a model formula:

# use full dummy coding (one-hot coding) for categorical variable Species
iris_X = ModelMatrix(ModelFrame(
    @formula(1 ~ 1 + SepalLength + SepalWidth + PetalLength + PetalWidth + Species), 
    iris,
    contrasts = Dict(:Species => StatsModels.FullDummyCoding()))).m

150×8 Matrix{Float64}:
 1.0  5.1  3.5  1.4  0.2  1.0  0.0  0.0
 1.0  4.9  3.0  1.4  0.2  1.0  0.0  0.0
 1.0  4.7  3.2  1.3  0.2  1.0  0.0  0.0
 1.0  4.6  3.1  1.5  0.2  1.0  0.0  0.0
 1.0  5.0  3.6  1.4  0.2  1.0  0.0  0.0
 1.0  5.4  3.9  1.7  0.4  1.0  0.0  0.0
 1.0  4.6  3.4  1.4  0.3  1.0  0.0  0.0
 1.0  5.0  3.4  1.5  0.2  1.0  0.0  0.0
 1.0  4.4  2.9  1.4  0.2  1.0  0.0  0.0
 1.0  4.9  3.1  1.5  0.1  1.0  0.0  0.0
 1.0  5.4  3.7  1.5  0.2  1.0  0.0  0.0
 1.0  4.8  3.4  1.6  0.2  1.0  0.0  0.0
 1.0  4.8  3.0  1.4  0.1  1.0  0.0  0.0
 ⋮                        ⋮         
 1.0  6.0  3.0  4.8  1.8  0.0  0.0  1.0
 1.0  6.9  3.1  5.4  2.1  0.0  0.0  1.0
 1.0  6.7  3.1  5.6  2.4  0.0  0.0  1.0
 1.0  6.9  3.1  5.1  2.3  0.0  0.0  1.0
 1.0  5.8  2.7  5.1  1.9  0.0  0.0  1.0
 1.0  6.8  3.2  5.9  2.3  0.0  0.0  1.0
 1.0  6.7  3.3  5.7  2.5  0.0  0.0  1.0
 1.0  6.7  3.0  5.2  2.3  0.0  0.0  1.0
 1.0  6.3  2.5  5.0  1.9  0.0  0.0  1.0
 1.0  6.5  3.0  5.2  2.0  0.0  0.0  1.0
 1.0  6.2  3.4  5.4  2.3  0.0  0.0  1.0
 1.0  5.9  3.0  5.1  1.8  0.0  0.0  1.0

1.2.2 Grayscale images

Neural networks can classify handwritten digits in high accuracy. Each handwritten digit is represented by a grayscale image. The famous MNIST data set contains 60,000 training images and 10,000 test images. Each image is a $28 \times 28$ matrix:

# first training sample: image, digit label
# MNIST.traindata(1)
MNIST(split=:train)[1]

(features = Float32[0.0 0.0 … 0.0 0.0; 0.0 0.0 … 0.0 0.0; … ; 0.0 0.0 … 0.0 0.0; 0.0 0.0 … 0.0 0.0], targets = 5)

# first training digit
X = MNIST(split=:train)[1][1]

28×28 Matrix{Float32}:
 0.0  0.0  0.0  0.0  0.0  0.0        …  0.0       0.0        0.0  0.0  0.0
 0.0  0.0  0.0  0.0  0.0  0.0           0.0       0.0        0.0  0.0  0.0
 0.0  0.0  0.0  0.0  0.0  0.0           0.0       0.0        0.0  0.0  0.0
 0.0  0.0  0.0  0.0  0.0  0.0           0.0       0.0        0.0  0.0  0.0
 0.0  0.0  0.0  0.0  0.0  0.0           0.215686  0.533333   0.0  0.0  0.0
 0.0  0.0  0.0  0.0  0.0  0.0        …  0.67451   0.992157   0.0  0.0  0.0
 0.0  0.0  0.0  0.0  0.0  0.0           0.886275  0.992157   0.0  0.0  0.0
 0.0  0.0  0.0  0.0  0.0  0.0           0.992157  0.992157   0.0  0.0  0.0
 0.0  0.0  0.0  0.0  0.0  0.0           0.992157  0.831373   0.0  0.0  0.0
 0.0  0.0  0.0  0.0  0.0  0.0           0.992157  0.529412   0.0  0.0  0.0
 0.0  0.0  0.0  0.0  0.0  0.0        …  0.992157  0.517647   0.0  0.0  0.0
 0.0  0.0  0.0  0.0  0.0  0.0           0.956863  0.0627451  0.0  0.0  0.0
 0.0  0.0  0.0  0.0  0.0  0.0117647     0.521569  0.0        0.0  0.0  0.0
 ⋮                        ⋮          ⋱                       ⋮         
 0.0  0.0  0.0  0.0  0.0  0.494118      0.0       0.0        0.0  0.0  0.0
 0.0  0.0  0.0  0.0  0.0  0.533333      0.0       0.0        0.0  0.0  0.0
 0.0  0.0  0.0  0.0  0.0  0.686275      0.0       0.0        0.0  0.0  0.0
 0.0  0.0  0.0  0.0  0.0  0.101961      0.0       0.0        0.0  0.0  0.0
 0.0  0.0  0.0  0.0  0.0  0.65098    …  0.0       0.0        0.0  0.0  0.0
 0.0  0.0  0.0  0.0  0.0  1.0           0.0       0.0        0.0  0.0  0.0
 0.0  0.0  0.0  0.0  0.0  0.968627      0.0       0.0        0.0  0.0  0.0
 0.0  0.0  0.0  0.0  0.0  0.498039      0.0       0.0        0.0  0.0  0.0
 0.0  0.0  0.0  0.0  0.0  0.0           0.0       0.0        0.0  0.0  0.0
 0.0  0.0  0.0  0.0  0.0  0.0        …  0.0       0.0        0.0  0.0  0.0
 0.0  0.0  0.0  0.0  0.0  0.0           0.0       0.0        0.0  0.0  0.0
 0.0  0.0  0.0  0.0  0.0  0.0           0.0       0.0        0.0  0.0  0.0

# apparently it's digit 5
convert2image(MNIST, X)

1.2.3 Color images

CIFAR-10 is a collection of 50,000 training images and 10,000 test images, each belonging to 1 of 10 mutually exclusive classes (frog, truck, …). Each color image is represented by three channels: R (red), G (green), B (blue). Each channel is a $32 \times 32$ intensity matrix.

# 2nd training image in CIFAR10
X = CIFAR10(split=:train)[2].features

32×32×3 Array{Float32, 3}:
[:, :, 1] =
 0.603922  0.54902   0.54902   0.533333  …  0.686275   0.647059   0.639216
 0.494118  0.568627  0.545098  0.537255     0.611765   0.611765   0.619608
 0.411765  0.490196  0.45098   0.478431     0.603922   0.623529   0.639216
 0.4       0.486275  0.576471  0.517647     0.576471   0.513726   0.568627
 0.490196  0.588235  0.541176  0.592157     0.607843   0.368627   0.168627
 0.607843  0.596078  0.517647  0.709804  …  0.631373   0.4        0.0745098
 0.67451   0.682353  0.666667  0.796078     0.627451   0.423529   0.0784314
 0.705882  0.698039  0.698039  0.815686     0.654902   0.501961   0.290196
 0.556863  0.52549   0.670588  0.815686     0.647059   0.603922   0.52549
 0.435294  0.431373  0.752941  0.796078     0.596078   0.611765   0.466667
 0.415686  0.521569  0.858824  0.701961  …  0.639216   0.713726   0.431373
 0.427451  0.639216  0.917647  0.662745     0.643137   0.701961   0.388235
 0.482353  0.752941  0.898039  0.643137     0.521569   0.490196   0.239216
 ⋮                                       ⋱             ⋮          
 0.454902  0.556863  0.694118  0.717647  …  0.243137   0.137255   0.0235294
 0.4       0.376471  0.396078  0.47451      0.2        0.0823529  0.0392157
 0.372549  0.388235  0.396078  0.356863     0.172549   0.054902   0.0980392
 0.352941  0.372549  0.345098  0.368627     0.152941   0.0431373  0.2
 0.282353  0.34902   0.403922  0.356863     0.168627   0.054902   0.266667
 0.235294  0.313726  0.368627  0.301961  …  0.4        0.231373   0.352941
 0.219608  0.254902  0.254902  0.270588     0.215686   0.192157   0.454902
 0.301961  0.329412  0.32549   0.380392     0.12549    0.211765   0.52549
 0.368627  0.360784  0.352941  0.345098     0.0901961  0.317647   0.54902
 0.356863  0.376471  0.309804  0.298039     0.164706   0.403922   0.560784
 0.341176  0.301961  0.266667  0.25098   …  0.239216   0.482353   0.560784
 0.309804  0.278431  0.262745  0.278431     0.364706   0.513726   0.560784

[:, :, 2] =
 0.694118  0.627451  0.607843  0.576471  …  0.654902  0.603922   0.580392
 0.537255  0.6       0.572549  0.556863     0.603922  0.596078   0.580392
 0.407843  0.490196  0.45098   0.47451      0.627451  0.631373   0.611765
 0.396078  0.505882  0.6       0.521569     0.6       0.509804   0.529412
 0.513726  0.631373  0.588235  0.615686     0.6       0.345098   0.12549
 0.65098   0.643137  0.568627  0.756863  …  0.603922  0.360784   0.0352941
 0.745098  0.737255  0.721569  0.870588     0.639216  0.419608   0.054902
 0.780392  0.741176  0.741176  0.890196     0.678431  0.505882   0.266667
 0.611765  0.545098  0.690196  0.87451      0.647059  0.584314   0.490196
 0.470588  0.435294  0.764706  0.858824     0.596078  0.592157   0.431373
 0.419608  0.498039  0.854902  0.760784  …  0.635294  0.694118   0.396078
 0.407843  0.611765  0.913725  0.721569     0.639216  0.686275   0.364706
 0.47451   0.752941  0.929412  0.729412     0.541176  0.505882   0.247059
 ⋮                                       ⋱            ⋮          
 0.458824  0.560784  0.694118  0.717647  …  0.25098   0.145098   0.0235294
 0.396078  0.380392  0.4       0.482353     0.207843  0.0862745  0.0352941
 0.372549  0.396078  0.403922  0.368627     0.168627  0.0470588  0.0862745
 0.34902   0.376471  0.34902   0.380392     0.141176  0.0235294  0.176471
 0.27451   0.34902   0.403922  0.368627     0.172549  0.0509804  0.25098
 0.235294  0.317647  0.372549  0.313726  …  0.423529  0.25098    0.352941
 0.223529  0.262745  0.262745  0.282353     0.219608  0.192157   0.443137
 0.305882  0.337255  0.337255  0.392157     0.101961  0.188235   0.498039
 0.376471  0.372549  0.364706  0.356863     0.054902  0.282353   0.509804
 0.372549  0.388235  0.321569  0.305882     0.133333  0.364706   0.521569
 0.352941  0.313726  0.27451   0.258824  …  0.207843  0.447059   0.52549
 0.317647  0.286275  0.270588  0.286275     0.32549   0.47451    0.521569

[:, :, 3] =
 0.733333  0.662745  0.643137  0.607843  …  0.65098   0.501961   0.470588
 0.533333  0.603922  0.584314  0.572549     0.627451  0.509804   0.478431
 0.372549  0.462745  0.439216  0.47451      0.666667  0.556863   0.521569
 0.388235  0.517647  0.623529  0.545098     0.639216  0.498039   0.490196
 0.545098  0.678431  0.635294  0.654902     0.647059  0.372549   0.12549
 0.705882  0.686275  0.603922  0.776471  …  0.670588  0.407843   0.0470588
 0.823529  0.784314  0.745098  0.878431     0.705882  0.470588   0.0745098
 0.839216  0.768627  0.752941  0.901961     0.729412  0.537255   0.27451
 0.611765  0.537255  0.686275  0.882353     0.682353  0.596078   0.478431
 0.431373  0.4       0.741176  0.85098      0.627451  0.603922   0.419608
 0.384314  0.470588  0.85098   0.776471  …  0.666667  0.701961   0.388235
 0.4       0.611765  0.933333  0.768627     0.67451   0.701961   0.356863
 0.458824  0.733333  0.921569  0.745098     0.596078  0.529412   0.243137
 ⋮                                       ⋱            ⋮          
 0.403922  0.533333  0.698039  0.752941  …  0.301961  0.172549   0.0431373
 0.32549   0.333333  0.380392  0.486275     0.270588  0.117647   0.0470588
 0.298039  0.329412  0.360784  0.341176     0.223529  0.0705882  0.0862745
 0.309804  0.341176  0.321569  0.360784     0.184314  0.0352941  0.164706
 0.270588  0.337255  0.388235  0.345098     0.223529  0.0784314  0.262745
 0.239216  0.301961  0.337255  0.258824  …  0.478431  0.301961   0.396078
 0.211765  0.235294  0.211765  0.215686     0.25098   0.227451   0.478431
 0.282353  0.298039  0.27451   0.313726     0.113725  0.203922   0.521569
 0.329412  0.313726  0.294118  0.27451      0.054902  0.286275   0.533333
 0.278431  0.305882  0.25098   0.25098      0.141176  0.376471   0.545098
 0.278431  0.243137  0.215686  0.207843  …  0.223529  0.470588   0.556863
 0.27451   0.239216  0.215686  0.231373     0.356863  0.513726   0.564706

# is this a truck?
convert2image(CIFAR10, X)

1.2.4 Text data

Text data (webpage, blog, twitter) can be transformed to numeric matrices for statistical analysis as well. For example, the 29 State of the Union Addresses by U.S. presidents, from George W Bush in 1989 to Donald Trump in 2017, can be represented by a $29 \times 9610$ document term matrix, where each row stands for one speech and each column is a word that ever appears in these speeches. An entry $x_{ij}$ of the matrix counts the number of occurrences of word $j$ in speech $i$.

sotupath = joinpath(dirname(pathof(TextAnalysis)), "..", "test/data/sotu")
Base.Filesystem.readdir(sotupath)

29-element Vector{String}:
 "Bush_1989.txt"
 "Bush_1990.txt"
 "Bush_1991.txt"
 "Bush_1992.txt"
 "Bush_2001.txt"
 "Bush_2002.txt"
 "Bush_2003.txt"
 "Bush_2004.txt"
 "Bush_2005.txt"
 "Bush_2006.txt"
 "Bush_2007.txt"
 "Bush_2008.txt"
 "Clinton_1993.txt"
 ⋮
 "Clinton_1998.txt"
 "Clinton_1999.txt"
 "Clinton_2000.txt"
 "Obama_2009.txt"
 "Obama_2010.txt"
 "Obama_2011.txt"
 "Obama_2012.txt"
 "Obama_2013.txt"
 "Obama_2014.txt"
 "Obama_2015.txt"
 "Obama_2016.txt"
 "Trump_2017.txt"

crps = DirectoryCorpus(sotupath)
# Donald Trump 2017 SOTU address
text(crps[29])

"Thank you very much. Mr. Speaker, Mr. Vice President, Members of Congress, the First Lady of the United States, and citizens of America: Tonight, as we mark the conclusion of our celebration of Black History Month, we are reminded of our Nation's path towards civil righ" ⋯ 28704 bytes ⋯ "n me in dreaming big and bold, and daring things for our country. I am asking everyone watching tonight to seize this moment. Believe in yourselves, believe in your future, and believe, once more, in America.\n\nThank you, God bless you, and God bless the United States.\n"

standardize!(crps, StringDocument)
remove_case!(crps)
prepare!(crps, strip_punctuation)
update_lexicon!(crps)
update_inverse_index!(crps)
m = DocumentTermMatrix(crps)
D = dtm(m, :dense)

29×9610 Matrix{Int64}:
 3  0  0  0  0  0  0  0  0  0  0  0  0  …  0  0  0  0  0  0  0  1  0   0   0
 3  0  0  0  0  0  0  0  0  0  0  0  0     0  0  0  0  0  0  0  1  0   0   0
 1  0  0  0  0  0  0  0  0  0  0  0  0     0  0  0  0  0  0  1  1  0   0   0
 0  1  0  0  0  0  0  0  0  0  0  0  0     0  0  0  0  0  0  1  0  0   0   0
 2  8  0  1  0  0  0  0  0  0  0  0  0     0  0  0  0  0  0  0  0  0   0   0
 0  0  0  0  0  0  0  0  0  0  0  0  5  …  0  1  0  0  0  0  0  0  0   0   0
 0  2  0  2  0  0  0  0  0  0  0  1  3     0  0  0  0  0  0  0  0  0   0   0
 0  2  1  0  0  0  0  0  0  0  0  0  5     0  0  0  0  0  0  0  0  0   0   0
 0  0  0  0  0  0  0  0  0  0  0  0  1     1  0  0  0  0  0  0  0  0   0   0
 0  1  0  0  0  0  0  0  0  0  0  0  2     1  0  1  0  1  0  0  0  0  67  31
 1  1  1  0  0  0  0  0  0  0  0  0  2  …  1  0  0  0  0  0  0  0  0   0   0
 1  0  1  0  0  1  0  0  0  0  0  0  0     0  0  0  0  1  0  0  0  0   0   0
 2  6  1  0  0  2  0  0  0  0  0  0  0     0  0  0  0  0  0  0  1  0   0   0
 ⋮              ⋮              ⋮        ⋱     ⋮              ⋮            
 0  3  1  3  0  2  0  0  0  0  1  0  1     0  1  0  1  0  0  0  1  0   0   0
 3  3  0  1  1  4  0  0  0  0  0  0  1     0  0  0  0  0  0  0  1  0   0   0
 1  7  0  2  1  3  0  0  0  0  0  0  0     0  0  0  0  0  0  0  1  0   0   0
 1  0  0  0  0  0  0  0  0  0  0  0  0  …  0  0  0  0  0  0  0  0  0   0   0
 3  3  1  0  2  0  0  0  0  0  0  0  0     0  0  0  0  0  0  0  0  0  45   1
 1  0  1  1  1  2  0  0  0  0  0  0  1     0  0  0  0  0  0  0  0  0  47   1
 1  0  1  0  1  0  0  0  0  0  0  0  0     0  0  0  0  0  0  0  0  0   3   0
 2  0  1  0  0  0  0  1  0  0  0  0  0     0  0  0  0  0  0  0  0  0  41   1
 2  0  1  0  0  0  2  0  0  0  0  0  0  …  0  1  0  0  0  0  0  0  0  62   7
 0  0  0  0  0  0  0  0  0  0  0  0  1     0  1  0  0  0  0  0  0  0   0   0
 0  2  0  0  1  0  0  0  0  0  0  0  0     0  0  0  0  0  0  0  0  0   0   0
 2  0  2  0  0  0  0  0  0  0  0  0  0     0  0  0  0  0  0  0  0  0   0   0

m.terms

9610-element Vector{String}:
 "1"
 "10"
 "100"
 "1000"
 "10000"
 "100000"
 "1010"
 "102"
 "103"
 "104"
 "105"
 "108"
 "11"
 ⋮
 "zarfos"
 "zarqawi"
 "zero"
 "zeroemission"
 "zeros"
 "zimbabwe"
 "zion"
 "zone"
 "zones"
 "ј"
 "–"
 "…"

1.2.5 Networks

The world wide web (WWW) with $n$ webpages can be described by a connectivity matrix or adjacency matrix $\mathbf{A} \in \{0,1\}^{n \times n}$ with entry \[\begin{eqnarray*} a_{ij} = \begin{cases} 1 & \text{if page $i$ links to page $j$} \\ 0 & \text{otherwise} \end{cases}. \end{eqnarray*}\] According to Internet Live Stats, $n \approx 1.98$ billion now. The smaller SNP/web-Google data set contains a web of 916,428 pages.

mdinfo("SNAP/web-Google") |> show

# SNAP/web-Google

###### MatrixMarket matrix coordinate pattern general

---

  * UF Sparse Matrix Collection, Tim Davis
  * http://www.cise.ufl.edu/research/sparse/matrices/SNAP/web-Google
  * name: SNAP/web-Google
  * [Web graph from Google]
  * id: 2301
  * date: 2002
  * author: Google
  * ed: J. Leskovec
  * fields: name title A id date author ed kind notes
  * kind: directed graph

---

  * notes:
  * Networks from SNAP (Stanford Network Analysis Platform) Network Data Sets,
  * Jure Leskovec http://snap.stanford.edu/data/index.html
  * email jure at cs.stanford.edu
  * 
  * Google web graph
  * 
  * Dataset information
  * 
  * Nodes represent web pages and directed edges represent hyperlinks between them.
  * The data was released in 2002 by Google as a part of Google Programming
  * Contest.
  * 
  * Dataset statistics
  * Nodes   875713
  * Edges   5105039
  * Nodes in largest WCC    855802 (0.977)
  * Edges in largest WCC    5066842 (0.993)
  * Nodes in largest SCC    434818 (0.497)
  * Edges in largest SCC    3419124 (0.670)
  * Average clustering coefficient  0.6047
  * Number of triangles     13391903
  * Fraction of closed triangles    0.05523
  * Diameter (longest shortest path)    22
  * 90-percentile effective diameter    8.1
  * 
  * Source (citation)
  * 
  * J. Leskovec, K. Lang, A. Dasgupta, M. Mahoney. Community Structure in Large
  * Networks: Natural Cluster Sizes and the Absence of Large Well-Defined Clusters.
  * arXiv.org:0810.1355, 2008.
  * 
  * Google programming contest, 2002
  * http://www.google.com/programming-contest/
  * 
  * Files
  * File    Description
  * web-Google.txt.gz   Webgraph from the Google programming contest, 2002

---

916428 916428 5105039

md = mdopen("SNAP/web-Google")
md.A

916428×916428 SparseArrays.SparseMatrixCSC{Bool, Int64} with 5105039 stored entries:
⎡⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⎤
⎢⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⎥
⎢⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⎥
⎢⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⎥
⎢⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⎥
⎢⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⎥
⎢⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⎥
⎢⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⎥
⎢⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⎥
⎢⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⎥
⎢⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⎥
⎢⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⎥
⎢⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⎥
⎢⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⎥
⎢⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⎥
⎢⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⎥
⎢⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⎥
⎢⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⎥
⎢⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⎥
⎣⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⎦

Here is a visulization of the SNAP/web-Google network

Such a directed graph can also be represented by an indicence matrix $\mathbf{B} \in \{-1,0,1\}^{m \times n}$ where $m$ is the number of verticies and $n$ is the number of edges. The entries of an incidence matrix are \[\begin{eqnarray*} b_{ij} = \begin{cases} -1 & \text{if edge $j$ starts at vertex $i$} \\ 1 & \text{if edge $j$ ends at vertex $i$} \\ 0 & \text{otherwise} \end{cases}. \end{eqnarray*}\]

Here is a directed graph with 4 nodes and 5 edges.

# a simple directed graph on GS p16
g = SimpleDiGraph(4)
add_edge!(g, 1, 2)
add_edge!(g, 1, 3)
add_edge!(g, 2, 3)
add_edge!(g, 2, 4)
add_edge!(g, 4, 3)
gplot(g, nodelabel=["x1", "x2", "x3", "x4"], edgelabel=["b1", "b2", "b3", "b4", "b5"])

# adjacency matrix A
convert(Matrix{Int64}, adjacency_matrix(g))

4×4 Matrix{Int64}:
 0  1  1  0
 0  0  1  1
 0  0  0  0
 0  0  1  0

# incidence matrix B
convert(Matrix{Int64}, incidence_matrix(g))

4×5 Matrix{Int64}:
 -1  -1   0   0   0
  1   0  -1  -1   0
  0   1   1   0   1
  0   0   0   1  -1

1.3 A view of statistics (or data science)

XKCD #1838