[ Info: verify download of index files...
[ Info: reading database
[ Info: adding metadata...
[ Info: adding svd data...
[ Info: writing database
[ Info: used remote sites are sparse.tamu.edu with MAT index and math.nist.gov with HTML index
In statistics, tabular data is often summarized by a predictor matrix or covariate matrix or design matrix or feature matrix, which is denoted by \(\mathbf{X}\) by convention. Each row of the feature matrix is an observation, and each column is a covariate/measurement/feature.
Neural networks can classify handwritten digits in high accuracy. Each handwritten digit is represented by a grayscale image. The famous MNIST data set contains 60,000 training images and 10,000 test images. Each image is a \(28 \times 28\) matrix:
# first training sample: image, digit label# MNIST.traindata(1)MNIST(split=:train)[1]
CIFAR-10 is a collection of 50,000 training images and 10,000 test images, each belonging to 1 of 10 mutually exclusive classes (frog, truck, …). Each color image is represented by three channels: R (red), G (green), B (blue). Each channel is a \(32 \times 32\) intensity matrix.
# 2nd training image in CIFAR10X =CIFAR10(split=:train)[2].features
Text data (webpage, blog, twitter) can be transformed to numeric matrices for statistical analysis as well. For example, the 29 State of the Union Addresses by U.S. presidents, from George W Bush in 1989 to Donald Trump in 2017, can be represented by a \(29 \times 9610\)document term matrix, where each row stands for one speech and each column is a word that ever appears in these speeches. An entry \(x_{ij}\) of the matrix counts the number of occurrences of word \(j\) in speech \(i\).
crps =DirectoryCorpus(sotupath)# Donald Trump 2017 SOTU addresstext(crps[29])
"Thank you very much. Mr. Speaker, Mr. Vice President, Members of Congress, the First Lady of the United States, and citizens of America: Tonight, as we mark the conclusion of our celebration of Black History Month, we are reminded of our Nation's path towards civil righ" ⋯ 28704 bytes ⋯ "n me in dreaming big and bold, and daring things for our country. I am asking everyone watching tonight to seize this moment. Believe in yourselves, believe in your future, and believe, once more, in America.\n\nThank you, God bless you, and God bless the United States.\n"
The world wide web (WWW) with \(n\) webpages can be described by a connectivity matrix or adjacency matrix\(\mathbf{A} \in \{0,1\}^{n \times n}\) with entry \[\begin{eqnarray*}
a_{ij} = \begin{cases}
1 & \text{if page $i$ links to page $j$} \\
0 & \text{otherwise}
\end{cases}.
\end{eqnarray*}\] According to Internet Live Stats, \(n \approx 1.98\) billion now. The smaller SNP/web-Google data set contains a web of 916,428 pages.
mdinfo("SNAP/web-Google") |> show
# SNAP/web-Google
###### MatrixMarket matrix coordinate pattern general
---
* UF Sparse Matrix Collection, Tim Davis
* http://www.cise.ufl.edu/research/sparse/matrices/SNAP/web-Google
* name: SNAP/web-Google
* [Web graph from Google]
* id: 2301
* date: 2002
* author: Google
* ed: J. Leskovec
* fields: name title A id date author ed kind notes
* kind: directed graph
---
* notes:
* Networks from SNAP (Stanford Network Analysis Platform) Network Data Sets,
* Jure Leskovec http://snap.stanford.edu/data/index.html
* email jure at cs.stanford.edu
*
* Google web graph
*
* Dataset information
*
* Nodes represent web pages and directed edges represent hyperlinks between them.
* The data was released in 2002 by Google as a part of Google Programming
* Contest.
*
* Dataset statistics
* Nodes 875713
* Edges 5105039
* Nodes in largest WCC 855802 (0.977)
* Edges in largest WCC 5066842 (0.993)
* Nodes in largest SCC 434818 (0.497)
* Edges in largest SCC 3419124 (0.670)
* Average clustering coefficient 0.6047
* Number of triangles 13391903
* Fraction of closed triangles 0.05523
* Diameter (longest shortest path) 22
* 90-percentile effective diameter 8.1
*
* Source (citation)
*
* J. Leskovec, K. Lang, A. Dasgupta, M. Mahoney. Community Structure in Large
* Networks: Natural Cluster Sizes and the Absence of Large Well-Defined Clusters.
* arXiv.org:0810.1355, 2008.
*
* Google programming contest, 2002
* http://www.google.com/programming-contest/
*
* Files
* File Description
* web-Google.txt.gz Webgraph from the Google programming contest, 2002
---
916428 916428 5105039
Here is a visulization of the SNAP/web-Google network
Such a directed graph can also be represented by an indicence matrix\(\mathbf{B} \in \{-1,0,1\}^{m \times n}\) where \(m\) is the number of verticies and \(n\) is the number of edges. The entries of an incidence matrix are \[\begin{eqnarray*}
b_{ij} = \begin{cases}
-1 & \text{if edge $j$ starts at vertex $i$} \\
1 & \text{if edge $j$ ends at vertex $i$} \\
0 & \text{otherwise}
\end{cases}.
\end{eqnarray*}\]
Here is a directed graph with 4 nodes and 5 edges.