# Topological Data Analysis

Gregory Clark

July 30, 2014

## Homotopy Groups

The corner stone algebraic topology, $$\pi_1$$, is the group of equivalence classes of loops where two loops are equivalent if one can be continuously deformed into the other.

More generally, other homotopy groups are defined on equivalence classes of higher dimensional “loops” (maps from higher dimensional spheres).

These groups are topological invariants, and computing them would help us identify the shape of spaces. However, . . .

## Intractible Computations

However, homotopy groups are notoriously difficult to compute (due to generator and relation computations in free groups). [Carlsson] [Hatcher]

## Simplicial Homology

Homology groups are easier to compute.*

$$H_n = \frac{Z_n}{B_n} = \frac{\text{ker}(\delta_n)}{\text{im}(\delta_{n+1})}$$

*Technical conditions apply.

## Betti Numbers

The $$k^{\text{th}}$$ Betti number is the number of generators of the $$k^{\text{th}}$$ homology group.

$$\beta_k = \text{rank}(H_k) = \text{log}_2 \frac{|\text{ker}(\delta_k)|}{|\text{im}(\delta_{k+1})|}$$

We may calculate Betti numbers using the Smith normal forms of matrix represenations of the boundary operators $$\delta_i$$.

The $$k^{\text{th}}$$ Betti number counts equivalence classes of $$k$$-dimensional surfaces (the number of “$$k$$-dimensional holes”) in the space.

## Wedge of Circles

For example, let $$X = S^1 \vee S^1$$ be the wedge of two circles.

What are the Betti numbers of $$X$$?

 $$\beta_0 =$$ the number of connected components $$= 1$$ $$\beta_1 =$$ the number of one-dimensional holes $$= 2$$ $$\beta_k =$$ $$0 \quad \text{ for all } k > 1$$

## Betti Numbers of a Dataset

Suppose we have a dataset containing $$n$$ points in $$\mathbb{R}^d$$.

What are the Betti numbers of that set?

 $$\beta_0 =$$ the number of connected components $$= n$$ $$\beta_1 =$$ the number of one-dimensional holes $$= 0$$ $$\beta_k =$$ $$0 \quad \text{ for all } k > 0$$

Okay, that’s fine... but not very interesting.

## Simplicial Complexes

Simplicial complexes are generalizations of graphs.

Instead of edges that connect two vertices, simplicial complexes have simplices which connect any number of vertices.

## A Formal Definition

A simplicial complex is a finite collection of sets $$A$$ such that $$\alpha \in A$$ and $$\beta \subset \alpha$$ implies $$\beta \in A$$.
The sets in $$A$$ of cardinality $$k+1$$ are called $$k$$-simplices.

We often blur the distinction between an abstract simplicial complex and its geometric realization in Euclidean space.

## Čech Complex

The Čech complex is the nerve of the balls of radius $$r$$ around each point.

$$\text{Čech}(S, r) = \big\{\sigma \subset S : \bigcap_{x \in \sigma} B_x(r) \neq \varnothing \big\}$$

## Vietoris-Rips Complex

The Vietoris-Rips complex contains all edges between points that are within $$2r$$ along with every higher-dimensional simplex whose edges are all included.

$$\text{Vietoris-Rips}(S, r) = \big\{\sigma \subset S : \text{diam } \sigma \le 2r \big\}$$

## Vietoris-Rips Lemma

Let $$S$$ be a finite set of points in $$\mathbb{R}^d$$, and let $$r > 0$$. Then $$\text{Čech}(S, r) \subset \text{Vietoris-Rips}(S, r) \subset \text{Čech}(S, \sqrt{2}r).$$

We tend to favor Vietoris-Rips complexes since they are easier to compute.

## Caveat

Both Čech and Vietoris-Rips complexes could produce a simplicial complex of higher dimension than the space we started in!

## Delaunay Complex

$$\text{Delaunay}(S) = \big\{\sigma \subset S : \bigcap_{u \in \sigma} V_u \neq \varnothing \big\}$$ where $$V_u$$ is the Voronoi cell of $$u$$.

## Alpha Complex

$$\text{Alpha}(S) = \big\{\sigma \subset S : \bigcap_{u \in \sigma} R_u(r) \neq \varnothing \big\} \\ \quad \quad \quad \quad \; \, = \text{Čech}(S, r) \cap \text{Delaunay}(S)$$
where $$R_u(r) = B_u(r) \cap V_u$$.

## (Lazy) Witness Complex

In practice using all of the points in a dataset can be wasteful in terms of computing time and memory. The Witness construction strategically chooses landmark points and creates a simplicial complex that is (hopefully) still a faithful topological representation of the underlying space.

Lazy-Witness : Witness : : Vietoris-Rips : Čech

## Another Caveat

Real world data is often messy, and chosing a particular radius $$r$$ can yield a simplicial complex with artificial features.

$$\beta_1 = 3$$
$$\beta_1 = 2$$

## Persistent Homology

The features that persist for a large range of resolutions are more likely to represent genuine features of the underlying space.

## Plex: Simplicial Complexes in Matlab

Plex 2.5.1
Authors:Patrick Perry
Vin de Silva
Started:2000
Updated:2006 (deprecated)
Languagefilesblankcommentcode
C++852709384310206
Bourne Shell684313398280
MATLAB6762921652219
m471581091110
make1710953206
SUM:25664211400827230

## JPlex: Persistent Homology Library

JPlex (PLEX3)
John Chakerian
Started:2006
Updated:2011 “phased out”
Languagefilesblankcommentcode
HTML623048128324667
Java57167468008969
Bourne Shell481610167759
Bourne Again Shell16988570
C++117530370
MATLAB9106398309
R2445170
Ant1200143
SUM:1415977963443020

## JavaPlex: Persistent Homology Library

javaPlex 4.2.0
Authors:Andrew Tausz
Mikael Vejdemo-Johansson
Started:2010
Updated:yesterday
Languagefilesblankcommentcode
HTML405158917715130927
Java25956951262023457
Bourne Shell681910167773
MATLAB190176914464656
Bourne Again Shell16988570
C++117530370
Ant13211223
R2445170
Arduino Sketch13263164
Visual Basic1290122
SUM:8712458123008168498

## Phom: Persistent Homology in R

phom 1.0.3
Authors:Andrew Tausz
Started:2011
Updated:February 2014
Languagefilesblankcommentcode
C++14719176
R25945118
SUM:245312561969

## Mapper

Mapper
Authors:Gurjeet Singh
Started:2007
Updated:2009
Languagefilesblankcommentcode
MATLAB656215227
HTML12138114
SUM:777253341

## Python Mapper

Python Mapper 0.1.6
Authors:Daniel Müllner
Aravindakshan Babu
Started:2011
Updated:March 2014
Languagefilesblankcommentcode
Bourne Shell83907451925464
m471063919850
Python1917487659131
C++13291402705
HTML2200296
DOS Batch1231166
make2245130
SUM:417218560548208

## Dionysus: Persistent Homology Library

Dionysus
Authors:Dmitriy Morozov
Started:2006
Updated:December 2013
Languagefilesblankcommentcode
C++1618767870
Python177348265
CMake102114125
make15010
SUM:9414488216112

## CTL: Computational Topology Library

CTL
Authors:Ryan H. Lewis
Started:August 2013
Updated:July 2014
Languagefilesblankcommentcode
C++2230911091814
CMake21113168393
SUM:101127843248102

## CHomP: Computational Homology Project

People at Rutgers are also doing things. I'm not sure exactly what, but it does involve 40,000+ lines of C/C++ code related to persistent homology.

## Phom Example 1

R code:

N <- 50
x1 <- rnorm(N, mean=0, sd=1)
y1 <- rnorm(N, mean=0, sd=1)
z1 <- rnorm(N, mean=0, sd=1)
x2 <- rnorm(N, mean=8, sd=1)
y2 <- rnorm(N, mean=8, sd=1)
z2 <- rnorm(N, mean=8, sd=1)
V1 <- cbind(x1, y1, z1)
V2 <- cbind(x2, y2, z2)
V <- as.matrix(rbind(V1, V2))
plot(V)


(Plotted with processing.js.)

R code:

library('phom')

max_f <- 9

dim <- 0
intervals <- pHom(V, dim, max_f, mode='vr', metric='euclidean')
plotBarcodeDiagram(intervals, dim, max_f, title = '')

dim <- 1
intervals <- pHom(V, dim, max_f, mode='vr', metric='euclidean')
plotBarcodeDiagram(intervals, dim, max_f, title = '')

dim <- 2
intervals <- pHom(V, dim, max_f, mode='vr', metric='euclidean')
plotBarcodeDiagram(intervals, dim, max_f, title = '')


Note that "vr" stands for the Vietoris-Rips complex.

## Mapper Complex

The “Mapper complex” is the nerve of all partial clusters taken on the preimages of a covering of the range of a filter function $$f : S \to Z$$.

$$\text{Mapper}(S, f, \mathcal{A}, \text{Clust}) = \big\{\sigma \subset P : \bigcap_{s \in \sigma} s \neq \varnothing \big\}$$
where $$\mathcal{A}$$ is a cover of the range of $$f : S \to Z$$, $$\text{Clust} : \mathcal{P}(S) \to \mathcal{P}(\mathcal{P}(S))$$ is a clustering function, and

$$P = \bigcup_{U \in \mathcal{A}} \text{Clust}\big(f^{-1}(U)\big).$$

## A Circle

Matlab:

X = randn(300, 2);
X = X./(sqrt(sum(X.*X,2))*ones(1, 2));
d = L2_distance(X',X',1);
filter = d(1, :);
scatter(X(:,1),X(:,2),1000,filter,'.');
axis equal;


## Mapper Results

Matlab:

filterSamples = 5;
overlapPct = 40;
[adja, nodeInfo, levelIdx] = mapper(d,...
filter, 1/filterSamples, overlapPct);
%%%%
for i=1:length(nodeInfo)
ecc(i) = nodeInfo{i}.filter;
setSize(i) = length(nodeInfo{i}.set);
end
writeDotFile('circle.dot', adja, ecc, setSize);


Bash:

\$ neato -Tpng circle.dot -o mapper-output.png


## Mapper Results in the Browser

writeJsonFile('circle.json', adja, ecc, setSize);


## Miller-Reaven Diabetes Dataset

Projection Pursuit

## Reproduced Results Reproduced

R code:

require(locfit)
#### locfit 1.5-9.1 	 2013-03-22
data("chemdiab")
normdiab <- rbind(chemdiab)
for (i in 1:5) {
normdiab[i] <- scale(chemdiab[i],center=FALSE)
}
write.csv(normdiab,'diab-norm.csv',row.names=FALSE)


## Microarray Data

NKI gene expression [van de Vijver]

Agilent Microarray

## PAD = DSGA + Mapper

Progression Analysis of Disease is the one-two punch of Disease-Specific Genomic Analysis and Mapper.

## DSGA = HSM + FLAT

DSGA is the one-two punch of decomposing disease data with a Healthy State Model created from normal tissue data and the FLAT construction.

## Flat = data desparsing + PCA

Flat is the one-two punch of data desparsing and Principal Component Analysis.

NKI Dataset

## Normal Breast Tissue Data

UNC Microarray Database(my source)

Stanford Microarray Database(the paper's source)

I replaced empty cells with NaN, deleted some poorly formatted rows, and used Matlab to impute missing data and change from log10 scale to log2 scale.

Matlab:

normal_bt = csvread('normal-breast-tissue-data.csv');
normal_bt_imputed = knnimpute(normal_bt, 10);
csvwrite('normal-breast-tissue-imputed.csv', normal_bt_imputed);

nki_data_imputed = knnimpute(nki_data, 10);
nki_data_imputed_log2 = nki_data_imputed .* 3.32192809489;
csvwrite('nki-chang-complete-imputed-log2.csv', nki_data_imputed_log2);


## References

1. A. Tausz, phom: Persistent Homology in R, Version 1.0.1, 2011. Available at CRAN http://cran.r-project.org.
2. G. Carlsson. Topology and Data. Bull. Amer. Math. Soc., 46:255-308, 2009.
3. H. Edelsbrunner, J. L. Harer, Computational Topology: An Introduction, AMS 2010
4. N. De Jong, A Probabilistic Approach To Simplicial Homology, UTK, 2012
5. A. J. Zomorodian, Topology for Computing, 2005
6. ...

### Thank you to everybody who came out to support me, especially my committee:

Dr. Fernando Schwartz

Dr. Michael Langston

Dr. Michael Berry