Dimensionality Reduction

Guest Lecture by Professor Anders Hast
Distinguished University Teacher, InfraVis, UU Node
Research page: andershast.com
email: anders.hast@it.uu.se
InfraVis: infravis.se
Questions
How and why dimensionality reduction is an important tool?
Objectives
We will look at tools for visualising what cannot easily be seen, i.e. high dimensionality reduction
Share insights and experience from Anders’s own research
The Essence of Machine Learning: Classification

Decision boundary
How can a model separate the “blue” from the “red”?
Which model is the best, the green curve or the black?
Classification challenges:
Black curve: The model will guess “wrong” sometimes for new data
- Green curve: The model will make even more wrong guesses? Why?
“Outliers” or special cases have too much impact on the classification boundaries.
Dimensionality reduction:
Project from several dimensions to fewer, often 2D or 3D.
Remember: we get a distorted picture of the high dimensional space!

Some Dimensionality Reduction Techniques:
PCA (on Iris Data)
Fisher’s iris data consists of measurements on the sepal length, sepal width, petal length, and petal width for 150 iris specimens. There are 50 specimens from each of three species.
Pretty good separation of classes
However PCA often fails for high dimensional data as the clusters will overlap!
Iris and its PCA


t-SNE
t-distributed stochastic neighbor embedding (t-SNE) is a statistical method for visualising high-dimensional data by giving each datapoint a location in a two or three-dimensional map.
The t-SNE algorithm comprises two main stages.
First, t-SNE constructs a probability distribution over pairs of high-dimensional objects in such a way that similar objects are assigned a higher probability while dissimilar points are assigned a lower probability.
Second, t-SNE de nes a similar probability distribution over the points in the low-dimensional map, and it minimises the Kullback–Leibler divergence between the two distributions with respect to the locations of the points in the map.
UMAP
Uniform manifold approximation and projection (UMAP) is a nonlinear dimensionality reduction technique.
Visually, it is similar to t-SNE, but it assumes that the data is uniformly distributed on a locally connected Riemannian manifold and that the Riemannian metric is locally constant or approximately locally constant.
UMAP is newer and therefore preferred by many.
However tends to separate clusters better! But is that always better?
Face Recognition (FR) Use case
InfraVis slides on FR








Keypoints
Dimensionality reduction techniques are useful to be able to explore your high dimensional data!
- But not only nice pictures
Make discoveries!
New results!
Use visualisation and clustering for classification