Dimensionality Reduction

Guest Lecture by Professor Anders Hast

Distinguished University Teacher, InfraVis, UU Node
Research page: andershast.com
email: anders.hast@it.uu.se
InfraVis: infravis.se

Questions

How and why dimensionality reduction is an important tool?

Objectives

We will look at tools for visualising what cannot easily be seen, i.e. high dimensionality reduction
Share insights and experience from Anders’s own research

The Essence of Machine Learning: Classification

How can a model separate the “blue” from the “red”?

Which model is the best, the green curve or the black?

Classification challenges:

Black curve: The model will guess “wrong” sometimes for new data
Green curve: The model will make even more wrong guesses? Why?
“Outliers” or special cases have too much impact on the classification boundaries.

Dimensionality reduction:

Project from several dimensions to fewer, often 2D or 3D.

Remember: we get a distorted picture of the high dimensional space!

Some Dimensionality Reduction Techniques:

PCA (on Iris Data)

Fisher’s iris data consists of measurements on the sepal length, sepal width, petal length, and petal width for 150 iris specimens. There are 50 specimens from each of three species.
Pretty good separation of classes
However PCA often fails for high dimensional data as the clusters will overlap!

Iris and its PCA

t-SNE

t-distributed stochastic neighbor embedding (t-SNE) is a statistical method for visualising high-dimensional data by giving each datapoint a location in a two or three-dimensional map.
The t-SNE algorithm comprises two main stages.
First, t-SNE constructs a probability distribution over pairs of high-dimensional objects in such a way that similar objects are assigned a higher probability while dissimilar points are assigned a lower probability.
Second, t-SNE de nes a similar probability distribution over the points in the low-dimensional map, and it minimises the Kullback–Leibler divergence between the two distributions with respect to the locations of the points in the map.

PCA vs t-SNE vs HOG

pca_exhog_tnse_exampleample — HOG & t-SNE

UMAP

Uniform manifold approximation and projection (UMAP) is a nonlinear dimensionality reduction technique.
Visually, it is similar to t-SNE, but it assumes that the data is uniformly distributed on a locally connected Riemannian manifold and that the Riemannian metric is locally constant or approximately locally constant.
UMAP is newer and therefore preferred by many.
However tends to separate clusters better! But is that always better?

Face Recognition (FR) Use case

InfraVis slides on FR

Keypoints

Dimensionality reduction techniques are useful to be able to explore your high dimensional data!
But not only nice pictures
- Make discoveries!
- New results!
- Use visualisation and clustering for classification

Exercise

Exercise

You will find a jupyter notebook in the tarball called DimRed.ipynb (Exercises/day4/Dim_reduction), which works upon a face recognition dataset kept in the dataset folder. Try running the notebook and give the correct dataset path wherever required.

The env required for this notebook is pip install numpy matplotlib scikit-learn scipy pillow plotly umap-learn