Dimensionality Reduction

infravis_logo

Guest Lecture by Professor Anders Hast

Questions

  • How and why dimensionality reduction is an important tool?

Objectives

  • We will look at tools for visualising what cannot easily be seen, i.e. high dimensionality reduction

  • Share insights and experience from Anders’s own research

The Essence of Machine Learning: Classification

ml_classification

Decision boundary

How can a model separate the “blue” from the “red”?

Which model is the best, the green curve or the black?

Classification challenges:

  • Black curve: The model will guess “wrong” sometimes for new data

  • Green curve: The model will make even more wrong guesses? Why?

    “Outliers” or special cases have too much impact on the classification boundaries.

Dimensionality reduction:

Project from several dimensions to fewer, often 2D or 3D.

Remember: we get a distorted picture of the high dimensional space!

dim_reduction_bunny

Some Dimensionality Reduction Techniques:

PCA (on Iris Data)

  • Fisher’s iris data consists of measurements on the sepal length, sepal width, petal length, and petal width for 150 iris specimens. There are 50 specimens from each of three species.

  • Pretty good separation of classes

  • However PCA often fails for high dimensional data as the clusters will overlap!

t-SNE

  • t-distributed stochastic neighbor embedding (t-SNE) is a statistical method for visualising high-dimensional data by giving each datapoint a location in a two or three-dimensional map.

  • The t-SNE algorithm comprises two main stages.

  • First, t-SNE constructs a probability distribution over pairs of high-dimensional objects in such a way that similar objects are assigned a higher probability while dissimilar points are assigned a lower probability.

  • Second, t-SNE de nes a similar probability distribution over the points in the low-dimensional map, and it minimises the Kullback–Leibler divergence between the two distributions with respect to the locations of the points in the map.

UMAP

  • Uniform manifold approximation and projection (UMAP) is a nonlinear dimensionality reduction technique.

  • Visually, it is similar to t-SNE, but it assumes that the data is uniformly distributed on a locally connected Riemannian manifold and that the Riemannian metric is locally constant or approximately locally constant.

  • UMAP is newer and therefore preferred by many.

  • However tends to separate clusters better! But is that always better?

Face Recognition (FR) Use case

Keypoints

  • Dimensionality reduction techniques are useful to be able to explore your high dimensional data!

  • But not only nice pictures
    • Make discoveries!

    • New results!

    • Use visualisation and clustering for classification

Exercise

Exercise

You will find a jupyter notebook in the tarball called DimRed.ipynb (Exercises/day4/Dim_reduction), which works upon a face recognition dataset kept in the dataset folder. Try running the notebook and give the correct dataset path wherever required.

The env required for this notebook is pip install numpy matplotlib scikit-learn scipy pillow plotly umap-learn