Dimensionality Reduction

infravis_logo

Guest Lecture by Professor Anders Hast

Questions

  • How and why dimensionality reduction is an important tool?

Objectives

  • We will look at tools for visualising what cannot easily be seen, i.e. high dimensionality reduction

  • Share insights and experience from Anders’s own research

The Essence of Machine Learning: Classification

ml_classification

Decision boundary

How can a model separate the “blue” from the “red”?

Which model is the best, the green curve or the black?

Classification challenges:

  • Black curve: The model will guess “wrong” sometimes for new data

  • Green curve: The model will make even more wrong guesses? Why?

    “Outliers” or special cases have too much impact on the classification boundaries.

Dimensionality reduction:

Project from several dimensions to fewer, often 2D or 3D.

Remember: we get a distorted picture of the high dimensional space!

dim_reduction_bunny

Some Dimensionality Reduction Techniques:

PCA (on Iris Data)

  • Fisher’s iris data consists of measurements on the sepal length, sepal width, petal length, and petal width for 150 iris specimens. There are 50 specimens from each of three species.

  • Pretty good separation of classes

  • However PCA often fails for high dimensional data as the clusters will overlap!

t-SNE

  • t-distributed stochastic neighbor embedding (t-SNE) is a statistical method for visualising high-dimensional data by giving each datapoint a location in a two or three-dimensional map.

  • The t-SNE algorithm comprises two main stages.

  • First, t-SNE constructs a probability distribution over pairs of high-dimensional objects in such a way that similar objects are assigned a higher probability while dissimilar points are assigned a lower probability.

  • Second, t-SNE de nes a similar probability distribution over the points in the low-dimensional map, and it minimises the Kullback–Leibler divergence between the two distributions with respect to the locations of the points in the map.

UMAP

  • Uniform manifold approximation and projection (UMAP) is a nonlinear dimensionality reduction technique.

  • Visually, it is similar to t-SNE, but it assumes that the data is uniformly distributed on a locally connected Riemannian manifold and that the Riemannian metric is locally constant or approximately locally constant.

  • UMAP is newer and therefore preferred by many.

  • However tends to separate clusters better! But is that always better?

Face Recognition (FR) Use case

Keypoints

  • Dimensionality reduction techniques are useful to be able to explore your high dimensional data!

  • But not only nice pictures
    • Make discoveries!

    • New results!

    • Use visualisation and clustering for classification