Dimensionality Reduction

infravis_logo

Guest Lecture by Professor Anders Hast

Questions

  • How and why dimensionality reduction is an important tool?

Objectives

  • We will look at tools for visualising what cannot easily be seen, i.e. high dimensionality reduction

  • Share insights and experience from Anders’s own research

visualisation <–> Science

varanoi_regions

Clustering

ml_classification

Decision boundary

  • We will look at tools for visualising what cannot easily be seen, i.e. high dimensionality reduction

  • We will also see that you can make discoveries in your visualisations!

What is a typical machine learning task?

  • Differ between different classes of features

  • Features usually have more than 3 dimensions, hundreds or even thousands!

  • The idea is to find a separating curve in high dimensional space

  • Usually we visualise this in 2D since it is easier to understand!

  • We will look at several techniques to do this!

  • If we can separate in 2D it can often be done in High dimensional space and vice versa!

Dimensionality reduction:

  • Project from several dimensions to fewer, often 2D or 3D

  • Remember: we get a distorted picture of the high dimensional space!

  • Some techniques
    • SOM

    • PCA

    • t-SNE

    • UMAP

dim_reduction_bunny

Some Dimensionality Reduction Techniques:

PCA (on Iris Data)

  • PCA = “find the directions where the data varies the most.”

  • PCA finds a new coordinate system that fits the data:

  • The first axis (1st principal component) points where the data spreads out the most.

  • The second axis (2nd principal component) is perpendicular to the first and captures the next largest spread.

  • The eigenvectors are the directions of those new axes — the principal components.

  • The eigenvalues tell you how much variance (spread) each component captures.

  • Fisher’s iris data consists of measurements on the sepal length, sepal width, petal length, and petal width for 150 iris specimens.

  • There are 50 specimens from each of three species.

  • One axis per data element (which ones are discriminant?)

  • Follow each individual using the lines

t-SNE

  • A dimensionality-reduction method for visualising high- dimensional data in 2D or 3D

  • It keeps similar points close together and dissimilar ones far apart

  • Works by turning distances between points into probabilities of being neighbours, both in the original space and in the low-dimensional map

  • Then it moves points to make those probabilities match (minimizing KL divergence)

  • Uses a Student’s t-distribution in 2D to keep clusters separated and avoid crowding

UMAP

  • A nonlinear dimensionality-reduction method, like t-SNE, used to visualize high-dimensional data in 2D or 3D

  • Based on manifold theory — it assumes your data lies on a curved surface within a high-dimensional space

  • Builds a graph of local relationships (who’s close to whom) in the original space, then finds a low-dimensional layout that preserves those relationships

Face Recognition (FR) Use case

Keypoints

  • We looked at several dimensionality reduction techniques

  • They are useful to be able to explore your high dimensional data!

  • But not only nice pictures
    • Make discoveries!

    • New results!

Exercise

Exercise

You will find a jupyter notebook in the tarball called DimRed.ipynb (Exercises/day4/Dim_reduction), which works upon a face recognition dataset kept in the dataset folder. Try running the notebook and give the correct dataset path wherever required.

The env required for this notebook is pip install numpy matplotlib scikit-learn scipy pillow plotly umap-learn jupyter

Sample examples from documentations: https://scikit-learn.org/stable/auto_examples/decomposition/plot_pca_iris.html#sphx-glr-auto-examples-decomposition-plot-pca-iris-py , https://plotly.com/python/t-sne-and-umap-projections/