Linear Discriminant Analysis (LDA)

New in version 0.6.

Linear discriminant analysis (LDA) [1] is a method used to determine the features that separates some classes of items. The output of LDA may be used as a linear classifier, or for dimensionality reduction for purposes of classification.

See also: Principal Component Analysis (PCA)

Usage Explanation

For reduction of data-set x with labels stored in array (labels) to new dataset new_x containg just n number of columns

new_x = pa.preprocess.LDA(x, labels, n) 

The sorted array of scattermatrix eigenvalues for dataset x described with variable labels can be obtained as follows

eigenvalues = pa.preprocess.LDA_discriminants(x, labels) 

Minimal Working Examples

In this example we create data-set x of 150 random samples. Every sample is described by 4 values and label. The labels are stored in array labels.

Firstly, it is good to see the eigenvalues of scatter matrix to determine how many rows is reasonable to reduce

import numpy as np
import padasip as pa

np.random.seed(100) # constant seed to keep the results consistent

N = 150 # number of samples
classes = np.array(["1", "a", 3]) # names of classes
cols = 4 # number of features (columns in dataset)

x = np.random.random((N, cols)) # random data
labels = np.random.choice(classes, size=N) # random labels

print pa.preprocess.LDA_discriminants(x, labels)

what prints

>>> [  2.90863957e-02   2.28352079e-02   1.23545720e-18  -1.61163011e-18]

From this output it is obvious that reasonable number of columns to keep is 2. The following code reduce the number of features to 2.

import numpy as np
import padasip as pa

np.random.seed(100) # constant seed to keep the results consistent

N = 150 # number of samples
classes = np.array(["1", "a", 3]) # names of classes
cols = 4 # number of features (columns in dataset)

x = np.random.random((N, cols)) # random data
labels = np.random.choice(classes, size=N) # random labels

new_x = pa.preprocess.LDA(x, labels, n=2)

to check if the size of new data-set is really correct we can print the shapes as follows

>>> print "Shape of original dataset: {}".format(x.shape) 
Shape of original dataset: (150, 4)
>>> print "Shape of new dataset: {}".format(new_x.shape)
Shape of new dataset: (150, 2)

References

[1]Ronald A Fisher. The use of multiple measurements in taxonomic problems. Annals of eugenics, 7(2):179–188, 1936.

Code Explanation

padasip.preprocess.lda.LDA(x, labels, n=False)[source]

Linear Discriminant Analysis function.

Args:

  • x : input matrix (2d array), every row represents new sample
  • labels : list of labels (iterable), every item should be label for sample with corresponding index

Kwargs:

  • n : number of features returned (integer) - how many columns should the output keep

Returns:

  • new_x : matrix with reduced size (number of columns are equal n)
padasip.preprocess.lda.LDA_base(x, labels)[source]

Base function used for Linear Discriminant Analysis.

Args:

  • x : input matrix (2d array), every row represents new sample
  • labels : list of labels (iterable), every item should be label for sample with corresponding index

Returns:

  • eigenvalues, eigenvectors : eigenvalues and eigenvectors from LDA analysis
padasip.preprocess.lda.LDA_discriminants(x, labels)[source]

Linear Discriminant Analysis helper for determination how many columns of data should be reduced.

Args:

  • x : input matrix (2d array), every row represents new sample
  • labels : list of labels (iterable), every item should be label for sample with corresponding index

Returns:

  • discriminants : array of eigenvalues sorted in descending order