Source code for padasip.preprocess.lda

"""
.. versionadded:: 0.6

Linear discriminant analysis (LDA)
is a method used to determine the features
that separates some classes of items. The output of LDA may be used as
a linear classifier, or for dimensionality reduction for purposes of
classification.

.. contents::
   :local:
   :depth: 1

See also: :ref:`preprocess-pca`

Usage Explanation
********************

For reduction of data-set :code:`x` with labels stored in array (:code:`labels`)
to new dataset :code:`new_x` containg just :code:`n` number of
columns

.. code-block:: python

    new_x = pa.preprocess.LDA(x, labels, n)

The sorted array of scattermatrix eigenvalues for dataset :code:`x` described
with variable :code:`labels` can be obtained as follows

.. code-block:: python

    eigenvalues = pa.preprocess.LDA_discriminants(x, labels)


Minimal Working Examples
*****************************

In this example we create data-set :code:`x` of 150 random samples. Every sample
is described by 4 values and label. The labels are stored in
array :code:`labels`.

Firstly, it is good to see the eigenvalues of scatter matrix to determine
how many rows is reasonable to reduce

.. code-block:: python

    import numpy as np
    import padasip as pa

    np.random.seed(100) # constant seed to keep the results consistent

    N = 150 # number of samples
    classes = np.array(["1", "a", 3]) # names of classes
    cols = 4 # number of features (columns in dataset)

    x = np.random.random((N, cols)) # random data
    labels = np.random.choice(classes, size=N) # random labels

    print pa.preprocess.LDA_discriminants(x, labels)

what prints

>>> [  2.90863957e-02   2.28352079e-02   1.23545720e-18  -1.61163011e-18]

From this output it is obvious that reasonable number of columns to keep is 2.
The following code reduce the number of features to 2.

.. code-block:: python

    import numpy as np
    import padasip as pa

    np.random.seed(100) # constant seed to keep the results consistent

    N = 150 # number of samples
    classes = np.array(["1", "a", 3]) # names of classes
    cols = 4 # number of features (columns in dataset)

    x = np.random.random((N, cols)) # random data
    labels = np.random.choice(classes, size=N) # random labels

    new_x = pa.preprocess.LDA(x, labels, n=2)

to check if the size of new data-set is really correct we can print the shapes
as follows

>>> print "Shape of original dataset: {}".format(x.shape)
Shape of original dataset: (150, 4)
>>> print "Shape of new dataset: {}".format(new_x.shape)
Shape of new dataset: (150, 2)


Code Explanation
*****************
"""
from __future__ import division
import numpy as np

[docs]def LDA_base(x, labels): """ Base function used for Linear Discriminant Analysis. **Args:** * `x` : input matrix (2d array), every row represents new sample * `labels` : list of labels (iterable), every item should be label for \ sample with corresponding index **Returns:** * `eigenvalues`, `eigenvectors` : eigenvalues and eigenvectors \ from LDA analysis """ classes = np.array(tuple(set(labels))) cols = x.shape[1] # mean values for every class means = np.zeros((len(classes), cols)) for i, cl in enumerate(classes): means[i] = np.mean(x[labels == cl], axis=0) # scatter matrices scatter_within = np.zeros((cols, cols)) for cl, mean in zip(classes, means): scatter_class = np.zeros((cols, cols)) for row in x[labels == cl]: dif = row - mean scatter_class += np.dot(dif.reshape(cols, 1), dif.reshape(1, cols)) scatter_within += scatter_class total_mean = np.mean(x, axis=0) scatter_between = np.zeros((cols, cols)) for cl, mean in zip(classes, means): dif = mean - total_mean dif_product = np.dot(dif.reshape(cols, 1), dif.reshape(1, cols)) scatter_between += x[labels == cl, :].shape[0] * dif_product # eigenvalues and eigenvectors from scatter matrices scatter_product = np.dot(np.linalg.inv(scatter_within), scatter_between) eigen_values, eigen_vectors = np.linalg.eig(scatter_product) return eigen_values, eigen_vectors
[docs]def LDA(x, labels, n=False): """ Linear Discriminant Analysis function. **Args:** * `x` : input matrix (2d array), every row represents new sample * `labels` : list of labels (iterable), every item should be label for \ sample with corresponding index **Kwargs:** * `n` : number of features returned (integer) - how many columns should the output keep **Returns:** * new_x : matrix with reduced size (number of columns are equal `n`) """ n = n if n else x.shape[1] - 1 assert x.shape[1] > n, "The requested n is bigger than \ number of features in x." # make the LDA eigen_values, eigen_vectors = LDA_base(x, labels) # sort the eigen vectors according to eigen values eigen_order = eigen_vectors.T[(-eigen_values).argsort()] return eigen_order[:n].dot(x.T).T
[docs]def LDA_discriminants(x, labels): """ Linear Discriminant Analysis helper for determination how many columns of data should be reduced. **Args:** * `x` : input matrix (2d array), every row represents new sample * `labels` : list of labels (iterable), every item should be label for \ sample with corresponding index **Returns:** * `discriminants` : array of eigenvalues sorted in descending order """ # validate inputs try: x = np.array(x) except: raise ValueError('Impossible to convert x to a numpy array.') # make the LDA eigen_values, eigen_vectors = LDA_base(x, labels) return eigen_values[(-eigen_values).argsort()]