"""
.. versionadded:: 0.6
Linear discriminant analysis (LDA)
is a method used to determine the features
that separates some classes of items. The output of LDA may be used as
a linear classifier, or for dimensionality reduction for purposes of
classification.
.. contents::
:local:
:depth: 1
See also: :ref:`preprocess-pca`
Usage Explanation
********************
For reduction of data-set :code:`x` with labels stored in array (:code:`labels`)
to new dataset :code:`new_x` containg just :code:`n` number of
columns
.. code-block:: python
new_x = pa.preprocess.LDA(x, labels, n)
The sorted array of scattermatrix eigenvalues for dataset :code:`x` described
with variable :code:`labels` can be obtained as follows
.. code-block:: python
eigenvalues = pa.preprocess.LDA_discriminants(x, labels)
Minimal Working Examples
*****************************
In this example we create data-set :code:`x` of 150 random samples. Every sample
is described by 4 values and label. The labels are stored in
array :code:`labels`.
Firstly, it is good to see the eigenvalues of scatter matrix to determine
how many rows is reasonable to reduce
.. code-block:: python
import numpy as np
import padasip as pa
np.random.seed(100) # constant seed to keep the results consistent
N = 150 # number of samples
classes = np.array(["1", "a", 3]) # names of classes
cols = 4 # number of features (columns in dataset)
x = np.random.random((N, cols)) # random data
labels = np.random.choice(classes, size=N) # random labels
print pa.preprocess.LDA_discriminants(x, labels)
what prints
>>> [ 2.90863957e-02 2.28352079e-02 1.23545720e-18 -1.61163011e-18]
From this output it is obvious that reasonable number of columns to keep is 2.
The following code reduce the number of features to 2.
.. code-block:: python
import numpy as np
import padasip as pa
np.random.seed(100) # constant seed to keep the results consistent
N = 150 # number of samples
classes = np.array(["1", "a", 3]) # names of classes
cols = 4 # number of features (columns in dataset)
x = np.random.random((N, cols)) # random data
labels = np.random.choice(classes, size=N) # random labels
new_x = pa.preprocess.LDA(x, labels, n=2)
to check if the size of new data-set is really correct we can print the shapes
as follows
>>> print "Shape of original dataset: {}".format(x.shape)
Shape of original dataset: (150, 4)
>>> print "Shape of new dataset: {}".format(new_x.shape)
Shape of new dataset: (150, 2)
Code Explanation
*****************
"""
from __future__ import division
import numpy as np
[docs]def LDA_base(x, labels):
"""
Base function used for Linear Discriminant Analysis.
**Args:**
* `x` : input matrix (2d array), every row represents new sample
* `labels` : list of labels (iterable), every item should be label for \
sample with corresponding index
**Returns:**
* `eigenvalues`, `eigenvectors` : eigenvalues and eigenvectors \
from LDA analysis
"""
classes = np.array(tuple(set(labels)))
cols = x.shape[1]
# mean values for every class
means = np.zeros((len(classes), cols))
for i, cl in enumerate(classes):
means[i] = np.mean(x[labels == cl], axis=0)
# scatter matrices
scatter_within = np.zeros((cols, cols))
for cl, mean in zip(classes, means):
scatter_class = np.zeros((cols, cols))
for row in x[labels == cl]:
dif = row - mean
scatter_class += np.dot(dif.reshape(cols, 1), dif.reshape(1, cols))
scatter_within += scatter_class
total_mean = np.mean(x, axis=0)
scatter_between = np.zeros((cols, cols))
for cl, mean in zip(classes, means):
dif = mean - total_mean
dif_product = np.dot(dif.reshape(cols, 1), dif.reshape(1, cols))
scatter_between += x[labels == cl, :].shape[0] * dif_product
# eigenvalues and eigenvectors from scatter matrices
scatter_product = np.dot(np.linalg.inv(scatter_within), scatter_between)
eigen_values, eigen_vectors = np.linalg.eig(scatter_product)
return eigen_values, eigen_vectors
[docs]def LDA(x, labels, n=False):
"""
Linear Discriminant Analysis function.
**Args:**
* `x` : input matrix (2d array), every row represents new sample
* `labels` : list of labels (iterable), every item should be label for \
sample with corresponding index
**Kwargs:**
* `n` : number of features returned (integer) - how many columns
should the output keep
**Returns:**
* new_x : matrix with reduced size (number of columns are equal `n`)
"""
n = n if n else x.shape[1] - 1
assert x.shape[1] > n, "The requested n is bigger than \
number of features in x."
# make the LDA
eigen_values, eigen_vectors = LDA_base(x, labels)
# sort the eigen vectors according to eigen values
eigen_order = eigen_vectors.T[(-eigen_values).argsort()]
return eigen_order[:n].dot(x.T).T
[docs]def LDA_discriminants(x, labels):
"""
Linear Discriminant Analysis helper for determination how many columns of
data should be reduced.
**Args:**
* `x` : input matrix (2d array), every row represents new sample
* `labels` : list of labels (iterable), every item should be label for \
sample with corresponding index
**Returns:**
* `discriminants` : array of eigenvalues sorted in descending order
"""
# validate inputs
try:
x = np.array(x)
except:
raise ValueError('Impossible to convert x to a numpy array.')
# make the LDA
eigen_values, eigen_vectors = LDA_base(x, labels)
return eigen_values[(-eigen_values).argsort()]