Linear Discriminant Analysis

Linear Discriminant Analysis

The discriminant analysis is a predictive technique of ad hoc classification and is so named because groups or classes are previously known before making the classification, which unlike decision trees (post hoc) where the classification groups are derived from the execution of the same technique without knowing previously.

Siquieres verlo en video:

It is an ideal technique to build a predictive model and forecast the group or class to which an observation belongs based on certain characteristics that define its profile.

As the name implies, discriminant analysis helps identify the characteristics that differentiate (discriminate) to two or more groups and to create a function capable of distinguishing, as accurately as possible, the members of one group or another.

On the other hand, LDA It is also a method of dimension reduction, given that taking n variables independent of the data set, the estrae method p <= n new independent variables that contribute most to class separation of the dependent variable.

The interpretation of the differences between groups is to determine:

  • To what extent a set of characteristics allows us to extract dimensions that differentiate groups and
  • Which of these characteristics are those that contribute the most to such dimensions, that is, which have the greatest power of discrimination.

In summary, we can say that Linear Discriminant Analysis - LDA has the following applications:

  • It is widely used as a dimension reduction technique
  • It is used as a step in the pre-processing of data for the classification of patterns
  • Its objective is to project a data set in a smaller space

The objective of the LDA is to project the space of a characteristic (data set of n dimensions) in a small subspace k where k <= n - 1, maintaining the discriminatory information of classes.

Both PCA (Principal Component Analysis) and LDA (Linear Discriminant Analysis) are linear transformation techniques used for dimension reduction, however, PCA is unsupervised and LDA is supervised given its relationship with the dependent variable.

Source: https://sebastianraschka.com/Articles/2014_python_lda.html

The 5 general steps for linear discriminant analysis are:

  1. Calculate the mean vectors d-dimensional for the different classes of the data set.
  2. Calculate scatter matrices (between classes and within classes).
  3. Calculate the eigenvectors (e1, e2, ... ed) and their corresponding eigenvalues ​​(l1, l2, ..., ld) for the dispersion matrices
  4. Sort the eigenvectors in a decreasing way in relation to the eigenvalues ​​and select the k eigenvectors with the largest eigenvalues ​​to form a dimension matrix dxk called W (where each column represents an eigenvector).
  5. Use the dxk matrix of eigenvectors to transform the samples into a new sub space. This can be summarized by the multiplication of matrices Y = X x W (where X is a nxd matrix that represents the n samples and y is the nxk transform of the samples in the new sub space).

If the objective is to reduce the dimension of a d-dimensional data set to project it in a k-dimensional sub space (where k ˂ d), how to know what size to select for k (where k is the number or dimension of the new sub space) and how to know if we have a sub space that correctly represents the data.

Calculate the eigenvectors for the data set and concentrate them to form the scattering matrices, where each of these vectors is associated with an eigenvalue which indicates the size or magnitude of the vectors.

If we observe that all eigenvalues ​​have similar magnitudes, then this is a good indicator that the data is already projected to a good sub space.

But, if one of the eigenvalues ​​is much larger than others, what is interesting is to keep only those vectors that have the largest eigenvalues, since they contain more information related to the distribution of the data. Values ​​that are closer to zero are less informative and should not be taken for the creation of the new sub space.

Implementation of LDA with Python

For the exercise with python we will consider a data set with characteristics of three wine categories. The characteristics are conformed with 13 variables that go from the degree of alcohol, the acidity, the aroma, etc. To form three customer segments that represent the classes or groups. In total there are 179 records or samples.

AlcoholMalic AcidAshAsh AlcanityMagnesiumTotal PhenolsFlavanoidsNonflavanoid PhenolsProanthocyaninsColor_IntensityHueOD280ProlineCustomer Segment
14.231.712.4315.61272.83.060.282.295.641.043.9210651
13.21.782.1411.21002.652.760.261.284.381.053.410501
13.162.362.6718.61012.83.240.32.815.681.033.1711851
14.371.952.516.81133.853.490.242.187.80.863.4514801
13.242.592.87211182.82.690.391.824.321.042.937351
14.21.762.4515.21123.273.390.341.976.751.052.8514501
14.391.872.4514.6962.52.520.31.985.251.023.5812901
14.062.152.6117.61212.62.510.311.255.051.063.5812951
14.831.642.1714972.82.980.291.985.21.082.8510451
13.861.352.2716982.983.150.221.857.221.013.5510451
14.12.162.3181052.953.320.222.385.751.253.1715101
14.121.482.3216.8952.22.430.261.5751.172.8212801
13.751.732.4116892.62.760.291.815.61.152.913201
14.751.732.3911.4913.13.690.432.815.41.252.7311501

To start with the model, we load the file with the data and we separate in X the independent variable columns 1 to 13 and in Y the dependent variable column 14 with the customer segmentation.

# LDA # Import of libraries
import numpy ace np
import matplotlib.pyplot ace plt
import pandas ace P.S

# Importation of the dataset
dataset = pd.read_csv ('Vinos.csv') X = dataset.iloc [:, 0:13] .values ​​y = dataset.iloc [:, 13] .values

We divided the data set into a training set and a set of tests to train the LDA model, then we made a scale adjustment and we created the LDA model to reduce the dimension to 2 variables.

To prove that with 2 of 13 variables we can make a prediction or classification of the 3 groups of wines, we use the logistic regression and check the confusion matrix

Untitled
In 1]:
# Import of libraries
import numpy ace np
import matplotlib.pyplot ace plt
import pandas ace P.S
In [3]:
# Importation of the dataset
dataset = P.S.read_csv('Vinos.csv')
X = dataset.iloc[:, 0:13].values
Y = dataset.iloc[:, 13].values
In [4]:
# We divide the data set into training sample and 
# test sample
desde sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, Y, 
                                                    test_size = 0.2, 
                                                    random_state = 0)
In [5]:
# Adjustment of Scales
desde sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)
In [6]:
# Applying LDA
desde sklearn.discriminant_analysis import LinearDiscriminantAnalysis ace LDA
lda = LDA(n_components = 2)
X_train = lda.fit_transform(X_train, y_train)
X_test = lda.transform(X_test)
In [7]:
# We check the resulting independent variables with 
# a Logistic regression to determine that with only two
# variables we get the right prediction
desde sklearn.linear_model import LogisticRegression
classifier = LogisticRegression(random_state = 0)
classifier.fit(X_train, y_train)
Out [7]:
LogisticRegression (C = 1.0, class_weight = None, dual = False, fit_intercept = True, intercept_scaling = 1, max_iter = 100, multi_class = 'ovr', n_jobs = 1, penalty = 'l2', random_state = 0, solver = 'liblinear ', tol = 0.0001, verbose = 0, warm_start = False)
In [8]:
# Prediction of the test set for 
# buy the results
y_pred = classifier.predict(X_test)
In [9]:
# We create the confusion matrix
desde sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_test, y_pred)
In [10]:
# Visualization of training data
desde matplotlib.colors import ListedColormap
X_set, y_set = X_train, y_train
X1, X2 = np.meshgrid(np.arange(start = X_set[:, 0].min() - 1, stop = X_set[:, 0].max() + 1, step = 0.01),
                     np.arange(start = X_set[:, 1].min() - 1, stop = X_set[:, 1].max() + 1, step = 0.01))
plt.contourf(X1, X2, classifier.predict(np.array([X1.ravel(), X2.ravel()]).T).reshape(X1.shape),
             alpha = 0.75, cmap = ListedColormap(('net', 'green', 'blue')))
plt.xlim(X1.min(), X1.max())
plt.ylim(X2.min(), X2.max())
for i, j in enumerate(np.unique(y_set)):
    plt.scatter(X_set[y_set == j, 0], X_set[y_set == j, 1],
                c = ListedColormap(('net', 'green', 'blue'))(i), label = j)
plt.title('Logistic Regression (Training Set)')
plt.xlabel('LD1')
plt.ylabel('LD2')
plt.legend()
plt.Show()
In [11]:
# Visualization of the test results
desde matplotlib.colors import ListedColormap
X_set, y_set = X_test, y_test
X1, X2 = np.meshgrid(np.arange(start = X_set[:, 0].min() - 1, stop = X_set[:, 0].max() + 1, step = 0.01),
                     np.arange(start = X_set[:, 1].min() - 1, stop = X_set[:, 1].max() + 1, step = 0.01))
plt.contourf(X1, X2, classifier.predict(np.array([X1.ravel(), X2.ravel()]).T).reshape(X1.shape),
             alpha = 0.75, cmap = ListedColormap(('net', 'green', 'blue')))
plt.xlim(X1.min(), X1.max())
plt.ylim(X2.min(), X2.max())
for i, j in enumerate(np.unique(y_set)):
    plt.scatter(X_set[y_set == j, 0], X_set[y_set == j, 1],
                c = ListedColormap(('net', 'green', 'blue'))(i), label = j)
plt.title('Logistic Regression (Test Set)')
plt.xlabel('LD1')
plt.ylabel('LD2')
plt.legend()
plt.Show()

We observed that the confusion matrix for the prediction of the group with the logistic regression for the new variables with the reduced dimension is very accurate and there are no errors

Matrix of confusion

0 0 votes
Article Rating
Subscribe
Notify of
guest
2 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments
juan
Juan
5 years ago

hello hello, your explanation is very good, I am doing an exercise using scikit learn but the data I am using are only two variables with 4 classes and using the discriminant linear analysis algorithm, could you help me?

2
0
Would love your thoughts, please comment.x
()
x

JacobSoft

Receive notifications of new articles and tutorials every time a new one is added

Thank you, you have subscribed to the blog and the newsletter

There was an error while trying to send your request. Please try again.

JacobSoft will use the information you provide to be in contact with you and send you updates.