Showing posts with label Dimensionality reduction. Show all posts
Showing posts with label Dimensionality reduction. Show all posts

Principal Component Analysis Explained with Examples


(Originally published on August 21, 2018)

Any machine learning model building task begins with a collection of data vectors wherein each vector consists of a fixed number of components. These components represent the measurements, known as attributes or features, deemed useful for the given machine learning task at hand. The number of components, i.e. the size of the vector, is termed as the dimensionality of the feature space. When the number of features is large, we are often interested in reducing their number to limit the number of training examples needed  to strike a proper balance with the number of model parameters.  One way to reduce the number of features is to look for a subset of original features via some suitable search technique. Another way to reduce the number of features or dimensionality is to map or transform the original features in to another feature space of smaller dimensionality. The Principal Component Analysis (PCA) is an example of this feature transformation approach where the new features are constructed by applying a linear transformation on the original set of features. The use of PCA does not require knowledge of the class labels associated with each data vector. Thus, PCA is characterized as a linear, unsupervised  technique for dimensionality reduction.

Basic Idea Behind PCA

The basic idea behind PCA is to exploit the correlations between the original features. To understand this, lets look at the following two plots showing how a pair of variables vary together. In the left plot, there is no relationship in how X-Y values are varying; the values seem to be varying randomly. On the other hand, the variations in the right side plot exhibits a pattern; the Y values are moving up in a linear fashion. In terms of correlation, we say that the values in the left plot show no correlation while the values in the right plot show good correlation. It is not hard to notice that given a X-value from the right plot, we can reasonably guess the Y-value; however this cannot be done for X-values in the left graph. This means that we can represent the data in the right plot with a good approximation lying along a line, that is we can reduce the original two-dimensional data in one dimensions. Thus achieving dimensionality reduction. Of course, such a reduction is not possible for data in the left plot where there is no correlation in X-Y pairs of values.

PCA Steps

Now that we know the basic idea behind the PCA, let's look at the steps needed to perform PCA. These are:

  • We start with N d-dimensionaldata vectors, $\boldsymbol{x}_i, i= 1, \cdots,N$, and find the eigenvalues and eigenvectors of the sample covariance matrix of size d x d using the given data
  • We select the top k eigenvalues,  d, and use the corresponding eigenvectors to define the linear transformation matrix A of size k x d for transforming original features into the new space.
  • Obtain the transformed vectors, $\boldsymbol{y}_i, i= 1, \cdots,N$, using the following relationship. Note the transformation involves first shifting the origin of the original feature space using the mean of the input vectors as shown below, and then applying the transformation. 

$\boldsymbol{y}_i = \bf{A}(\bf{x}_i - \bf{m}_x)$

  • The transformed vectors are the ones we then use for visualization and building our predictive model. We can also recover the original data vectors with certain error by using the following relationship

$\boldsymbol{\hat x}_i = \boldsymbol{A}^t\boldsymbol{y}_i + \boldsymbol{m}_x$

  • The mean square error (mse) between the original and reconstructed vectors is the sum of the eigenvalues whose corresponding eigenvectors are not used in the transformation matrix A.

$ e_{mse} = \sum\limits_{j=k+1}\limits^d \lambda_j$

  • Another way to look at how good the PCA is doing is by calculating the percentage variability, P, captured by the eigenvectors corresponding to top k eigenvalues. This is expressed by the following formula

$ P = \frac{\sum\limits_{j=1}^k \lambda_j}{\sum_{j=1}^d \lambda_j}$

A Simple PCA Example

Let's look at PCA computation in python using 10 vectors in three dimensions. The PCA calculations will be done following the steps given above. Lets first describe the input vectors, calculate the mean vector and the covariance matrix.






Next, we get the eigenvalues and eigenvectors. We are going to reduce the data to two dimensions. So we form the transformation matrix A using the eigenvectors of top two eigenvalues.


With the calculated A matrix, we transform the input vectors to obtain vectors in two dimensions to complete the PCA operation.


Looking at the calculated mean square error, we find that it is not equal to the smallest eigenvalue (0.74992815) as expected. So what is the catch here? It turns out that the formula used in calculating the covariance matrix assums the number of examples, N, to be large. In our case, the number of examples is rather small, only 10. Thus, if we multiply the mse value by N/(N-1), known as the small sample correction, we will get the result identical to the smallest eigenvalue. As N becomes large, the ratio N/(N-1) approaches unity and no such correction is required.

The above PCA computation was deliberately done through a series of steps. In practice, the PCA can be easily done using the scikit-learn implementation as shown below.


Before wrapping up, let me summarize a few takeaways about PCA.

  • You should not expect PCA to provide much reduction in dimensionality if the original features have little correlation with each other.
  • It is often a good practice to perform data normalization prior to applying PCA. The normalization converts each feature to have a zero mean and unit variance. Without normalization, the features showing large variance tend to dominate the result. Such large variances could also be caused by the scales used for measuring different features. This can be easily done by using sklearn.preprocessing.StandardScalar class.
  • Instead of performing PCA using the covariance matrix, we can also use the correlation matrix. The correlation matrix has a built-in normalization of features and thus the data normalization is not needed. Sometimes, the correlation matrix is referred to as the standardized covariance matrix.
  • Eigenvalues and eigenvectors are typically calculated by the singular value decomposion (SVD) method of matrix factorization. Thus, PCA and SVD are often viewed the same. But you should remember that the starting point for PCA is a collection of data vectors that are needed to compute sample covariance/correlation matrices to perform eigenvector decomposition which is often done by SVD.

CCA for Finding Latent Relationships and Dimensionality Reduction

Canonical Correlation Analysis (CCA) is a powerful statistical technique. In machine learning and multimedia information retrieval, CCA plays a vital role in uncovering intricate relationships between different sets of variables. In this blog post, we will look into this powerful technique and show how it can be used for finding hidden correlations as well as for dimensionality reduction.

To understand CCA's capabilities, let’s take a look at two sets of observations, X and Y, shown below. These two sets of observations are made on the same set of objects and each observation represents a different variable.

Two Sets of Observations


When computing the pairwise correlation between the column vectors of X and Y, we obtain the following set of values, where the entry at (i,j) represents the correlation between the i-th column of X and the j-th column of Y.

The resulting correlation values give us some insight between the two sets of measurements. The correlation values show moderate to almost no correlation between the columns of the two datasets except a relatively higher correlation between the second column of X and the third column of Y.

Hidden Relationship

It looks like there is not much of a relationship between X and Y. Is that so? Let's wait before concluding that X and Y do not have much of a relationship.

Lets transform X and Y into one-dimensional arrays, a and b, using the vectors [-0.427 -0.576 0.696] and [0 0 -1].

a = X[-0.427 -0.576 0.696]T

b = Y[0 0 -1]T

Now, let's calculate the correlation between a and b. Wow! we get a correlation value of 0.999, meaning that the two projections of X and Y are very strongly correlated. In other words, there is a very strong hidden relationship present in our two sets of observations. Wow! How did we end up getting a and b?” The answer to this is the canonical correlation analysis. 

What is Canonical Correlation Analysis?

Canonical correlation analysis is a technique that looks for pairs of basis vectors for two sets of variables X and Y such that the correlation between the projections of X and Y  onto these basis vectors are mutually maximized. In other words, the transformed arrays show much higher correlation to bring out any hidden relationship. The number of pairs of such basis vectors is limited to the smallest dimensionality of X and Y. For example, if X is an array of size nxp and Y of size nxq, then the number of basis vectors cannot exceed min{p,q}.

Assume wx and wy be the pair of basis vectors projecting X and Y into a and b given by a=Xwx, and b=Ywy. The projections a and b are called the scores or the canonical variates. The correlation between the projections, after some algebraic manipulation, can be expressed as:

$\Large \rho = \frac{\bf{w}_{x}^T \bf{C}_{xy}\bf{w}_{y}}{\sqrt{\bf{w}_{x}^T \bf{C}_{xx}\bf{w}_{x}\bf{w}_{y}^T \bf{C}_{yy}\bf{w}_{y}}}$,

where CxxCxy and Cyy  are three covariance matrices. The canonical correlations between X and Y are found by solving the eigenvalue equations

$ \bf{C}_{xx}^{-1}\bf{C}_{xy}\bf{C}_{yy}^{-1}\bf{C}_{yx}\bf{w}_x = \rho^2 \bf{w}_x$

$ \bf{C}_{yy}^{-1}\bf{C}_{yx}\bf{C}_{xx}^{-1}\bf{C}_{xy}\bf{w}_y = \rho^2 \bf{w}_y$

The eigenvalues in the above solution correspond to the squared canonical correlations and the corresponding eigenvectors yield the needed basis vectors. The number of non-zero solutions to these equations are limited to the smallest dimensionality of X and Y.

CCA Example

Let’s take a look at an example using the wine dataset from the sklearn library. We will divide the13 features of the dataset into X and Y sets of observations. The class labels in our example will act as hidden or latent feature. First, we will load the data, split it into X and Y and perform feature normalization.

from sklearn.datasets import load_wine
import numpy as np
wine = load_wine()
X = wine.data[:, :6]# Form X using first six features
Y = wine.data[:, 6:]# Form Y using the remaining seven features
# Perform feature normalization
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X = scaler.fit_transform(X)
Y = scaler.fit_transform(Y)

Next, we import CCA object and fit the data. After that we obtain the canonical variates. In the code below, we are calculating 3 projections, X_c and Y_c, each for X and Y.

from sklearn.cross_decomposition import CCA
cca = CCA(n_components=3)
cca.fit(X, Y)
X_c, Y_c = cca.transform(X, Y)
We can now calculate the canonical correlation coefficients to see what correlation values are obtained.
cca_corr = np.corrcoef(X_c.T, Y_c.T).diagonal(offset=3)
print(cca_corr)

[0.90293514 0.73015495 0.51667522]

The highest canonical correlation value is 0.9029, indicating a strong hidden relationship between the two sets of vectors. Let us now try to visualize whether these correlations have captured any hidden relationship or not. In the present example, the underlying latent information not available to CCA is the class-membership of different measurements in X and Y. To check this, I have plotted the scatter plots of the three sets of x-y canonical variates where each variate pair is colored using the class label not accessible to CCA. These plots are shown below. It is clear that the canonical variates associated with the highest correlation coefficient show the existence of three groups in the scatter plot. This means that CCA is able to discern the presence of a hidden variable that reflects the class membership of the different observations.


Summary

Canonical Correlation Analysis (CCA) is a valuable statistical technique that enables us to uncover hidden relationships between two sets of variables. By identifying the most significant patterns and correlations, CCA helps gain valuable insights with numerous potential applications. CCA can be also used for dimensionality reduction. In machine and deep learning, CCA has been used for cross-modal learning and cross-modal retrieval.