Data dimensionality reduction
As pointed out before, if we have the problem of having more dimensions (or variables) than samples in our data, we can either augment the data or reduce the dimensionality of the data. Now, we will address the basics of the latter.
We will look into reducing dimensions both in supervised and unsupervised ways with both small and large datasets.
Supervised algorithms
Supervised algorithms for dimensionality reduction are so called because they take the labels of the data into account to find better representations. Such methods often yield good results. Perhaps the most popular kind is called linear discriminant analysis (LDA), which we'll discuss next.
Linear discriminant analysis
Scikit learn has a LinearDiscriminantAnalysis class that can easily perform dimensionality reduction on a desired number of components.
By number of components, the number of dimensions desired is understood. The name comes from principal component analysis (PCA), which is a statistical approach that determines the eigenvectors and eigenvalues of the centered covariance matrix of a dataset; then, the largest eigenvalues associated with specific eigenvectors are known to be the most important, principal, components. When we use PCA to reduce to a specific number of components, we say that we want to keep those components that are the most important in a space induced by the eigenvalues and eigenvectors of the covariance matrix of the data.
LDA and other dimensionality reduction techniques also have a similar philosophy in which they aim to find low-dimensional spaces (based on the number of components desired) that can better represent the data based on other properties of the data.
If we use the heart disease dataset as an example, we can perform LDA to reduce the entire dataset from 13 dimensions to 2 dimensions, all the while using the labels [0, 1, 2, 3, 4] to inform the LDA algorithm how to better separate the groups represented by those labels.
To achieve this, we can follow these steps:
- First, we reload the data and drop the missing values:
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
df = pd.read_csv('processed.cleveland.data', header=None)
df = df.apply(pd.to_numeric, errors='coerce').dropna()
Notice that we did not have to deal with missing values before on the heart disease dataset because pandas automatically ignores missing values. But here, because we are strictly converting data into numbers, missing values will be converted to NaN since we are specifying errors='coerce', which forces any errors in the conversion to become NaN. Consequently, with dropna(), we ignore rows with those values from our dataset because they will cause LDA to fail.
- Next, we prepare the X and y variables to contain the data and targets, respectively, and we perform LDA as follows:
X = df[[0,1,2,3,4,5,6,7,8,9,10,11,12]].values
y = df[13].values
dr = LinearDiscriminantAnalysis(n_components=2)
X_ = dr.fit_transform(X, y)
In this example, X_ contains the entire dataset represented in two dimensions, as given by n_components=2. The choice of two components is simply to illustrate graphically how the data looks. But you can change this to any number of components you desire.
Figure 3.8 depicts how the 13-dimensional dataset looks if compressed, or reduced, down to two dimensions:
Notice how the values with 0 (no heart disease) are mostly clustered toward the left side, while the rest of the values (that is, 1, 2, 3, and 4, which represent heart disease) seem to cluster toward the right side. This is a nice property that was not observed in Figures 3.2 to 3.6 when we picked two columns out of the 13.
Technically speaking, the relevant information of the 13 dimensions is still contained in the LDA-induced two dimensions. If the data seems to be separable in these low-dimensional representations, a deep learning algorithm may have a good chance of learning representations to classify or regress on the data with high performance.
While LDA can offer a very nice way to perform dimensionality reduction informed by the labels in the data, we might not always have labeled data, or we may not want to use the labels that we have. In those cases we can, and we should, explore other robust methodologies that require no label information, such as unsupervised techniques, which we'll discuss next.
Unsupervised techniques
Unsupervised techniques are the most popular methods because they need no prior information about labels. We begin with a kernelized version of PCA and then we move on to methods that operate on larger datasets.
Kernel PCA
This variant of PCA uses kernel methods to estimate distances, variances, and other parameters to determine the major components of the data (Schölkopf, B., et al. (1997)). It may take a bit more time to produce a solution than regular PCA, but it is very much worth using it over traditional PCA.
The KernelPCA class of scikit-learn can be used as follows:
from sklearn.decomposition import KernelPCA
dr = KernelPCA(n_components=2, kernel='linear')
X_ = dr.fit_transform(X)
Again, we use two dimensions as the new space, and we use a 'linear' kernel. Other popular choices for the kernel include the following:
- 'rbf' for a radial basis function kernel
- 'poly' for a polynomial kernel
The result of using kernel PCA is shown in Figure 3.9. The diagram again shows a clustering arrangement of the negative class (no heart disease, a value of 0) toward the bottom left of the KPCA-induced space. The positive class (heart disease, values ≥ 1) tends to cluster upward:
Compared to Figure 3.8, LDA produces a slightly better space where the groups can be separated. However, KPCA does a good job in spite of now knowing the actual target classes. Now, LDA and KPCA might take no time on small datasets, but what if we have a lot of data? We will discuss some options next.
Large datasets
The previous examples will work well with moderate-sized datasets. However, when dealing with very large datasets, that is, with many dimensions or many samples, some algorithms may not function at their best. In the worst case, they will fail to produce a solution. The next two unsupervised algorithms are designed to function well for large datasets by using a technique called batch training. This technique is well known and has been applied in machine learning successfully (Hinton, G. E. (2012)).
The main idea is to divide the dataset into small (mini) batches and partially make progress toward finding a global solution to the problem at hand.
Sparse PCA
We'll first look into a sparse-coding version of PCA available in scikit-learn as MiniBatchSparsePCA. This algorithm will determine the best transformation into a subspace that satisfies a sparsity constraint.
Follow the next steps in order to use the MNIST dataset and reduce its dimensions, since it has 784 dimensions and 70,000 samples. It is large enough, but even larger datasets can also be used:
- We begin by reloading the data and preparing it for the sparse PCA encoding:
from sklearn.datasets import fetch_openml
mnist = fetch_openml('mnist_784')
X = mnist.data
- Then we perform the dimensionality reduction as follows:
from sklearn.decomposition import MiniBatchSparsePCA
dr = MiniBatchSparsePCA(n_components=2, batch_size=50,
normalize_components=True)
X_ = dr.fit_transform(X)
Here, the MiniBatchSparsePCA() constructor takes three arguments:
- n_components, which we set to 2 for visualization purposes.
- batch_size determines how many samples the algorithm will use at a time. We set it to 50, but larger numbers may cause the algorithm to slow down.
- normalize_components refers to the preprocessing of the data by centering it, that is, making it have a zero mean and a unit variance; we recommend doing this every time, especially if you have data that is highly correlated, such as images.
The MNIST dataset transformed using sparse PCA looks as depicted in Figure 3.10:
As you can see, the separation between classes is not perfectly clear. There are some definitive clusters of digits, but it does not seem like a straightforward task due to the overlap between groups. This is caused in part by the fact that many digits may look alike. It would make sense to have the numerals 1 and 7 clustered together (the left side up and down), or 3 and 8 (the middle and up).
But let's also use another popular and useful algorithm called Dictionary Learning.
Dictionary Learning
Dictionary Learning is the process of learning the basis of transformations, called dictionaries, by using a process that can easily scale to very large datasets (Mairal, J., et al. (2009)).
The algorithm is available in scikit-learn through the MiniBatchDictionaryLearning class. We can use it as follows:
from sklearn.decomposition import MiniBatchDictionaryLearning
dr = MiniBatchDictionaryLearning(n_components=2, batch_size=50)
X_ = dr.fit_transform(X)
The constructor MiniBatchDictionaryLearning() takes on similar arguments as MiniBatchSparsePCA() with the same meaning. The results of the learned space are shown in Figure 3.11:
As can be seen, there is a significant overlap among classes even if there are clearly defined clusters. This could lead to poor performance results if this data, the two-dimensional data, is used as input to train a classifier. This does not mean that algorithms are bad, necessarily. What this could mean is that, maybe, two dimensions are not the best choice of final dimensions. Continue reading to learn more about this.
Regarding the number of dimensions
Reducing dimensions is not always a necessary step. But it is highly recommended for data that is highly correlated, for example, images.
All the discussed dimensionality reduction techniques actually strive to remove redundant information in the data and preserve the important content. If we ask an algorithm to reduce the dimensions of our non-correlated, non-redundant dataset from 13 dimensions to 2, that sounds a bit risky; perhaps 8 or 9 would be a better choice.
No serious-minded machine learner would try to reduce a non-correlated, non-redundant dataset with 784 dimensions to only 2. Even if the data is highly correlated and redundant, like the MNIST dataset, asking to go from 784 down to 2 is a big stretch. It is a very risky decision that may get rid of important, discriminant, relevant information; perhaps 50 or 100 would be a better choice.
There is no general way of finding which amount of dimensions is good. It is a process that requires experimentation. If you want to become good at this, you must do your due diligence and at least try two or more experiments with different dimensions.