Dimensionality reduction is often regarded as being part of exploring step. It’s useful for when there are too many features for plotting. You could do a scatter-plot matrix, but that only shows you two features at a time. It’s also useful as a pre-processing step for other machine learning algorithms.
Most dimensionality reduction algorithms are unsupervised. This means that they don’t employ the labels of the data points in order to construct the lower-dimensional mapping.
In this section we’ll look at two techniques: PCA, which stands for Principal Components Analysis (Pearson ) and t-SNE, which stands for t-distributed Stochastic Neighbor Embedding (Maaten and Hinton 2008).
- Locally Linear Embedding
- Isomap
- PCA
- t-SNETapkee’s website: , contains more information about these algorithms. Although Tapkee is mainly a library that can be included in other applications, it also offers a command-line tool. We’ll use this to perform dimensionality reduction on our wine data set.
If you aren’t running the Data Science Toolbox, you’ll need to download and compile Tapkee yourself. First make sure that you have installed. On Ubuntu, you simply run:
Please consult Tapkee’s website for instructions for other operating systems. Then execute the following commands to download the source and compile it:
This creates a binary executable named .
To scale we use a combination of and :
Now we apply both dimensionality reduction techniques and visualize the mapping using :
Figure 9.1: PCA
Note that there’s not a single GNU core util (i.e., classic command-line tool) in this one-liner. Now that’s the power of the command line!