In Machine Learning (ML) classification problems, the final categorization is often determined by numerous and sometimes redundant factors, making it hard to effectively visualize and manipulate the training set. These factors, known as features, can be reduced through a method known as dimensionality reduction, which is divided into feature selection and feature extraction.
Dimensionality reduction plays an essential role in simplifying complicated data sets. For instance, in a basic junk email classification process, dimensionality reduction can refine overlapping factors into a single, more efficient feature. Another way to view dimensionality reduction is to imagine translating a three-dimensional problem into a two-dimensional space or even a single line.
Dimensionality reduction comprises two crucial steps:
- Feature selection: It focuses on identifying a manageable subset of original features to model the problem efficiently. This step is commonly achieved through three primary methods: Filter, Wrapper, and Embedded.
- Feature extraction is responsible for reducing data from a higher-dimensional space to a lower-dimensional one.
Various techniques deployed in dimensionality reduction include Principal Component Analysis (PCA), Generalized Discriminant Analysis (GDA), and Linear Discriminant Analysis (LDA). PCA, for instance, aims at maintaining the most variance when translating data from a higher to a lower dimensional space. Despite this, there is a possibility of losing some information during this process.
Dimensionality reduction techniques can be classified as either linear or nonlinear, depending on their application.
Dimensionality reduction has its advantages and disadvantages. It promotes data compression, which saves storage space and computation time, and it helps in removing redundant features. However, it also carries the risk of loss of some data and complications in deciding the number of principal components required. Moreover, PCA faces limitations in identifying linear connections between variables and might fail when the data cannot be sufficiently characterized by mean and covariance.