The series of steps that transform raw data into features suitable for machine learning algorithms, particularly predictive models, encompasses what we call the feature engineering cycle. Predictive models demand an outcome variable alongside predictor variables; the most formidable predictor variables are produced and hand-picked throughout feature engineering for the model. Since 2016, certain machine learning tools began to incorporate automated feature engineering processes. Feature engineering in machine learning primarily comprises four processes: the creation of features, transformations, extraction of features, and selection of features.
Feature engineering is essentially the creation, alteration, extraction, and selection of the most optimum variables or features to create a precise machine learning algorithm. These procedures involve:
The first phase of feature engineering is the recognition of all pertinent predictor variables that can fit into the model. This is a primarily theoretical task and can be accomplished by referring to related literature, talking to professionals in the field or even ideating.
Often, while developing predictive models, individuals make the mistake of concentrating solely on the accessible data without thinking about the data they need. This common error can lead to two main issues:
- Crucial predictor variables might be omitted from the model. In a model predicting property values, for instance, knowing the kind of property is crucial. If this data is not readily available, it must be sought before developing a predictive model.
- Variables that need to be developed from existing data might not be. For example, the Body Mass Index (BMI) is an excellent indicator of health outcomes. To calculate BMI, the individual's weight is divided by the square of their height. Including BMI along with height and weight in the model will give much better results than including only height and weight and other relevant factors.
Transformation is the process of changing a predictor variable to enhance its performance in the predictive model. Key factors to consider in transformations include:
- Compatibility with machine learning and statistical models for different types of data.
- Easy interpretation. Prediction models where all predictors are on the same scale are more straightforward to understand.
- Improved prediction accuracy.
- Avoid computational errors since some algorithms can render incorrect results with large data inputs.
Unlike transformations which create new variables by altering available ones, feature extraction generates variables from other data sets. For instance, Principal Component Analysis (PCA) can be used to reduce a large quantity of predictor variables to a manageable number.
Selection of Features
Feature selection is about choosing which predictor variables to include in a model. While it might seem simple as one might think of including all available features in the model and let the predictive model decide, the reality is more complex. The machine might even crash under the sheer weight of potential predictor variables as the algorithm might not be designed to handle all available factors.
In conclusion, feature selection involves a combination of intuition, theory and evaluating the performance of different feature combinations in predictive modelling. This also underlines the importance of testing, CI/CD, and monitoring in ML systems which tend to be more fragile than anticipated.