Machine Learning fundamentally merges statistical methods with computational knowledge. It is conceptually driven by algorithms or models, which are elevated statistical approximations. Nevertheless, due to data distribution variations, each applied model carries its share of flaws. For they are mere approximations, not one of them can be entirely accurate.
These limitations are typically encapsulated in the terms “bias” and “variance.” A model bearing high bias tends to cut corners by not paying heed to training points. On the other hand, a model with substantial variance narrows its scope to the training data without adapting to unseen test points.
These constraints become a conundrum when they are marginally differentiating, like while choosing between a random forest method and a gradient boosting technique, or between two versions of the same decision tree algorithm. Both indeed, exhibit high variance and low bias.
Essential Model Selection
After having evaluated various models concerning pertinent parameters, model selection as a process aids in determining the most suitable model.
Resampling techniques employ the simple strategy of reorganizing data samples to ascertain if the model performs well with data samples that it wasn’t trained on. Thus, resampling gives a sense of whether the model will effectively generalize.
Random Splits are employed to randomly sample a portion of the data and divide it into training, testing, and ideally, validation sets. This method ensures that the original population is well represented across all the subsets thus avoiding biased data sampling.
While utilizing the validation set for model selection, it’s critical to be mindful that it is used for model selection. Despite the presence of two test sets, the validation set functions as the secondary test set. The test set aids model evaluation during feature selection and tuning. Therefore, the best possible feature set and model parameters are culled based on the test set performance. Consequently, the validation set is employed for the final assessment involving entirely unseen data points.
The cross-validation protocol perturbs the dataset randomly and splits it into k groups. This is followed by taking one group as a test set, considering the remaining ones as a training set, and testing the model on the former. Repetition of this process for k groups will yield distinct results for each test group.
Through this iterative process, model selection becomes relatively easy by opting for the model that scores highest.
Stratified K-Fold follows a similar approach to K-Fold cross-validation, the distinguishing factor being that stratified k-fold takes into account the target variable's values, whereas k-fold cross-validation doesn't.
Standardly known as one of the robust techniques to derive a stable model, Bootstrap resonates with the random splitting method due to its adoption of random sampling.
The initial step involves determining your sample size (usually equal to the original dataset size). Subsequently choose a random data point from the original dataset and incorporate it into the bootstrap sample. Repeat this process N times where N is the sample size.
The bootstrap sample thus becomes a product of data points resampled from the original dataset. This implies you could come across multiple instances of the same data point in the bootstrap sample.
Upon training the model on the bootstrap sample, it then gets tested on data points that are absent from the bootstrap sample, referred to as out-of-bag samples.
While model selection and model evaluation processes can initially seem complex, they become part and parcel of your routine with frequent practice and dedicated time investment. Different challenges necessitate different strategies, making it crucial to choose methods that align with your project requirements.