Data Science often grapples with the issue of incomplete information. The absence of data often obstructs effective data modeling and analysis. Properly managing rows with these non-existent data entries is crucial, either by removing them or substituting suitable values.
Classifications of Missing Data
1. Missing Completely At Random (MCAR):
This is the most arbitrary type of missing data where one feature's missing data points are not influenced by the values of any other features. It's the ideal case for handling missing data.
2. Missing At Random (MAR):
Here, a feature's missing data points can be affected by the values of other characteristics.
3. Missing Not At Random (MNAR):
This is a significant issue and warrants closer inspection of the data collection process. Understanding the reason behind the missing data is vital. For instance, determining why a majority of survey participants avoided a particular question is essential. Was the question ambiguous?
Strategies for Managing Missing Values
Assessing the Scope:
After detecting the missing values in the data, gauging the extent of these absences becomes paramount.
Overlooking Insignificant Missing Values:
For data that's either MAR or MNAR, if the missing values constitute less than 10% of individual cases, they might be overlooked. However, there should be enough complete cases to sustain the selected analytical method.
Deletion of Features:
If a feature has over 5% of its data missing and is classified as MCAR or MAR, it's prudent to consider its removal. If any dependent variables contain missing values, it's best to discard them to avert artificial enhancements in relations with independent variables.
This involves eliminating instances with missing values in one or more attributes. While straightforward, this method might result in a significantly smaller sample size, especially if not all data points are randomly missing.
Here, features with missing data are treated as dependent variables. A linear equation, based on the observed values of the dependent variable, predicts the missing data points. This method, however, often assumes a linear relationship, which might not always be accurate.
K-Nearest Neighbour (KNN):
This method predicts and replaces missing data by taking the average of distances to the k-neighbors. It's suitable for both qualitative and quantitative attributes, but can be computationally challenging with an increasing number of variables.