Exploratory Data Analysis (EDA)

Introduction to Exploratory Data Analysis (EDA)

Exploratory Data Analysis (EDA) acts as a critical tool for dissecting datasets by highlighting their key properties, typically through visual methodologies. Prior to any modeling process, EDA aids in unearthing the crux of the raw data. Extracting relevant insights from series of numbers or expansive spreadsheets may not be straightforward, and can sometimes be monotonous or overpowering. Hence, EDA techniques come into play to ease this process.

Classification of EDA Techniques

EDA can be classified into two categories. One way is by distinguishing them as either non-graphical or graphical methods and the other is by determining if each method is univariate or multivariate, with bivariate being the most prevalent.

Steps in Conducting EDA

1. Understanding the Background

Prior to digging into the data, comprehend the general overview of it. Interact with executives or the product team to gather as much background and context as possible. Whether your intention is to predict a trend or simply carry out research can greatly influence the direction and focus of your EDA.

2. Handling Missing Data

Once you have defined the direction of your study, it is time to delve into the data itself. Begin by identifying any missing data. For this and subsequent analyses, it advisable to evaluate each feature one by one and prioritize them according to their relevance to your study.

Despite our best efforts, identifying the reasons behind missing data is not always straightforward. This is why the field of imputation statistics exists, offering a variety of solutions dedicated to this issue. The approach you take largely depends on your data type. For instance, time series data devoid of any trend or seasonality, can fill in missing values using mean or median.

3. Analyzing Data Shape

Next up, we analyze the Shape of data. If dealing with a time series dataset, observe how the feature changes. It might exhibit seasonality or a linear trend either in the positive or negative direction. The mean and variance of the feature will also need to be calculated. You will then draw conclusions based on the observed patterns. Features with extremely low or high variances, however, may warrant further investigation.

Probability Density Functions (PDFs) and Probability Mass Functions (PMFs) are invaluable tools to catalog the feature forms for continuous and discrete features respectively.

4. Correlation Analysis

The Correlation aspect of EDA investigates the relationship between two variables. For instance, a scatter plot vividly depicts the correlation between two discrete features like 'Delivered Orders' and 'Fulfilled Orders'. However, creating this plot for every feature when dealing with numerous characteristics can be time-consuming. Therefore, constructing the Pearson Correlation Matrix is a suitable option. It computes the linear association between your dataset features, giving each pair a value range of -1 to 1. Positive scores indicate a positive relationship, while the contrary signifies a negative association.

Concluding Notes

Taking note of these characteristic correlations is crucial as they can provide useful insights for your study. It's possible you may, or may not, notice significant interrelationships between dataset features.

Wrapping up, be mindful of missing data in your dataset, understanding their cause and devising a plan to address it. Briefly illustrate your features and categorize them accordingly to influence the visual and statistical methods you employ. In visualizing your data distribution, you gain a better understanding of it and can discover unexpected elements. Also, familiarizing yourself with how your data changes over time and amongst samples can prove valuable. Keep tabs on relationships among your data attributes; these associations might turn out to be advantageous down the line.

Integrate | Scan | Test | Automate

Detect hidden vulnerabilities in ML models, from tabular to LLMs, before moving to production.