Data Preprocessing using Python

Data Reduction using Variance Threshold, Univariate Feature Selection, Recursive Feature Elimination, PCA.

5 min readOct 28, 2021

When building a machine learning model in real-life, it’s almost rare that all the variables in the dataset are useful to build a model. Adding redundant variables reduces the generalization capability of the model and may also reduce the overall accuracy of a classifier. Furthermore adding more and more variables to a model increases the overall complexity of the model. The model may generalize better if those irrelevant characteristics are removed. Thus, feature selection becomes an indispensable part of building machine learning models.

The goal of feature selection in machine learning is to find the best set of features that allows one to build useful models of studied phenomena. In this blog, we will compare the performance of several feature selection approaches on the same data.

For carrying out data reduction, I have used the Iris dataset from the sklearn.datasets library.

Imported all required libraries and observed the dataset. There are four distinct characteristics in the data. We have added some additional noise features to the data set to test the effectiveness of different feature selection methods.

Now, there are 14 features in the dataset. We must first separate the data before implementing the feature selection approach. The reason for this is that we only choose features based on data from the training set, not the entire data set. To evaluate the success of the feature selection and the model, we should put aside a portion of the entire data set as a test set. As a result, the information from the test set is hidden while we choose features and train the model.

Variance Threshold

Variance Threshold is a simple baseline approach to feature selection. It removes all features whose variance doesn’t meet some threshold. By default, it removes all zero-variance features. Our dataset has no zero variance feature so our data isn’t affected here.

Univariate Feature Selection

Univariate feature selection works by selecting the best features based on univariate statistical tests.
We compare each feature to the target variable, to see whether there is a statistically significant relationship between them.
When we analyze the relationship between one feature and the target variable we ignore the other features. That is why it is called ‘univariate’.
Each feature has its own test score.
Finally, all the test scores are compared, and the features with top scores will be selected.
These objects take as input a scoring function that returns univariate scores and p-values (or only scores for SelectKBest and SelectPercentile):
For regression: f_regression, mutual_info_regression
For classification: chi2, f_classif, mutual_info_classif

Recursive Feature Elimination

Given an external estimator that assigns weights to features (e.g., the coefficients of a linear model), recursive feature elimination (RFE) is to select features by recursively considering smaller and smaller sets of features. First, the estimator is trained on the initial set of features and the importance of each feature is obtained either through a coef_ attribute or through a feature_importances_ attribute. Then, the least important features are pruned from the current set of features. That procedure is recursively repeated on the pruned set until the desired number of features to select is eventually reached.

Differences Between Before and After Using Feature Selection

You can observe a clear difference in precision, recall, f1-score, and accuracy in both outputs. This shows the importance of using feature selection to increase the performance of the model.

Before using Feature Selection

After using Feature Selection

Principal Component Analysis (PCA)

By altering the optimization technique, we can speed up the fitting of a machine learning system. Principal Component Analysis is a more prevalent method of speeding up a machine learning system.

It helps to be able to see your data in a number of machine learning applications. It’s not difficult to visualize data in two or three dimensions. The Iris dataset utilized in this study is four-dimensional. PCA has been used to visualize the DataFrame with reduced components in 2D as well as 3D.

PCA to 2D Projection:
There are four columns in the original data (sepal length, sepal width, petal length, and petal width). The code in this part converts four-dimensional data into two-dimensional data. The two main axes of variation are represented by the new components.

To visualize the DataFrame, concatenated DataFrame along axis=1.

PCA Projection to 3D:

The original data has 4 columns (sepal length, sepal width, petal length, and petal width). In this section, the code projects the original data which is 4 dimensional into 3 dimensions. The new components are just the three main dimensions of variation