Feature Selection#

Welcome to this tutorial on Feature Selection, where we’ll explore practical Python examples to master this crucial data analysis technique.

Feature Selection, or variable subset selection, involves choosing the most relevant features for model construction, improving model accuracy and computational efficiency.

In this tutorial, we’ll employ the Iris dataset to illustrate various feature selection methods. Follow the step-by-step instructions and execute the code snippets by pressing SHIFT+ENTER in each code cell.

Importing Libraries and Configuration#

import pandas as pd
import numpy as np
from sklearn import datasets
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier
from sklearn.feature_selection import SequentialFeatureSelector

Settings#

# Configure pandas to display all columns of a DataFrame when printed to the console
pd.set_option('display.max_columns', None)

# Configure pandas to display all rows of a DataFrame when printed to the console
pd.set_option('display.max_rows', None)

Load data#

iris = datasets.load_iris()
scaler = StandardScaler()
# Standardize the data
normalized_data = scaler.fit_transform(iris.data)
iris_df = pd.DataFrame(iris.data, columns=iris.feature_names)
iris_df["species"] = iris.target
iris_df.head()
iris_df.describe()
sepal length (cm) sepal width (cm) petal length (cm) petal width (cm) species
count 150.000000 150.000000 150.000000 150.000000 150.000000
mean 5.843333 3.057333 3.758000 1.199333 1.000000
std 0.828066 0.435866 1.765298 0.762238 0.819232
min 4.300000 2.000000 1.000000 0.100000 0.000000
25% 5.100000 2.800000 1.600000 0.300000 0.000000
50% 5.800000 3.000000 4.350000 1.300000 1.000000
75% 6.400000 3.300000 5.100000 1.800000 2.000000
max 7.900000 4.400000 6.900000 2.500000 2.000000

Filter#

Filter methods assess features based on their intrinsic properties, like the correlation with the target variable, independent of any predictive model. These methods are fast and effective for preliminary feature selection.

Variance Threshold#

This approach assumes features with higher variance contain more information.The following code for how to use VarianceThreshold to select features using a threshold 0.6, and the features with a training-set variance lower than this threshold will be removed.

from sklearn.feature_selection import VarianceThreshold
threshold = 0.6

# Apply VarianceThreshold
selector = VarianceThreshold(threshold=threshold)
data_reduced = selector.fit_transform(iris.data)

selected_features = np.append(selector.get_support(), True)
display(iris_df.iloc[:, selected_features].head())
sepal length (cm) petal length (cm) species
0 5.1 1.4 0
1 4.9 1.4 0
2 4.7 1.3 0
3 4.6 1.5 0
4 5.0 1.4 0

Wrapper#

Wrapper methods assess feature subsets using a predictive model, iteratively adding or removing features to find the optimal combination. These methods are computationally intensive as they evaluate numerous feature combinations to determine the one that yields the best model performance.

Sequential Feature Selection#

Sequential Feature Selection (SFS) is a type of wrapper method that either adds features (forward selection) or removes them (backward elimination) in a stepwise manner. It selects the optimal feature subset based on the model performance, utilizing cross-validation to ensure robustness and prevent overfitting.

# Initialize k-Nearest Neighbors estimator for feature evaluation
knn = KNeighborsClassifier(n_neighbors=3)

# Forward selection to identify top 2 features
sfs_forward = SequentialFeatureSelector(knn, n_features_to_select=2, direction='forward')
sfs_forward.fit(normalized_data, iris.target)
selected_forward = sfs_forward.get_support()
print(f"forward selection result: {selected_forward}")
# Append target species column in the display
selected_forward = np.append(selected_forward, True)
display(iris_df.iloc[:, selected_forward].head())
# Backward elimination to identify top 2 features
sfs_backward = SequentialFeatureSelector(knn, n_features_to_select=2, direction='backward')
sfs_backward.fit(normalized_data, iris.target)
selected_backward = sfs_backward.get_support()
print(f"backward selection result: {selected_backward}")
# Append target species column in the display
selected_backward = np.append(selected_backward, True)
display(iris_df.iloc[:, selected_backward].head())
forward selection result: [False False  True  True]
petal length (cm) petal width (cm) species
0 1.4 0.2 0
1 1.4 0.2 0
2 1.3 0.2 0
3 1.5 0.2 0
4 1.4 0.2 0
backward selection result: [False False  True  True]
petal length (cm) petal width (cm) species
0 1.4 0.2 0
1 1.4 0.2 0
2 1.3 0.2 0
3 1.5 0.2 0
4 1.4 0.2 0

Embedded Method#

In embedded methods, the feature selection algorithm is blended as part of the learning algorithm, thus having its own built-in feature selection methods.

Lasso Regression#

Also known as L1 Regularization, is a type of linear regression that includes a regularization term. The regularization term encourages simpler models by penalizing features with larger coefficient values. The following code for how to use Lasso Regression to select features.

from sklearn.linear_model import Lasso
# Apply Lasso Regression for feature selection
lasso = Lasso(alpha=0.1)
lasso.fit(normalized_data, iris.target)

# Get the selected features (non-zero coefficients)
selected_features = lasso.coef_ != 0
# Print the coefficients from Lasso regression to show feature importance
print(f"Lasso coefficients: {lasso.coef_}")
# Append target species column in the display
selected_features = np.append(selected_features, True)
# Display the selected features
selected_features_df = iris_df.loc[:, selected_features]

selected_features_df.head()
Lasso coefficients: [ 0.         -0.          0.26332996  0.42746631]
petal length (cm) petal width (cm) species
0 1.4 0.2 0
1 1.4 0.2 0
2 1.3 0.2 0
3 1.5 0.2 0
4 1.4 0.2 0