Data Preprocessing II#

This tutorial contains Python examples for data transformation, focusing on techniques for data normalization and discretization, . Follow the step-by-step instructions below carefully. To execute the code, click on the corresponding cell and press the SHIFT+ENTER keys simultaneously.

Data Transformation I#

Sometimes, the original values of an attribute might not be ideally suited for data analysis or modeling purposes. Data transformation is a process that involves converting the entire set of an attribute’s values into a new series of values. The approach is strategically utilized to improve the suitability of the data for different data mining objectives, facilitate more effective analyses, and potentially improve model accuracy and efficiency.

In this tutorial, we’ll focus on the “LotArea” attribute from the house price dataset, exploring transformation functions, data normalization, and data discretization.

Importing Libraries and Configuration#

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn import datasets
from sklearn.preprocessing import MinMaxScaler, StandardScaler
from sklearn.cluster import KMeans
from scipy.special import boxcox1p
from scipy.stats import skew
%matplotlib inline
from matplotlib.pylab import rcParams

Settings#

# Set the default figure size for matplotlib plots to 15 inches wide by 6 inches tall
rcParams["figure.figsize"] = (15, 6)

# Increase the default font size of the titles in matplotlib plots to extra-extra-large
rcParams["axes.titlesize"] = "xx-large"

# Make the titles of axes in matplotlib plots bold for better visibility
rcParams["axes.titleweight"] = "bold"

# Set the default location of the legend in matplotlib plots to the upper left corner
rcParams["legend.loc"] = "upper left"

# Configure pandas to display all columns of a DataFrame when printed to the console
pd.set_option('display.max_columns', None)

# Configure pandas to display all rows of a DataFrame when printed to the console
pd.set_option('display.max_rows', None)

Load Data#

data = datasets.fetch_openml(name="house_prices", as_frame=True)
df = pd.DataFrame(data=data.data, columns=data.feature_names)
df['MedHouseVal'] = data.target
print('Number of samples = %d' % (df.shape[0]))
print('Number of attributes = %d' % (df.shape[1]))
display (df.head(n=10))
Number of samples = 1460
Number of attributes = 81
/opt/conda/lib/python3.11/site-packages/sklearn/datasets/_openml.py:1022: FutureWarning: The default value of `parser` will change from `'liac-arff'` to `'auto'` in 1.4. You can set `parser='auto'` to silence this warning. Therefore, an `ImportError` will be raised from 1.4 if the dataset is dense and pandas is not installed. Note that the pandas parser may return different data types. See the Notes Section in fetch_openml's API doc for details.
  warn(
Id MSSubClass MSZoning LotFrontage LotArea Street Alley LotShape LandContour Utilities LotConfig LandSlope Neighborhood Condition1 Condition2 BldgType HouseStyle OverallQual OverallCond YearBuilt YearRemodAdd RoofStyle RoofMatl Exterior1st Exterior2nd MasVnrType MasVnrArea ExterQual ExterCond Foundation BsmtQual BsmtCond BsmtExposure BsmtFinType1 BsmtFinSF1 BsmtFinType2 BsmtFinSF2 BsmtUnfSF TotalBsmtSF Heating HeatingQC CentralAir Electrical 1stFlrSF 2ndFlrSF LowQualFinSF GrLivArea BsmtFullBath BsmtHalfBath FullBath HalfBath BedroomAbvGr KitchenAbvGr KitchenQual TotRmsAbvGrd Functional Fireplaces FireplaceQu GarageType GarageYrBlt GarageFinish GarageCars GarageArea GarageQual GarageCond PavedDrive WoodDeckSF OpenPorchSF EnclosedPorch 3SsnPorch ScreenPorch PoolArea PoolQC Fence MiscFeature MiscVal MoSold YrSold SaleType SaleCondition MedHouseVal
0 1 60 RL 65.0 8450 Pave NaN Reg Lvl AllPub Inside Gtl CollgCr Norm Norm 1Fam 2Story 7 5 2003 2003 Gable CompShg VinylSd VinylSd BrkFace 196.0 Gd TA PConc Gd TA No GLQ 706 Unf 0 150 856 GasA Ex Y SBrkr 856 854 0 1710 1 0 2 1 3 1 Gd 8 Typ 0 NaN Attchd 2003.0 RFn 2 548 TA TA Y 0 61 0 0 0 0 NaN NaN NaN 0 2 2008 WD Normal 208500
1 2 20 RL 80.0 9600 Pave NaN Reg Lvl AllPub FR2 Gtl Veenker Feedr Norm 1Fam 1Story 6 8 1976 1976 Gable CompShg MetalSd MetalSd None 0.0 TA TA CBlock Gd TA Gd ALQ 978 Unf 0 284 1262 GasA Ex Y SBrkr 1262 0 0 1262 0 1 2 0 3 1 TA 6 Typ 1 TA Attchd 1976.0 RFn 2 460 TA TA Y 298 0 0 0 0 0 NaN NaN NaN 0 5 2007 WD Normal 181500
2 3 60 RL 68.0 11250 Pave NaN IR1 Lvl AllPub Inside Gtl CollgCr Norm Norm 1Fam 2Story 7 5 2001 2002 Gable CompShg VinylSd VinylSd BrkFace 162.0 Gd TA PConc Gd TA Mn GLQ 486 Unf 0 434 920 GasA Ex Y SBrkr 920 866 0 1786 1 0 2 1 3 1 Gd 6 Typ 1 TA Attchd 2001.0 RFn 2 608 TA TA Y 0 42 0 0 0 0 NaN NaN NaN 0 9 2008 WD Normal 223500
3 4 70 RL 60.0 9550 Pave NaN IR1 Lvl AllPub Corner Gtl Crawfor Norm Norm 1Fam 2Story 7 5 1915 1970 Gable CompShg Wd Sdng Wd Shng None 0.0 TA TA BrkTil TA Gd No ALQ 216 Unf 0 540 756 GasA Gd Y SBrkr 961 756 0 1717 1 0 1 0 3 1 Gd 7 Typ 1 Gd Detchd 1998.0 Unf 3 642 TA TA Y 0 35 272 0 0 0 NaN NaN NaN 0 2 2006 WD Abnorml 140000
4 5 60 RL 84.0 14260 Pave NaN IR1 Lvl AllPub FR2 Gtl NoRidge Norm Norm 1Fam 2Story 8 5 2000 2000 Gable CompShg VinylSd VinylSd BrkFace 350.0 Gd TA PConc Gd TA Av GLQ 655 Unf 0 490 1145 GasA Ex Y SBrkr 1145 1053 0 2198 1 0 2 1 4 1 Gd 9 Typ 1 TA Attchd 2000.0 RFn 3 836 TA TA Y 192 84 0 0 0 0 NaN NaN NaN 0 12 2008 WD Normal 250000
5 6 50 RL 85.0 14115 Pave NaN IR1 Lvl AllPub Inside Gtl Mitchel Norm Norm 1Fam 1.5Fin 5 5 1993 1995 Gable CompShg VinylSd VinylSd None 0.0 TA TA Wood Gd TA No GLQ 732 Unf 0 64 796 GasA Ex Y SBrkr 796 566 0 1362 1 0 1 1 1 1 TA 5 Typ 0 NaN Attchd 1993.0 Unf 2 480 TA TA Y 40 30 0 320 0 0 NaN MnPrv Shed 700 10 2009 WD Normal 143000
6 7 20 RL 75.0 10084 Pave NaN Reg Lvl AllPub Inside Gtl Somerst Norm Norm 1Fam 1Story 8 5 2004 2005 Gable CompShg VinylSd VinylSd Stone 186.0 Gd TA PConc Ex TA Av GLQ 1369 Unf 0 317 1686 GasA Ex Y SBrkr 1694 0 0 1694 1 0 2 0 3 1 Gd 7 Typ 1 Gd Attchd 2004.0 RFn 2 636 TA TA Y 255 57 0 0 0 0 NaN NaN NaN 0 8 2007 WD Normal 307000
7 8 60 RL NaN 10382 Pave NaN IR1 Lvl AllPub Corner Gtl NWAmes PosN Norm 1Fam 2Story 7 6 1973 1973 Gable CompShg HdBoard HdBoard Stone 240.0 TA TA CBlock Gd TA Mn ALQ 859 BLQ 32 216 1107 GasA Ex Y SBrkr 1107 983 0 2090 1 0 2 1 3 1 TA 7 Typ 2 TA Attchd 1973.0 RFn 2 484 TA TA Y 235 204 228 0 0 0 NaN NaN Shed 350 11 2009 WD Normal 200000
8 9 50 RM 51.0 6120 Pave NaN Reg Lvl AllPub Inside Gtl OldTown Artery Norm 1Fam 1.5Fin 7 5 1931 1950 Gable CompShg BrkFace Wd Shng None 0.0 TA TA BrkTil TA TA No Unf 0 Unf 0 952 952 GasA Gd Y FuseF 1022 752 0 1774 0 0 2 0 2 2 TA 8 Min1 2 TA Detchd 1931.0 Unf 2 468 Fa TA Y 90 0 205 0 0 0 NaN NaN NaN 0 4 2008 WD Abnorml 129900
9 10 190 RL 50.0 7420 Pave NaN Reg Lvl AllPub Corner Gtl BrkSide Artery Artery 2fmCon 1.5Unf 5 6 1939 1950 Gable CompShg MetalSd MetalSd None 0.0 TA TA BrkTil TA TA No GLQ 851 Unf 0 140 991 GasA Ex Y SBrkr 1077 0 0 1077 1 0 1 0 2 2 TA 5 Typ 2 TA Attchd 1939.0 RFn 1 205 Gd TA Y 0 4 0 0 0 0 NaN NaN NaN 0 1 2008 WD Normal 118000

Transformation Functions.#

Transformation functions systematically adjust data values. In the following example, we’ll apply a Box-Cox power transformation to the “LotArea” attribute, aiming to reduce skewness and achieve a more symmetric distribution.

skewness_before = skew(df["LotArea"])
print(f"Skewness before transformation: {skewness_before}")

# Apply the Box-Cox transformation (boxcox1p)
lambda_value = 0.1  # A commonly used lambda value for Box-Cox in many scenarios
df["Transformed_LotArea"] = boxcox1p(df["LotArea"], lambda_value)

# Calculate the skewness after transformation
skewness_after = skew(df["Transformed_LotArea"])
print(f"Skewness after transformation: {skewness_after}")

# Plotting the original and transformed data
fig, ax = plt.subplots(1, 2, figsize=(12, 3))

sns.histplot(df["LotArea"], ax=ax[0], kde=True, bins=45, element='step', color='blue')
ax[0].set_title("Original Data")
# rotate x-axis labels
for label in ax[0].get_xticklabels():
    label.set_rotation(45)

sns.histplot(df["Transformed_LotArea"], ax=ax[1], kde=True,  bins=45, element='step', color="crimson")
ax[1].set_title("Transformed Data")

plt.show()
Skewness before transformation: 12.195142125084478
Skewness after transformation: 0.4281475593423937
../_images/cb53d3b01ee2022151d6c60427fb218e3683e705824a4a38e6eda8c0f8d4cc31.png

Data Normalization#

Data normalization scales dataset values to a uniform range, crucial for balanced attribute influence in data mining models. The primary methods are:

  • Min-max normalization

  • Z-score normalization

  • Decimal scaling We will demonstrate these techniques on the “LotArea” attribute in the example below.

# Min-Max Normalization
min_max_scaler = MinMaxScaler()
df['Min_Max_Normalized_LotArea'] = min_max_scaler.fit_transform(df[['LotArea']]   )

# Z-Score Normalization (Standardization)
standard_scaler = StandardScaler()
df['Z_Score_Normalized_LotArea'] = standard_scaler.fit_transform(df[['LotArea']]   )

# Decimal Scaling - Manual implementation as before
max_abs_value = df['LotArea'] .abs().max()
num_decimal_places = np.ceil(np.log10(max_abs_value))
df['Decimal_Scaling_Normalized_LotArea'] = df['LotArea']  / (10**num_decimal_places)

# Display the first few rows to verify
display(df[['LotArea', 'Min_Max_Normalized_LotArea', 'Z_Score_Normalized_LotArea', 'Decimal_Scaling_Normalized_LotArea']].head())
LotArea Min_Max_Normalized_LotArea Z_Score_Normalized_LotArea Decimal_Scaling_Normalized_LotArea
0 8450 0.033420 -0.207142 0.00845
1 9600 0.038795 -0.091886 0.00960
2 11250 0.046507 0.073480 0.01125
3 9550 0.038561 -0.096897 0.00955
4 14260 0.060576 0.375148 0.01426

Data Discretization#

Data discretization transforms continuous data into discrete categories or intervals, simplifying the analysis of complex relationships.

  • Binning: Grouping data into categories.

  • Histogram Analysis: Visualizing data distribution across intervals.

  • Clustering Analysis: Organizing data into clusters based on similarity.

We will apply these techniques to the “LotArea” attribute in the upcoming example.

Binning#

# Equal-width binning into 4 bins
df['LotArea_Equalwidth'] = pd.cut(df['LotArea'], bins=4, labels=["Small", "Medium", "Large", "Very Large"])
display(df[['LotArea', 'LotArea_Equalwidth']].head())
display(df['LotArea_Equalwidth'].value_counts())

# Equal-depth binning into 4 bins
df['LotArea_EqualDepth'] = pd.qcut(df['LotArea'], q=4, labels=["Small", "Medium", "Large", "Very Large"])
display(df[['LotArea', 'LotArea_EqualDepth']].head())
display(df[ 'LotArea_EqualDepth'].value_counts())
LotArea LotArea_Equalwidth
0 8450 Small
1 9600 Small
2 11250 Small
3 9550 Small
4 14260 Small
LotArea_Equalwidth
Small         1453
Medium           3
Large            2
Very Large       2
Name: count, dtype: int64
LotArea LotArea_EqualDepth
0 8450 Medium
1 9600 Large
2 11250 Large
3 9550 Large
4 14260 Very Large
LotArea_EqualDepth
Small         365
Medium        365
Large         365
Very Large    365
Name: count, dtype: int64

Histogram analysis#

plt.figure(figsize=(10, 5))
sns.histplot(df["LotArea"], kde=True, bins=45, element='step', palette='tab10')
plt.title('Attribute Distributions of LotArea')
plt.ylabel('Frequency')
plt.show()
/tmp/ipykernel_3703/3600601536.py:2: UserWarning: Ignoring `palette` because no `hue` variable has been assigned.
  sns.histplot(df["LotArea"], kde=True, bins=45, element='step', palette='tab10')
../_images/18d117b722733c018dcf0d8798be6d0565981c354afb10c8dcccea4fa01cad38.png

Clustering analysis#

# Apply KMeans clustering
kmeans = KMeans(n_clusters=4, random_state=0).fit(df[['LotArea']])

# Assign cluster labels to each data point for discretization
df['LotArea_Cluster'] = kmeans.labels_

# Optionally, you can map these cluster labels to more meaningful category names
cluster_mapping = {
    0: 'Cluster 1',
    1: 'Cluster 2',
    2: 'Cluster 3',
    3: 'Cluster 4'
}
df['LotArea_Cluster_Label'] = df['LotArea_Cluster'].map(cluster_mapping)

# Display the first few rows to see the clustering-based discretization
display(df[['LotArea', 'LotArea_Cluster', 'LotArea_Cluster_Label']].head())
/opt/conda/lib/python3.11/site-packages/sklearn/cluster/_kmeans.py:1416: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
  super()._check_params_vs_input(X, default_n_init=10)
LotArea LotArea_Cluster LotArea_Cluster_Label
0 8450 0 Cluster 1
1 9600 0 Cluster 1
2 11250 3 Cluster 4
3 9550 0 Cluster 1
4 14260 3 Cluster 4