Data Preprocessing II

Data Preprocessing II#

This tutorial contains Python examples for data transformation, focusing on techniques for data normalization and discretization, . Follow the step-by-step instructions below carefully. To execute the code, click on the corresponding cell and press the SHIFT+ENTER keys simultaneously.

Data Transformation I#

Sometimes, the original values of an attribute might not be ideally suited for data analysis or modeling purposes. Data transformation is a process that involves converting the entire set of an attribute’s values into a new series of values. The approach is strategically utilized to improve the suitability of the data for different data mining objectives, facilitate more effective analyses, and potentially improve model accuracy and efficiency.

In this tutorial, we’ll focus on the “LotArea” attribute from the house price dataset, exploring transformation functions, data normalization, and data discretization.

Importing Libraries and Configuration#

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn import datasets
from sklearn.preprocessing import MinMaxScaler, StandardScaler
from sklearn.cluster import KMeans
from scipy.special import boxcox1p
from scipy.stats import skew
%matplotlib inline
from matplotlib.pylab import rcParams

Settings#

# Set the default figure size for matplotlib plots to 15 inches wide by 6 inches tall
rcParams["figure.figsize"] = (15, 6)

# Increase the default font size of the titles in matplotlib plots to extra-extra-large
rcParams["axes.titlesize"] = "xx-large"

# Make the titles of axes in matplotlib plots bold for better visibility
rcParams["axes.titleweight"] = "bold"

# Set the default location of the legend in matplotlib plots to the upper left corner
rcParams["legend.loc"] = "upper left"

# Configure pandas to display all columns of a DataFrame when printed to the console
pd.set_option('display.max_columns', None)

# Configure pandas to display all rows of a DataFrame when printed to the console
pd.set_option('display.max_rows', None)

Load Data#

data = datasets.fetch_openml(name="house_prices", as_frame=True)
df = pd.DataFrame(data=data.data, columns=data.feature_names)
df['MedHouseVal'] = data.target
print('Number of samples = %d' % (df.shape[0]))
print('Number of attributes = %d' % (df.shape[1]))
display (df.head(n=10))

Number of samples = 1460
Number of attributes = 81

/opt/conda/lib/python3.11/site-packages/sklearn/datasets/_openml.py:1022: FutureWarning: The default value of `parser` will change from `'liac-arff'` to `'auto'` in 1.4. You can set `parser='auto'` to silence this warning. Therefore, an `ImportError` will be raised from 1.4 if the dataset is dense and pandas is not installed. Note that the pandas parser may return different data types. See the Notes Section in fetch_openml's API doc for details.
  warn(

	Id	MSSubClass	MSZoning	LotFrontage	LotArea	Street	Alley	LotShape	LandContour	Utilities	LotConfig	LandSlope	Neighborhood	Condition1	Condition2	BldgType	HouseStyle	OverallQual	OverallCond	YearBuilt	YearRemodAdd	RoofStyle	RoofMatl	Exterior1st	Exterior2nd	MasVnrType	MasVnrArea	ExterQual	ExterCond	Foundation	BsmtQual	BsmtCond	BsmtExposure	BsmtFinType1	BsmtFinSF1	BsmtFinType2	BsmtFinSF2	BsmtUnfSF	TotalBsmtSF	Heating	HeatingQC	CentralAir	Electrical	1stFlrSF	2ndFlrSF	GrLivArea	BsmtFullBath	BsmtHalfBath	FullBath	HalfBath	BedroomAbvGr	KitchenAbvGr	KitchenQual	TotRmsAbvGrd	Functional	Fireplaces	FireplaceQu	GarageType	GarageYrBlt	GarageFinish	GarageCars	GarageArea	GarageQual	GarageCond	PavedDrive	WoodDeckSF	OpenPorchSF	EnclosedPorch	3SsnPorch	PoolQC	Fence	MiscFeature	MiscVal	MoSold	YrSold	SaleType	SaleCondition	MedHouseVal
0	1	60	RL	65.0	8450	Pave	NaN	Reg	Lvl	AllPub	Inside	Gtl	CollgCr	Norm	Norm	1Fam	2Story	7	5	2003	2003	Gable	CompShg	VinylSd	VinylSd	BrkFace	196.0	Gd	TA	PConc	Gd	TA	No	GLQ	706	Unf	0	150	856	GasA	Ex	Y	SBrkr	856	854	1710	1	0	2	1	3	1	Gd	8	Typ	0	NaN	Attchd	2003.0	RFn	2	548	TA	TA	Y	0	61	0	0	NaN	NaN	NaN	0	2	2008	WD	Normal	208500
1	2	20	RL	80.0	9600	Pave	NaN	Reg	Lvl	AllPub	FR2	Gtl	Veenker	Feedr	Norm	1Fam	1Story	6	8	1976	1976	Gable	CompShg	MetalSd	MetalSd	None	0.0	TA	TA	CBlock	Gd	TA	Gd	ALQ	978	Unf	0	284	1262	GasA	Ex	Y	SBrkr	1262	0	1262	0	1	2	0	3	1	TA	6	Typ	1	TA	Attchd	1976.0	RFn	2	460	TA	TA	Y	298	0	0	0	NaN	NaN	NaN	0	5	2007	WD	Normal	181500
2	3	60	RL	68.0	11250	Pave	NaN	IR1	Lvl	AllPub	Inside	Gtl	CollgCr	Norm	Norm	1Fam	2Story	7	5	2001	2002	Gable	CompShg	VinylSd	VinylSd	BrkFace	162.0	Gd	TA	PConc	Gd	TA	Mn	GLQ	486	Unf	0	434	920	GasA	Ex	Y	SBrkr	920	866	1786	1	0	2	1	3	1	Gd	6	Typ	1	TA	Attchd	2001.0	RFn	2	608	TA	TA	Y	0	42	0	0	NaN	NaN	NaN	0	9	2008	WD	Normal	223500
3	4	70	RL	60.0	9550	Pave	NaN	IR1	Lvl	AllPub	Corner	Gtl	Crawfor	Norm	Norm	1Fam	2Story	7	5	1915	1970	Gable	CompShg	Wd Sdng	Wd Shng	None	0.0	TA	TA	BrkTil	TA	Gd	No	ALQ	216	Unf	0	540	756	GasA	Gd	Y	SBrkr	961	756	1717	1	0	1	0	3	1	Gd	7	Typ	1	Gd	Detchd	1998.0	Unf	3	642	TA	TA	Y	0	35	272	0	NaN	NaN	NaN	0	2	2006	WD	Abnorml	140000
4	5	60	RL	84.0	14260	Pave	NaN	IR1	Lvl	AllPub	FR2	Gtl	NoRidge	Norm	Norm	1Fam	2Story	8	5	2000	2000	Gable	CompShg	VinylSd	VinylSd	BrkFace	350.0	Gd	TA	PConc	Gd	TA	Av	GLQ	655	Unf	0	490	1145	GasA	Ex	Y	SBrkr	1145	1053	2198	1	0	2	1	4	1	Gd	9	Typ	1	TA	Attchd	2000.0	RFn	3	836	TA	TA	Y	192	84	0	0	NaN	NaN	NaN	0	12	2008	WD	Normal	250000
5	6	50	RL	85.0	14115	Pave	NaN	IR1	Lvl	AllPub	Inside	Gtl	Mitchel	Norm	Norm	1Fam	1.5Fin	5	5	1993	1995	Gable	CompShg	VinylSd	VinylSd	None	0.0	TA	TA	Wood	Gd	TA	No	GLQ	732	Unf	0	64	796	GasA	Ex	Y	SBrkr	796	566	1362	1	0	1	1	1	1	TA	5	Typ	0	NaN	Attchd	1993.0	Unf	2	480	TA	TA	Y	40	30	0	320	NaN	MnPrv	Shed	700	10	2009	WD	Normal	143000
6	7	20	RL	75.0	10084	Pave	NaN	Reg	Lvl	AllPub	Inside	Gtl	Somerst	Norm	Norm	1Fam	1Story	8	5	2004	2005	Gable	CompShg	VinylSd	VinylSd	Stone	186.0	Gd	TA	PConc	Ex	TA	Av	GLQ	1369	Unf	0	317	1686	GasA	Ex	Y	SBrkr	1694	0	1694	1	0	2	0	3	1	Gd	7	Typ	1	Gd	Attchd	2004.0	RFn	2	636	TA	TA	Y	255	57	0	0	NaN	NaN	NaN	0	8	2007	WD	Normal	307000
7	8	60	RL	NaN	10382	Pave	NaN	IR1	Lvl	AllPub	Corner	Gtl	NWAmes	PosN	Norm	1Fam	2Story	7	6	1973	1973	Gable	CompShg	HdBoard	HdBoard	Stone	240.0	TA	TA	CBlock	Gd	TA	Mn	ALQ	859	BLQ	32	216	1107	GasA	Ex	Y	SBrkr	1107	983	2090	1	0	2	1	3	1	TA	7	Typ	2	TA	Attchd	1973.0	RFn	2	484	TA	TA	Y	235	204	228	0	NaN	NaN	Shed	350	11	2009	WD	Normal	200000
8	9	50	RM	51.0	6120	Pave	NaN	Reg	Lvl	AllPub	Inside	Gtl	OldTown	Artery	Norm	1Fam	1.5Fin	7	5	1931	1950	Gable	CompShg	BrkFace	Wd Shng	None	0.0	TA	TA	BrkTil	TA	TA	No	Unf	0	Unf	0	952	952	GasA	Gd	Y	FuseF	1022	752	1774	0	0	2	0	2	2	TA	8	Min1	2	TA	Detchd	1931.0	Unf	2	468	Fa	TA	Y	90	0	205	0	NaN	NaN	NaN	0	4	2008	WD	Abnorml	129900
9	10	190	RL	50.0	7420	Pave	NaN	Reg	Lvl	AllPub	Corner	Gtl	BrkSide	Artery	Artery	2fmCon	1.5Unf	5	6	1939	1950	Gable	CompShg	MetalSd	MetalSd	None	0.0	TA	TA	BrkTil	TA	TA	No	GLQ	851	Unf	0	140	991	GasA	Ex	Y	SBrkr	1077	0	1077	1	0	1	0	2	2	TA	5	Typ	2	TA	Attchd	1939.0	RFn	1	205	Gd	TA	Y	0	4	0	0	NaN	NaN	NaN	0	1	2008	WD	Normal	118000

Transformation Functions.#

Transformation functions systematically adjust data values. In the following example, we’ll apply a Box-Cox power transformation to the “LotArea” attribute, aiming to reduce skewness and achieve a more symmetric distribution.

skewness_before = skew(df["LotArea"])
print(f"Skewness before transformation: {skewness_before}")

# Apply the Box-Cox transformation (boxcox1p)
lambda_value = 0.1  # A commonly used lambda value for Box-Cox in many scenarios
df["Transformed_LotArea"] = boxcox1p(df["LotArea"], lambda_value)

# Calculate the skewness after transformation
skewness_after = skew(df["Transformed_LotArea"])
print(f"Skewness after transformation: {skewness_after}")

# Plotting the original and transformed data
fig, ax = plt.subplots(1, 2, figsize=(12, 3))

sns.histplot(df["LotArea"], ax=ax[0], kde=True, bins=45, element='step', color='blue')
ax[0].set_title("Original Data")
# rotate x-axis labels
for label in ax[0].get_xticklabels():
    label.set_rotation(45)

sns.histplot(df["Transformed_LotArea"], ax=ax[1], kde=True,  bins=45, element='step', color="crimson")
ax[1].set_title("Transformed Data")

plt.show()

Skewness before transformation: 12.195142125084478
Skewness after transformation: 0.4281475593423937

../_images/cb53d3b01ee2022151d6c60427fb218e3683e705824a4a38e6eda8c0f8d4cc31.png

Data Normalization#

Data normalization scales dataset values to a uniform range, crucial for balanced attribute influence in data mining models. The primary methods are:

Min-max normalization
Z-score normalization
Decimal scaling We will demonstrate these techniques on the “LotArea” attribute in the example below.

# Min-Max Normalization
min_max_scaler = MinMaxScaler()
df['Min_Max_Normalized_LotArea'] = min_max_scaler.fit_transform(df[['LotArea']]   )

# Z-Score Normalization (Standardization)
standard_scaler = StandardScaler()
df['Z_Score_Normalized_LotArea'] = standard_scaler.fit_transform(df[['LotArea']]   )

# Decimal Scaling - Manual implementation as before
max_abs_value = df['LotArea'] .abs().max()
num_decimal_places = np.ceil(np.log10(max_abs_value))
df['Decimal_Scaling_Normalized_LotArea'] = df['LotArea']  / (10**num_decimal_places)

# Display the first few rows to verify
display(df[['LotArea', 'Min_Max_Normalized_LotArea', 'Z_Score_Normalized_LotArea', 'Decimal_Scaling_Normalized_LotArea']].head())

	LotArea	Min_Max_Normalized_LotArea	Z_Score_Normalized_LotArea	Decimal_Scaling_Normalized_LotArea
0	8450	0.033420	-0.207142	0.00845
1	9600	0.038795	-0.091886	0.00960
2	11250	0.046507	0.073480	0.01125
3	9550	0.038561	-0.096897	0.00955
4	14260	0.060576	0.375148	0.01426

Data Discretization#

Data discretization transforms continuous data into discrete categories or intervals, simplifying the analysis of complex relationships.

Binning: Grouping data into categories.
Histogram Analysis: Visualizing data distribution across intervals.
Clustering Analysis: Organizing data into clusters based on similarity.

We will apply these techniques to the “LotArea” attribute in the upcoming example.

Binning#

# Equal-width binning into 4 bins
df['LotArea_Equalwidth'] = pd.cut(df['LotArea'], bins=4, labels=["Small", "Medium", "Large", "Very Large"])
display(df[['LotArea', 'LotArea_Equalwidth']].head())
display(df['LotArea_Equalwidth'].value_counts())

# Equal-depth binning into 4 bins
df['LotArea_EqualDepth'] = pd.qcut(df['LotArea'], q=4, labels=["Small", "Medium", "Large", "Very Large"])
display(df[['LotArea', 'LotArea_EqualDepth']].head())
display(df[ 'LotArea_EqualDepth'].value_counts())

	LotArea	LotArea_Equalwidth
0	8450	Small
1	9600	Small
2	11250	Small
3	9550	Small
4	14260	Small

LotArea_Equalwidth
Small         1453
Medium           3
Large            2
Very Large       2
Name: count, dtype: int64

	LotArea	LotArea_EqualDepth
0	8450	Medium
1	9600	Large
2	11250	Large
3	9550	Large
4	14260	Very Large

LotArea_EqualDepth
Small         365
Medium        365
Large         365
Very Large    365
Name: count, dtype: int64

Histogram analysis#

plt.figure(figsize=(10, 5))
sns.histplot(df["LotArea"], kde=True, bins=45, element='step', palette='tab10')
plt.title('Attribute Distributions of LotArea')
plt.ylabel('Frequency')
plt.show()

/tmp/ipykernel_3703/3600601536.py:2: UserWarning: Ignoring `palette` because no `hue` variable has been assigned.
  sns.histplot(df["LotArea"], kde=True, bins=45, element='step', palette='tab10')

../_images/18d117b722733c018dcf0d8798be6d0565981c354afb10c8dcccea4fa01cad38.png

Clustering analysis#

# Apply KMeans clustering
kmeans = KMeans(n_clusters=4, random_state=0).fit(df[['LotArea']])

# Assign cluster labels to each data point for discretization
df['LotArea_Cluster'] = kmeans.labels_

# Optionally, you can map these cluster labels to more meaningful category names
cluster_mapping = {
    0: 'Cluster 1',
    1: 'Cluster 2',
    2: 'Cluster 3',
    3: 'Cluster 4'
}
df['LotArea_Cluster_Label'] = df['LotArea_Cluster'].map(cluster_mapping)

# Display the first few rows to see the clustering-based discretization
display(df[['LotArea', 'LotArea_Cluster', 'LotArea_Cluster_Label']].head())

/opt/conda/lib/python3.11/site-packages/sklearn/cluster/_kmeans.py:1416: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
  super()._check_params_vs_input(X, default_n_init=10)

	LotArea	LotArea_Cluster	LotArea_Cluster_Label
0	8450	0	Cluster 1
1	9600	0	Cluster 1
2	11250	3	Cluster 4
3	9550	0	Cluster 1
4	14260	3	Cluster 4