Data Preprocessing III#
This tutorial contains Python examples for data transformation, focusing on techniques for sampling and data compression. Follow the step-by-step instructions below carefully. To execute the code, click on the corresponding cell and press the SHIFT+ENTER keys simultaneously.
Importing Libraries and Configuration#
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import pywt
from sklearn import datasets
%matplotlib inline
from matplotlib.pylab import rcParams
Settings#
# Set the default figure size for matplotlib plots to 15 inches wide by 6 inches tall
rcParams["figure.figsize"] = (15, 6)
# Increase the default font size of the titles in matplotlib plots to extra-extra-large
rcParams["axes.titlesize"] = "xx-large"
# Make the titles of axes in matplotlib plots bold for better visibility
rcParams["axes.titleweight"] = "bold"
# Set the default location of the legend in matplotlib plots to the upper left corner
rcParams["legend.loc"] = "upper left"
# Configure pandas to display all columns of a DataFrame when printed to the console
pd.set_option('display.max_columns', None)
# Configure pandas to display all rows of a DataFrame when printed to the console
pd.set_option('display.max_rows', None)
Sampling#
Sampling plays a crucial role in both data reduction for exploratory analysis and scaling algorithms for big data applications, as well as in quantifying uncertainties stemming from diverse data distributions. There are several methods for sampling data, including sampling without replacement, where each chosen instance is removed from the dataset, preventing its reselection, and sampling with replacement, which allows each chosen instance to remain in the dataset, permitting its multiple selections within the sample.
In the upcoming example, we will demonstrate both sampling with and without replacement using the house_prices dataset.
data = datasets.fetch_openml(name="house_prices", as_frame=True)
df = pd.DataFrame(data=data.data, columns=data.feature_names)
df['MedHouseVal'] = data.target
print('Number of samples = %d' % (df.shape[0]))
print('Number of attributes = %d' % (df.shape[1]))
display (df.head(n=10))
Number of samples = 1460
Number of attributes = 81
/opt/conda/lib/python3.11/site-packages/sklearn/datasets/_openml.py:1022: FutureWarning: The default value of `parser` will change from `'liac-arff'` to `'auto'` in 1.4. You can set `parser='auto'` to silence this warning. Therefore, an `ImportError` will be raised from 1.4 if the dataset is dense and pandas is not installed. Note that the pandas parser may return different data types. See the Notes Section in fetch_openml's API doc for details.
warn(
Id | MSSubClass | MSZoning | LotFrontage | LotArea | Street | Alley | LotShape | LandContour | Utilities | LotConfig | LandSlope | Neighborhood | Condition1 | Condition2 | BldgType | HouseStyle | OverallQual | OverallCond | YearBuilt | YearRemodAdd | RoofStyle | RoofMatl | Exterior1st | Exterior2nd | MasVnrType | MasVnrArea | ExterQual | ExterCond | Foundation | BsmtQual | BsmtCond | BsmtExposure | BsmtFinType1 | BsmtFinSF1 | BsmtFinType2 | BsmtFinSF2 | BsmtUnfSF | TotalBsmtSF | Heating | HeatingQC | CentralAir | Electrical | 1stFlrSF | 2ndFlrSF | LowQualFinSF | GrLivArea | BsmtFullBath | BsmtHalfBath | FullBath | HalfBath | BedroomAbvGr | KitchenAbvGr | KitchenQual | TotRmsAbvGrd | Functional | Fireplaces | FireplaceQu | GarageType | GarageYrBlt | GarageFinish | GarageCars | GarageArea | GarageQual | GarageCond | PavedDrive | WoodDeckSF | OpenPorchSF | EnclosedPorch | 3SsnPorch | ScreenPorch | PoolArea | PoolQC | Fence | MiscFeature | MiscVal | MoSold | YrSold | SaleType | SaleCondition | MedHouseVal | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 60 | RL | 65.0 | 8450 | Pave | NaN | Reg | Lvl | AllPub | Inside | Gtl | CollgCr | Norm | Norm | 1Fam | 2Story | 7 | 5 | 2003 | 2003 | Gable | CompShg | VinylSd | VinylSd | BrkFace | 196.0 | Gd | TA | PConc | Gd | TA | No | GLQ | 706 | Unf | 0 | 150 | 856 | GasA | Ex | Y | SBrkr | 856 | 854 | 0 | 1710 | 1 | 0 | 2 | 1 | 3 | 1 | Gd | 8 | Typ | 0 | NaN | Attchd | 2003.0 | RFn | 2 | 548 | TA | TA | Y | 0 | 61 | 0 | 0 | 0 | 0 | NaN | NaN | NaN | 0 | 2 | 2008 | WD | Normal | 208500 |
1 | 2 | 20 | RL | 80.0 | 9600 | Pave | NaN | Reg | Lvl | AllPub | FR2 | Gtl | Veenker | Feedr | Norm | 1Fam | 1Story | 6 | 8 | 1976 | 1976 | Gable | CompShg | MetalSd | MetalSd | None | 0.0 | TA | TA | CBlock | Gd | TA | Gd | ALQ | 978 | Unf | 0 | 284 | 1262 | GasA | Ex | Y | SBrkr | 1262 | 0 | 0 | 1262 | 0 | 1 | 2 | 0 | 3 | 1 | TA | 6 | Typ | 1 | TA | Attchd | 1976.0 | RFn | 2 | 460 | TA | TA | Y | 298 | 0 | 0 | 0 | 0 | 0 | NaN | NaN | NaN | 0 | 5 | 2007 | WD | Normal | 181500 |
2 | 3 | 60 | RL | 68.0 | 11250 | Pave | NaN | IR1 | Lvl | AllPub | Inside | Gtl | CollgCr | Norm | Norm | 1Fam | 2Story | 7 | 5 | 2001 | 2002 | Gable | CompShg | VinylSd | VinylSd | BrkFace | 162.0 | Gd | TA | PConc | Gd | TA | Mn | GLQ | 486 | Unf | 0 | 434 | 920 | GasA | Ex | Y | SBrkr | 920 | 866 | 0 | 1786 | 1 | 0 | 2 | 1 | 3 | 1 | Gd | 6 | Typ | 1 | TA | Attchd | 2001.0 | RFn | 2 | 608 | TA | TA | Y | 0 | 42 | 0 | 0 | 0 | 0 | NaN | NaN | NaN | 0 | 9 | 2008 | WD | Normal | 223500 |
3 | 4 | 70 | RL | 60.0 | 9550 | Pave | NaN | IR1 | Lvl | AllPub | Corner | Gtl | Crawfor | Norm | Norm | 1Fam | 2Story | 7 | 5 | 1915 | 1970 | Gable | CompShg | Wd Sdng | Wd Shng | None | 0.0 | TA | TA | BrkTil | TA | Gd | No | ALQ | 216 | Unf | 0 | 540 | 756 | GasA | Gd | Y | SBrkr | 961 | 756 | 0 | 1717 | 1 | 0 | 1 | 0 | 3 | 1 | Gd | 7 | Typ | 1 | Gd | Detchd | 1998.0 | Unf | 3 | 642 | TA | TA | Y | 0 | 35 | 272 | 0 | 0 | 0 | NaN | NaN | NaN | 0 | 2 | 2006 | WD | Abnorml | 140000 |
4 | 5 | 60 | RL | 84.0 | 14260 | Pave | NaN | IR1 | Lvl | AllPub | FR2 | Gtl | NoRidge | Norm | Norm | 1Fam | 2Story | 8 | 5 | 2000 | 2000 | Gable | CompShg | VinylSd | VinylSd | BrkFace | 350.0 | Gd | TA | PConc | Gd | TA | Av | GLQ | 655 | Unf | 0 | 490 | 1145 | GasA | Ex | Y | SBrkr | 1145 | 1053 | 0 | 2198 | 1 | 0 | 2 | 1 | 4 | 1 | Gd | 9 | Typ | 1 | TA | Attchd | 2000.0 | RFn | 3 | 836 | TA | TA | Y | 192 | 84 | 0 | 0 | 0 | 0 | NaN | NaN | NaN | 0 | 12 | 2008 | WD | Normal | 250000 |
5 | 6 | 50 | RL | 85.0 | 14115 | Pave | NaN | IR1 | Lvl | AllPub | Inside | Gtl | Mitchel | Norm | Norm | 1Fam | 1.5Fin | 5 | 5 | 1993 | 1995 | Gable | CompShg | VinylSd | VinylSd | None | 0.0 | TA | TA | Wood | Gd | TA | No | GLQ | 732 | Unf | 0 | 64 | 796 | GasA | Ex | Y | SBrkr | 796 | 566 | 0 | 1362 | 1 | 0 | 1 | 1 | 1 | 1 | TA | 5 | Typ | 0 | NaN | Attchd | 1993.0 | Unf | 2 | 480 | TA | TA | Y | 40 | 30 | 0 | 320 | 0 | 0 | NaN | MnPrv | Shed | 700 | 10 | 2009 | WD | Normal | 143000 |
6 | 7 | 20 | RL | 75.0 | 10084 | Pave | NaN | Reg | Lvl | AllPub | Inside | Gtl | Somerst | Norm | Norm | 1Fam | 1Story | 8 | 5 | 2004 | 2005 | Gable | CompShg | VinylSd | VinylSd | Stone | 186.0 | Gd | TA | PConc | Ex | TA | Av | GLQ | 1369 | Unf | 0 | 317 | 1686 | GasA | Ex | Y | SBrkr | 1694 | 0 | 0 | 1694 | 1 | 0 | 2 | 0 | 3 | 1 | Gd | 7 | Typ | 1 | Gd | Attchd | 2004.0 | RFn | 2 | 636 | TA | TA | Y | 255 | 57 | 0 | 0 | 0 | 0 | NaN | NaN | NaN | 0 | 8 | 2007 | WD | Normal | 307000 |
7 | 8 | 60 | RL | NaN | 10382 | Pave | NaN | IR1 | Lvl | AllPub | Corner | Gtl | NWAmes | PosN | Norm | 1Fam | 2Story | 7 | 6 | 1973 | 1973 | Gable | CompShg | HdBoard | HdBoard | Stone | 240.0 | TA | TA | CBlock | Gd | TA | Mn | ALQ | 859 | BLQ | 32 | 216 | 1107 | GasA | Ex | Y | SBrkr | 1107 | 983 | 0 | 2090 | 1 | 0 | 2 | 1 | 3 | 1 | TA | 7 | Typ | 2 | TA | Attchd | 1973.0 | RFn | 2 | 484 | TA | TA | Y | 235 | 204 | 228 | 0 | 0 | 0 | NaN | NaN | Shed | 350 | 11 | 2009 | WD | Normal | 200000 |
8 | 9 | 50 | RM | 51.0 | 6120 | Pave | NaN | Reg | Lvl | AllPub | Inside | Gtl | OldTown | Artery | Norm | 1Fam | 1.5Fin | 7 | 5 | 1931 | 1950 | Gable | CompShg | BrkFace | Wd Shng | None | 0.0 | TA | TA | BrkTil | TA | TA | No | Unf | 0 | Unf | 0 | 952 | 952 | GasA | Gd | Y | FuseF | 1022 | 752 | 0 | 1774 | 0 | 0 | 2 | 0 | 2 | 2 | TA | 8 | Min1 | 2 | TA | Detchd | 1931.0 | Unf | 2 | 468 | Fa | TA | Y | 90 | 0 | 205 | 0 | 0 | 0 | NaN | NaN | NaN | 0 | 4 | 2008 | WD | Abnorml | 129900 |
9 | 10 | 190 | RL | 50.0 | 7420 | Pave | NaN | Reg | Lvl | AllPub | Corner | Gtl | BrkSide | Artery | Artery | 2fmCon | 1.5Unf | 5 | 6 | 1939 | 1950 | Gable | CompShg | MetalSd | MetalSd | None | 0.0 | TA | TA | BrkTil | TA | TA | No | GLQ | 851 | Unf | 0 | 140 | 991 | GasA | Ex | Y | SBrkr | 1077 | 0 | 0 | 1077 | 1 | 0 | 1 | 0 | 2 | 2 | TA | 5 | Typ | 2 | TA | Attchd | 1939.0 | RFn | 1 | 205 | Gd | TA | Y | 0 | 4 | 0 | 0 | 0 | 0 | NaN | NaN | NaN | 0 | 1 | 2008 | WD | Normal | 118000 |
In the following code, a sample of size 3 is randomly selected (without replacement) from the original data.
sample = df.sample(n=3)
sample
Id | MSSubClass | MSZoning | LotFrontage | LotArea | Street | Alley | LotShape | LandContour | Utilities | LotConfig | LandSlope | Neighborhood | Condition1 | Condition2 | BldgType | HouseStyle | OverallQual | OverallCond | YearBuilt | YearRemodAdd | RoofStyle | RoofMatl | Exterior1st | Exterior2nd | MasVnrType | MasVnrArea | ExterQual | ExterCond | Foundation | BsmtQual | BsmtCond | BsmtExposure | BsmtFinType1 | BsmtFinSF1 | BsmtFinType2 | BsmtFinSF2 | BsmtUnfSF | TotalBsmtSF | Heating | HeatingQC | CentralAir | Electrical | 1stFlrSF | 2ndFlrSF | LowQualFinSF | GrLivArea | BsmtFullBath | BsmtHalfBath | FullBath | HalfBath | BedroomAbvGr | KitchenAbvGr | KitchenQual | TotRmsAbvGrd | Functional | Fireplaces | FireplaceQu | GarageType | GarageYrBlt | GarageFinish | GarageCars | GarageArea | GarageQual | GarageCond | PavedDrive | WoodDeckSF | OpenPorchSF | EnclosedPorch | 3SsnPorch | ScreenPorch | PoolArea | PoolQC | Fence | MiscFeature | MiscVal | MoSold | YrSold | SaleType | SaleCondition | MedHouseVal | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1173 | 1174 | 50 | RL | 138.0 | 18030 | Pave | NaN | IR1 | Bnk | AllPub | Inside | Gtl | ClearCr | Norm | Norm | 1Fam | 1.5Fin | 5 | 6 | 1946 | 1994 | Gable | CompShg | MetalSd | MetalSd | None | 0.0 | TA | TA | CBlock | TA | TA | No | Rec | 152 | BLQ | 469 | 977 | 1598 | GasA | TA | Y | SBrkr | 1636 | 971 | 479 | 3086 | 0 | 0 | 3 | 0 | 3 | 1 | Ex | 12 | Maj1 | 1 | Gd | NaN | NaN | NaN | 0 | 0 | NaN | NaN | Y | 122 | 0 | 0 | 0 | 0 | 0 | NaN | MnPrv | NaN | 0 | 3 | 2007 | WD | Normal | 200500 |
994 | 995 | 20 | RL | 96.0 | 12456 | Pave | NaN | Reg | Lvl | AllPub | FR2 | Gtl | NridgHt | Norm | Norm | 1Fam | 1Story | 10 | 5 | 2006 | 2007 | Hip | CompShg | CemntBd | CmentBd | Stone | 230.0 | Ex | TA | PConc | Ex | TA | Gd | GLQ | 1172 | Unf | 0 | 528 | 1700 | GasA | Ex | Y | SBrkr | 1718 | 0 | 0 | 1718 | 1 | 0 | 2 | 0 | 3 | 1 | Ex | 7 | Typ | 1 | Gd | Attchd | 2008.0 | Fin | 3 | 786 | TA | TA | Y | 216 | 48 | 0 | 0 | 0 | 0 | NaN | NaN | NaN | 0 | 7 | 2009 | WD | Normal | 337500 |
324 | 325 | 80 | RL | 96.0 | 11275 | Pave | NaN | Reg | Lvl | AllPub | Corner | Gtl | NAmes | PosN | Norm | 1Fam | SLvl | 7 | 7 | 1967 | 2007 | Mansard | WdShake | Wd Sdng | Wd Sdng | BrkFace | 300.0 | Gd | Gd | CBlock | Gd | TA | No | Unf | 0 | Unf | 0 | 710 | 710 | GasA | Ex | Y | SBrkr | 1898 | 1080 | 0 | 2978 | 0 | 0 | 2 | 1 | 5 | 1 | Gd | 11 | Typ | 1 | Gd | BuiltIn | 1961.0 | Fin | 2 | 564 | TA | TA | Y | 240 | 0 | 0 | 0 | 0 | 0 | NaN | NaN | NaN | 0 | 6 | 2010 | WD | Normal | 242000 |
In the next example, we randomly select 1% of the data (without replacement) and display the selected samples. The random_state argument of the function specifies the seed value of the random number generator.
sample = df.sample(frac=0.01, random_state=1)
sample
Id | MSSubClass | MSZoning | LotFrontage | LotArea | Street | Alley | LotShape | LandContour | Utilities | LotConfig | LandSlope | Neighborhood | Condition1 | Condition2 | BldgType | HouseStyle | OverallQual | OverallCond | YearBuilt | YearRemodAdd | RoofStyle | RoofMatl | Exterior1st | Exterior2nd | MasVnrType | MasVnrArea | ExterQual | ExterCond | Foundation | BsmtQual | BsmtCond | BsmtExposure | BsmtFinType1 | BsmtFinSF1 | BsmtFinType2 | BsmtFinSF2 | BsmtUnfSF | TotalBsmtSF | Heating | HeatingQC | CentralAir | Electrical | 1stFlrSF | 2ndFlrSF | LowQualFinSF | GrLivArea | BsmtFullBath | BsmtHalfBath | FullBath | HalfBath | BedroomAbvGr | KitchenAbvGr | KitchenQual | TotRmsAbvGrd | Functional | Fireplaces | FireplaceQu | GarageType | GarageYrBlt | GarageFinish | GarageCars | GarageArea | GarageQual | GarageCond | PavedDrive | WoodDeckSF | OpenPorchSF | EnclosedPorch | 3SsnPorch | ScreenPorch | PoolArea | PoolQC | Fence | MiscFeature | MiscVal | MoSold | YrSold | SaleType | SaleCondition | MedHouseVal | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
258 | 259 | 60 | RL | 80.0 | 12435 | Pave | NaN | Reg | Lvl | AllPub | Inside | Gtl | CollgCr | Norm | Norm | 1Fam | 2Story | 7 | 5 | 2001 | 2001 | Gable | CompShg | VinylSd | VinylSd | BrkFace | 172.0 | Gd | TA | PConc | Gd | TA | No | GLQ | 361 | Unf | 0 | 602 | 963 | GasA | Ex | Y | SBrkr | 963 | 829 | 0 | 1792 | 0 | 0 | 2 | 1 | 3 | 1 | Gd | 7 | Typ | 1 | TA | Attchd | 2001.0 | RFn | 2 | 564 | TA | TA | Y | 0 | 96 | 0 | 245 | 0 | 0 | NaN | NaN | NaN | 0 | 5 | 2008 | WD | Normal | 231500 |
267 | 268 | 75 | RL | 60.0 | 8400 | Pave | NaN | Reg | Bnk | AllPub | Inside | Mod | SWISU | Norm | Norm | 1Fam | 2.5Fin | 5 | 8 | 1939 | 1997 | Gable | CompShg | Wd Sdng | Wd Sdng | None | 0.0 | TA | TA | PConc | TA | TA | No | LwQ | 378 | Unf | 0 | 342 | 720 | GasA | Ex | Y | SBrkr | 1052 | 720 | 420 | 2192 | 0 | 0 | 2 | 1 | 4 | 1 | Gd | 8 | Typ | 1 | Gd | Detchd | 1939.0 | Unf | 1 | 240 | TA | TA | Y | 262 | 24 | 0 | 0 | 0 | 0 | NaN | NaN | NaN | 0 | 7 | 2008 | WD | Normal | 179500 |
288 | 289 | 20 | RL | NaN | 9819 | Pave | NaN | IR1 | Lvl | AllPub | Inside | Gtl | Sawyer | Norm | Norm | 1Fam | 1Story | 5 | 5 | 1967 | 1967 | Gable | CompShg | MetalSd | MetalSd | BrkFace | 31.0 | TA | Gd | CBlock | TA | TA | No | BLQ | 450 | Unf | 0 | 432 | 882 | GasA | TA | Y | SBrkr | 900 | 0 | 0 | 900 | 0 | 0 | 1 | 0 | 3 | 1 | TA | 5 | Typ | 0 | NaN | Detchd | 1970.0 | Unf | 1 | 280 | TA | TA | Y | 0 | 0 | 0 | 0 | 0 | 0 | NaN | MnPrv | NaN | 0 | 2 | 2010 | WD | Normal | 122000 |
649 | 650 | 180 | RM | 21.0 | 1936 | Pave | NaN | Reg | Lvl | AllPub | Inside | Gtl | MeadowV | Norm | Norm | Twnhs | SFoyer | 4 | 6 | 1970 | 1970 | Gable | CompShg | CemntBd | CmentBd | None | 0.0 | TA | TA | CBlock | Gd | TA | Av | BLQ | 131 | GLQ | 499 | 0 | 630 | GasA | Gd | Y | SBrkr | 630 | 0 | 0 | 630 | 1 | 0 | 1 | 0 | 1 | 1 | TA | 3 | Typ | 0 | NaN | NaN | NaN | NaN | 0 | 0 | NaN | NaN | Y | 0 | 0 | 0 | 0 | 0 | 0 | NaN | MnPrv | NaN | 0 | 12 | 2007 | WD | Normal | 84500 |
1233 | 1234 | 20 | RL | NaN | 12160 | Pave | NaN | IR1 | Lvl | AllPub | Inside | Gtl | NAmes | Norm | Norm | 1Fam | 1Story | 5 | 5 | 1959 | 1959 | Hip | CompShg | Plywood | Plywood | BrkFace | 180.0 | TA | TA | CBlock | TA | TA | No | Rec | 1000 | Unf | 0 | 188 | 1188 | GasA | Fa | Y | SBrkr | 1188 | 0 | 0 | 1188 | 1 | 0 | 1 | 0 | 3 | 1 | TA | 6 | Typ | 0 | NaN | Attchd | 1959.0 | RFn | 2 | 531 | TA | TA | Y | 0 | 0 | 0 | 0 | 0 | 0 | NaN | MnPrv | NaN | 0 | 5 | 2010 | COD | Abnorml | 142000 |
167 | 168 | 60 | RL | 86.0 | 10562 | Pave | NaN | Reg | Lvl | AllPub | Inside | Gtl | NridgHt | Norm | Norm | 1Fam | 2Story | 8 | 5 | 2007 | 2007 | Gable | CompShg | VinylSd | VinylSd | Stone | 300.0 | Gd | TA | PConc | Ex | TA | No | GLQ | 1288 | Unf | 0 | 294 | 1582 | GasA | Ex | Y | SBrkr | 1610 | 551 | 0 | 2161 | 1 | 0 | 1 | 1 | 3 | 1 | Ex | 8 | Typ | 1 | Gd | Attchd | 2007.0 | Fin | 3 | 789 | TA | TA | Y | 178 | 120 | 0 | 0 | 0 | 0 | NaN | NaN | NaN | 0 | 11 | 2007 | New | Partial | 325624 |
926 | 927 | 60 | RL | 93.0 | 11999 | Pave | NaN | Reg | Lvl | AllPub | Inside | Gtl | NridgHt | Norm | Norm | 1Fam | 2Story | 8 | 5 | 2003 | 2004 | Hip | CompShg | VinylSd | VinylSd | BrkFace | 340.0 | Gd | TA | PConc | Gd | TA | No | Unf | 0 | Unf | 0 | 1181 | 1181 | GasA | Ex | Y | SBrkr | 1234 | 1140 | 0 | 2374 | 0 | 0 | 2 | 1 | 4 | 1 | Ex | 10 | Typ | 1 | Gd | BuiltIn | 2003.0 | Fin | 3 | 656 | TA | TA | Y | 104 | 100 | 0 | 0 | 0 | 0 | NaN | NaN | NaN | 0 | 5 | 2007 | WD | Normal | 285000 |
831 | 832 | 160 | FV | 30.0 | 3180 | Pave | Pave | Reg | Lvl | AllPub | Inside | Gtl | Somerst | Norm | Norm | TwnhsE | 2Story | 7 | 5 | 2005 | 2005 | Gable | CompShg | MetalSd | MetalSd | None | 0.0 | TA | TA | PConc | Gd | TA | No | Unf | 0 | Unf | 0 | 600 | 600 | GasA | Ex | Y | SBrkr | 520 | 600 | 80 | 1200 | 0 | 0 | 2 | 1 | 2 | 1 | Gd | 4 | Typ | 0 | NaN | Detchd | 2005.0 | RFn | 2 | 480 | TA | TA | Y | 0 | 166 | 0 | 0 | 0 | 0 | NaN | NaN | NaN | 0 | 6 | 2006 | WD | Normal | 151000 |
1237 | 1238 | 60 | RL | 41.0 | 12393 | Pave | NaN | IR2 | Lvl | AllPub | FR2 | Gtl | CollgCr | Norm | Norm | 1Fam | 2Story | 7 | 5 | 2004 | 2005 | Gable | CompShg | VinylSd | VinylSd | None | 0.0 | Gd | TA | PConc | Gd | TA | No | Unf | 0 | Unf | 0 | 847 | 847 | GasA | Ex | Y | SBrkr | 847 | 1101 | 0 | 1948 | 0 | 0 | 2 | 1 | 4 | 1 | Gd | 8 | Typ | 1 | Gd | BuiltIn | 2004.0 | Fin | 2 | 434 | TA | TA | Y | 100 | 48 | 0 | 0 | 0 | 0 | NaN | NaN | NaN | 0 | 9 | 2006 | WD | Normal | 195000 |
426 | 427 | 80 | RL | NaN | 12800 | Pave | NaN | Reg | Low | AllPub | Inside | Mod | SawyerW | Norm | Norm | 1Fam | SLvl | 7 | 5 | 1989 | 1989 | Gable | CompShg | Wd Sdng | Wd Sdng | BrkFace | 145.0 | Gd | TA | PConc | Gd | TA | Gd | GLQ | 1518 | Unf | 0 | 0 | 1518 | GasA | Gd | Y | SBrkr | 1644 | 0 | 0 | 1644 | 1 | 1 | 2 | 0 | 2 | 1 | Gd | 5 | Typ | 1 | TA | Attchd | 1989.0 | Fin | 2 | 569 | TA | TA | Y | 80 | 0 | 0 | 0 | 396 | 0 | NaN | NaN | NaN | 0 | 8 | 2009 | WD | Normal | 275000 |
487 | 488 | 20 | RL | 70.0 | 12243 | Pave | NaN | IR1 | Lvl | AllPub | Inside | Gtl | NWAmes | Norm | Norm | 1Fam | 1Story | 5 | 6 | 1971 | 1971 | Gable | CompShg | Plywood | Plywood | None | 0.0 | TA | TA | CBlock | Gd | TA | Av | ALQ | 998 | Unf | 0 | 486 | 1484 | GasA | Gd | Y | SBrkr | 1484 | 0 | 0 | 1484 | 0 | 0 | 2 | 0 | 3 | 1 | TA | 7 | Typ | 1 | TA | Attchd | 1971.0 | Unf | 2 | 487 | TA | TA | Y | 224 | 0 | 0 | 0 | 180 | 0 | NaN | NaN | NaN | 0 | 2 | 2007 | WD | Normal | 175000 |
375 | 376 | 30 | RL | NaN | 10020 | Pave | NaN | IR1 | Low | AllPub | Inside | Sev | Edwards | Norm | Norm | 1Fam | 1Story | 1 | 1 | 1922 | 1950 | Gable | CompShg | Wd Sdng | Wd Sdng | None | 0.0 | Fa | Fa | BrkTil | Fa | Po | Gd | BLQ | 350 | Unf | 0 | 333 | 683 | GasA | Gd | N | FuseA | 904 | 0 | 0 | 904 | 1 | 0 | 0 | 1 | 1 | 1 | Fa | 4 | Maj1 | 0 | NaN | NaN | NaN | NaN | 0 | 0 | NaN | NaN | Y | 0 | 0 | 0 | 0 | 0 | 0 | NaN | NaN | NaN | 0 | 3 | 2009 | WD | Normal | 61000 |
1126 | 1127 | 120 | RL | 53.0 | 3684 | Pave | NaN | Reg | Lvl | AllPub | Inside | Gtl | Blmngtn | Norm | Norm | TwnhsE | 1Story | 7 | 5 | 2007 | 2007 | Hip | CompShg | VinylSd | VinylSd | BrkFace | 130.0 | Gd | TA | PConc | Gd | TA | No | Unf | 0 | Unf | 0 | 1373 | 1373 | GasA | Ex | Y | SBrkr | 1555 | 0 | 0 | 1555 | 0 | 0 | 2 | 0 | 2 | 1 | Gd | 7 | Typ | 1 | TA | Attchd | 2007.0 | Fin | 3 | 660 | TA | TA | Y | 143 | 20 | 0 | 0 | 0 | 0 | NaN | NaN | NaN | 0 | 6 | 2009 | WD | Normal | 174000 |
53 | 54 | 20 | RL | 68.0 | 50271 | Pave | NaN | IR1 | Low | AllPub | Inside | Gtl | Veenker | Norm | Norm | 1Fam | 1Story | 9 | 5 | 1981 | 1987 | Gable | WdShngl | WdShing | Wd Shng | None | 0.0 | Gd | TA | CBlock | Ex | TA | Gd | GLQ | 1810 | Unf | 0 | 32 | 1842 | GasA | Gd | Y | SBrkr | 1842 | 0 | 0 | 1842 | 2 | 0 | 0 | 1 | 0 | 1 | Gd | 5 | Typ | 1 | Gd | Attchd | 1981.0 | Fin | 3 | 894 | TA | TA | Y | 857 | 72 | 0 | 0 | 0 | 0 | NaN | NaN | NaN | 0 | 11 | 2006 | WD | Normal | 385000 |
1033 | 1034 | 20 | RL | NaN | 8125 | Pave | NaN | Reg | Lvl | AllPub | Inside | Gtl | CollgCr | Norm | Norm | 1Fam | 1Story | 7 | 5 | 2002 | 2002 | Gable | CompShg | VinylSd | VinylSd | Stone | 295.0 | Gd | TA | PConc | Gd | TA | No | GLQ | 986 | Unf | 0 | 668 | 1654 | GasA | Ex | Y | SBrkr | 1654 | 0 | 0 | 1654 | 1 | 0 | 2 | 0 | 3 | 1 | Gd | 6 | Typ | 0 | NaN | Attchd | 2002.0 | Unf | 3 | 900 | TA | TA | Y | 0 | 136 | 0 | 0 | 0 | 0 | NaN | NaN | NaN | 0 | 2 | 2006 | WD | Normal | 230000 |
Finally, we perform a sampling with replacement to create a sample whose size is equal to 1% of the entire data. You should be able to observe duplicate instances in the sample by increasing the sample size.
sample = df.sample(frac=0.01, replace=True, random_state=1)
sample
Id | MSSubClass | MSZoning | LotFrontage | LotArea | Street | Alley | LotShape | LandContour | Utilities | LotConfig | LandSlope | Neighborhood | Condition1 | Condition2 | BldgType | HouseStyle | OverallQual | OverallCond | YearBuilt | YearRemodAdd | RoofStyle | RoofMatl | Exterior1st | Exterior2nd | MasVnrType | MasVnrArea | ExterQual | ExterCond | Foundation | BsmtQual | BsmtCond | BsmtExposure | BsmtFinType1 | BsmtFinSF1 | BsmtFinType2 | BsmtFinSF2 | BsmtUnfSF | TotalBsmtSF | Heating | HeatingQC | CentralAir | Electrical | 1stFlrSF | 2ndFlrSF | LowQualFinSF | GrLivArea | BsmtFullBath | BsmtHalfBath | FullBath | HalfBath | BedroomAbvGr | KitchenAbvGr | KitchenQual | TotRmsAbvGrd | Functional | Fireplaces | FireplaceQu | GarageType | GarageYrBlt | GarageFinish | GarageCars | GarageArea | GarageQual | GarageCond | PavedDrive | WoodDeckSF | OpenPorchSF | EnclosedPorch | 3SsnPorch | ScreenPorch | PoolArea | PoolQC | Fence | MiscFeature | MiscVal | MoSold | YrSold | SaleType | SaleCondition | MedHouseVal | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1061 | 1062 | 30 | C (all) | 120.0 | 18000 | Grvl | NaN | Reg | Low | AllPub | Inside | Gtl | IDOTRR | Norm | Norm | 1Fam | 1Story | 3 | 4 | 1935 | 1950 | Gable | CompShg | MetalSd | MetalSd | None | 0.0 | Fa | TA | CBlock | TA | TA | No | Unf | 0 | Unf | 0 | 894 | 894 | GasA | TA | Y | SBrkr | 894 | 0 | 0 | 894 | 0 | 0 | 1 | 0 | 2 | 1 | TA | 6 | Typ | 0 | NaN | Detchd | 1994.0 | RFn | 3 | 1248 | TA | TA | Y | 0 | 20 | 0 | 0 | 0 | 0 | NaN | NaN | Shed | 560 | 8 | 2008 | ConLD | Normal | 81000 |
235 | 236 | 160 | RM | 21.0 | 1680 | Pave | NaN | Reg | Lvl | AllPub | Inside | Gtl | BrDale | Norm | Norm | TwnhsE | 2Story | 6 | 3 | 1971 | 1971 | Gable | CompShg | HdBoard | HdBoard | BrkFace | 604.0 | TA | TA | CBlock | TA | TA | No | ALQ | 358 | Unf | 0 | 125 | 483 | GasA | TA | Y | SBrkr | 483 | 504 | 0 | 987 | 0 | 0 | 1 | 1 | 2 | 1 | TA | 5 | Typ | 0 | NaN | Detchd | 1971.0 | Unf | 1 | 264 | TA | TA | Y | 0 | 0 | 0 | 0 | 0 | 0 | NaN | NaN | NaN | 0 | 8 | 2008 | WD | Normal | 89500 |
1096 | 1097 | 70 | RM | 60.0 | 6882 | Pave | NaN | Reg | Lvl | AllPub | Inside | Gtl | IDOTRR | Norm | Norm | 1Fam | 2Story | 6 | 7 | 1914 | 2006 | Gable | CompShg | Wd Sdng | Wd Sdng | None | 0.0 | TA | TA | PConc | TA | TA | No | Unf | 0 | Unf | 0 | 684 | 684 | GasA | TA | Y | SBrkr | 773 | 582 | 0 | 1355 | 0 | 0 | 1 | 1 | 3 | 1 | Gd | 7 | Typ | 0 | NaN | NaN | NaN | NaN | 0 | 0 | NaN | NaN | Y | 136 | 0 | 115 | 0 | 0 | 0 | NaN | NaN | NaN | 0 | 3 | 2007 | WD | Normal | 127000 |
905 | 906 | 20 | RL | 80.0 | 9920 | Pave | NaN | Reg | Lvl | AllPub | Inside | Gtl | NAmes | Norm | Norm | 1Fam | 1Story | 5 | 5 | 1954 | 1954 | Gable | CompShg | HdBoard | HdBoard | Stone | 110.0 | TA | TA | CBlock | TA | TA | No | Rec | 354 | LwQ | 290 | 412 | 1056 | GasA | TA | Y | SBrkr | 1063 | 0 | 0 | 1063 | 1 | 0 | 1 | 0 | 3 | 1 | TA | 6 | Typ | 0 | NaN | Attchd | 1954.0 | RFn | 1 | 280 | TA | TA | Y | 0 | 0 | 164 | 0 | 0 | 0 | NaN | MnPrv | NaN | 0 | 2 | 2010 | WD | Normal | 128000 |
715 | 716 | 20 | RL | 78.0 | 10140 | Pave | NaN | Reg | Lvl | AllPub | Inside | Gtl | NWAmes | Norm | Norm | 1Fam | 1Story | 6 | 5 | 1974 | 1974 | Hip | CompShg | HdBoard | HdBoard | BrkFace | 174.0 | TA | TA | CBlock | Gd | TA | No | Unf | 0 | Unf | 0 | 1064 | 1064 | GasA | TA | Y | SBrkr | 1350 | 0 | 0 | 1350 | 0 | 0 | 2 | 0 | 3 | 1 | TA | 7 | Typ | 1 | TA | Attchd | 1974.0 | RFn | 2 | 478 | TA | TA | Y | 0 | 0 | 0 | 0 | 0 | 0 | NaN | MnPrv | NaN | 0 | 8 | 2009 | WD | Normal | 165000 |
847 | 848 | 20 | RL | 36.0 | 15523 | Pave | NaN | IR1 | Lvl | AllPub | CulDSac | Gtl | CollgCr | Norm | Norm | 1Fam | 1Story | 5 | 6 | 1972 | 1972 | Gable | CompShg | HdBoard | Plywood | None | 0.0 | TA | TA | CBlock | TA | TA | Av | BLQ | 460 | Unf | 0 | 404 | 864 | GasA | Ex | Y | SBrkr | 864 | 0 | 0 | 864 | 1 | 0 | 1 | 0 | 3 | 1 | TA | 5 | Typ | 1 | Fa | Attchd | 1972.0 | Unf | 1 | 338 | TA | TA | Y | 0 | 0 | 0 | 0 | 0 | 0 | NaN | NaN | NaN | 0 | 8 | 2009 | WD | Normal | 133500 |
960 | 961 | 20 | RL | 50.0 | 7207 | Pave | NaN | IR1 | Lvl | AllPub | Inside | Gtl | BrkSide | Norm | Norm | 1Fam | 1Story | 5 | 7 | 1958 | 2008 | Gable | CompShg | Wd Sdng | Plywood | None | 0.0 | TA | Gd | CBlock | TA | TA | Gd | BLQ | 696 | Unf | 0 | 162 | 858 | GasA | Gd | Y | SBrkr | 858 | 0 | 0 | 858 | 1 | 0 | 1 | 0 | 2 | 1 | TA | 4 | Typ | 0 | NaN | NaN | NaN | NaN | 0 | 0 | NaN | NaN | Y | 117 | 0 | 0 | 0 | 0 | 0 | NaN | NaN | NaN | 0 | 2 | 2010 | WD | Normal | 116500 |
144 | 145 | 90 | RM | 70.0 | 9100 | Pave | NaN | Reg | Lvl | AllPub | Inside | Gtl | Sawyer | RRAe | Norm | Duplex | 1Story | 5 | 5 | 1963 | 1963 | Gable | CompShg | HdBoard | HdBoard | BrkFace | 336.0 | TA | TA | CBlock | TA | TA | No | Rec | 1332 | Unf | 0 | 396 | 1728 | GasA | TA | Y | SBrkr | 1728 | 0 | 0 | 1728 | 1 | 0 | 2 | 0 | 6 | 2 | TA | 10 | Typ | 0 | NaN | Detchd | 1963.0 | Unf | 2 | 504 | TA | TA | Y | 0 | 0 | 0 | 0 | 0 | 0 | NaN | NaN | NaN | 0 | 11 | 2006 | ConLI | Abnorml | 125000 |
129 | 130 | 20 | RL | 69.0 | 8973 | Pave | NaN | Reg | Lvl | AllPub | Inside | Gtl | NAmes | Norm | Norm | 1Fam | 1Story | 5 | 7 | 1958 | 1991 | Gable | CompShg | Plywood | Plywood | BrkFace | 85.0 | TA | TA | CBlock | TA | TA | No | Rec | 567 | BLQ | 28 | 413 | 1008 | GasA | TA | Y | FuseA | 1053 | 0 | 0 | 1053 | 0 | 1 | 1 | 1 | 3 | 1 | Ex | 6 | Typ | 0 | NaN | 2Types | 1998.0 | RFn | 2 | 750 | TA | TA | Y | 0 | 80 | 0 | 180 | 0 | 0 | NaN | MnWw | NaN | 0 | 7 | 2006 | WD | Abnorml | 150000 |
749 | 750 | 50 | RL | 50.0 | 8405 | Pave | NaN | Reg | Lvl | AllPub | Inside | Gtl | Edwards | Norm | Norm | 1Fam | 1.5Fin | 4 | 3 | 1945 | 1950 | Gable | CompShg | WdShing | Wd Shng | None | 0.0 | TA | TA | Slab | NaN | NaN | NaN | NaN | 0 | NaN | 0 | 0 | 0 | Wall | TA | N | FuseF | 1088 | 441 | 0 | 1529 | 0 | 0 | 2 | 0 | 4 | 1 | TA | 9 | Mod | 0 | NaN | Detchd | 1945.0 | Unf | 1 | 240 | TA | TA | N | 92 | 0 | 185 | 0 | 0 | 0 | NaN | NaN | NaN | 0 | 4 | 2009 | WD | Normal | 98000 |
508 | 509 | 70 | RM | 60.0 | 9600 | Pave | NaN | Reg | Lvl | AllPub | Inside | Gtl | OldTown | Norm | Norm | 1Fam | 2Story | 7 | 9 | 1928 | 2005 | Gambrel | CompShg | MetalSd | MetalSd | None | 0.0 | TA | Ex | BrkTil | TA | TA | No | Rec | 141 | Unf | 0 | 548 | 689 | GasA | Ex | Y | SBrkr | 689 | 689 | 0 | 1378 | 0 | 0 | 2 | 0 | 3 | 1 | Gd | 7 | Typ | 1 | Gd | Detchd | 1928.0 | Unf | 2 | 360 | TA | TA | N | 0 | 0 | 116 | 0 | 0 | 0 | NaN | NaN | NaN | 0 | 10 | 2008 | WD | Normal | 161000 |
1414 | 1415 | 50 | RL | 64.0 | 13053 | Pave | Pave | Reg | Bnk | AllPub | Inside | Gtl | BrkSide | Norm | Norm | 1Fam | 1.5Fin | 6 | 7 | 1923 | 2000 | Gambrel | CompShg | Wd Sdng | Wd Sdng | None | 0.0 | TA | TA | BrkTil | TA | TA | No | Unf | 0 | Unf | 0 | 833 | 833 | GasA | Gd | Y | SBrkr | 1053 | 795 | 0 | 1848 | 0 | 0 | 1 | 1 | 4 | 1 | Gd | 8 | Typ | 1 | Gd | Detchd | 1922.0 | Unf | 2 | 370 | TA | TA | N | 0 | 0 | 0 | 0 | 220 | 0 | NaN | NaN | NaN | 0 | 6 | 2008 | WD | Normal | 207000 |
1305 | 1306 | 20 | RL | 108.0 | 13173 | Pave | NaN | IR1 | Lvl | AllPub | Corner | Gtl | NridgHt | Norm | Norm | 1Fam | 1Story | 9 | 5 | 2006 | 2007 | Hip | CompShg | VinylSd | VinylSd | Stone | 300.0 | Gd | TA | PConc | Ex | TA | No | GLQ | 1572 | Unf | 0 | 80 | 1652 | GasA | Ex | Y | SBrkr | 1652 | 0 | 0 | 1652 | 1 | 0 | 2 | 0 | 2 | 1 | Ex | 6 | Typ | 2 | Ex | Attchd | 2006.0 | Fin | 2 | 840 | TA | TA | Y | 404 | 102 | 0 | 0 | 0 | 0 | NaN | NaN | NaN | 0 | 11 | 2009 | WD | Normal | 325000 |
1202 | 1203 | 50 | RM | 50.0 | 6000 | Pave | NaN | Reg | Lvl | AllPub | Corner | Gtl | BrkSide | Norm | Norm | 1Fam | 1.5Fin | 5 | 8 | 1925 | 1997 | Gable | CompShg | Wd Sdng | Wd Sdng | None | 0.0 | TA | TA | BrkTil | TA | TA | No | Unf | 0 | Unf | 0 | 884 | 884 | GasA | Ex | Y | SBrkr | 884 | 464 | 0 | 1348 | 1 | 0 | 1 | 0 | 3 | 1 | TA | 5 | Typ | 1 | Fa | Detchd | 1960.0 | Unf | 1 | 216 | TA | TA | N | 0 | 0 | 208 | 0 | 0 | 0 | NaN | NaN | NaN | 0 | 5 | 2009 | WD | Normal | 117000 |
1300 | 1301 | 60 | RL | NaN | 10762 | Pave | NaN | IR1 | Lvl | AllPub | CulDSac | Gtl | Gilbert | Norm | Norm | 1Fam | 2Story | 7 | 5 | 1999 | 1999 | Gable | CompShg | VinylSd | VinylSd | None | 344.0 | Gd | TA | PConc | Gd | TA | No | GLQ | 694 | Unf | 0 | 284 | 978 | GasA | Ex | Y | SBrkr | 1005 | 978 | 0 | 1983 | 0 | 0 | 2 | 1 | 3 | 1 | Gd | 9 | Typ | 1 | TA | Attchd | 1999.0 | Fin | 2 | 490 | TA | TA | Y | 0 | 0 | 0 | 0 | 0 | 0 | NaN | NaN | NaN | 0 | 5 | 2009 | WD | Normal | 225000 |
Data Compression & Noise Reduction using Haar Discrete wavelet transform (DWT)#
Load an example image and Generate a noisy image by adding Gaussian noise to the image.
from skimage.data import camera
from skimage.util import random_noise
image = camera()
noisy_image = random_noise(image, mode='gaussian', var=0.01)
Perform 2D Discrete Wavelet Transform (DWT) using the Haar wavelet to decompose an image into its constituent components:
cA: Approximation coefficients representing the low-frequency content of the image.
cH: Horizontal detail coefficients capturing edge information in the horizontal direction.
cV: Vertical detail coefficients capturing edge information in the vertical direction.
cD: Diagonal detail coefficients capturing edge information along the diagonals.
Data Compression: Primarily targets the “cA” component, retaining significant low-frequency information while reducing dimensionality to achieve data compression. The method selectively retains a portion of “cA” based on its importance to overall image structure.
Noise Reduction: Focuses on the “cH”, “cV”, and “cD” components to mitigate noise. By applying a thresholding technique, it eliminates coefficients below a certain magnitude believed to represent noise rather than true image content, thereby preserving essential edge information while reducing noise.
coeffs = pywt.dwt2(noisy_image, 'haar')
cA, (cH, cV, cD) = coeffs
# Set a threshold to zero out small coefficients in the detail bands for denoising
threshold = 0.02
cH_denoised = np.where(np.abs(cH) > threshold, cH, 0)
cV_denoised = np.where(np.abs(cV) > threshold, cV, 0)
cD_denoised = np.where(np.abs(cD) > threshold, cD, 0)
# Reconstruct the image using the modified coefficients for denoising
denoised_image = pywt.idwt2((cA, (cH_denoised, cV_denoised, cD_denoised)), 'haar')
# Image compression by keeping top 70% coefficients
n_largest = int(len(cA.flatten()) * 0.7)
largest_indices = np.argpartition(cA.flatten(), -n_largest)[-n_largest:]
compressed_cA = np.zeros_like(cA.flatten())
compressed_cA[largest_indices] = cA.flatten()[largest_indices]
compressed_cA = compressed_cA.reshape(cA.shape)
# Reconstruct the image from the compressed approximation coefficients
compressed_image = pywt.idwt2((compressed_cA, (cH_denoised, cV_denoised, cD_denoised)), 'haar')
# Plot the original, noisy, denoised, and compressed images
fig, axs = plt.subplots(2, 2, figsize=(10, 10))
axs[0, 0].imshow(image, cmap='gray')
axs[0, 0].set_title('Original Image')
axs[0, 0].axis('off')
axs[0, 1].imshow(noisy_image, cmap='gray')
axs[0, 1].set_title('Noisy Image')
axs[0, 1].axis('off')
axs[1, 0].imshow(denoised_image, cmap='gray')
axs[1, 0].set_title('Denoised Image')
axs[1, 0].axis('off')
axs[1, 1].imshow(compressed_image, cmap='gray')
axs[1, 1].set_title('Compressed Image')
axs[1, 1].axis('off')
plt.show()
