Feature Engineering
Contents
Feature Engineering#
Dirty Cat#
Dirty Cat: Never use one-hot encoding again!
Most ML models can only work with numeric data. So it’s common practice to one-hot encode categorical data before feeding them into an ML model.
However, one-hot encoding creates a new column for each unique category in your column. E.g., If you have 300 unique job titles in your company, one hot encoding this job_title column will convert that one column to 300 (or 299 if you drop first) new columns.
Adding so many columns to your dataset could result in the curse of dimensionality and cause model accuracy to suffer.
How to solve this problem?
You could use CatBoost: 🌟Github: https://github.com/catboost/catboost, which supports categorical features natively.
However, if you want to use any other scikit-learn algorithm, then you can either use the various options available in the category encoders library: 🌟 Github: https://github.com/scikit-learn-contrib/category_encoders
Or you can use dirty_cat: 🌟 Github: https://github.com/dirty-cat/dirty_cat
pip install dirty_cat
I particularly like dirt_cat because
Simple implementation in just one line!
Automatically selects the best encoder to use for each categorical column.
Automatically passes through continuous columns without the need for column transformers.
Provides NLP N-gram similarity and topic modeling-based encoders for string categories. E.g., the classes”Firefighter/Rescuer III” and “Fire/Rescue Lieutenant” belong to the same topic “firefighter, rescuer, rescue.”
Excellent results in head-to-head comparisons.
🌟 Github: https://github.com/dirty-cat/dirty_cat
📖 Docs: https://dirty-cat.github.io/stable/index.html
# Imports
import warnings
import pandas as pd
import plotly.express as px
from dirty_cat import SuperVectorizer
from dirty_cat.datasets import fetch_employee_salaries
from sklearn.compose import make_column_transformer
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import cross_val_score, train_test_split
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import OneHotEncoder
warnings.filterwarnings("ignore", category=FutureWarning)
# Load some high cardinal feature data
employee_salaries = fetch_employee_salaries()
X = employee_salaries.X
y = employee_salaries.y
# Pre-processing
mask = X.isna()["gender"]
X.dropna(subset=["gender"], inplace=True)
y = y[~mask]
X = X[
[
"gender",
"department_name",
"assignment_category",
"employee_position_title",
"year_first_hired",
]
]
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.15, random_state=42
)
# Print original data stats
print(f"Num cols before categorical encoding: {X_train.shape[1]}")
print(f"Num unique values in each col: \n{X_train.nunique()}")
/opt/hostedtoolcache/Python/3.8.13/x64/lib/python3.8/site-packages/dirty_cat/datasets/fetching.py:105: UserWarning: Could not find the dataset 42125 locally. Downloading it from OpenML; this might take a while... If it is interrupted, some files might be invalid/incomplete: if on the following run, the fetching raises errors, you can try fixing this issue by deleting the directory PosixPath('/opt/hostedtoolcache/Python/3.8.13/x64/lib/python3.8/site-packages/dirty_cat/datasets/data').
warnings.warn(
Num cols before categorical encoding: 5
Num unique values in each col:
gender 2
department_name 37
assignment_category 2
employee_position_title 363
year_first_hired 51
dtype: int64
# Group the cols
cat_cols = [
"gender",
"department_name",
"assignment_category",
"employee_position_title",
]
cont_cols = ["year_first_hired"]
def plot_feature_importance(pipeline, features):
importances = pipeline[1].feature_importances_
feature_importance_df = (
pd.DataFrame({"feature": features, "importances": importances})
.sort_values("importances")
.tail(10)
)
fig = px.bar(
feature_importance_df, x="importances", y="feature", title="Feature Importance"
)
fig.show()
def pipeline_model(encoder, X_train, y_train, X_test, y_test):
## Pipeline Model
pipeline = make_pipeline(encoder, RandomForestRegressor())
## Fit and evaluate model
pipeline.fit(X_train, y_train)
train_score = pipeline.score(X_train, y_train)
test_score = pipeline.score(X_test, y_test)
print(f"Train score: {train_score:.3f}, Test score: {test_score:.3f}")
# Check the features
features = pipeline[0].get_feature_names_out()
print(f"Num cols after encoding: {len(features)}")
print(f"Encoded columns: ...{features[-5:]}")
# Plot Feature Importance
plot_feature_importance(pipeline, features)
return pipeline, features
# One Hot encoding - 🪢 Complex!
oh_encoder = make_column_transformer(
(OneHotEncoder(handle_unknown="ignore", drop="first"), cat_cols),
("passthrough", cont_cols),
)
# Create and fit the pipeline model
pipeline, features = pipeline_model(oh_encoder, X_train, y_train, X_test, y_test)
/opt/hostedtoolcache/Python/3.8.13/x64/lib/python3.8/site-packages/sklearn/preprocessing/_encoders.py:188: UserWarning: Found unknown categories in columns [3] during transform. These unknown categories will be encoded as all zeros
warnings.warn(
Train score: 0.967, Test score: 0.817
Num cols after encoding: 401
Encoded columns: ...['onehotencoder__employee_position_title_Welder'
'onehotencoder__employee_position_title_Work Force Leader II'
'onehotencoder__employee_position_title_Work Force Leader III'
'onehotencoder__employee_position_title_Work Force Leader IV'
'passthrough__year_first_hired']
# Dirty Cat SuperVectorizer - 🧵 Simple!
super_encoder = SuperVectorizer()
# Create and fit the pipeline model
pipeline, features = pipeline_model(super_encoder, X_train, y_train, X_test, y_test)
Train score: 0.975, Test score: 0.896
Num cols after encoding: 72
Encoded columns: ...['warehouse, worker, caseworker', 'representative, recreation, construction', 'communications, telecommunications, safety', 'aide, disability, legislative', 'year_first_hired']