Feature Engineering#

Dirty Cat#

Dirty Cat: Never use one-hot encoding again!

  • Most ML models can only work with numeric data. So it’s common practice to one-hot encode categorical data before feeding them into an ML model.

  • However, one-hot encoding creates a new column for each unique category in your column. E.g., If you have 300 unique job titles in your company, one hot encoding this job_title column will convert that one column to 300 (or 299 if you drop first) new columns.

  • Adding so many columns to your dataset could result in the curse of dimensionality and cause model accuracy to suffer.

How to solve this problem?

pip install dirty_cat

I particularly like dirt_cat because

  • Simple implementation in just one line!

  • Automatically selects the best encoder to use for each categorical column.

  • Automatically passes through continuous columns without the need for column transformers.

  • Provides NLP N-gram similarity and topic modeling-based encoders for string categories. E.g., the classes”Firefighter/Rescuer III” and “Fire/Rescue Lieutenant” belong to the same topic “firefighter, rescuer, rescue.”

  • Excellent results in head-to-head comparisons.

🌟 Github: https://github.com/dirty-cat/dirty_cat

📖 Docs: https://dirty-cat.github.io/stable/index.html

# Imports
import warnings

import pandas as pd
import plotly.express as px
from dirty_cat import SuperVectorizer
from dirty_cat.datasets import fetch_employee_salaries
from sklearn.compose import make_column_transformer
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import cross_val_score, train_test_split
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import OneHotEncoder

warnings.filterwarnings("ignore", category=FutureWarning)

# Load some high cardinal feature data
employee_salaries = fetch_employee_salaries()
X = employee_salaries.X
y = employee_salaries.y

# Pre-processing
mask = X.isna()["gender"]
X.dropna(subset=["gender"], inplace=True)
y = y[~mask]
X = X[
    [
        "gender",
        "department_name",
        "assignment_category",
        "employee_position_title",
        "year_first_hired",
    ]
]
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.15, random_state=42
)

# Print original data stats
print(f"Num cols before categorical encoding: {X_train.shape[1]}")
print(f"Num unique values in each col: \n{X_train.nunique()}")
/opt/hostedtoolcache/Python/3.8.13/x64/lib/python3.8/site-packages/dirty_cat/datasets/fetching.py:105: UserWarning: Could not find the dataset 42125 locally. Downloading it from OpenML; this might take a while... If it is interrupted, some files might be invalid/incomplete: if on the following run, the fetching raises errors, you can try fixing this issue by deleting the directory PosixPath('/opt/hostedtoolcache/Python/3.8.13/x64/lib/python3.8/site-packages/dirty_cat/datasets/data').
  warnings.warn(
Num cols before categorical encoding: 5
Num unique values in each col: 
gender                       2
department_name             37
assignment_category          2
employee_position_title    363
year_first_hired            51
dtype: int64
# Group the cols
cat_cols = [
    "gender",
    "department_name",
    "assignment_category",
    "employee_position_title",
]
cont_cols = ["year_first_hired"]
def plot_feature_importance(pipeline, features):
    importances = pipeline[1].feature_importances_
    feature_importance_df = (
        pd.DataFrame({"feature": features, "importances": importances})
        .sort_values("importances")
        .tail(10)
    )
    fig = px.bar(
        feature_importance_df, x="importances", y="feature", title="Feature Importance"
    )
    fig.show()


def pipeline_model(encoder, X_train, y_train, X_test, y_test):
    ## Pipeline Model
    pipeline = make_pipeline(encoder, RandomForestRegressor())

    ## Fit and evaluate model
    pipeline.fit(X_train, y_train)
    train_score = pipeline.score(X_train, y_train)
    test_score = pipeline.score(X_test, y_test)
    print(f"Train score: {train_score:.3f}, Test score: {test_score:.3f}")

    # Check the features
    features = pipeline[0].get_feature_names_out()
    print(f"Num cols after encoding: {len(features)}")
    print(f"Encoded columns: ...{features[-5:]}")

    # Plot Feature Importance
    plot_feature_importance(pipeline, features)
    return pipeline, features
# One Hot encoding - 🪢 Complex!
oh_encoder = make_column_transformer(
    (OneHotEncoder(handle_unknown="ignore", drop="first"), cat_cols),
    ("passthrough", cont_cols),
)

# Create and fit the pipeline model
pipeline, features = pipeline_model(oh_encoder, X_train, y_train, X_test, y_test)
/opt/hostedtoolcache/Python/3.8.13/x64/lib/python3.8/site-packages/sklearn/preprocessing/_encoders.py:188: UserWarning: Found unknown categories in columns [3] during transform. These unknown categories will be encoded as all zeros
  warnings.warn(
Train score: 0.967, Test score: 0.817
Num cols after encoding: 401
Encoded columns: ...['onehotencoder__employee_position_title_Welder'
 'onehotencoder__employee_position_title_Work Force Leader II'
 'onehotencoder__employee_position_title_Work Force Leader III'
 'onehotencoder__employee_position_title_Work Force Leader IV'
 'passthrough__year_first_hired']
# Dirty Cat SuperVectorizer - 🧵 Simple!
super_encoder = SuperVectorizer()

# Create and fit the pipeline model
pipeline, features = pipeline_model(super_encoder, X_train, y_train, X_test, y_test)
Train score: 0.975, Test score: 0.896
Num cols after encoding: 72
Encoded columns: ...['warehouse, worker, caseworker', 'representative, recreation, construction', 'communications, telecommunications, safety', 'aide, disability, legislative', 'year_first_hired']