Interactive Pipeline and Composite Estimators for Your End-to-End ML Model
Machine Learning Modeling posted by ODSC Community November 3, 2022 ODSC Community
A data science model development pipeline involves various components including data injection, data preprocessing, feature engineering, feature scaling, and modeling. A...
A data science model development pipeline involves various components including data injection, data preprocessing, feature engineering, feature scaling, and modeling. A data scientist needs to write the learning and inference code for all the components. The code structure sometimes becomes messier and difficult to interpret for other team members, for machine learning projects with heterogeneous data.
A pipeline is a very handy function that can sequentially ensemble all your model development components. Using a pipeline one can easily perform the learning and inference tasks in a comparatively cleaner code structure.
In this article, we will discuss how to use scikit-learn pipelines to structure your code using chaining estimators and column transformers while developing an end-to-end machine learning model.
What is Pipeline?
A pipeline can sequentially list all the data processing and feature engineering estimators in a clean code structure. Basically, it chains multiple estimators into one. Pipelines are very convenient to use for learning and inference tasks and avoid data leakage. One can also perform a grid search over parameters of all estimators in the pipeline at once.
I will be developing an end-to-end machine learning model for a binary sample dataset with heterogeneous features. The binary sample dataset has 8 independent features of text, numerical, and categorical data types.
(Image by Author), Snapshot of the sample dataset
Usage:
The sample dataset includes text features (Name), Categorical features (Sex, Embarked), and Numerical features (PClass, Age, SibSp, Parch, Fare).
The raw real-world dataset might contain a lot of missing data values. We can use the SimpleImputer function from the scikit-learn package to impute the missing values. For categorical features, we can chain a one-hot encoder followed by an SVD estimator for feature decomposition.
For text features, we can vectorize the text using Count Vectorizer or Tf-Idf vectorizer to convert the text data into numerical embeddings followed by a dimensionality reduction estimator.
Pipeline 1 (For categoircal features): 1) Most Frequent value Imputer 2) One Hot Encoder 3) Truncated SVD decompositionPipeline 2 (For Text based features): 1) Tf-Idf Vectorizer 2) Truncated SVD decomposition
from sklearn.impute import SimpleImputer from sklearn.preprocessing import OneHotEncoder from sklearn.feature_extraction.text import TfidfVectorizer from sklearn.decomposition import TruncatedSVD from sklearn.pipeline import make_pipeline most_frequent_imputer = SimpleImputer(strategy='most_frequent') onehot_encoder = OneHotEncoder(handle_unknown='ignore') vectorizer = TfidfVectorizer() svd = TruncatedSVD(n_components=2, random_state=42) pipe1 = make_pipeline(most_frequent_imputer, onehot_encoder, svd) pipe2 = make_pipeline(vectorizer, svd)
Column Transformer for Heterogenous data:
The sample dataset contains various data types of features from text datatypes, to float and object datatypes. So each type of feature requires separate feature engineering strategies.
Column Transformer is a scikit-learn function that enables developers to perform different feature engineering and data transformation steps for different sets of features. The good part of column transformers is that one can perform data transformation within the pipeline safe from data leakage issues.
Usage:
I have performed different feature transformation strategies for different sets of features.
Pipeline 1 for categorical features such as ‘Sex’ and ‘Embarked’
Pipeline 2 for text-based features such as ‘Name’
Mean imputer for numerical feature ‘Age’ as it has a lot of missing values.
The remaining numerical features don’t require any feature transformation, so it can be passed as it is using the ‘passthrough’ keyword
most_frequent_imputer = SimpleImputer(strategy='most_frequent') onehot_encoder = OneHotEncoder(handle_unknown='ignore') vectorizer = CountVectorizer() mean_imputer = SimpleImputer(strategy='mean') svd = TruncatedSVD(n_components=2, random_state=42) pipe1 = make_pipeline(most_frequent_imputer, onehot_encoder, svd) pipe2 = make_pipeline(vectorizer, svd) column_trans = make_column_transformer( (pipe1, ['Sex', 'Embarked']), (pipe2, 'Name'), (mean_imputer, ['Age']), ('passthrough', ['Fare', 'SibSp', 'Parch', 'Pclass']))
Modeling Pipeline:
After performing the data transformation steps one can move to the model component. I will be using a Logistic Regression estimator to train the transformed dataset. But before moving to the modeling stage we can also include a StandardScaler estimator to standardize the transformed dataset.
pipe1 = make_pipeline(most_frequent_imputer, onehot_encoder, svd) pipe2 = make_pipeline(vectorizer, svd) column_trans = make_column_transformer( (pipe1, ['Sex', 'Embarked']), (pipe2, 'Name'), (mean_imputer, ['Age']), ('passthrough', ['Fare', 'SibSp', 'Parch', 'Pclass'])) scaler = StandardScaler() classifier = LogisticRegression(random_state=42) # Final Pipeline pipeline = make_pipeline(column_trans, scaler, classifier)
Visualizing the Pipeline:
A visual representation of the entire pipeline is quite easy to interpret the end-to-end flow of the case study. Scikit-learn comes up with a set_config function that enables the developer to display a diagrammatic representation of the entire end-to-end pipeline.
By default, the set_config display parameter is ‘text’, which displays a textual format of the entire pipeline. Changing to the ‘diagram’ keyword will make it work.
from sklearn import set_config set_config(display='diagram')
(Image by Author), Diagrammatic interpretation of the entire pipeline
Learning and Inference:
One can train the model pipeline using the .fit() function and perform inference using the .predict() function.
{ "cells": [ { "cell_type": "code", "execution_count": 1, "id": "8a6bc438", "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "from sklearn.model_selection import train_test_split\n", "\n", "from sklearn.datasets import load_iris\n", "from sklearn.model_selection import train_test_split\n", "from sklearn.preprocessing import StandardScaler\n", "from sklearn.decomposition import PCA\n", "from sklearn.pipeline import Pipeline\n", "from sklearn.linear_model import LogisticRegression\n", "from sklearn.tree import DecisionTreeClassifier\n", "from sklearn.ensemble import RandomForestClassifier\n", "\n", "from sklearn.impute import SimpleImputer\n", "from sklearn.preprocessing import OneHotEncoder\n", "from sklearn.feature_extraction.text import CountVectorizer\n", "from sklearn.preprocessing import LabelEncoder\n", "from sklearn.pipeline import make_pipeline\n", "from sklearn.compose import make_column_transformer\n", "from sklearn.decomposition import PCA\n", "from sklearn.utils import estimator_html_repr\n", "from sklearn.decomposition import TruncatedSVD\n", "from sklearn.metrics import *" ] }, { "cell_type": "code", "execution_count": 14, "id": "8126bc5c", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "(891, 9)\n" ] }, { "data": { "text/html": [ "
\n", "