This series looks at the development and deployment of machine learning (ML) models. In this post, you train an ML model and save that model so it can be deployed as part of an ML system. Part 1 gave an overview of the ML workflow, considering the stages involved in using machine learning and data science to deliver business value. Part 3 looks at how to deploy ML models on Google Cloud Platform (GCP).
Training a model that can then be used as part of a machine learning system requires an understanding of your data, business goals, and many other requirements both technical and organizational.
In this post, you create a Python script that, when executed, trains an ML model and then saves it for future use.
First, I highlight some important considerations when training an ML model for your application.
Considerations before training the model
From model selection to dataset complexity and size, data practitioners must strategically plan out resources and desired requirements. The factors to consider before training a model include the following:
- Choice of model
- Explainability
- Model hyperparameters
- Choice of hardware
- Size of data
Choice of model
There are many classes of ML models that you could use to solve a problem. The model that you select is dependent upon your use case and possible constraints.
Explainability
If your model is to be deployed as part of a system that runs in a regulated industry, such as finance or healthcare, your model must likely be explainable. This means that, for any prediction made by your model, it is possible to state the reasons why the model made that decision.
In such cases, you may wish to stick to models such as linear regression or logistic regression, which are easily explainable.
Model hyperparameters
Models have tunable hyperparameters. It is important to understand what these hyperparameters correspond to and how they impact the model.
The model’s performance can change greatly depending on the choice of hyperparameters.
Choice of hardware
Most data practitioners know that model training can often be accelerated on GPU. But even before you get to the model training stage, GPUs can greatly benefit your data science workflows.
Everything from preprocessing pipelines to data exploration and visualization can also be accelerated. This helps you iterate faster and try out more computationally expensive techniques.
Size of data
When dealing with data bigger than the memory on one core or machine, it is important to think about techniques to make the most of all your data.
Perhaps it makes sense to move to the GPU, using tools like RAPIDS, to accelerate your pandas and scikit-learn style workflows. Or maybe you want to investigate a scale-out framework like Dask, which can scale model training and data processing whether you’re working on CPU or GPU.
Understanding the dataset
In this post, you train a model on a classic data set: the Iris Dataset from the UCI Machine Learning Repository. This dataset contains recordings of petal length and width and sepal length and width for a set of 150 iris flowers. Each Iris belongs to one of three types: setosa, virginica, or versicolor.
You use this data to train a classification model, with the goal of predicting the type of iris, based on petal and sepal dimensions.
Training on a CPU
Before you can deploy an ML model, you must first build one. Begin by downloading the popular Iris Dataset. This example assumes that the iris dataset is downloaded and saved as iris.data
in your current working directory.
To train the logistic regression model, follow these steps:
- Read the training data.
- Split the training data into features and labels.
- Split the data into a training and test set (75% is training data and 25% is test data).
- Train a Logistic Regression model.
- Persist the trained model.
import joblib
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
def run_training():
"""
Train the model
"""
# Read the training data
dataset = pd.read_csv(
filepath_or_buffer="iris.data",
names=["sepal_length_cm", "sepal_width_cm", "petal_length_cm", "petal_width_cm", "class"]
)
# Split into labels and targets
X = dataset.drop("class", axis=1).copy()
y = dataset["class"].copy()
# Create train and test set
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.25, random_state=26)
# Training the model
model = LogisticRegression(random_state=26)
model.fit(X_train, y_train)
# Persist the trained model
joblib.dump(model, "logistic_regression_v1.pkl")
if __name__ == "__main__":
run_training()
The random_state
parameter in both the train_test_split
and LogisticRegression
calls helps to ensure that this script produces the same results each time it is run.
Running the script produces a model, saved in the file logistic_regression_v1.pkl
, which you can use to classify other irises, based on their petal and sepal dimensions.
GPU-accelerated model training
In this example, you are working with a small data set, containing only 150 rows of data. Because of the simplicity of the data, the model trains in seconds on a CPU.
However, when dealing with real-world datasets, it is not uncommon for model training to become a bottleneck. In such cases, you can often speed up the model training stage of your workflow by working on GPU, rather than CPU.
For example, RAPIDS provides a suite of open-source software tools that enable data scientists and engineers to quickly run workloads and data science pipelines on the GPU. By mimicking the APIs of common data science libraries, like pandas
and scikit-learn
, you can speed up your machine learning model training (as well as exploratory data science) with only minor code changes.
What’s next?
Now that you have a trained model, you can consider deploying it to production. In the next post, Machine Learning in Practice: Deploy an ML Model on Google Cloud Platform, you learn three ways to deploy your model on GCP.
Looking for help or guidance on anything from data ingestion to deploying machine learning? Join the Data Processing forum.