The goal of this blog is to cover the key topics to consider in operationalizing machine learning and to provide a practical guide for navigating the modern tools available along the way. To that end, the subsequent blogs will include further detailed architecture concepts and help you apply them to your own model pipelines.
This blog series will not explain machine learning concepts but rather to tackle the auxiliary challenges like dealing with large data sets, computational requirements and optimizations, and the deployment of models and data to large software systems.
Most classical software applications are deterministic where the developer writes explicit lines of code that encapsulate the logic for the desired behavior.
Whereas, the ML software applications are probabilistic where the developer writes a more abstract code and lets the computer write the code in a human unfriendly language i.e. the weights or parameters required for the ML model. Andrej Karpathy has written in detail about pointing to the same difference in his blog.
This requires us to look at new ways of getting data, cleaning, training, and deployments methods since apart from code, we have the weights and data which keep changing.
One of the first steps in starting machine learning projects is to gather data, clean the data, and make it ready for the purpose of experimenting and building models
The initial techniques may start with doing the above in a manual way but without automated pipelines to operationalize these ETL processes, the technical debt increases over time.
In addition, there needs to be a way to store large data either on cloud storage or file storage system. Storage also means proper tooling for gathering, labeling, and making the data access scalable.
Finally, as the data is being transformed, it is key to keep track of versions of data so downstream, when the data is being used for experimentation, training, or testing of algorithms, there is a trackable version of data that run can be associated with.
Once data is gathered and explored, it is time to perform feature engineering and modeling. While some methods require strong domain knowledge to make sensible decisions feature engineering decisions, others can learn significantly from the data. Models such as logistic regression, random forest, or deep learning techniques are then run to train the algorithms.
There are multiple steps involved here and keeping track of experiment versions is essential for governance and reproducibility of previous experiments.
Hence, having both the tools and IDE around managing experiments with Jupyter notebook, scripts, and others is essential. Such tools require provisioning of hardware and proper frameworks to allow data scientists to perform their jobs optimally.
Deployment and management of model and ML pipelines
After the model is trained and performing well, in order to leverage the output of this machine learning initiative, it is essential to deploy the model into a product whether that is on the cloud or directly “on the edge”.
Deployment in Machine Learning systems can mean any of the following:
Following showcases architecture diagram with AWS tools. Either of these setups allows a data scientist or machine learning engineer to deploy their models.
After the model has been deployed it is also essential we understand the performance of models in production to avoid issues with model or concept drift.
Tools for monitoring visualizations of data distributions and creating metrics around how the test data differs from the training data can be leveraged to track ongoing model performance and ensure the best performance.
Usual software continuous integration (CI) principles do not directly translate into the world of ML. Data scientists and ML engineers are not writing code according to a prototype specification, so it feels unnatural to write unit tests.
CI for ML has two key goals:
CI should be built by using test data and running prediction scripts while validating the outputs received from the model on ground truth data at regular intervals.
This step combines all the above mentioned steps into a reproducible pipeline, which can be run sequentially. Generally, it will include the following steps:
ETL of data: Create a robust self-service data pipeline along with a set of utility tools that make the gathering, loading, transforming, and building datasets a much faster and simpler task
Establish a Continuous Integration pipeline around using test data and run predictions scripts to validate the reproducibility of the desired output and the performance of the model.
Then we can deploy the model into production either for online prediction as REST API, offline predictions, or have it deployed on the edge. In the deployed code establishing a method for ongoing monitoring of these models enables full end-to-end performance modeling of the model.
Continuous delivery ensures that a pipeline for all of the above steps is created. Traditional software tools for continuous delivery can still be leveraged with a few tweaks, along with specific data processing and machine learning tools to establish an automated and reliable workflow.