There are many open questions in machine learning that are only going to be solved through breakthroughs in research. Those problems are down to data scientists and researchers.
At the same time, there are many challenges within production machine learning that closely parallel challenges in software engineering — problems we’ve spent decades solving.
What we need, and what we’re seeing happen, is for software engineering principles and patterns to be applied to the challenges of production machine learning.
Below are a few examples of how this is already happening:
You typically hear about “reproducibility” in reference to ML research, particularly when a paper doesn’t include enough information to recreate the experiment. However, reproducibility also comes up a lot in production ML.
Think of it this way — you’re on a team staffed with data scientists and engineers, and you’re all responsible for an image classification API. The data scientists are constantly trying new techniques and architectural tweaks to improve the model’s baseline performance, while at the same time, the model is constantly being retrained on new data.
Looking over the APIs performance, you see one moment a week ago where the model’s performance dropped significantly. What caused that drop? Without knowing exactly how the model was trained, and on what data, it’s impossible to know for sure.
In software engineering, we use version control to solve this. But, as many engineers have learned, you can’t just use GitHub to version control your model and training data. Git doesn’t handle very large files well, and this is a deal breaker when you’re handling gigabytes of raw data. Additionally, training data, experiment code, and the outputted model need to be versioned together as a single experiment.
The principles of version control, however, are still applicable. That’s why Data Version Control (DVC) has become so popular among ML teams over the last few years. DVC’s maintainers explain the project like this:
Deploying models is one of the most commonly complained about parts of production ML. The process is often characterized as a messy, hack-things-till-it-works procedure.
For context, to deploy a model as a web service for realtime inference, you need to:
And within each of those tasks, there is a world of glue code and hackery required:
The list can go on much longer that.
In software engineering, we automate a lot of this with orchestration and DevOps tooling. In machine learning, there hasn’t been an equivalent tool. Lambda has size limits that rule out larger models, Elastic Beanstalk/Elastic Container Service require a good deal of custom configuration under the hood to run inference (defeating the point of using them), etc.
We built an open source tool, Cortex, specifically because of this. Similar to Serverless or Beanstalk, Cortex takes simple config files, and then deploys model APIs to cloud infrastructure, automating all of the underlying DevOps:
A model’s performance can change over time for a number of reasons. Training data can change, training techniques can be tweaked, users behavior can change, etc.
Catching performance issues and rolling models back is a nontrivial challenge, with a variety of hacked together solutions used by teams in the field.
To a software engineer, this sounds very familiar. Applications can also experience periods of degraded performance, oftentimes for similar reasons. There’s an entire ecosystem of monitoring tools built exactly for this, like Datadog and New Relic.
Monitoring model performance, however, is an ML-specific task. You can measure things like latency or errors, but measuring prediction accuracy requires tools built for models.
But the principles still apply. There are now tools specifically for monitoring prediction accuracy in real time, like Weights & Biases:
The familiarity of these production ML challenges is part of what makes them so frustrating. It feels like they should be easy to solve — after all, we’ve spent decades building tools to solve identical problems.
And then you try, and you realize that while the problems are the same in spirit, the production ML challenges are just specific enough that software engineering tools don’t generalize.
As a result, the pragmatic approach becomes hacking a data science workflow to work in production, sort of like eating soup with a fork because no one has invented a spoon.
Now, with the emergence of machine learning engineering, we’re seeing that change. There is a new focus on building tools that allow us to use ML in production. As a result, the barrier between interesting ML experiments and useful ML applications is coming down.
In the near future, any software engineer with some basic knowledge of machine learning will be to use ML as a part of their stack.