Logo

The Data Daily

Industrializing AI & Machine Learning Applications with Kubeflow

Industrializing AI & Machine Learning Applications with Kubeflow

Industrializing AI & Machine Learning Applications with Kubeflow
Enable data scientists to make scaling and production-ready ML products
By Unsplash
Backgrounds
Over the last few years, people are talking about industrializing AI & ML applications a lot. And you might have heard the famous Netflix Prize as well as some challenges of implementing winning solutions from Kaggle to real world. This is mainly because:
Building an AI / ML model is different to scaling and maintaining it. One is a typical data science problem and the other is a typical engineering problem.
by gfycat
To be more specific, to productionize an AI & ML algorithm, there will be gaps from both product and people perspectives:
Product — hidden technical debt caused by unclear/unscoped engineering works. (AI / ML product = small part of AI & ML modeling + many DevOps and Engineering works)
By Hidden Technical Debt in Machine Learning System
People — lack of DevOps and engineering skillsets in data science teams. (When I was talking to data scientist friends, I found not many of them are familiar with concepts like CI/CD (Continuous Integration & Continuous Deployment), Microservices, Containerization, Kubernetes and sometimes Git. )
To Minimize the Gaps
To resolve the gaps, industries have tried to introduce new positions like ML engineers or ML DevOps to either filling the knowledge gaps for data science teams or do the jobs for data scientists to productionize their ML models.
On the other hand, enterprises are building unified AI & ML platforms to automate lots of engineering works and minimize the DevOps works required for data scientists (such as H2O, DataRobot, SageMaker, etc.)
Kubeflow, as one of the new candidates focusing on that area, has been evolving rapidly and raised a lot of interests over the last few months. In this article, I’d like to discuss how Kubeflow can help to resolve some fundamental challenges of AI & ML industrialization.
What is Kubeflow?
Our goal is to make scaling machine learning (ML) models and deploying them to production as simple as possible.
The above from Kubeflow website is self-explanatory. Essentially, Kubeflow is leveraging the existing ecosystems of K8s (Kubernetes) to make ML highly scalable and operated in a microservice manner.
On top of that, they are integrating some open source components/tools to fulfill requirements from different stages of data science workflows. Including:
Kubeflow Notebooks — Python Jupyter Notebooks
Kubeflow Pipelines — ML pipelines orchestration
Fairing — model building, training and deploying
Katib — hyperparameters tuning on Kubernetes
Seldon — model serving
I think Kubeflow has the potential to be a great AI & ML platform as it is solving our challenges of AI & ML in production from:
Platform Scalability
Machine Learning Code Deployment & Maintenance
Development Experience
Scalability
AI & ML are closely related to big data nowadays, with the facts that data scientists are exposed to TBs ~ PBs of data, plus the nature of data science itself requires a lot of “test & learn” loops at different stages — EDA, feature engineering, model selection, hyper-parameter tuning. This leads to a strong relationship between the platform scalability and the efficiency of building scaling AI & ML products.
Previously, we had those VMs (with hundreds GBs of RAM) setting on internal servers, and what a data scientist needs to do is simply SSH into the VMs and start doing the modeling works. However, it was not scalable in terms of there is always a hard upper limit of how much RAM (e.g. up to 1 TB) and cores in each VM.
Later Hadoop YARN and Spark were introduced, which can effectively solve the scalability issues and process the data very fast. But in early days, those platforms were not very user-friendly to data scientists and doesn’t support ML very well as they were initially designed for data processing rather than matrix computation as what ML does.
Recently, Kubernetes (K8s) as a container-based orchestration engine for automating deployment, scaling, and management of containerized applications has become very popular in industries for hosting different services, which in turn also triggered a lot of people to think if it can be a good match for ML works as well.
Kubeflow (Tensorflow + Kubernetes) as one of the K8s based ML solutions, there are already cases in industries that have proven the advantages of running ML applications on K8s like:
Better resiliency by simpler auto-scaling your ML models (containers) using CDR and TBJob . The cluster can be much easier to scale up based on business needs if it’s deployed on Cloud.
Better support python tools compared to Hadoop which is java focused.
Based on the above, compared to YARN and traditional VMs, Kubeflow is more promising in terms of it’s not only providing the scalability and but also enabling data scientists to work on the tools they are familiar with.
Deployment & Maintenance
I used to see models in production as a black box (probably the same way as how others see AI & ML algorithms). Before a model can be promoted to the production environment, we need to make it’s testable, self-contained, version-controlled and managed by CICD pipelines. However, those works may not be fun to data scientists and also working on that may not be the best use of their time.
Kubeflow obviously understands data scientists quite well, it provides SDK for its pipelines to help data scientists create a pre-compiled zip file (with Dockerfile and YAML files ready) based on pipelines defined in python, which can then be passed to CICD pipelines directly for review, testing and being promoted to production environments eventually. This effectively minimizes works of data scientists in writing all those YAML files and simplifies the whole ML deployment process.
In the below chart, data scientists only need to spend their time on the yellow parts and Kubeflow Pipelines SKD will take care of the rest of works from source code repository to container registry.
By Kubeflow Community
Once your container is registered, Kubeflow is leveraging Argo as its pipelines tool to orchestrate those container-based pipelines, which makes each of your end-to-end ML pipeline to be very easy to be tested, scaled and maintained in a microservices manner.
That will effectively simplify the ML workflows and from data scientists’ perspectives, it also improves their experience in terms of:
Each single ML model or component are now self-contained including the dependencies, versioning and etc. which make them portable and testable by data scientists.
The whole platform becomes more robust — if a newly deployed container breaks in the production environment, any team member can easily track down to where the problem is without understanding the entire architecture.
The lifecycle of ML models/services becomes clearer. Data scientists can either update current ML services by redeploying a newer version or release a new container to conduct A/B testing and then retire it.
From a data governance and risk perspectives, the secrets can now be managed on a container level to ensure each service or component with a minimum required access to data assets and resources.
Here is an example of Kubeflow pipeline for an end to end XBoost model and each of the boxes below is a container produced by Python SDK.
Multi-user Isolation
This is another common requirement especially when there are multiple different teams or streams of products in your business. There could be multiple reasons depending on your business and teams but essentially it prevents data scientists from different streams stepping on each others’ jobs and resources to improve their dev experience.
Kuebeflow is using resource quotas of Kubernetes to manage resources based on different namespaces.
In current version (v0.6) this is only available for Notebooks.
By Kubeflow Community
Monitoring
Kubeflow is using Prometheus to monitor GPU, CPU, networks, and RAM to show each pod/worker’s resource usage, which sounds like a feature designed for DevOps. However, from my experience, it is also useful for data scientists in terms of:
Helping data scientists to understand better around their machine resource usages in production so that they can optimize pipeline schedules and dependencies in an informed way.
It helps data scientists to detect bottlenecks of their models by monitoring the distributed computing across multiple workers and see any improvement can be made (e.g. data partitioning, parameter tuning, etc.)
It is a reference to data scientists to balance the cost of model training and the model performance to make their work more cost-effective.
Open Source
Last but not least, one of the key advantages of Kubeflow in my mind is open-source. Of course, there are benefits from both vendor-specific solutions and open source solutions.
To me, I consider more about agility and flexibility — no matter you are a large organization with multi-cloud strategies or a small-medium size startup who needs the flexibilities to redesign or migrate your AI & ML infrastructure in future. If your key requirements fit into those buckets, then open source is advantageous for you.
Anywhere you are running Kubernetes, you should be able to run Kubeflow.
By saying that, Kubeflow has a very active and committed community and it’s in the momentum of releasing new features monthly, conducting product design meetings weekly and actively updating its roadmap based on feedback.
Commits over time — by Kubeflow Github
Overall
So far we discussed the major gaps in AI & ML industrializations and some solutions people are looking into. Kubeflow as one of the trending tools, it can help us to succeed in the data science projects from different aspects.
While we will also need to keep in mind that Kubeflow is far from perfect at this stage (v0.6.2):
Kubeflow is under heavy development and you will not be guaranteed that future releases are going to be compatible with older versions. For example, starting from v0.6, the community decided to move away from Ambassador to Istio as its service mesh and moving away from ksonnet to kustomize to manage deployment templates.
In addition to the prior point, Kubeflow won’t be enterprise-ready until v1.0 (Jan 2020) — for example, the multi-tenancy and user-isolation have been deprioritized for now which might be an important feature to your data science team.
What’s Next
In this article, I’ve mainly focused on how Kubeflow can help industrializing AI & ML products from operation perspectives and team’s level.
Next, I’d like to see from the aspects of individual data scientists — how Kubeflow enables them to build end-to-end AI & ML models using JupyterHub, Katib and Seldon…

Images Powered by Shutterstock