Scaling Machine Learning in Real World

Read original article here

Machine Learning is evolving at break neck pace across the spectrum of platforms, tools, technologies, stacks and operationalization. However, there remains a gap in scaling in the Real World that starts to become apparent in players who are either in later stages of solving the reproducibility, measurement, versioning issues or the newer players who are trying out a few models or algorithms and get stymied by the operational aspects. Even in larger or more established players, the scaling problem manifests due to the fact that real world is changing or in a state of flux and that is the nature of data and associated models. In the Data Science world, this is referred to as the hidden technical debt, it’s one of the active areas of development.

In this article, I will share the learnings or experiences that I have had in dealing with such challenges and will lay out some architectural principles or guidance to approach these problems and build the capability to scale in Real World

Taking an example of a typical Data Science workflow as illustrated in one of the papers by R Olson, we can see how this differs from the usual software development in the use of experiments and feedback decision loop. The key principles or factors at play that allow scaling of this workflow are: Reproducibility, Testing and Monitoring, Measurement and Observability

Reproducibility is not an easy or straightforward problem. There are many dimensions and aspects that need to be considered, the more important ones being data, domain, external dependencies, result measurement and evaluation criteria etc. However, there are some options that help in getting started with the solution path. The key principle here is versioning, that includes data, model and all associated artifacts associated with the execution such as code, pipeline, evaluation metrics, accepted and rejected results, experimentation or AB test requirements etc.

In the software stack, one of the options available that captures these types of requirements is DVC, acronym for Data Version Control dvc.org. This is an open source framework that sits on top of Git and with integration to cloud storage services such as AWS S3, GCP GCS. This allows ML, NLP, DL Pipelines to be versioned, tracked and reproduced. Excerpt from dvc documentation:

“Data processing or ML pipelines typically start a with large raw datasets, include intermediate featurization and training stages, and produce a final model, as well as accuracy metrics.

Versioning large data files and directories for data science is great, but not enough. How is data filtered, transformed, or used to train ML models? DVC introduces a mechanism to capture data pipelines — series of data processes that produce a final result.

DVC pipelines and their data can also be easily versioned (using Git). This allows you to better organize your project, and reproduce your workflow and results later exactly as they were built originally!”

In DVC, pipeline stages and commands, their data I/O, interdependencies, and results (intermediate or final) are specified in , which can be written manually or built using the helper command . This allows DVC to restore one or more pipelines later (see ).

In the real world, this solution gets mapped across the environment layers starting with local development with less restricted build configuration to integrated test environments built with CICD tools such as Jenkins, Travis to prod deploy environments with tight restrictions.

The currency of the world is change and that is represented in data. If the ML or whatever combination of ML,NLP, DL model is serving results that are not in sync with the current patterns, that negatively impacts relevance, accuracy, bias and as a result business and enterprise metrics around revenue, engagement, loyalty and trust. An example, from recent times is the change in customer behavior and interaction given the Covid circumstances. Models that have adapted to the change have increased customer engagement, deepened brand trust and established loyalty while the reverse holds true for models that have drifted as a result of change in data.

The realm of Testing for Machine Learning is fundamentally different from Software Testing. The combination of data, pipelines with multiple steps, offline online training serving, failure points across the layers and so on these are some of the key factors. Google published a critical research The ML Test Score: A Rubric for ML Production Readiness and Technical Debt Reduction that lays down guidelines for Testing

With this guidance, we can architect the test framework using these boundaries: Data, Infrastructure, Model, Integration, Monitoring and more with some level of overlap. At a higher level, this segmentation can be viewed as Data + Model + Code along with production/operational needs such as Monitoring. On the software side, this translates to a combination of test scripts (PyTest as an example), execution model (Docker), tracking framework (MLFlow), logging (ELK stack Elastic Search/LogStash/Kibana), monitoring (Prometheus, Grafana)

Given the number of boundaries and the types of tests involved along with execution and feedback loop, this is a significantly complex undertaking. The approach here is to come up with a plan guided by analysis of ML Tech Debt, analysis of tech capabilities, analysis of foundational needs and some architectural, roadmap view. The key outcome here is to start building the framework guided by the plan to start getting field usage from the teams and start the feedback cycle

Measurement is critical to establish any sort of baseline to gauge accuracy and relevance. There are many aspects of measurement, such as raw metrics, aggregated metrics, baseline, deviations, explained, unexplained and more. This is one of the areas in ML that is fast evolving and strongly overlaps operations and business. There is a strong overlap between Measurement and Testing Monitoring, with key differentiation being Measurement tightly focuses on Observability that is the capability to determine what is going on inside the system by looking at the outside. Twitter referred to observability in their tech blog posts. A great reference is Cindy Sriharan’s article. Quoting from her article:

“Monitoring” is best suited to report the overall health of systems. Monitoring, as such, is best limited to key business and systems metrics derived from time-series based instrumentation, known failure modes as well as blackbox tests. “Observability”, on the other hand, aims to provide highly granular insights into the behavior of systems along with rich context, perfect for debugging purposes.

Metrics can harness the power of mathematical modeling (sampling, aggregation, summarization, correlation) and prediction to derive knowledge of the behavior of a system over intervals of time in the present and future. Since numbers are optimized for storage, processing, compression, and retrieval, metrics enable longer retention of data as well as easier querying. This makes metrics perfectly suited to building dashboards that reflect historical trends depicting operational concerns (Latency, Memory/CPU usage) as well as prediction monitoring (Median & mean prediction values, Min/Max, Standard deviation)

While metrics show the trends of a service or an application, logs focus on specific events. The purpose of logs is to preserve as much information as possible on a specific occurrence. The information in logs can be used to investigate incidents and to help with root-cause analysis.

In terms of mapping to Software stack, the non production or development environments are better suited to solutions such as MLFlow while the production grade environments need scale, speed and reliability of ELK, Prometheus, Grafana type of stacks.

Any conversation around scaling or productionizing data science, would need to talk about Continuous Integration/Continuous Deployment or CICD in short as it’s known in the DevOps, MLOps world. That is an entire topic worthy of its own discussion. To briefly touch upon it, when we integrate and automate the Reproducibility, Testing, Monitoring, Measurement, Observability the end result or outcome is a CICD execution model or framework. In future articles, I will explore how these core elements come together to form the CICD pipeline.

I hope this post gives some useful information on the Real World aspects of Machine Learning and helps in the journey towards scaling. I look forward to hearing your feedback.

Images Powered by Shutterstock

The Data Daily

Scaling Machine Learning in Real World