AI is embedding itself into the products and processes of virtually every industry. But implementing AI at scale remains an unresolved, frustrating issue for most organizations. Businesses can help ensure success of their AI efforts by scaling teams, processes, and tools in an integrated, cohesive manner. This is all part of an emerging discipline called MLOps.
AI is no longer exclusively for digital native companies like Amazon, Netflix, or Uber. Dow Chemical Company recently used machine learning to accelerate its R&D process for Polyurethane formulations by 200,000x — from 2–3 months to just 30 seconds. And Dow isn’t alone. A recent index from Deloitte shows how companies across sectors are operationalizing AI to drive business value. Unsurprisingly, Gartner predicts that more than 75% of organizations will shift from piloting AI technologies to operationalizing them by the end of 2024 — which is where the real challenges begin.
AI is most valuable when it is operationalized at scale. For business leaders who wish to maximize business value using AI, scale refers to how deeply and widely AI is integrated into an organization’s core product or service and business processes.
Unfortunately, scaling AI in this sense isn’t easy. Getting one or two AI models into production is very different from running an entire enterprise or product on AI. And as AI is scaled, problems can (and often do) scale, too. For example, one financial company lost $20,000 in 10 minutes because one of its machine learning models began to misbehave. With no visibility into the root issue — and no way to even identify which of its models was malfunctioning — the company was left with no choice but to pull the plug. All models were rolled back to much earlier iterations, which severely degraded performance and erased weeks of effort.
Organizations that are serious about AI have started to adopt a new discipline, defined loosely as “MLOps” or Machine Learning Operations. MLOps seeks to establish best practices and tools to facilitate rapid, safe, and efficient development and operationalization of AI. When implemented right, MLOps can significantly accelerate the speed to market. Implementing MLOps requires investing time and resources in three key areas: processes, people, and tools.
Building the models and algorithms that power AI is a creative process that requires constant iteration and refinement. Data scientists prepare the data, create features, train the model, tune its parameters, and validate that it works. When the model is ready to be deployed, software engineers and IT operationalize it, monitoring the output and performance continually to ensure the model works robustly in production. Finally, a governance team needs to oversee the entire process to ensure that the AI model being built is sound from an ethics and compliance standpoint.
Given the complexity involved here, the first step to making AI scale is standardization: a way to build models in a repeatable fashion and a well-defined process to operationalize them. In this way, creating AI is closely akin to manufacturing: The first widget a company makes is always bespoke; scaling the manufacturing to produce lots of widgets and then optimizing their design continuously is where a repeatable development and manufacturing process becomes essential. But with AI, many companies struggle with this process.
It’s easy to see why. Bespoke processes are (by nature) fraught with inefficiency. Yet many organizations fall into the trap of reinventing the wheel every time they operationalize a model. In the case of the financial company discussed above, the lack of a repeatable way to monitor model performance caused expensive and slow-to-remedy failures. One-off processes like these can spell big trouble once research models are released into production.
The process standardization piece of MLOps helps streamline development, implementation, and refinement of models, enabling teams to build AI capabilities in a rapid but responsible manner.
To standardize, organizations should collaboratively define a “recommended” process for AI development and operationalization, and provide tools to support the adoption of that process. For example, the organization can develop a standard set of libraries to validate AI models, thus encouraging consistent testing and validation. Standardization at hand-off points in the AI lifecycle (e.g., from data science to IT) is particularly important, as it allows different teams to work independently and focus on their core competencies without worrying about unexpected, disruptive changes.
MLOps tools such as Model Catalogs and Feature Stores can support this standardization.
AI development used to be the responsibility of an AI “data science” team, but building AI at scale can’t be produced by a single team — it requires a variety of unique skill sets, and very few individuals possess all of them. For example, a data scientist creates algorithmic models that can accurately and consistently predict behavior, while an ML engineer optimizes, packages, and integrates research models into products and monitors their quality on an ongoing basis. One individual will seldom fulfill both roles well. Compliance, governance, and risk requires an even more distinct set of skills. As AI is scaled, more and more expertise is required.
To successfully scale AI, business leaders should build and empower specialized, dedicated teams that can focus on high-value strategic priorities that only their team can accomplish. Let data scientists do data science; let engineers do the engineering; let IT focus on infrastructure.
Two team structures have emerged as organizations scale their AI footprint. First, there is the “pod model,” where AI product development is undertaken by a small team made up of a data scientist, data engineer, and ML or software engineer. The second, the “Center of Excellence” or COE model, is when the organization “pools” together all data science experts who are then assigned to different product teams depending on requirements and resource availability. Both approaches have been implemented successfully and come with different pros and cons. The pod model is best suited for fast execution but can lead to knowledge siloes, whereas the COE model has the opposite tradeoff. In contrast to data science and IT, governance teams are most effective when they sit outside of the pods and COEs.
Finally, we come to tools. Given that trying to standardize production of AI and ML is a relatively new project, the ecosystem of data science and machine learning tools is highly fragmented — to build a single model, a data scientist works with roughly a dozen different, highly specialized tools and stitches them together. On the other side, IT or governance uses a completely different set of tools, and these distinct toolchains don’t easily talk to each other. As a result, it’s easy to do one-off work, but building a robust, repeatable workflow is difficult.
Ultimately, this limits the speed at which AI can be scaled across an organization. A scattershot collection of tools can lead to long times to market and AI products being built without adequate oversight.
But as AI scales across an organization, collaboration becomes more fundamental to success. Faster iteration demands ongoing contributions from stakeholders across the model lifecycle, and finding the correct tool or platform is an essential step. Tools and platforms that support AI at scale must support creativity, speed, and safety. Without the right tools in place, a business will struggle to uphold all of them concurrently.
When picking MLOps tools for your organization, a leader should consider:
More often than not, there will be some existing AI infrastructure already in place. To reduce friction in adopting a new tool, choose one that will interoperate with the existing ecosystem. On the production side, model services must work with DevOps tools already approved by IT (e.g., tools for logging, monitoring, governance). Ensure that new tools will work with the existing IT ecosystem or can be easily extended to provide this support. For organizations moving from on-premise infrastructure to the cloud, find tools that will work in a hybrid setting as cloud migration often takes multiple years.
Tools to scale AI have three primary user groups: the data scientists who build models, the IT teams who maintain the AI Infrastructure and run AI models in production, and the governance teams who oversee the use of models in regulated scenarios.
Of these, data science and IT tend to have opposing needs. To enable data scientists to do their best work, a platform must get out of the way — offering them flexibility to use libraries of their choice and work independently without requiring constant IT or engineering support. On the other hand, IT needs a platform that imposes constraints and ensures that production deployments follow predefined and IT-approved paths. An ideal MLOps Platform can do both. Frequently, this challenge is solved by picking one platform for the building of models and another platform for operationalizing them.
As described above, AI is a multi-stakeholder initiative. As a result, an MLOps tool must make it easy for data scientists to work with engineers and vice versa, and for both of these personas to work with governance and compliance. In the year of the Great Resignation, knowledge sharing and ensuring business continuity in the face of employee churn are crucial. In AI product development, while the speed of collaboration between data science and IT determines speed to market, governance collaboration ensures that the product being built is one that should be built at all.
With AI and ML, governance becomes much more critical than in other applications. AI Governance is not just limited to security or access control in an application. It is responsible for ensuring that an application is aligned with an organization’s ethical code, that the application is not biased towards a protected group, and that decisions made by the AI application can be trusted. As a result, it becomes essential for any MLOps tool to bake in practices for responsible and ethical AI including capabilities like “pre-launch” checklists for responsible AI usage, model documentation, and governance workflows.
In the race to scale AI and realize more business value through predictive technology, leaders are always looking for ways to get ahead of the pack. AI shortcuts like pre-trained models and licensed APIs can be valuable in their own right, but scaling AI for maximum ROI demands that organizations focus on how they operationalize AI. The businesses with the best models or smartest data scientists aren’t necessarily the ones who are going to come out on top; success will go to the companies that can implement and scale smartly to unlock the full potential of AI.