Logo

The Data Daily

Basics of Data Science Product Management: The ML Workflow

Basics of Data Science Product Management: The ML Workflow

I’ve spent the last few years applying data science in different aspects of business. Some use cases are internal machine learning (ML) tools, analytics reports, data pipelines, prediction APIs, and more recently, end-to-end ML products.

I’ve had my fair share of successful and unsuccessful ML products. There are even reports ofML product horror stories where the developed solutions ended up failing to address the problems they were supposed to solve. To a large extent, the gap can be filled by properly managing ML products to ensure that it ends up being actually useful to users.

Something I quickly learned was that managing ML products is difficult because of the complexities and uncertainties involved with the different steps in the machine learning (ML) workflow (enumerated below):

Given the difficulties in the ML workflow and our resource constraints (e.g. time, people, storage, and compute), how do we make sure that our approach to each of the steps will maximize the value contribution of the ML model to the overall ML product?

In this blog post I aim to give an overview of each of these steps, while illustrating some of the foreseeable challenges and the frameworks that I’ve found to be useful in optimizing the ML workflow. Note that we consider an ML workflow as optimized when it creates an ML model with the highest value add to the overall product, given the available resources.

Also, this post was intended for readers that have at least a high level understanding of machine learning concepts, tools, and workflows; however, don’t worry if you aren’t familiar with some of the concepts that I mention. I often link to articles that explain these concepts and even for those that I don’t, a quick google search should give a high level understanding.

The review of related literature (RRL) step of the ML workflow involves reading up on existing approaches, datasets, and other resources that may be relevant to the methodology that we will use for our model. This gives us a set of possible design choices to choose from for our methodology. Examples of these design choices are

Most design choices are taken from related literature, but it’s also possible for some components to be created from “scratch” usually because there are aspects of the project that may be different from anything found from related literature. For example, if the model being trained aims to predict a person’s income based on their online purchases and there are no papers / blogs published on the topic, then your team will likely have to generate their own design choices on feature engineering.

Another relevant product management (PM) concern is to decide on how much time to allocate to RRL before starting on model development. In my experience, 1 – 2 weeks is a good amount of time to spend focused on RRL. In some cases it also makes sense to have team members who are solely focused on RRL while another set of team members are focused on conducting experiments based on RRL findings. The latter is more feasible with larger teams.

The RRL is also supposed to inform us about the resources that are available for our use. From here, we have to understand the extent through which these can reduce the amount of work or even take away the need for certain parts of our workflow.

For example, the existence of pre-trained models can possibly speed up training time while requiring less training data, which means we can allocate less time to model training and dataset generation (in the case of supervised models). In the context of semantic search, we can even completely eliminate the need for model training by using features generated by a pre-trained BERT model.

For available resources in the form of labelled datasets, it’s important for us to determine whether the labels are consistent with the objectives that we want our model to achieve. For example, if we want to train a model that can identify the sentiment of a customer service email without a sufficiently large customer service dataset, would the IMDB reviews dataset be relevant to use to train the model? We can hypothesize the answer to this question, but it’d be even better if we find a way to test whether or not the model trained on IMDB reviews is correctly detecting sentiment in the context of customer service emails. We will discuss data in more detail in the next section.

In summary, the RRL step informs us about existing approaches and available resources that will be helpful in designing and planning our ML workflow.

This step involves getting the data and processing it into a state where it is ready to be fed into the ML model for either training or evaluation. The approach to this step is mainly dependent on seven factors:

From the above, size of data (1) and required compute power (2) affect the amount of resources required to pull & process the data. Especially when available resources (3) are limited (e.g. fixed resources from on-premise data centers), this means data pulls are slower and more expensive for large datasets with complex queries, which justifies the need for additional query optimization efforts and data sampling. In the case of hive queries, this could mean running optimized queries on a few partitions at a time for initial data pulls. In addition, it’s important to note that the issue with scarce resources with respect to storage and compute is mitigated by using cloud computing platforms (e.g. GCP and AWS) since these platforms allow you to use resources “as needed” without adhering to the constraints of a physical data center.

While the cleanliness of data (5) may increase resource requirements for data processing, the quality of database documentation (4) doesn’t affect the resources needed. However, both of these factors affect how long it takes to write the queries required to pull data from the necessary databases. A huge bottleneck often comes from not knowing which tables contain the exact fields that you need. A well documented database will make it easy to find the correct fields while a poorly documented database will slow down your workflow significantly. Similarly, unclean data slows you down because additional (often very tedious) steps are necessary to bring the fields into the format required for processing.

The source of your data (6) can also have associated risks. In the case that the data source is from a public API or a website (via web crawler), there are the added issues of availability, maintainability, and legality. In the case of APIs, it’s very possible that what was once “free” to use will later require monthly payment via “subscription”. Another issue is when the owner of these sources suddenly imposes rate limits on how much data you can pull for a period of time (e.g. 5 requests per day).

For websites being crawled, maintenance become a huge issue because web crawlers are typically very sensitive to any changes in the websites that are being crawled. Also, web crawling can have legal issues if the website being scraped explicitly prohibits crawling based on its Terms & Conditions, so it’s best to check before doing any web crawling. Because of these reasons, it’s important to watch out for these risks when your ML product requires pulling from public APIs and websites. Knowing about them early will allow you to prepare for them by planning for potential alternatives in case things go south.

Lastly, the availability of ground truth data (7) for model evaluation will determine whether we will have to design a way to label data. Generally, labelled data has two possible sources: 1) External parties, 2) Users. Depending on the difficulty of the task being modelled, there will be varying levels of cost for “high quality” labelled data coming from external parties. For technical tasks (e.g. information extraction for medical journals), only domain expert users will be able to give high quality labels. For simple tasks (e.g. dog vs cat image classification), domain experts are not needed to have high quality labels. So why don’t we just make domain expert users label data all the time? Because they’re typically more expensive (imagine paying a doctor per hour to label data). In other words, the cost of hiring people to label data becomes more expensive as the task being modelled becomes more complex.

A clever workaround would be to get users to create your ground truth data for you. This can be done by embedding the “labelling” into the ML product while minimizing friction or even turning the labelling process into a value adding activity. A great example of this is Facebook giving users the ability to “tag” friends to let them know they’re in a photo. By tagging your friends, you are “labelling” their image datasets. Users benefit because they get to connect with their friends, while Facebook benefits because they get “free” labelled data. Everyone wins!

I plan to further discuss the topic of embedding evaluation into your AI product in a future blog post.

Now we get to the most famous part of the ML workflow: actually training the model and experimenting with different approaches!

You might have read articles that talk about how training the model is only 20% of a data scientist’s job. This is quite accurate, but from a PM standpoint, it’s important to understand that the main things to consider are:

Before you start training your models, you should already have some idea about the experiments (1) that you plan to run based on your team’s hypotheses (supported by the findings from the RRL step). List down these steps and rank them based on expected improvement to the model (in terms of accuracy) and incremental time required. The experiments with the highest expected improvement and lowest time required should be on top.

In addition, it’s important to distinguish between experiments that will require retraining of the model, and experiments that don’t require retraining. E.g. after a model is trained for text classification, one can still experiment with different text preprocessing methods assuming the format of the input text is still consistent. For experiments that don’t require “retraining”, the time cost is typically much lower since training runtime takes a large proportion of the time cost of an experiment. This implies that many non-retraining experiments can be seen as “quick wins” that the team could perform rapidly.

One thing to note about non-training experiments is they’re often the type of experiments you can do only after you have a few trained models already (e.g. model ensembling requires models that are already trained). Because of this, I typically allocate the last stretch of the experimentation period for rapid non-training experiments while training experiments that take longer are performed at the very beginning of the experimentation period.

Training runtime (2) just talks about how long it takes for one version of a model to train on a dataset. This matters because how long it takes to train a model (for experiments that require training) will determine the number of experiments you can do with different versions of the model, given the timelines that you’ve committed to for your project.

This implies that it’s very possible that you will not have enough time to do all of the experiments that you plan conduct. In any case, experiments that rank higher should be conducted first; however, the team can also increase the number of experiments possible by decreasing the time required to execute experiments. A simple way to do this is by reducing training runtime.

So how do we control training runtime? Typically, this is controlled by either simplifying the model (e.g. reducing the # of neural network layers) or reducing the size of your dataset (e.g. choosing a subset of examples or features from your dataset). A simple approach I take that works quite well is to start off with the simplest version of a model with the smallest reasonable sample of data (i.e. the version with the fastest runtime). Then, you can increase model complexity and the sample size incrementally until you reach your target training runtime – the runtime that will allow you to conduct all your planned experiments. It’s also important to keep track of the accuracy and runtime at each level of model complexity and sample size so you can use this to inform you about the time tradeoffs associated with increased complexity and sample size.

One more way to reduce training runtime is to upgrade the hardware resources you have access to (e.g. GPUs & CPUs). Fortunately, cloud computing platforms allow one to customize the resources available at different price points. Since the cost of an upgrade is fully documented, the upgrade decision becomes informed by the upgrade cost and the expected speedup in training runtime. In order to fully utilize these resources, many training algorithm implementations make it easy to speed up operations via parallelization (e.g. pytorch, tensorflow, scikit-learn).

Finally, we have to define the model’s accuracy metric (3) which we will use to evaluate the results of each experiment. The important thing to remember is that the accuracy metric should be a measure of the model’s value contribution to the overall ML product. Because of this, you will need to have a deep understanding of the problem that your ML product is trying to solve and from there, work back to the specific task that the model is performing and identify the exact way that it is helping the product solve the problem.

A common error is to focus on an accuracy metric simply because it’s the usual metric for a certain task. In fact, it’s incredibly important to make sure that the model’s accuracy metric is consistent with its value contribution to the product: otherwise, a better performing model will not translate to a better product in which case experimentation will be for naught.

For example, in the case of Airbnb, host preferences are modelled by the probability that a host will accept a booking request. Given that bookings with higher acceptance probabilities are shown first to guests, this also means that more bookings will be accepted, which will translate to a revenue uplift for Airbnb. The more accurate the model, the higher the acceptance rates, and the higher the revenue.

Once the value metric has been identified, the team now needs a baseline performance (4) which will determine if the model is “successful” and to what extent. One framework I have for setting a baseline performance is to understand the level of performance that would make the model actually “useful” to the ML product. This can be a difficult thing to to do but it can be useful because it 1) forces us to think about exactly how the model is supposed to provide value to users, and 2) gives us an initial target performance that we can use to assess usefulness, 3) it incentivises the team to have actual conversations with users to understand the model’s role in providing value to them.

In the long run, usefulness can be derived more empirically by running A/B tests to compare different versions of the model in terms of impact on actual product value metrics (e.g. revenue, signups, orders). Airbnb is a good example of a company that has a data science culture of assessing models using experiments via A/B testing. This makes sure that their models are actually improving the product.

Deployment is the last step in the ML workflow and the approach mainly depends on 1) whether predictions are needed in realtime or periodically in batches, and 2) latency requirements.

If realtime predictions are necessary, then the model will likely be deployed as an API, where predictions can be retrieved by sending requests to this API. This is often done using app frameworks like flask or django running on a separate instance (e.g. AWS ec2) or FaaS platforms (e.g. AWS Lambda) that have the prediction script running on the cloud.

If predictions are needed in batches within a specified cadence (e.g daily, monthly), then a job that runs a prediction script on new data periodically is sufficient. For this case, tools like crontab (built into Linux) and other automation frameworks (e.g. Airflow, Luigi) are sufficient. For reasons that are widely documented, I prefer to use Airflow for batch deployment.

For latency requirements, it basically answers the question: how quickly does the model need to be able to make predictions? Typically, applications will have service level agreements (SLAs) that hold prediction latency to a certain standard.

For example, when you open the Medium app and they show you a recommended list of articles, the recommended list is probably either 1) “predicted” and stored once everyday and shown to you once you open the app (batch deployment), or 2) “predicted” as you open your app (realtime deployment). Naturally, the first just needs a prediction SLA fast enough so that the user will be seeing an updated list by the time he/she opens the app in the morning (e.g. 5 hours from midnight everyday). For the second, the SLA has to be more stringent since the predictions happen as users open the app (e.g. 1 second from opening the app) and users will not want to wait too long to see their recommended articles.

It’s apparent from the illustration above that latency is more likely an issue for real time deployment. So how do we bring latency to an acceptable level (SLA), assuming it currently isn’t being reached? For example, imagine the case that prediction takes 5 seconds instead of the 1 second SLA after opening the app.

This is where Machine Learning Engineering (MLE) comes in. MLE involves the use of special techniques (e.g. parallelization) to speed up the runtimes of deployed machine learning models with the goal of reaching set SLAs. It is the ML engineer’s job to bring the 5 second prediction runtime down to the target 1 second prediction runtime.

At this point, you should have an overall understanding of the ML workflow and some of the different things to consider in designing your approach to each of the steps. It is important to note that each step should be discussed in detail with your data science team at the end of each sprint so that the planned approach at any point in time is the optimal one given all available information.

In addition, I am of the view that there is a considerable advantage in having overall experience in data science before becoming a data science product manager. This is because setting backlogs, prioritizing tasks, estimating timelines, and communicating how the ML product works and provides value will require an in depth understanding of the ML workflow and the uncertainties associated with each step. Even from just the standpoint of prioritizing tasks and drawing out timelines, understanding the ML workflow means knowing the implications of new information coming in as the project moves forward.

I plan to give more detailed discussions of other aspects of data science product management in later blog posts (e.g. setting a backlog, prioritizing tasks, estimating timelines, aligning ML model objectives with the product’s value proposition, integrating evaluation into the ML product, managing stakeholders new to ML). If you have any suggestions on which ones to start with, feel free to comment or email me at lorenzo.ampil@gmail.com. Stay tuned!

Images Powered by Shutterstock