Logo

The Data Daily

Is AutoML ready for Business?

Is AutoML ready for Business?

Do (will) we still need Data Scientists?

AutoML tools have been gaining traction for the last couple of years, both due to technological advancements and their potential to be leveraged by ‘Citizen Data Scientists’. Citizen Data Science, is an interesting (often controversial) aspect of Data Science (DS) that aims to automate the design of Machine Learning (ML)/Deep Learning (DL) models, making it more accessible to people without the specialized skills of a Data Scientist.

Let us start with a very high-level primer on Machine Learning (ML). Most of today’s ML models are supervised and applied on a prediction/classification task. Given a dataset, the Data Scientist has to go through a laborious process called feature extraction and the model’s accuracy depends entirely upon the Data Scientist’s ability to pick the right feature set. For simplicity, each feature can be considered a column of a dataset provided as a CSV file.

Even if the dataset has only one feature, the model selection process still plays a vital part as different algorithms need to be tried to find the best fit depending on the dataset distribution. For example, let us consider a dataset consisting of two columns: ‘Salary’ and ‘Years of Experience’. The goal is to predict the salary based on the experience level. Fig. 1 shows a Linear Regression illustration of the problem scenario.

However, Linear Regression is often inefficient and other regression techniques, e.g. Polynomial, Support Vector, Regression Trees, etc. need to be tried to find the best fit. The best fit in this case corresponds to a minimal prediction error (model accuracy) measured in terms of different metrics, e.g. Mean-Squared-Error (MSE); an example of which is the sum of lines connecting the data points to the (regression) line in Fig. 1.

Given this dataset and knowledge, a Data Scientist would proceed by writing programs to apply the different regression models on the dataset. Using a ML framework, e.g. scikit-learn, this translates to writing/modifying a few lines of code to implement the models and then analyzing the model accuracy in terms of different error metrics.

I guess by now you get the picture that the process so far in terms of feature extraction and model selection can be automated. And, this is precisely what an AutoML tool does.

Let us now consider AutoML in a Deep Learning (DL) context. The advantage of DL is that the program builds the feature set by itself without supervision. This is achieved by training large-scale neural networks, referred to as Deep Neural Nets (DNNs) over large labeled datasets. Training a DNN occurs over multiple iterations (epochs). Each forward run is coupled with a feedback loop, where the classification errors identified at the end of a run with respect to the ground truth (training dataset) is fed back to the previous (hidden) layers to adapt their parameter weights — ‘backpropagation’. A sample DNN architecture is illustrated in Fig. 2.

From an implementation perspective, if we want to write a Neural Network (NN) to solve the previous ‘Employee Salary’ prediction problem; a Data Scientist would write something like the below code using a DL framework, e.g. TensorFlow.

DNNs are, however, very tricky to build and train. Typically, DL models are painstakingly designed by a team of Data Scientists. This process of manually designing DNNs is difficult because the search space of all possible models can be combinatorially large — a typical 10-layer network can have ~1010 candidate networks! For this reason, the process of designing networks often takes a significant amount of time and experimentation even by those with significant DL expertise.

In practice, this translates to a trial and error process, trying out different combinations of some of the above configurable parameters (in bold), e.g. number of hidden layers, number of neurons per layer, the activation function, optimizer, batch size, number of training epochs, etc. There are some known architectures, e.g. Artificial Neural Networks — ANNs (Prediction), Convolutional Neural Networks — CNNs (Image Classification), Recurrent Neural Networks — RNNs/Long Short-term Memory Networks — LSTMs (Time Series Forecasting); that have been shown to work well for specific types of problems (in brackets). However, beyond this knowledge and the availability of some pre-trained NNs (e.g. ImageNet for Image Classification); the process of developing and training a NN for a new problem —Neural Architecture Search (NAS)[1]— very much remains an open problem.

Reinforcement Learning (RL) has been shown to be promising approach for NAS [2], where a controller neural net proposes a “child” model architecture, which is then trained and evaluated for quality on a particular task. That feedback is then used to inform the controller how to improve its proposals for the next round. This process is repeated thousands of times — generating new architectures, testing them, and giving that feedback to the controller to learn from. Eventually the controller net learns to assign high probability to areas of architecture space that achieve better accuracy on a held-out validation dataset.

Having gone through a bit of AutoML internals, and having a better understanding of the ML/DL pipelines; let us have a look at the maturity of current AutoML tools.

It would be fair to say that AutoML is at a stage where DL was a couple of years back — riding very high on expectations.

Gartner expects that “by 2020, more than 40% of data science tasks will be automated” (link).

Forrester analysts in their May 2019 Wave report [3] said that “just about every company will have a stand-alone AutoML tool. We expect this market to grow substantially as products get better and awareness increases of how these tools fit in the broader data science, ML, and AI landscape”. In the same report, they ranked DataRobot, H2O.ai, and dotData as the three leading providers of AutoML.

CB Insights [4] lists over 40 AutoML companies today. Here, it is important to mention that while labeling a tool as AutoML has become ‘cool’ these days, there is not much difference in the AutoML capabilities (in terms of the ML/DL algorithms) that they offer.

AI/ML practitioners would be aware of the challenges when a ML/DL model is trained and deployed by different teams on different platforms. So the AutoML tools today primarily help in resolving the end-to-end training-deployment issues for a number of known ML algorithms.

Also, while mature AutoML tools, e.g. DataRobot, can be quite expensive; AutoML has recently become quite accessible with the more pervasive cloud platforms providing integrated AutoML capabilities: Google Cloud’s AutoML, Microsoft Azure’s Machine Learning Service and AWS Sagemaker AutoPilot. It is only fair to say that these tools are quite limited at this stage, supporting only the basic Regression based Forecasting and Text Classification. For instance, none of them support Deep Learning (or NAS) at this stage, which is most likely due to the very high computational overhead of running NAS.

To conclude, let us come back to our original question: “Does AutoML mean the end of Data Scientists, basically the need for specialized Data Science skills?”

To answer this, let us start with the user interface. If by AutoML, we mean a tool which given an excel/csv as input data, is able to output a trained model with reasonable accuracy. Yes, current AutoML tools can do it today.

The challenge arises when the model needs to be improved. Remember, DS/ML is an iterative process. The problem is that you will rarely encounter a scenario where you get a very high accuracy on the first dataset that you provide. There is something wrong if you do, or your problem is too trivial -:) So the problem with AutoML starts when it needs to improve a model.

The same reasoning that makes ‘Explainable AI’ so difficult [5], applies here as well. The AutoML tool has very limited understanding of why a model behaves as it does. It can explore the input data distribution to point out certain data characteristics that can be improved. However, it will never be able to recommend that adding new data, e.g. weather, location, etc. will improve the model accuracy. This is because it lacks the business/domain knowledge. On the same limes, it lacks the technical knowhow currently to discover a completely new neural network architecture. For example, while it can recommend the right number of hidden layers for an ANN, its NAS will not be able to recommend that adding ‘memory’ will solve the vanishing gradient problem of RNNs — leading to the discovery of LSTMs.

To summarize, AutoML tools today are far from replacing your skilled Data Scientists — you still need them. However, they can act as a complementary or standalone DS/ML platform for Data Scientists, sigificantly accelerating their work by automating many of the exploratory stages that form part of any new DS use-case.

Images Powered by Shutterstock