It has — perhaps somewhat prematurely — been called the sexiest job of the twenty-first century, but whether you buy into the Big Data hype or not, data science is here to stay.
The media purport the image of the data science ‘artiste’: a data bohemian who lives among free, like-minded spirits in lofty surroundings, and who receives sacks of money in exchange for genuine works of art created with any possible ‘cool’ tool that flutters by in whatever direction the wind is blowing that day.
The reality for many in the field is quite different. Corporations rarely grant anyone unfettered access to all data, and similarly they are not willing to try and buy every new tool that hits the market, simply to satisfy someone’s curiosity. Furthermore, industrial data science has requirements that are much stricter than what is commonly taught in programmes around the world, and it’s time to make the case for industrial data science.
Production quality, or its reverse failures, and product safety are the main drivers of industrial data science.
Quality, cost, and delivery (QCD) are central to the operation of many industrial corporations. In manufacturing, for instance, quality is not just about providing customers with high-quality goods, it’s about ensuring that products are safe to use. This is especially true with safety-critical components, such as airbag control units, jet engines, and cardiopulmonary bypass pumps.
In manufacturing, failures nowadays happen in the so-called six sigma region, meaning that a few parts in a million are defective and need to be scrapped. Because many of the components that go into the products are expensive, real-time early-warning systems are needed. Sure, it’s expensive to scrap parts in production but it is even more costly when failures occur in the field, where they can potentially have deadly consequences. Recent examples of such field failures include GM’s faulty ignition switch and Takata’s airbag inflator.
To ensure that any analytics model captures failures correctly and in a timely manner, every single relevant event needs to be captured and processed in the correct order. For complex event processing (CEP) or real-time scoring engines, this means that all pertinent events must be guaranteed to be delivered at least once, preferably exactly once. At-most-once delivery is simply not good enough.
With for instance sentiment analysis based on tweets, your model does not really suffer if an occasional event (i.e. tweet) is delivered in the wrong order or is dropped altogether. The simplest mode, at-most-once delivery, is already good enough. As long as most tweets arrive the model will be fine. This is not the case in industrial settings.
When introducing a model with tuned hyperparameters or just a different, more accurate model, the trade-off between false positives and false negatives often leaves some wiggle room. This trade-off is typically examined by means of the ROC curve or even a collection of confusion matrices for different cut-offs.
In models that are supposed to detect failures, more false negatives cannot be permitted. A false negative is a part that ought to have been scrapped but was flagged as OK and thus moved on to the next processing step. More false positives are not ideal either, because additional resources need to be used to analyse the potential failure, but at least it does not represent a risk that cannot be tolerated.
Consequently, redundancy is also a critical ingredient: there must not be a single point of failure in the data flow. In case the real-time scoring or CEP engine is unavailable, there either has to be a backup that kicks in immediately or the model must revert to a default solution that is good enough.
The level of accuracy is also quite different in ordinary data science. The Netflix Prize awarded $1m to the data science team that beat Netflix’s own user rating predictions by 10% (in terms of the RMSE). In the parts-per-million (ppm) arena, such improvements in accuracy for mature analytical models are almost unheard of. Whereas a huge effort to improve the accuracy by a fraction of a per cent may not be warranted in Google AdWords campaigns or Netflix recommendations, such an improvement may save loads of cash in industry.
We have already looked at streaming data, but typically corporations also have plenty of data at rest inside database warehouses and ERPs. Master data is a classical example but it is not the only data that can be found inside RDBMSs. In fact, some data may not be available as streams, for example, read-only machine logs or certain data sets from a manufacturing execution system (MES). These systems haven’t historically been designed to satisfy the current appetite for data.
The idea of schema-on-read (i.e. HBase and NoSQL) or even schema-on-the-fly (e.g. Apache Drill) is a utopia in many corporations. A considerable cleansing and integration effort is required.
In addition, SaaS is not an option: the network traffic would immediately become a bottleneck. More importantly though, few industrial companies want to send their data outside of their corporate firewall. As such, only on-promise solutions are possible.
In itself, these considerations do not limit data science in any significant way. However, not everything is plug-and-play. Some of the most popular open-source libraries in R and Python have almost zero support for MapReduce or Spark. Yes, there is MLlib, but it is not nearly as complete as what Python and R have to offer.
What is even more critical is that some open-source solutions come with limited or no (commercial) support, which is not something many large corporations are too happy about. Related to that is that these solutions rarely offer the whole experience: a reusable model repository with solid built-in documentation capabilities that comes with one-click deployment seems to be mainly available in commercial suites. This DIY mentality is fine and perhaps even encouraged in start-ups and academia but many industrial companies have a hard time with what some perceive as a willy-nilly attitude towards data and software.
This leads us naturally to continuous integration (CI) issues. In production environments, continuous integration and DevOps common practice. Analytical models that have been developed for one facility or plant may have to be rolled out (and maintained) to other locations. These may have similar requirements but that may not necessarily be true; not every company builds their manufacturing plants according to Intel’s Copy Exactly philosophy. Hence, a platform and deployment standards are required. The platform ought to be capable of simple re-training of the models and running several versions in parallel for comparison purposes.
You could argue that data scientists and data engineers account for the distinction: the former come up with the ‘creative’ ideas and the latter do the professional implementation. However, as data science matures within companies I doubt that these companies want to stay in single-project mode forever, especially since that can create an atmosphere in which the data scientists become the frivolous artists and the data engineers their impresarios. I’ll have more to say on business processes in a moment, so bear with me on this one.
Closely related to the architecture is the impact a solution has on production. Ideally, the impact in terms of performance degradation is negligible. Many legacy systems were never designed to be accessed from the outside continuously or even at all. It is therefore possible that these systems are affected negatively by connecting modern systems that voraciously consume data. In the case of manufacturing execution systems, the performance impact can be disastrous. Any increase in the cycle time of products, simply because the MES has to deal with an additional load, is unacceptable. Industrial data science solutions need to be minimally invasive yet operate with maximum (positive) impact.
Even if you buy into the data scientist/engineer divide, data scientists cannot access data with impunity. They too have to be mindful of performance considerations when connecting to live systems. This of course is just another argument that the line between data scientist and data engineer in an industrial setting is not very clear. In fact, industrial data scientists have to be able to deal with such concerns independently.
Data science in most its guises is currently still in what I call single-project mode. A problem with potential is identified, a project is initiated, and if the project is successful, a permanent solution is developed. That works for many modern organizations, but in an industrial setting, where much of the work has been automated or is at least following a script, that won’t be a long-term solution. Repeatable, well-documented business processes are paramount. Hence, at some point industrial companies need to mature from project-focussed to process-oriented data science.
A framework, such as CRISP-DM or SEMMA may help when doing projects, as that at least ensures that there is no variation in the way the projects are done. More importantly, it allows projects to be handled over to operations with proper documentation of each step; I personally recommend CRISP-DM as it consists of the whole data science life cycle from business and data understanding to evaluation and deployment.
Such framework standardization is basically a baby step towards the industrialization of data science within an organization. I am well aware that some data scientists do not like the idea of formalizing the entire process, but that is a crucial component of industrial data science.
On a somewhat related note, there is sometimes an impedance mismatch between the data and business teams with regard to how ‘projects’ are done. Some organizations have not yet embraced agile and stick with classical project management, which is perfectly fine. However, data science teams often work in an agile fashion. In itself that is not a problem, but it can cause duplication of work.
Domain knowledge is obviously crucial to data science. It’s what makes feature engineering and selection so much more effective. However, an industrial data scientist must know not just about the processes and related control technologies (SPC/APC and run-to-run), but basic knowledge about physics, chemistry, engineering, biology, pharmacology, or medicine may be required in certain situations too. On top of that, in order to be able to talk with process engineers effectively, knowledge of automatic optical inspection (AOI) technologies, defect engineering, FMEA/FTA, Six Sigma, drug approval processes, and many more is a must.
Because of this complexity, analytical models can easily have tens of thousands of variables prior to feature engineering, especially when data scientists perform analyses across the value chain. A single product may be assembled from tens to hundreds of individual parts, all of which have typically a few dozen process steps each. In case ICs and/or on-board sensors are involved, the number of steps for the IC alone is typically a few hundred, each with single measurement values, time series, and images of (microscopic) structures.
Although Facebook may be able to experiment on its users with indemnity and while Uber can launch a service that appears to be against the law in some countries, industrial companies have to abide by strict laws to ensure the safety of their products. You may not believe that based on the recent Volkswagen scandal though. Nevertheless, regulatory compliance is not an option, and familiarity with applicable laws is a must for industrial data scientists.
Since we’re on the topic of processes, I firmly believe that industrial data scientists need to embrace process mining, which is a fairly young discipline. It allows process flows to be discovered (i.e. learned) from raw events, which in turns allows deviations from the norm or a pre-defined standard (i.e. process compliance) to be identified. Similarly, graphical analyses will become more commonplace in industry, as it’s a natural framework to analyse value chains and study the genealogy of product components.
Communication is often mentioned as the key skill that separates good data scientists from the truly great. It’s true, but that is but one side to the story. The other is that great communication skills still don’t get you anywhere when the organization resists, and there will be plenty of cases where a data scientist’s persistence is futile. Sometimes your hands are bound. An example that springs to mind is employee monitoring, which quickly lands you in hot water with the law or trade unions. No matter how smoothly you talk or how great your idea really is, no sometimes really means just that: no.
Anyway, dealing with corporate IT is definitely not ‘sexy’, to come back to the epithet. In fact, it can be downright frustrating. I’ve already mentioned the potential problem when interfacing with mission-critical systems, such as an MES, but beyond technical reasons, there may be political or even circumstantial ones. People responsible for different aspects (e.g. IT vs quality control) frequently have different agendas.
Managing expectations is of course not unique to industrial data scientists, but the need to integrate with a plethora of systems is not that common outside of industry. Not even data monsters like Google have such a huge variety of interlocking gears as most industrial companies.
About 50-80% of a data scientist's time is spent on data cleansing and preparation. What is almost never mentioned, yet definitely a part of the life of an industrial data scientist, is that waiting for the corporate cogs to align may take even longer than the time a data scientist needs to whip the data into shape. This is rarely something an industrial data scientist can control, but it is a reality that few are prepared for.
Without data governance, a data scientist is in a unique position to see the skeletons in the closet. Along with that comes the risk of becoming the designated data janitor because the data quality issues are more pressing. That risk is higher in an industrial giant than in a lean start-up, since ‘strategic re-evaluations’ are more common in the former than in the latter.
Depending on your perspective, industrial data science is either a specialization or an extension of regular data science. In some ways, it adds additional constraints, mainly organizational and architectural ones. In other ways, it requires a data scientist to be even more of an all-rounder with broader domain knowledge and technical expertise. Nevertheless, industrial data science exists, and it’s time we developed realistic programmes for it, because the people aren’t going to train themselves.