Empowering Data Science with an Effective Team
Human Centered Design | DataOps | Data Science | DevSecOps | Photo by Charles Deluvio on Unsplash
In an organization looking to move toward data-driven decision making, data scientists must work in concert with other teams. Collaboration requires more than just understanding how to use the latest machine learning techniques to analyze data.
It requires a complete understanding of the business problem to be solved. From a people-oriented standpoint, this includes developing a sense for how business users will best interact with machine learning tools. From a technical standpoint, it also requires high quality data and a way to securely deploy models in production.
To support the data science effort, we recommend Data Scientists embed within teams comprised of the following:
Human Centered Design practitioners responsible for ensuring the end user is at the center of solutions
DataOps practitioners responsible for maintaining high quality data
DevSecOps practitioners responsible for pipelines, security, and delivery
Each group has a role to play in ensuring the integrity of the data pipeline from collection through communication of insights. These teams can offer speed to insight across many potential use cases, such as reports and dashboarding, advanced analytics, natural language processing (NLP), and computer vision (image recognition, classification, and detection).
Human Centered Design
Human-centered design (HCD) is an approach to problem solving, commonly used in design and management, that develops solutions to problems by involving the human perspective in all steps of the problem-solving process.
As we’ve written about in other articles , this team aims to learn directly from the end user about what’s currently not working for them in order to prevent pitfalls that could arise from deploying a machine learning empowered tool into the existing workflow. HCD facilitates communication across stakeholder groups to ensure the end product offers resolution and avoids deepening existing feedback loops.
HCD is an important part of building a machine learning tool. This team ensures the needs of the end users are met, the handoff between machine and human is seamless, and negative feedback loops are prevented to the greatest extent possible.
Photo by fauxels from Pexels
An example of the importance of HCD for machine learning is illustrated by Google’s efforts to develop analytics to detect diabetic retinopathy . Google discovered that while the model worked well in the lab, it failed 20% of the time in the real world setting, where photos were often taken in poor lighting conditions.
Had the team incorporated HCD practices from the outset, they would have directed their attention to the environment into which the computer vision tool would be deployed, and they would have adjusted their system design accordingly.
Machine learning is poised to impact our day-to-day lives through new mediums like augmented and virtual reality and ground-breaking technologies such as self-driving cars. The imperative to develop a deep understanding of users’ needs through an HCD process will become ever more crucial as these systems become increasingly ubiquitous.
DataOps
Data quality is paramount to fostering trust in any data product. To create a culture founded on unwavering faith in data quality, an organization should develop strong data governance that links data to the strategic vision. This may involve modernizing tools that ingest, store, and catalog data. This is where the DataOps team comes in.
Issues of data quality are further exacerbated by poor handoffs between storage systems and a lack of effectively documented metadata. Poor data governance contributes to the generalization that in a typical organization, up to 80% of data scientists’ time might be tied up in finding and cleaning data.
To rectify these challenges, DataOps should work with leadership to perform a data health check on existing systems, create a data strategy that advances business outcomes, and encode best practices through formalized data governance that includes training and cultural elements around the data lifecycle.
Photo by ThisisEngineering RAEng on Unsplash
In addition, DataOps works with teams on the ground to extract and store metadata and use it to create a data catalog. This information can be communicated to potential users in the form of a data profile that also contains information about the owner, description, and other tags (e.g., training, validation, or testing for a data science use case). Where possible, we recommend storing data as immutable in order to maintain consistency and to properly track provenance and lineage of versioned datasets that result from the feature exploration process. The data catalog facilitates identification of high-quality datasets, dramatically decreasing time to develop new algorithms or perform transfer learning on existing machine learning models.
Moreover, the DataOps team is responsible for providing robust protection against underlying bias in the dataset as well as potential bias in resulting machine learning products. Tools such as FairTest and AI Fairness 360 can be integrated into the data pipeline to uncover potential bias in features before machine learning malpractice takes place.
Our team has witnessed the importance of bias prevention in the healthcare field with regard to social determinants of health. As machine learning is used to detect diseases (e.g. lung and breast cancer) and to assess potentially fraudulent claims, it’s essential that models are trained on datasets unbiased in terms of race, gender, and age.
DataOps also has a role to play in protecting datasets against corruption, such as the threat of data poisoning. This occurs when data samples are added to training data such that they negatively influence model performance. The field of computer vision offers many perplexing examples of this phenomenon .
A commonly used pipeline architecture would ingest data into the model for test and evaluation through the creation of a REST-based web service or Apache Nifi . Then output from the model should be saved into the data catalog for further analysis — including assessment of bias — and potential reintroduction to the data lifecycle.
Data Scientists
This is the team responsible for training and testing machine learning-based data products. As the field matures, elements of data science “tinkering” are becoming increasingly routinized and standardized.
Photo by NESA by Makers on Unsplash
For example, hyperparameter tuning, once considered a “dark art,” is now yielding to scientific methodologies and replicable best practices. An extensive hyperparameter grid search is no longer needed. Instead, machine learning engineers should use clues from early training iterations to inform the optimal hyperparameter settings. Many experiments can be run simultaneously using cloud-based instances with results logged in a tool such as ML Flow .
Increasingly, data science should be thought of as an operations role, not the domain of a specialist performing one-off experiments on data samples within Jupyter Lab.
DevSecOps
Photo by ThisisEngineering RAEng on Unsplash
DevSecOps should have experience utilizing containers, including Kubernetes-based container orchestration tools such as Red Hat OpenShift and AWS Elastic Kubernetes Service (EKS) . The team should test that the environment can scale and utilize the right tools to support the model as it conducts inference in a production setting.
These cloud engineers can create automated deployment scripts using tools such as Terraform and Ansible to deploy and tear down test harnesses in the cloud or on-premise data center. We recommend Jenkins as a DevSecOps pipeline to coordinate model deployment and run all necessary tests, then capture metrics for analysis.
Summary
By conceptualizing the data science workflow as taking place across these four teams, an organization can ensure that its business units are moving in concert toward strategic aims.
HCD sets forward requirements, engages the end user in solutioning, generates stakeholder buy-in, and prevents user-adoption risks around machine learning deployment. DataOps collects and warehouses high quality data. Data Scientists capitalize on process improvement opportunities through data product solutions. DevSecOps creates data pipelines, ensures security, and provides delivery of data-driven tools.
Underlying the work undertaken by these teams, a strong data strategy is critical to producing actionable insights and the deployment of advanced analytics.