Architecting Data Science Applications on Snowflake

Read original article here

As machine learning (ML) adoption continues to grow, delivering more sophisticated insights across industries, it has become increasingly important for companies to develop capabilities that maximize the value of data within their organizations. Unlocking value in this modern analytics paradigm requires an adaptive platform with the ability to leverage data in a way that supports evolving business use cases.

The Snowflake Data Cloud offers a range of native platform tools and extensibility features to meet this growing need. Throughout the course of this article, we’ll explore how these features can address common challenges within the ML lifecycle and create enterprise scale data science applications on the Snowflake platform.

Using example reference architectures, design patterns, and best practices based on real-world experience and conversations with Snowflake, we’ll dive deeper into:

As data platforms expand to meet the increasing volume and velocity of data generated by relevant producers, the complexity of the data ingestion process tends to grow as well. Mounting dependencies create pressure on both the platform and organization, often constraining capabilities and limiting realized value.

Snowflake’s data sharing offering provides a solution to this by allowing two or more Snowflake customers to leverage the platform’s multi-layer architecture. Consumers are granted access to data in a decoupled storage layer while owners maintain full control of governance through access policy definitions made in the metadata services layer. While the platform still has data load options, including the CLI-based SnowSQL and a connector ecosystem with support for Kafka, Spark, and Python, this option effectively eliminates the traditional data ingestion process, streamlining the workflows of various roles within the development team.

Data rarely lands within the platform in a format suitable to be processed by an ML application. Instead, raw data commonly finds itself in an extract, transform, load (ETL) or extract, load, transform (ELT) pipeline where it is enriched prior to predictive modeling. In practice, these pipelines can often grow beyond manageable scope, warranting a workflow orchestration tool or a form of modularization. Even then, processes can still suffer from inefficiencies and inconsistencies related to repetitive computation.

One method of addressing these concerns is by implementing a feature store: a consistent, curated collection of transformed data created for shared use across multiple teams and ML models. These reusable features reduce the amount of repeated processing required in ML applications, accelerating and standardizing development. When designing a feature store, organizations should ask themselves:

To illustrate how this architecture can be implemented using the Snowflake platform, a reference diagram and sample code have been included. This pattern leverages Snowflake’s native streams, tasks, and stored procedures in combination with Snowpark to prepare data for use in ML.

Even with a reusable semantic data layer in place, organizations still have a long way to go between inception and value delivery in ML applications. There is a significant cost of ownership to building, deploying, and maintaining models that is shared by data scientists, ML engineers, DevOps engineers, and others across the organization. As applications mature, development and maintenance requirements become more difficult to fulfill. Without proper management, this often leads to the creation of technical debt within the organization.

Snowflake’s external functions provide the platform-agnostic feature extensibility organizations need to access a rich ecosystem of ML solutions. Organizations may also benefit from the optionality provided by this architecture; as the operational context changes, so too can the components of their technical infrastructure.

In this section, we’ll focus on connecting to the Amazon Web Services (AWS) platform and the SageMaker ML service using Snowflake integration objects and external stages, tables, and functions. The following reference diagram and code examples illustrate an approach toward integrating the two platforms, allowing users to leverage a variety of resources to generate ML predictions:

By automating traditionally slow, manual processes, these tools increase the velocity with which teams can deliver and scale ML products while reducing complexity. This pattern decomposes traditionally monolithic ML applications by introducing a loosely coupled micro-services architecture, simplifying both development and deployment. The customizable pipeline object defined above streamlines various responsibilities within the data science workflow and its execution results in one of three outcomes:

Depending on the deployment strategy, the inference output of a pipeline can be re-ingested in the following ways:

Pipelines can also be integrated with other features of the SageMaker platform that build on this architecture to further simplify the ML process, including:

With solution infrastructure in place, organizations must turn their attention toward performance evaluation. Monitoring and reporting impose challenges related to reliability and accessibility of information, and a failure to properly address these concerns poses a risk of rendering the organization rudderless. Without the ability to gauge and communicate the degree to which success criteria is met, product teams will have trouble quantifying ROI and capitalizing on opportunities for improvement. This hampers effectiveness and limits the ability of solutions to realize their full potential.

Snowsight Dashboards provide an on-platform option to meet the need for a reporting solution, enabling users to extract valuable insights using embedded data visualization capabilities. Query-based tiles include support for aggregation and filtering operations, allowing users to define custom logic behind visualizations. In this SQL-driven architecture, a dashboard consisting of many tiles acts as a visual representation of the individual queries made against the underlying source data, allowing users to consolidate tasks within the data visualization process. Support for dashboard sharing between Snowflake users simplifies distribution, encouraging process transparency and data democratization amongst stakeholders.

Emerging trends in ML continue to redefine how organizations leverage data, shifting the basis of competition toward modern, data-driven applications. Activating these capabilities can pose a significant challenge to organizations. Platforms like Snowflake can help enterprises abstract away complexity and develop scalable data science platforms that enable digital transformation.

Images Powered by Shutterstock

The Data Daily

Architecting Data Science Applications on Snowflake