Logo

The Data Daily

Astronomer’s Cloud-Based Data Orchestration Brings Efficiency

Astronomer’s Cloud-Based Data Orchestration Brings Efficiency

I have recently written about the organizational and cultural aspects of being data-driven, and the potential advantages data-driven organizations stand to gain by responding faster to worker and customer demands for more innovative, data-rich applications and personalized experiences. I have also explained that data-driven processes require more agile, continuous data processing, with an increased focus on extract, load and transform processes — as well as change data capture and automation and orchestration — as part of a DataOps approach to data management. Safeguarding the health of data pipelines is fundamental to ensuring data is integrated and processed in the sequence required to generate business intelligence. The significance of these data pipelines to delivering data-driven business strategies has led to the emergence of vendors, such as Astronomer, focused on enabling organizations to orchestrate data engineering pipelines and workflows.

Astronomer was founded in 2018, building on a background of providing data engineering services to create a business around the Apache Airflow open-source workflow monitoring and management project. Apache Airflow began as an internal development project within Airbnb in 2014, but became an Apache Software Foundation project in 2016. It enables data engineers to use Python to programmatically author, schedule and monitor workflows. Astronomer’s workforce has been heavily involved in the Apache Airflow development project since the company’s foundation, contributing to significant enhancements such as the scheduler performance and high-availability capabilities delivered in 2020 with Apache Airflow 2.0, for example.

It was also in 2020 that Astronomer made the decision to focus its attention on building a cloud offering through which it could deliver Apache Airflow as a managed platform, rather than providing technical support services for Apache Airflow deployments. The resulting Astro service was launched in June 2022. Its development was financed by venture capital funding, including a $213 million Series C round announced in March 2022 that was led by Insight Partners, along with Meritech Capital, Salesforce Ventures, J.P. Morgan, K5 Global, Sutter Hill Ventures, Venrock and Sierra Ventures. The funding round also facilitated Astronomer’s acquisition of data pipeline observability and data lineage specialist Datakin, which was founded by the creators of the OpenLineage and Marquez open-source projects. The addition of data lineage capabilities based on the OpenLineage project adds data observability capabilities to Astronomer’s Airflow-as-a-service capabilities, with the company positioning the combined offering as a data orchestration cloud service.

The need for more agile data pipelines is driven by the need for real-time data processing. More frequent data analysis requires data to be integrated, cleansed, enriched, transformed and processed for analysis in a continuous and agile process. As such, data-driven organizations are increasingly treating the steps involved in extracting, integrating, aggregating, preparing, transforming and loading data as a continual process, with data pipelines used to enable the flow of information through the organization, increasingly scheduled, automated and orchestrated by data engineers without the need for constant manual intervention. I assert that by 2024, 6 in ten organizations will adopt data engineering processes that span data integration, transformation and preparation, producing repeatable data pipelines that create more agile information architectures.

Apache Airflow was designed to support robust and healthy data pipelines by providing a platform that can be used by data engineers to programmatically author, schedule and monitor data workflows. These could be data integration workflows to extract and combine data from multiple sources and load it into a target data platform. Additionally, Airflow can be used to schedule and orchestrate data science and machine learning pipelines involving multiple data integration and processing steps as well as operational analytics pipelines to generate immediate insight from operational applications. As such, Airflow has a potential role to play in both MLOps and Analytics Ops.

While data engineers can deploy and run Apache Airflow on-premises or in the cloud, Astronomer’s Astro service delivers these capabilities as a managed service available on Amazon Web Services, Google Cloud or Microsoft Azure. In developing Astro, Astronomer reengineered Airflow for the cloud, with optimized configuration and auto-scaling capabilities, while the managed service approach is also designed to reduce infrastructure consumption for long-term tasks as well as reducing the need for data engineers to shoulder security, upgrading and other management responsibilities, enabling them to focus on data pipelines. Astro also offers users the ability to visually monitor activity and data pipeline dependencies and, thanks to the acquisition of Datakin, the ability to collect lineage metadata as well as identify and monitor data quality metrics to improve trust in data. As I noted earlier this year, monitoring the quality and reliability of data is a key component of data observability's role in ensuring healthy data pipelines.

Although the capabilities currently offered by Astronomer do not equate to a fully-fledged data observability offering today, information collected via OpenLineage can be used to support data observability as well as root cause analysis and impact planning. I assert that, through 2025, data observability will continue to be a priority for the evolution of data operations products as vendors deliver more automated approaches to data engineering, improving trust in enterprise data.

While Apache Airflow and OpenLineage provide the core building blocks of the Astro data orchestration cloud service, there are opportunities for Astronomer to expand on this functionality with automated anomaly detection, alerting and root cause analysis, for example. That said, I recommend that all organizations currently managing Apache Airflow deployments or considering the use of Apache Airflow for evaluating data platforms to orchestrate data engineering pipelines and workflows include Astronomer in the evaluations.

Images Powered by Shutterstock