Logo

The Data Daily

The Modern Data Lakehouse: An Architectural Innovation | 7wData

The Modern Data Lakehouse: An Architectural Innovation | 7wData

Imagine having self-service access to all business data, anywhere it may be, and being able to explore it all at once. Imagine quickly answering burning business questions nearly instantly, without waiting for data to be found, shared, and ingested. Imagine independently discovering rich new business insights from both structured and unstructured data working together, without having to beg for data sets to be made available. As a data analyst or data scientist, we would all love to be able to do all these things, and much more. This is the promise of the modern data lakehouse architecture.

According to Gartner, Inc. analyst Sumit Pal, in “Exploring Lakehouse Architecture and Use Cases,” published January 11, 2022: “Data lakehouses integrate and unify the capabilities of data warehouses and data lakes, aiming to support AI, BI, ML, and data engineering on a single platform.” This sounds really good on paper, but how do we build this in reality, in our organizations, and meet the promise of self service across all data?

Cloudera has been supporting data lakehouse use cases for many years now, using open source engines on open data and table formats, allowing for easy use of data engineering, data science, data warehousing, and machine learning on the same data, on premises, or in any cloud. New innovations in the cloud have driven data explosions. We’re asking new and more complex questions of our data to gain even greater insights. We’re bringing in new data sets in real time, from more diverse sources than ever before. These new innovations bring with them new challenges for our data management solutions. These challenges require architecture changes and adoption of new table formats that can support massive scale, offer greater flexibility of compute engine and data types, and simplify schema evolution. 

Scale: With the massive growth of new data born in the cloud comes a need to have cloud-native data formats for files and tables. These new formats need to accommodate the massive scale increases while shortening the response windows for accessing, analyzing, and using these data sets for business insights. To respond to this challenge, we need to incorporate a new, cloud-native table format that is ready for the scope and scale of our modern data. Flexibility: With the increased maturity and expertise around advanced analytics techniques, we demand more. We need more insights from more of our data, leveraging more data types and levels of curation. With this in mind, it’s clear that no “one size fits all” architecture will work here; we need a diverse set of data services, fit for each workload and purpose, backed by optimized compute engines and tools.   Schema evolution: With fast-moving data and real-time data ingestion, we need new ways to keep up with data quality, consistency, accuracy, and overall integrity. Data changes in numerous ways: the shape and form of the data changes; the volume, variety, and velocity changes. As each data set transforms throughout its life cycle, we need to be able to accommodate that without burden and delay, while maintaining data performance, consistency, and trustworthiness.

Apache Iceberg, a top-level Apache project, is a cloud-native table format built to take on the challenges of the modern data lakehouse. Today, Iceberg enjoys a large active open source community with solid innovation investment and significant industry adoption. Iceberg is a next-generation, cloud-native table format designed to be open and scalable to petabyte datasets. Cloudera has incorporated Apache Iceberg as a core element of the Cloudera Data Platform (CDP), and as a result is a highly active contributor.  

Iceberg was born out of necessity to take on the challenges of modern analytics, and is particularly well suited to data born in the cloud.

Images Powered by Shutterstock