Many of you who have seen my LinkedIn profile must be wondering what my profile headline means. For the longest time, a Data Warehouse has been the gold standard for integrating data from disparate systems into a single database to analyze and derive business insights. After having worked in the data analytics space for over 20 years, I have never been more excited about a new technology/architecture which I truly believe is a material advancement in the way we manage and leverage data i.e. the Lakehouse.
For those of you who are not familiar with the Lakehouse, here is a simplified definition. Lakehouse is a combination of the best features of a Data Lake and a Data Warehouse. A Data Lake provides highly scalable low cost storage for structured and unstructured data, while a Data Warehouse provides high performance queries on structured data. Lakehouse provides both of the above features.
Below I provide four reasons why I believe Lakehouse is what we need to be building going forward rather than a Data warehouse.
Most enterprises currently have three types of data storage systems.
a) Application Databases — Transactional systems which capture data from all operations in the enterprise e.g. HR, Finance, CRM, Sales etc.
b) Data Lakes — These are catch-all cloud storage systems which store structured and unstructured data like application data backups, logs, web-click-streams, pictures, videos etc.
c) Data Warehouses — Integrated, cleansed data organized in a way to enhance query performance so that we can run reports and dashboards quickly.
Most data engineering teams when building data pipelines typically move data from Application Databases to Data Lakes and then move a subset of the data into a Data Warehouse for reporting purposes.
But with Lakehouse architecture, we combine Data Lake and Data Warehouse into a Lakehouse, so that your data is moving across just two types of systems. Also Lakehouse can support ingestion of both Streaming and Batch data into the same data structure. This obviates the extra step to consolidate batch and streaming data. This results in a more streamlined Data Pipeline with the fewest hops and faster time to value. So, with a Lakehouse, in most cases, it takes less than 5 minutes to get your data from the point it was generated to the point where it can be reported on in a dashboard.
As the maturity of data teams grows, enterprises are not satisfied with just building traditional business intelligence (BI) reports, but a lot of the enterprises now have in-house data scientists who use artificial intelligence and machine learning (AI/ML) algorithms to discover hidden patterns within their data to predict the future. Most of these data scientists rely on python, a programming language, to build their AI/ML models and require large amounts of input data to build reliable models. This has led to the growth of Data Lakes which typically hold the data that these data scientists need to build their models (as shown in Fig 1). But when an enterprise’s data is spread across Data Lakes and Data Warehouses, data scientists have to go searching for their data in two different systems and bring them together on a Data Lake and then start the process of building their models. In addition, Data Warehouses typically don’t support python language and support only SQL interface making it difficult for Data Scientists to leverage them as their data source.
By integrating Data Warehouses and Data Lakes into a single Lakehouse which supports both SQL and Python, data scientists can now query a single system (i.e. the Lakehouse) which now acts as the single source of truth for enterprise wide data. This simplifies and hastens the process of building BI reports as well as AI/ML models.
So what makes Lakehouse superior to a plain old Data Lake? Well it all boils down to the way the data is stored.
a) Format: Instead of data being stored in plain human readable formats such as csv or json, in a Lakehouse, we store the data in a compressed parquet format which is better for computers to read and write data fast. Also, parquet is an open source format, hence your data is not locked away in a proprietary format. You can use python language to access this data from your laptop without having to pay a vendor to access your own data.
b) Logs: In addition to storing the data in parquet format we also keep track of the metadata of the parquet files and maintain a log of all the operations executed on these files. This allows us to organize and manage data inside these files better. While the complete description of how these log files help is beyond the scope of this article, it suffices to say that these log/metadata files are the secret sauce (well not so secret as these are open source) which makes the experience of managing data in a Lakehouse that much more easier and efficient.
Currently there are three major open source projects that help you build a Lakehouse namely delta lake, iceberg and hudi. Each of these formats are supported by major companies as well as individual community developers.
Finally the fourth and final point I wanted to make was about the cost of storing and querying the data. As the data is stored in cheap cloud storage systems such as AWS S3, Azure ADLS or GCP GCS, the cost of storing a TB of data for a month is usually in the range of $20-$50 depending on your disaster recovery and high availability requirements. The other cost you incur is the cost of compute which is typically an Apache Spark Cluster which costs around $2/hr for a small cluster. For smaller workloads, you don’t even need a spark cluster, you can potentially query the data from your laptop or a virtual machine in the cloud using python language. Unlike Data Warehouses, which are typically always on, even when you are not using them, in a Lakehouse Architecture you turn on the cluster only when you need to query the data. While the newer cloud based Data Warehouses do provide on-demand compute power, the price points are usually higher than those of Spark Clusters and don’t allow you to query your data from your laptop without a cluster up and running.
Lakehouse helps you simplify data processing and democratize the usage of the data across your organization at the lowest cost possible. This is a game changer for small and big enterprises which are falling behind in the pursuit of leveraging data to gain competitive advantage. So say goodbye to Data warehouses and say hello to Lakehouse and you will never look back as you will be the superhero to all your data users.