Logo

The Data Daily

Power BI Datamart Vs. Dataflow Vs. Dataset

Power BI Datamart Vs. Dataflow Vs. Dataset

Datamart, Dataflow, and Dataset are all Power BI components dealing with the data. I have presented about these a lot, and one of the questions I get is: What is the difference between Dataflow, Dataset, and Datamart? So I thought to explain it in an article and help everyone in that understanding. In this post, you will learn the differences between these three components, when and where you use each, and how they work together besides other components of Power BI.

Power BI Dataflow is the data transformation component in Power BI. It is a Power Query process that runs in the cloud, independent of Power BI report and dataset, and stores the data into Azure Data Lake storage (or Dataverse). Dataflow is not only limited to Power BI; it can be created and used in other services such as Power Platform (Power Apps). Dataflow will give you both the transformation engine of Power Query plus the storage option. Dataflow will give you a re-usable ETL (Extract-Transform-Load) component.

To learn more about Dataflow, I recommend the below article/video;

Power BI Dataset is the object that contains the connection to the data source, data tables, the data itself, the relationship between tables, and DAX calculations. Usually, Power BI Dataset is usually hidden from the Power BI Desktop view but can be easily seen in the Power BI service. Power BI Dataset is used commonly when sharing a model between multiple visualization objects (such as multiple Power BI reports, paginated reports, and even Excel reports). The ability to use a shared Dataset will give you a re-usable modeling layer in Power BI.

To learn more about Datasets and how they are used in Power BI, read the article below;

Power BI Datamart is a recently added component to the Power BI ecosystem. Power BI Datamart is a combination of Dataflow, an Azure SQL Database (acting like a data warehouse), and Dataset. Power BI Datamart also comes with a unified editor in the Power BI Service. Power BI Datamart is more like a container around other components of Power BI (Dataflow, Dataset, and Azure SQL Database).

To learn more about Datamart, read the article below;

Datamart, Dataflow, and Dataset are Power BI components that store and work with the data. Now that you know the definition let’s talk about the difference between the three components.

Dataflow is decoupling the Power Query logic and code from the Power BI file so that it can be used in multiple files. You will have the ability to get data from many different data sources, do the transformations using Power Query online and get the data in the shape you want, set a scheduled refresh for it, and load it into storage options (Azure Data Lake storage, or Dataverse)

Dataflow is the only component of the three above that can also be created outside of Power BI. You do not need a business intelligence or data analysis requirement to use Dataflow. Sometimes you may need Dataflow for just data integration; For example, you may want to get data from some source systems, transform it and store it into data storage. This might be for other applications to use.

Dataflow in Power BI might be used for data analysis purposes, but you can also create dataflow in Power Platform under the Power Apps portal.

Using a shared dataset, you can re-use the DAX calculations and relationships you have created for one model in the other Power BI files. If you want to re-use the DAX measures, calculated columns and tables, the hierarchies, field-level formatting, and relationships defined in your model for multiple files, then Shared Dataset does that for you. You can have multiple reports connecting to it and re-using the data model.

The question that normally comes to mind is; what if you have a Data warehouse already? or even if there is no Data warehouse, let’s say, what if you consider the storage of your Dataflow as a data warehouse? Can’t you connect DirectQuery to that? Isn’t Power BI Dataset an unnecessary layer on top of that? Why do you need Power BI Dataset in those cases?

Power BI Dataset uses in-memory engine storage for the data. The in-memory storage for the data ensures the best performance in the report and visualization, as the interaction in the report would be the fastest. it also brings a powerful calculation language called DAX to help with some analytical requirements and calculations. So even if you already have a data warehouse, I would still highly recommend using a Dataset on top of that.

Dataflow is the Data Transformation layer in your Power BI implementation. The terminology for this layer is ETL (Extract, Transform, Load). This will extract data from data sources, transform the data, and load it into the CDM.

Dataset is the layer of all the calculations and modeling. It will get data from the Dataflow (or other sources) and build an in-memory data model using the Power BI (Analysis Services) engine.

The result of dataflow will be fed into a dataset for further modeling; a dataflow by itself is not a visualization-ready component.

Because the dataset is an in-memory model built and ready for visualization, the result that usually used directly to build a visualization;

Unless you use a linked entity or computed entity, a Dataflow usually gets data directly from the data source.

Although a Dataset can directly get data from a data source, it is a best practice that a shared Dataset gets the data from Dataflows. This is to have a multi-developer implementation of Power BI.

One of the reasons to use dataflows and shared datasets is to decouple the layers, so you have multiple developers building the Power BI solution at the same time. In such an environment, the skillset needed for a Dataflow developer is all about Power Query and how to build Star-Schema, etc. No DAX or Visualization skills are required for a Dataflow developer.

On the other hand, the Dataset developer needs to know everything about the relationships in Power BI and calculations in Power BI using DAX. Although the Dataset developer can know Power Query and visualization, it is not his/her primary skill.

Dataflow’s result can be used for data modelers. It is not a great approach to give the output of Dataflow to report visualizers. Because the Dataflow still has to be loaded into a model with proper relationships and calculations added to it.

The result of a dataset is ready for report visualizers. They can have a live connection simply to the shared Dataset and build their visualizations from it.

Using the Dataflow, you reduce the need to copy and paste your Power Query script into other files. You can re-use a table in multiple files.

Using a shared dataset, you can have multiple reports using the same calculations and data model without duplicating the code.

Dataflow and Datasets are not the replacement for each other.

They are two essential components of Power BI and have their own places in the Power BI architecture for a multi-developer scenario.

You learned a lot about the difference between Dataflow and Dataset. However, what about the Datamart? Shouldn’t that be in the comparison list?

Power BI Datamart is more like a container of components rather than a single object itself. When you create a Power BI Datamart, you are creating a Dataflow, an Azure SQL Database, and a Dataset. This means that Datamart is already having all the benefits mentioned for the Dataflow and Dataset in it. It also has an extra component: storing data in the Azure SQL Database. After processing by the Dataflow, the data is loaded into the Azure SQL Database. Some call this a data warehouse, and some may even call this a data mart. But Power BI Datamart includes all of these three components: Dataflow, Azure SQL Database, and Dataset.

Power BI Datamart also brings the unified web UI to build and manage all these components. With the appearance of the Datamart, the Multi-developer architecture of Power BI looks more like the below;

Now the question is would Datamart replace the Dataset and Dataflow? Let’s find that out.

No. Certainly not. Dataflow is a component by itself. As mentioned earlier in this article, you can build and use a Dataflow without needing a BI solution. Datamart is normally useful when you are building a BI solution. You may just want to integrate some tables into storage and re-use them in other applications; in that case, the Dataflow by itself can be used.

Another use case of Dataflow by itself is that even if you have a Datamart, you may still want to implement multiple layers of Dataflow for staging and transformation. Multiple layers of Dataflow is an effective technique in data source isolation and re-using Dataflow components. I have explained that in the article below;

The answer to this question is also No. Although it is now easier to create a Dataset from the unified UI of the Datamart, still that won’t reduce the need for the Dataset to be a separate component of its own. There are plenty of use-cases for a Dataset as a component of its own. Imagine you are implementing an architecture in which the data transformation is done using another service (such as Azure Data Factory), and the data warehouse is in Azure Synapse. You may and can still use Power BI Datasets to build the data model and calculations on top of that without building a full Datamart.

Another use case is that even if you use a Power BI Datamart, you may still create chained Datasets on top of your existing Dataset. These chained Datasets are DirectQuery to Power BI Dataset (which in this case is part of a Datamart), but can have other data sources combined in it. Chained Datasets are a very useful method of working with self-service data analysts in an enterprise organization.

Now is time for the million-dollar question; which one of these components should you use? Each of these comes with some benefits, as you see above. Let’s answer that through a scenario;

James is a BI developer who is building a new BI solution. His implementation includes stages such as; getting data from the source, doing the transformation, loading it into a data warehouse, writing calculations in the model, and visualization. Power BI Datamart enables him to build most of that in one unified structure but still a multi-layered approach that can be easily maintained in the future.

After some time, James realizes that the data transformation side of his implementation is heavy. He wants the data transformation to be isolated from the data source so that just in case the source changes, his solution still works with minimal changes required. So he uses more Dataflows as staging and transformation Dataflows in his architecture.

James has some other colleagues who are business analysts in other departments. They want to use the Dataset that James provided as the main source but add more data sources into that and some calculations too. They create Chanied Datasets on top of that.

The above scenario is a scenario that uses all of these three components in an architecture. Choosing which components you need will be mainly based on what re-usable components you have and where you want to use them.

Images Powered by Shutterstock