In my previous article, How To Automate Your Statistical Data Analysis, I proposed a Python script for automating statistical data analysis using the headless BI layer. Since then, I have been looking for data analysis and machine learning tools. Recently I came across MindsDB, and I was surprised by how powerful it can be. MindsDB is an open-source Machine Learning (ML) SQL server enabling users to set up machine learning workflows using SQL for the most popular databases and data warehouses.
Traditionally, MindsDB connects to databases or data warehouses, creates models, and then the model is consumed by, for example, Business Intelligence (BI) tools. There are several reasons why this approach might not be efficient. The main reason is that modern BI tools define a metric layer that provides them with reusability and work efficiency. Metrics can also be used for ML models, providing them with the same properties — reusability and work efficiency. A strong argument for using a metric layer in ML models is that ML and BI work on the same metrics (e.g., the exact definition of revenue) with minimal effort.
Modern BI tools do not want to keep the metric layer to themselves, they want to expose metrics and their usage to other data consumers through the previously mentioned so-called headless BI layer. In this article, I want to share with you this modern approach using the integration of MindsDB with a modern BI tool supporting the metric layer and a headless BI layer — GoodData.
In the integration, I would like to show how I create a model from the insight created in the analytical tool, whose predictions I display back in the analytical tool so end users can see the past, the present, and the future in a single insight.
In this chapter, I will give an overview of the integration. Let’s imagine a situation where we have a Postgres database with e-shop data. We will model a base insight and use appropriate tools with metric support to make our work easier (semantic model). Having the insight we would like to see the forecast for the insight. Thanks to our wisely chosen tool, we can use a headless BI layer to access the insight, make a forecast, and then create a new insight with data from the base insight and the forecast. Thanks to this flow, we see the desired output, and it is worth mentioning that we are providing data scientists with the same data with the same metrics as data analysts — consistency is preserved thanks to metrics and a headless BI layer. The work of data scientists and data analysts is more efficient and consistent, as they both work on the same data.
Both tools, MindsDB and GoodData, expose docker images, which makes them very developer-friendly. All we need to do to model the situation described above is to define the docker-compose file with services we want to use, and we are ready to explore them. I created a docker-compose file for this article, and all materials are available in the GitHub repository. Feel free to use it, and modify it as you wish.
The docker-compose file starts with three services; gooddata-cn-ce, gooddata-fdw, and mindsdb. Let us begin with the BI tool running in the gooddata-cn-ce (GoodData.CN Community Edition) service. I prepared a script that sets up the environment in the gooddata-cn-ce service.
The environment contains a GoodData.CN workspace named demo. The workspace is an environment for organizing, analyzing, and presenting data. Inside the demo workspace, we can find a connected Postgres data source, a logical data model (LDM), two metrics, and insight. Let us discuss the terms I mentioned above. For those familiar with database modeling, think of LDM as a conceptual model used by the metric layer to define metrics without further knowledge about database structure.
There are two metrics prepared in the environment — `order amount` and `revenue`. Metrics are created using MAQL syntax, similar to SQL syntax but with several modifications to simplify it. As you can see in the following figures, the metric `order amount` is used within the `revenue`, simplifying metrics usage even more and demonstrating how metrics can be reused and nested inside each other.
The important thing to note is that metrics define only aggregation without context (alternative name for context is dimensionality). The metrics may be used when creating insights using a simple drag-and-drop UI design. Another great thing is that the metrics will adapt to the context in which you are using them. For this article, I have prepared an insight — revenue per month, segmented by region.
The insight is the output of GoodData.CN that we will take into account during the integration.
To get familiar with GoodData, I recommend checking out GoodData University, GoodData Documentation, and GoodData Community Slack.
As we mentioned at the beginning of the article, we want to utilize the headless BI layer to enable integration with MindsDB. In our case, we want to connect to it using the JDBC protocol. That is possible through the gooddata-fdw (GoodData Foreign Data Wrapper, or just GoodData FDW for short) service. The service is a Postgres database with a Python package extension. GoodData FDW is a SQL gateway to GoodData, which uses the technology of Postgres Foreign Data Wrapper. We will connect to GoodData using our favorite database manager.
We want to enable all components to be able to integrate using SQL through a standard JDBC protocol, to make metrics and insights readily accessible. Disabling access to data for other components or consumers is counterproductive. The real magic behind GoodData FDW is that insights are being represented as foreign tables in Postgres. By selecting a foreign table, an insight is executed in GoodData in the background.
To set up the GoodData FDW:
When the headless BI layer is set up, we can proceed to MindsDB. I mention the basic functionalities of MindsDB in this article. To learn more about MindsDB, check out the MindsDB Documentation or reach out on the MindsDB Community Slack.
First, we connect to GoodData FDW, which we treat as a Postgres database. After the database is connected, we are ready to use MindsDB’s full potential. Let us predict revenue development from the insight we defined in GoodData. The insight from the GoodData is represented as a table in the Postgres database.
Let us train a model that will predict revenue values in the future in month-time granularity for each region. We trigger the training of a model using the command above. When the model is successfully trained, we can use our model for predictions. The query below will show the first four revenue predictions for the South region.
The output of MindsDB is a model created from a foreign table from headless BI. The foreign table is an abstraction of an insight we looked at earlier in the GoodData.CN environment.
Let us embed predictions in the GoodData. Let’s store results in the original data source, so we can combine raw data with predictions in the most flexible way — using SQL. When the table is created, we will clean the data in the database manager and create a view for convenience. Then we will be able to access our predictions.
We can reuse GoodData.CN to consume the predictions we have just created, as we can see in the figure below.
The past revenue is visualized in red, and the predicted revenue is visualized in green. The blue connection between past and predicted revenue is only for convenience, and it is created using the metric listed below.
By consuming the model’s data, we are closing a cycle, which started from insight and resulted in insight enriched with predictions. This insight can then be integrated into your web applications using GoodData UI React library, as well as by other clients (e.g. other BI tools) through the foreign tables.
The integration use case described in this article presents a new approach to the cooperation of ML and BI tools through a headless BI layer. The concept of a cycle (insight -> ML model -> insight) makes sense to me, as the next evolutionary step in data analysis.
Thanks to the adaptability of MindsDB and GoodData, we can integrate both tools without much effort, and we can expose data to other data consumers.
Besides the self-hosted option using docker image, MindsDB and GoodData provide helm chart deployment and a hosted version (MindsDB Cloud and GoodData Cloud respectively), which takes away one more excuse as to why these tools should not be part of the modern data stack.
For future work, I believe it would be beneficial to further improve this integration and explore options for integration into CICD. The CICD option could relate to both components independently — in the case of GoodData, for example, preserving integration of insights and dashboards after new changes, and MindsDB to trigger retraining a model on new data. If you are interested in CICD with GoodData, I would like to redirect you to the following articles How To Automate Data Analytics Using CI/CD and How To Build a Modern Data Pipeline.
I would like to hear your opinion about this approach. Do you think ML and BI tools should interact through a headless BI layer? Does it make sense to train a model on top of the result of another training? Would our approach help here? Do you have other advanced ML/AI use cases which could benefit from our approach? Please, let me know what you think in the comments.
I am looking forward to hearing your point of view on this topic.