Logo

The Data Daily

How we built a tool for validating big data workflows

How we built a tool for validating big data workflows

Klarna’s Checkout product is integrated with 130,000+ online merchants. Making credit and fraud assessments for every single purchase in real-time requires tonnes of high quality data available at a moment’s notice. This is done by using a group of data transformation workflows to fetch the required data for each assessment.

Even though risk and fraud assessments workflows might have the highest name recognition in Klarna, they are by no means the only workflows executed on a regular basis. There are currently about 300 workflows executed per day, each producing data that is important for Klarna’s day-to-day business operations.

Everything would be ok if those workflows were consistent and did not change, but this is not the case. Workflows often need to be improved and updated to cope up with the evolving business needs and the rapid increase of the data. Some examples of these updates can be:

For any effort to improve/update a workflow, there is a hard requirement to uphold the level of quality of output data. Hence, we need data validation to ensure that this is the case.

Not really and for the following reasons:

In general, validating big data workflows is a bit like looking for a needle in a haystack.

Preserving data quality after developing a new version of a workflow is critical. Manual data validation can be time-consuming, error-prone, and tedious. So it might seem like a good idea to either skip any meaningful validation and hope for the best, or to abort the update. Both alternatives have negative impacts as they will lead to either producing data with poor quality, or restricting any attempts to improve organizations methods.

As a result, much-needed workflow updates and improvements might end up being postponed or canceled altogether after a short-term cost-benefit analysis. Which in turn, over time, produces slower workflows with degraded data quality. All of this seems like a waste of manpower and computing resources, so we decided to do something about it.

We will provide a short description of a big data validation tool, called Diftong, that provides an automated and more agile way of doing large scale data validation by comparing the outputs of a pre-updated workflow and its post-update equivalent.

Diftong automatically computes overall statistics of the differences between two tabular databases that have the same schema. The validation process in Diftong can be described in three main steps:

To explain the steps in a clearer way, we will go through a running example that consists of two databases with identical schemas that are to be validated, let’s call them DB1 and DB2. Each database contains a table named “Users” with four columns (id, name, salary, and birthday). A sample of the data is shown below. Again, this is just a simple example to explain how the tool works, while the real use cases are applied to larger databases with millions of records.

As duplicate rows -with matching values in all columns- may be present in big data, we need to account for that. So each row in both databases (DB1 and DB2) is counted in this step and the results are stored in two new databases (DDB1 and DDB2). Each new database contains the same tables and data as the original databases before deduplication, but with one additional column that shows the number of times each row appears in a table to be used in later stages of the validation process.

Back to our running example, the deduplication step detected that the row with id=2 is duplicated in the second database as shown below.

This step gives an overview of the total number of changes in each table where the differences between all rows are calculated. Detecting the differences and similarities on a row-based level gives a clear idea of any changes that might have occurred. If this is the case, a deeper level of statistics based on each column in that table will be calculated in the next step.

The generated statistics in this step are:

Looking at the data in the “Users” table in DDB1 and DDB2, we can see that five rows out of seven contain differences in the values.

Hence, the generated statistics from this step reflected those changes and we can see in the following table that there is 71.4% difference in the data.

If differences were detected in the previous step, a validation based on the values in each column is needed to get a more detailed view of the changes in data. In addition, this step gives more understanding of the distribution of data by measuring how differences are spread. In other words, starting from the results of the row validation step, column based statistics are calculated for the tables that contain differences.

Looking at the field level differences displayed in our example below, we can see that:

Noting that extra rows are not taken into consideration, like the record with id=4.

Since there is always one difference in each column, the generated statistics (Max, Min, Q1, Q2, and Q3) ended up having the same values in this example:

Using Diftong in real use cases related to Klarna’s core business has shown that the capability of identifying the differences between large scale databases helps in managing data transformation workflows.

In particular, Diftong has helped Klarna validate optimizations of workflows. It has added a great value to the company as the automatic comparison of databases enables a more agile approach to workflow improvements and updates. More specifically, such comparisons:

This study is published in the Journal of Big Data. The full article with some examples and test cases can be found in this link.

This post was co-written by Johan Petrini. Many thanks for the support of Klarna colleagues in the Data Domain, especially Erik Zeitler, and also for the support of Uppsala University in Sweden.

Images Powered by Shutterstock