Logo

The Data Daily

Helpful strategies for improving data quality in data lakes

Helpful strategies for improving data quality in data lakes

Ingesting large volumes of disparate data can yield a rich source of information — but it's also a recipe for data chaos. Use these tips to improve data quality as your data lake grows.

For as long as there’s been data, enterprises have tried to store it and make it useful. Unfortunately, sometimes the way enterprises store data does not directly correlate with making it useful. Yes, I’m talking about data lakes.

The promise of data lakes is clear: A central place for an enterprise to push its data. In some ways, data lakes could be seen as the next generation of data warehouses. Unlike the warehouse, however, data lakes allow companies to dump data into the lake without cleansing and preparing it beforehand.

This approach simply delays the inevitable need to make sense of that data. However, properly applied data quality initiatives can simplify and standardize the way data lakes are used. In this guide, learn useful ways to make all that data accessible to the business analysts, data scientists and others in your company who get paid to make sense of it.

A data lake is a central repository for storing data, whatever the source or nature — structured, unstructured or semi-structured — of that data. Unlike a data warehouse in which data is stored in files and folders, a data lake keeps data in a flat structure and uses object storage, which is tagged for easier, faster retrieval.

SEE: 4 steps to purging big data from unstructured data lakes (TechRepublic)

Unlike a data warehouse, which requires incoming data to be stored in a common schema to allow for easier processing, data lakes allow enterprises to store data in its raw format. Data warehouses tend to store data in relational formats, pulling structured data from line-of-business applications and transactional systems. They allow for fast SQL queries but tend to be expensive and proprietary.

Data warehouses are also often misused, as Decodable CEO Eric Sammer has argued, putting expensive, slow batch-oriented ETL processes between applications to move data. Data lakes, by contrast, tend to store data in open formats and allow for a broader range of analytical queries.

That is, if you can first make sense of the data.

This is the first and most pressing problem of data lakes: Learning how to make sense of that wildly disparate data.

In an interview with David Meyer, SVP of Product Management at Databricks, a leading provider of data lake and data warehousing solutions, he called out the benefits of data lakes as “great in a lot of ways” because “you can stuff all your data in them.”

The problem, however, is that “they don’t have a lot of characteristics that you’d want to do data [analytics] and AI at scale.” He went on to say that “they weren’t transactional or ACID compliant. They weren’t fast.”

Databricks has fixed many of those problems by layering things like governance capabilities on top and then open sourcing them. As an example, they developed the Delta Lake format, for which Google Cloud recently announced support. The Delta Lake format essentially turns a data lake into a warehouse.

Though they don’t suffer from the same problems as data warehouses, data lakes can be expensive to implement and maintain — in part because even skilled practitioners may find it difficult to manage them.

The lack of structure may seem liberating when data is being ingested, but it can be burdensome when an enterprise hopes to make sense of the data. Absent something like the Databricks governance overlay, data lakes are often plagued by poor governance and security.

Even so, there’s enough promise in data lakes that enterprises will continue to invest in them for their data management needs. So how can enterprises use data lakes wisely?

One answer to the traditional data lake is to turn it into something else. Databricks first came up with the idea of a “data lakehouse,” bringing together the best of data lakes and data warehouses by adding a transactional storage layer on top of the data lake.

This means, as Meyer has described, “you don’t have to copy data. You can leave the data where it is.” The data stays in the lake, but if it’s stored in the open source storage framework of Delta Lake, you can apply data warehousing tools from Databricks, Google’s BigQuery or any other vendor that supports the format in order to improve data quality.

As I’ve written before, there are several approaches to improving data quality that generally apply to data lakes. As tempting as it can be to dump data into a lake without concern for schema, a smarter approach is to apply some thought beforehand. Many companies are now completing extensive data cleansing and preparation projects prior to adding their data to data lake environments.

You probably don’t want to undertake the burden of rebuilding databases after the fact. To keep up with your competitors, think ahead and standardize data formats when data is being ingested; this step can remove a great deal of the pain associated with data preparation.

That’s right: Despite the promise of unfettered data lake freedom, you actually are going to want to implement strong data governance policies and practices to ensure your data lake doesn’t become a data swamp. Data governance dictates how an organization manages its data throughout the data’s lifecycle, from acquisition to disposal, as well as the different modes of usage in between.

Though data governance involves tooling, it’s much more than that: It also involves the processes people must follow to ensure the security, availability and integrity of data.

Implied in this is the reality that data quality is more a matter of process than tooling. These processes include defining “good enough” standards for data quality and making it a recurring agenda item when the data governance board meets.

SEE: Data governance checklist for your organization (TechRepublic Premium)

Such processes help to ensure that employees can trust the data they’re using to fuel an array of operational use cases, especially AI/ML operations. With AI and ML technologies growing their enterprise prominence and use cases, data consistency, integrity and overall quality continue to increase in business value.

On a related note, you probably don’t want to retroactively seek out and sanitize data containing private information after it’s already in the data lake. It’s smart to pseudonymize personally identifiable information before or as it enters the data lake. Taking this approach allows you to meet GDPR regulations and store the data indefinitely.

It’s also important to remember that data silos and haphazard data quality are a reflection of the people and organizations that create them. As such, one of the best ways to improve data quality within data lakes is to improve the organizational structure that feeds data into the lake.

Consider investing in data quality training for your staff, and be sure to offer them regular training on data security best practices and general data literacy.

No matter how well you do with the rest of these tips, your company must hire and retain strong data engineers if you want to set your data lakes up for success. No matter what process went into the creation of data and data silos, accessing the data remains a task best suited to a data engineer, which is not the same thing as a data scientist or business analyst.

Hard as it may be to hire data scientists, data engineers are even more scarce — perhaps one data engineer is on staff for every 100 data scientists or business analysts in any given company. A data engineer prepares data for operational and/or analytical uses, and they’re in short supply. However, their skills are worth the investment it will take to bring them on board for data lake and data quality management.

Disclosure: I work for MongoDB but the views expressed herein are mine.

Images Powered by Shutterstock