Logo

The Data Daily

How to Generate Business Value From Unstructured Data Analytics

How to Generate Business Value From Unstructured Data Analytics

With the advancements of powerful big data processing platforms and algorithms come the ability to analyze increasingly large and complex datasets. This goes well beyond the structured and semi-structured datasets that are compatible with a data warehouse, as there is considerable business value to attain from unstructured data analytics.

Why organizations need the ability to process unstructured data

The quantity and diversity of unstructured data continues to grow. The share of unstructured data is between 70% and 90% of all data generated. Its growth is estimated to be around 60% YoY amounting to hundreds of zetabytes of data. And while it is certainly valuable to govern the storage and access to such data in a cloud data warehouse, most of the value comes from the custom processing of unstructured data for specific use cases.

The most well known examples of unstructured data analytics come from the medical and automotive fields. The value in unstructured medical data is clear: lives are saved through a deep understanding of the imaging data from the human body for example. However, in other industries, there are also many real-world use cases for unstructured data such as sentiment analysis, predictive analytics and real-time decision-making. There are, of course, no restrictions to the type of data: images, audio and text all may contain valuable information.

On Databricks, any type of data can be processed in a meaningful way without having to move or copy data, as the most recent machine learning libraries are natively supported. This allows for our customers to include all properties of unstructured datasets – from social media posts and metadata to catalog images – in their analysis and models.

That brings us to the real 4 Vs of unstructured data: value, value, value and value. Here, we have curated a set of example use cases from various industries based on unstructured data, along with the attained business value.

Most use cases based on unstructured data follow a similar computational pattern. Compared to analysis and modeling of structured data, it is typically required to have a relatively profound feature extraction step preceding such modeling. In other words, the unstructured data needs structuring. But besides that, there is no fundamental difference compared to rudimentary machine learning.

The Databricks Lakehouse Platform natively allows for processing unstructured data, as the data can be ingested in the same way as (semi-)structured data. Here, we follow the medallion architecture in which raw data is progressively refined up to a consumable form:

For a detailed explanation of the general approach to modeling unstructured data using deep learning on Databricks, see the article How to Manage End-to-end Deep Learning Pipelines with Databricks.

Did you know that in addition to its native support for unstructured data analytics, Databricks has set a world record when it comes to data warehousing performance? That is what we mean with a Lakehouse: where data engineers, data scientists and data analysts work together on any data driven use case, from advanced machine learning to performant and reliable BI workloads, delivering business value to our customers.

If you are looking specifically for best practices around image processing on Databricks, check out this past Data + AI Summit session on image processing and this related image processing blog. See the to find out how to use images in a recommender system. For natural language processing, there is this recent blog post that contains a solution accelerator for adverse drug event detection.

Images Powered by Shutterstock