Logo

The Data Daily

Databricks and The Modern Data Warehouse Enable Data Scientists

Databricks and The Modern Data Warehouse Enable Data Scientists

The big data analytics market is set to reach $103 billion by 2023 (Techjury, 2021). It is well documented that the volumes of data are increasing. However, what’s less clear is that as both data grows, so does the number of transformations and computations required to conduct all parts of the data journey inside an organisation.

Therefore, enterprises need a scalable compute platform for all parts of the data journey to include data that has a vague structure as well as more structured data sources. The journey includes data storage, but it also includes compute for other parts of the data journey in the organisation that are harder to see.

The cost of acquiring new customers and maintaining those relationships in an online environment versus bricks and mortar is significant. –  Stephen Cohen, American educator

By 2025, global data will grow to 175 zettabytes, according to IDC forecasts. Data Science enables companies to efficiently understand gigantic data from multiple sources and derive valuable insights to make smarter data-driven decisions. Data Science is widely used in various industry domains, including marketing, healthcare, finance, banking, policy work, and more. That explains why Data Science is important.

Organisations often want data science, but they are not sure of how to get there. Often, they start with the data scientist first, who is then frustrated by the lack of robust data engineering that is in place so that they can start data preparation and data modelling. The data scientist then tries to adopt the role of the data engineer to show progress, but this is not a satisfactory outcome as they can get stuck in the data engineer role.

All these requirements demand a technology that is flexible enough to work with data that is anywhere on the continuum of unstructured through to structured data. Further, the technology will need to cope with the requirements of data that requires extensive parallelisation to land, write and return data quickly as required for business intelligence and data science requirements.

There is increased interest in how the data is protected and managed from the business as well as the consumer perspective. This is driven by the growth of data generation by companies and consumers and an increased focus on data collection.

Databricks is one solution that can help organisation seamlessly combine powerful data engineering and intricate data science, supporting data scientists to move quickly towards data science.

Fortunately, this is where Azure Databricks comes in. Azure Databricks offers an open-source technology backed by a robust cloud-managed Data Engineering & AI platform aimed at the enterprise.

Based on Apache Spark, Databricks is an integrated platform that makes the power of Apache Spark available to enterprises. Azure Databricks offers an integrated platform for data engineers and data scientists to collaborate in their preferred language, whether it is SQL, Python or Scala to transform and load the data. Azure Databricks is currently in production at a range of businesses, including Shell, Devon Energy and LINX Cargo Care Group.

Databricks offers all aspects of functionality for handling data engineering activities, but it offers a range of features that are especially popular with data scientists. It can also work with other tools to offer an integrated experience for a full data engineering estate, from ingestion to predictive analytics. Additionally, it supports data visualisation through integration with Power BI. Let’s look at the problems that Databricks solves for a range of roles along the data journey.

A scientist can discover a new star, but he cannot make one. He would have to ask an engineer to do it for him. – Gordon Lindsay Glegg

ETL stands for ‘Extract, Transform, Load’. It refers to the process used to consolidate data from diverse source databases, applications and systems. The data includes a variety of Big Data sources, such as key value, graph, time series and even location data. The data could also include unstructured data, such as text, waveform (sound), or sensor data. The ETL process also includes modifying the data to match the target system’s required formatting and placing it into a destination database.

Organisations need ways to cope with the increase in volumes and types of data. It has been noted that 95% of businesses cite the need to manage unstructured data as a problem for their business. Examples of source data include streaming data, data warehouses themselves, data stores in the cloud and sensor data.

Traditionally, data integration is a very rigid process and centralizes the data. It is hard to get data into a structure and to get it back out again. Databricks can help support the analysis of data warehouses, data lakes and other data sources in a single platform from the starting point of historical data through to predictive analytics.

In traditional Business Intelligence environments, technology such as Microsoft SQL Server Integration Services perform traditional Extract, Transform and Load (ETL) technology to transform source data, blend it, and then put the data in a data store. However, technologies such as SSIS (SQL Server Integration Services) can be required to perform extensive data transformations, but they cannot be easily parallelized.

97.2% of organizations are investing in big data and AI, believing that it is a way to support innovation. Databricks can support data innovation throughout the data journey for the data engineer as well as the data scientist. Databricks makes data most interesting by helping so many different sources and types of disparate data to be blended. It accomplishes data blending through multi-stage pipelines that operate on complex data stores. Databricks can scale to operate complex multi-stage data pipelines, thereby simplifying the problem of complex data integration. From that point, the data can be used with machine learning models, business intelligence tools such as Microsoft’s Power BI, or even further data exports for other systems downstream.

Azure Databricks enables data engineers to expedite ETL pipelines by parallelizing operations over scalable compute clusters. If your data is likely to grow rapidly in volume, then this is a great option for processing data reliably. Further, it is easier to move to this option by leveraging existing SQL skillsets with SQL with Databricks notebooks to extract, transform and load data into an Azure data lake or a Databricks Delta Lake.

What does this mean for the data scientist? The demand for predictive analytics, including methods based on machine learning techniques, have advanced to such a degree that organizations want to make the most of their data, and feel that the time is right to do so. Data is viewed as a corporate asset, and people have an interest in its contribution to the success of the organisation.

Advanced data science was once the domain of leading-edge, high-tech companies that could afford the hardware. Now, these advanced approaches are adopted across many private and public sector organizations, and it is not limited to large enterprises. Cloud computing is the architecture of choice for many start-ups, who can get started without much initial investment.

The growth of data is further driving the need for solid data engineering and data science practices, and this is facilitating the data scientist to show what they can do with the data. Databricks uses Apache Spark, which is a well-regarded data science open-source technology that has survived where other open-source technologies have fallen out of favour. Azure Databricks offers interactive data science through the use of Notebooks, which integrate with Azure DevOps for continuous integration and continuous deployment.

From the data scientist’s perspective, it is possible to work with Notebooks while ensuring that it fits into a DataOps strategy.

“DataOps is the hub for collecting and distributing data, with a mandate to provide controlled access to systems of record for customer and marketing performance data, while protecting privacy, usage restrictions and data integrity.” Definition from Gartner.

Gartner is now recognizing DataOps on their hype cycle, so we can expect that the industry will start to follow this trend. DataOps can bring the data disciplines together by adopting principles and practices that resolve the conflicting goals of the different data tribes in the organisation, helping the whole organisation to derive value from their data.

Organisations can get very caught up in the arguments for and against different languages, as evidenced by an internet search for ‘R vs Python’, for example. Since Databricks offers such flexibility in language choice, it means that Data Scientists can work together in their preferred language. Databricks is very easy to set up. It has a one-click deploy methodology, which means it is very easy to spin up Databricks since it is auto-configured for the organization.

Once Databricks is set up, it is possible to set a default language, such as Scala, but it is important to note that it is also possible to use other languages such as Java, SQL, R and Python. You can also access Databricks using .NET which opens data engineering and data science to developers who prefer using .NET. Python Developers can also use PySpark to access the Apache Spark engine which lies at the heart of Databricks. To summarise, there is plenty of choice to support collaboration between different groups of data scientists as well as data engineers and data scientists.

Databricks is extremely straightforward to set up within Azure through the Azure Portal. Data Scientists find it easy to set up Databricks, with its single one-click deployment in the Azure Portal. Databricks manages the infrastructure, and all the data scientists need to do is select the types of Azure Virtual machines during setup.  Here is an example of the Databricks cluster setup, which is very easy to use.

In Azure, ownership and control of data is with the customer. Databricks includes security controls that make it easy to leverage Spark in enterprises with thousands of users. Azure Databricks adheres to the same compliance standards in Azure.  One of the many features of Databricks is how it natively integrates with Azure security and other data services, creating a secure environment for the data used. Using Databricks, productivity can be boosted by up to 25%, expediting collaboration between Data Engineers and Data Scientists.

Databricks is automatically scalable in terms of storage and compute, which is set up in terms of usage. In turn, this can then be shared between Data Science teams, all while preserving permissions through Azure Active Directory integration.

Once these services are ready, users can then manage the Databricks cluster through the Azure Databricks UI or features such as autoscaling. Data scientists often work with data that is extremely sensitive, so it is reassuring to know that it uses Azure Active Directory Single Sign-On to streamline access and permissions. Here is an example of the Azure Databricks login screen, showing that Azure Active Directory Single Sign-On has been configured for the seamless login to Databricks.

Data scientists can have confidence in meeting enterprise requirements efficiently. For example, it is straightforward to automate jobs through scheduling. All metadata, such as scheduled jobs, is stored in an Azure Database with geo-replication for fault tolerance. This is crucial for large enterprises, which need to have the assurance that business-critical jobs will run as scheduled when they are supposed to.

Another important benefit is the integration between Azure Databricks and other Azure technologies. In this example, the AzureML SDK library is installed as part of the Databricks solution. The AzureML SDK allows data scientists to include machine learning models as part of the Azure data science platform.

Azure Databricks integrates with Azure Machine Learning and its AutoML capabilities, offering further benefits to the data science teams to showcase the power of their models and results to the rest of the business.

For example, the data science team can use Azure Databricks to train models using Spark MLlib. Once the model is tested and developed, it can be deployed to the Azure Kubernetes Service for robust production requirements, or, alternatively, the Azure Container Instance service. Azure Databricks can then use automated machine learning capabilities using an Azure ML SDK. Alternatively, it can be executed as a compute target from an Azure Machine Learning pipeline. The Notebook can be used with Azure DevOps, providing support for a DataOps process within the enterprise.

In conclusion, Databricks in Azure offers a variety of benefits, ranging from close integration with Azure services to optimized connectors to data sources, as well as one-click management directly from the Azure console. Azure Databricks can greatly simplify building enterprise-grade production of data applications, allowing teams of data scientists and data engineers to work effectively with a streamlined architecture that allows them to deliver in a more efficient and faster manner.

As a Microsoft Gold Partner with a strong focus in Azure, the modern data warehouse is a core area of our delivery. Check out our experience in Business Intelligence and how we can help you, with case studies, testimonials and more.