Logo

The Data Daily

The Key to Building Data Pipelines for Machine Learning | 7wData

The Key to Building Data Pipelines for Machine Learning | 7wData

As a consumer of goods and services, you experience the results of machine learning (ML) whenever the institutions you rely on use ML processes to run their operations. You may receive a text message from a bank requiring verification after the bank has paused a credit card transaction. Or, an online travel site may send you an email that offers personalized accommodations for your next personal or business trip.

The work that happens behind the scenes to facilitate these experiences can be difficult to fully realize or appreciate. An important portion of that work is done by the data engineering teams that build the data pipelines to help train and deploy those ML models. Once focused on building pipelines to support traditional data warehouses, today’s data engineering teams now build more technically demanding continuous data pipelines that feed applications with artificial intelligence and ML algorithms. These data pipelines must be cost-effective, fast, and reliable regardless of the type of workload and use case.

Due to the diversity of data sources and the volume of data that needs to be processed, traditional data processing tools fail to meet the performance and reliability requirements for modern machine learning applications. The need to build reliable pipelines for these workloads coupled with advances in distributed high performance computing has given rise to Big Data processing engines such as Hadoop.

Let’s quickly review the different engines and frameworks often used in data engineering aimed at supporting ML efforts:

While the typical steps of data engineering may be standardized across industries — data exploration, building pipelines, data orchestration, and delivery — organizations have various workload types and data engineering needs. That’s where support for multiple engines and workloads becomes essential. Let’s talk about which engines are most effective for each stage of the data engineering cycle:

Data engineering always starts with data exploration. Data exploration involves inspecting the data to understand its characteristics and what it represents. The learnings acquired during this stage will impact the amount and type of work that the data scientist will conduct during their data preparation phase.

Hadoop/Hive is an excellent choice for data exploration of larger unstructured data sets, because of its inexpensive storage and its compatibility with SQL. Spark works well for data sets that require a programmatic approach, such as with file formats widely used in healthcare insurance processing. On the other hand, Presto provides a quick and easy way to access data from a variety of sources using the industry standard SQL query language.

Data pipelines carry and process data from data sources to the business intelligence (BI) and ML applications that take advantage of it.

Images Powered by Shutterstock