In this next Oceans of Data series article covering Apache Spark, I continue where I left off in Part 1. Let’s have fun playing with data on a Spark Cluster. I am going to share how to spin up a Spark cluster and start analyzing data using Spark SQL and data frame operations. I’ll also show you a little pySpark, a Python programming interface to Spark. For R fans, we there is an option called SparkR that I just showed briefly in an IBM Watson Data Platform article.
For the hands-on exploration, we will use Databricks Community Edition. Databricks provides a lovely free online Spark software development environment. If you do not already have an account, you can register for one at this link: https://databricks.com/ce. Note Databricks is not the only option for Spark. There are many other widely available Spark options and flavors from most of the top data platform vendors AWS, IBM, etc.
Spark’s distributed data-sharing concept is called “Resilient Distributed Datasets,” or RDD. RDDs are fault-tolerant collections of objects partitioned across a cluster that can be queried in parallel and used in a variety of workload types. RDDs are created by applying operations called “transformations” with map, filter, and groupBy clauses. They can persist in memory for rapid reuse. If an RDD data does not fit in memory, Spark will overflow it to disk.
Apache Spark Architecture: Spark master-worker architecture distributes work. When you run a Spark program, it actually consists of two programs 1) a driver program and 2) a workers program. The driver program runs on one machine and the worker program runs either on cluster nodes or in local threads on the same machine.
Dataset: A Dataset is Apache Spark’s newest distributed collection and can be considered a combination of DataFrames and RDDs. It provides the typed interface that is available in RDDs while providing a lot of conveniences of DataFrames. It will be the core abstraction going forward.
DataFrame API: A DataFrame is collection of distributed Row types. These provide a flexible interface and are similar in concept to the DataFrames you may be familiar with in Python (pandas) and R. Unlike existing data frame APIs in R and Python, DataFrame operations in Spark SQL go through an extensible relational optimizer called Catalyst. DataFrame objects represent a logical plan to compute a dataset. No execution on Spark occurs until an output operation is called.
Spark SQL’s DataFrame API supports inline definition of user defined functions (UDFs) including machine learning predictions. UDFs can be queried from Spark SQL. Thus Spark predictive queries are made much more accessible for embedding into intelligent reports and applications.
Spark SQL: Spark SQL is a module in Apache Spark that integrates relational processing with Spark’s functional programming API for querying data and calling complex analytics libraries. Spark SQL uses a nested data model based on Hive. It supports all major SQL data types, complex data types and user-defined types. Like the Shark earlier project, Spark SQL can cache hot data in memory using columnar storage
Transformations: Transformations are operations that will not be completed until you call an action. An example of a transformation might be to convert an integer into a float or to filter a set of values. (select, distinct, groupBy, sum, orderBy, filter, limit)
Transformations are a way of creating a new DataFrame from an existing one, or from a data source. Notably Spark uses lazy evaluation with transformations. That means that the results are not computed right away. Instead, Spark remembers the set of transformations that you requested, and then will apply those to the base DataFrame when an Action is called.
Actions: Actions are commands that are computed by Spark at the time of execution. They run transformations to return a result. (show, count, collect, save). Avoid using collect actions on big data sets since that command returns all records.
Ok, let’s spin up a Spark Cluster. After you log into Databricks Community Edition, click on the Clusters Icon. Then click Create Cluster.
Databricks Community Edition contains a plethora of learning resources. After logging in, explore the Introduction to Apache Spark on Databricks Notebook to get started with loading, querying and visualizing data.
Notebooks are extremely popular for big data analytics. If you have not used Jupyter or Zeppelin, I highly recommend getting familiar them and the awesome public collections of sample Notebooks on github.
To import data, Databricks provides another Notebook with links and example code. We will simply upload data by clicking on the Tables button on the side bar and then clicking on Create Table link at the top of the tables panel. We’ll select a csv file and upload it.
If we want to cache our table on the solid-state drives (SSDs) of the Spark cluster, click on the arrow on the right side of the table name and choose Cache.
Now to add a Python Notebook for playing with our uploaded data. Navigate to Workspace, select Create and Notebook. Select Python and the cluster you want it to query.
Spark DataFrames can be constructed from a wide array of sources such as: structured data files, tables in Hive, external databases, JSON files, Parquet files, distributed file systems (HDFS), cloud storage (S3), or existing RDDs. DataFrames can be also support third-party data formats.
One significant difference and nuance between SQL and Spark SQL is that Spark SQL does not support DELETE. Spark DataFrames are immutable which means they cannot be changed once you create them. Instead of modifying a Spark DataFrame, you must create a new DataFrame.
One of the most common ways to analyze data is to query it. Spark SQL, previously project Shark, supports similar queries to columnar database engines. It is compatible with existing Hive data, queries, and UDFs. Spark SQL is often used with data discovery tools like Tableau via JDBC/ODBC connectivity. When analyzing data with Spark SQL you will use sqlContext functions.
If you are using a Python Notebook, pandas Python data analysis library that you will reference. To create pySpark data frames from pandas, use sqlContext.createDataFrame on an RDD of pyspark.sql.Row objects. Here is a super simple example.
Now we’ll query and visualize our uploaded data on the Spark cluster with the Python Notebook. Here we’ll execute SQL commands by invoking %sql or using sqlContext.sql(…).
Reviewing our results, we see Databricks has cooked in a few plotting features for basic tables, charts and histograms. Additional visualization types may require referencing a Python plotting library by uploading it to our Databricks Workspace.
For creating data frames from Hadoop Distributed File System (HDFS), text files, JSON files, Apache Parquet, Hypertable, Amazon S3, Apache HBase or other Hadoop input format, you’ll simply read the data with sqlContext.read.
Spark SQL has many more commands than we covered in this basic introduction including string, math, statistical, date, hashing, and windowing functions. More advanced User Defined Functions (UDFs) allow you to query machine learning models.
Although the Notebooks are nice, there are other ways to query and use Spark. Spark also has a command line shell utility called spark-sql located in the bin directory of your Spark installation. This can be used with loading and reporting routines.
Spark SQL can also be queried through a Thrift JDBC/OBDC Server interface that is based on HiveServer2 project. Thrift is an Apache project used for cross-language service development. Beeline is a Java console command line shell that can be used with it to test the Spark SQL Thrift server.
For Tableau fans, there are several Spark SQL flavor connectors available. Tableau can connect to Spark version 1.2.1 and later. You will need to install a driver from https://www.tableau.com/support/drivers. The documentation for using Tableau with Spark SQL is at http://onlinehelp.tableau.com/current/pro/desktop/en-us/help.htm#examples_sparksql.html. Unfortunately, the free Databricks Community Edition that I am using for this series does not support ODBC or I would have shown it. If you want to upgrade to a paid version, here are the instructions to connect to a Databricks Spark cluster.
For a much deeper dive into Spark SQL, DataFrames and Python pySpark, read the white paper and run through the linked examples on the Databricks Community Edition. Another fabulous resource for getting comfortable with Spark SQL is the Apache Spark SQL, DataFrames and Datasets Guide and the older SlideShare tutorial. I’d also recommend bookmarking the Python Language docs.