10 Best Practices for Amazon Redshift Spectrum

AWS Big Data Blog
10 Best Practices for Amazon Redshift Spectrum
by Po Hong and Peter Dalton | on
24 JUN 2017
| Permalink | Comments
Amazon Redshift Spectrum enables you to run Amazon Redshift SQL queries against data that is stored in Amazon S3 . With Amazon Redshift Spectrum, you can extend the analytic power of Amazon Redshift beyond the data that is stored on local disks in your data warehouse. You can query vast amounts of data in your Amazon S3 “data lake” without having to go through a tedious and time-consuming extract, transfer, and load (ETL) process. Amazon Redshift Spectrum applies sophisticated query optimization and scales processing across thousands of nodes to deliver fast performance.
In this blog post, we have collected 10 important best practices for Amazon Redshift Spectrum by grouping them into several different functional groups.
These guidelines are the product of many interactions and direct project work with Amazon Redshift customers.
Amazon Redshift vs. Amazon Athena
AWS customers often ask us: Amazon Athena or Amazon Redshift Spectrum? When should I use one over the other?
When to use Amazon Athena
Amazon Athena supports a use case in which you want interactive ad-hoc queries to run against data that is stored in Amazon S3 using SQL. The serverless architecture in Amazon Athena frees you from having to provision a cluster to perform queries. You are charged based on the amount of S3 data scanned by each query. You can get significant cost savings and better performance by compressing, partitioning, or converting your data into a columnar format, which reduces the amount of data that Amazon Athena needs to scan to execute a query. All the major BI tools and SQL clients that use JDBC can be used with Amazon Athena. You can also use Amazon QuickSight for easy visualization.
When to use Amazon Redshift
We recommend using Amazon Redshift on large sets of structured data. Amazon Redshift Spectrum gives you the freedom to store your data where you want, in the format you want, and have it available for processing when you need it. With Amazon Redshift Spectrum, you don’t have to worry about scaling your cluster. It lets you separate storage and compute, allowing you to scale each independently. You can even run multiple Amazon Redshift clusters against the same Amazon S3 data lake, enabling limitless concurrency. Amazon Redshift Spectrum automatically scales out to thousands of instances. So queries run quickly, whether they are processing a terabyte, a petabyte, or even an exabyte.
Set up the test environment
For information about prerequisites and steps to get started in Amazon Redshift Spectrum, see Getting Started with Amazon Redshift Spectrum .
You can use any data set to perform the tests to validate the best practices we have outlined in this blog post. One important requirement is that the S3 files for the largest table need to be in three separate data formats: CSV, non-partitioned Parquet as well as partitioned Parquet. How to convert from one file format to another is beyond the scope of this blog post. For more information on how this can be done, check out the following resources:
Converting to Columnar Formats
Creating the external schema
Use the Amazon Athena data catalog as the metadata store, and create an external schema named “spectrum” as follows:
create external schema spectrum from data catalog database 'spectrumdb' iam_role 'arn:aws:iam::
:role/aod-redshift-role' create external database if not exists;
The Redshift cluster and the data files in Amazon S3 must be in the same AWS region. Your Redshift cluster needs authorization to access your external data catalog in Amazon Athena and your data files in Amazon S3. You provide that authorization by referencing an AWS Identity and Access Management (IAM) role (e.g. aod-redshift-role) that is attached to your cluster. For more information, see Create an IAM Role for Amazon Redshift .
Defining external tables
As examples, an Amazon Redshift Spectrum external table using partitioned Parquet files and another external table using CSV files are defined as follows:
CREATE external table spectrum.LINEITEM_PART_PARQ ( L_ORDERKEY BIGINT, L_PARTKEY BIGINT, L_SUPPKEY BIGINT, L_LINENUMBER INT, L_QUANTITY DECIMAL(12,2), L_EXTENDEDPRICE DECIMAL(12,2), L_DISCOUNT DECIMAL(12,2), L_TAX DECIMAL(12,2), L_RETURNFLAG VARCHAR(128), L_LINESTATUS VARCHAR(128), L_COMMITDATE VARCHAR(128), L_RECEIPTDATE VARCHAR(128), L_SHIPINSTRUCT VARCHAR(128), L_SHIPMODE VARCHAR(128), L_COMMENT VARCHAR(128)) partitioned by (L_SHIPDATE VARCHAR(128)) stored as PARQUET location 's3://
/
/lineitem_partition/' ; CREATE external table spectrum.LINEITEM_CSV ( L_ORDERKEY BIGINT, L_PARTKEY INT, L_SUPPKEY INT, L_LINENUMBER INT, L_QUANTITY DECIMAL(12,2), L_EXTENDEDPRICE DECIMAL(12,2), L_DISCOUNT DECIMAL(12,2), L_TAX DECIMAL(12,2), L_RETURNFLAG VARCHAR(128), L_LINESTATUS VARCHAR(128), L_SHIPDATE VARCHAR(128) , L_COMMITDATE VARCHAR(128), L_RECEIPTDATE VARCHAR(128), L_SHIPINSTRUCT VARCHAR(128), L_SHIPMODE VARCHAR(128), L_COMMENT VARCHAR(128)) row format delimited fields terminated by '|' stored as textfile location 's3://
/
/lineitem_csv/'
Querying data
To recap, Amazon Redshift Spectrum uses external tables to query data that is stored in Amazon S3. You can query an external table using the same SELECT syntax you use with other Amazon Redshift tables. You can’t write to external tables because they are read-only.
You first create an external schema that references an external database, which can reside in either an Amazon Athena data catalog or an Apache Hive metastore, such as Amazon EMR . Then you create an external table in Amazon Redshift using this external schema. You must reference the external table in your SELECT statements by prefixing the table name with the schema name, without needing to create and load the table into Amazon Redshift.
The external schema references a database in the external data catalog. This requires an IAM role that authorizes your cluster to access Amazon S3 and Amazon Athena on your behalf.
If you would like to perform your tests using Amazon Redshift Spectrum, the following two queries would be a good start:
QUERY 1:
SELECT l_returnflag, l_linestatus, sum(l_quantity) as sum_qty, sum(l_extendedprice) as sum_base_price, sum(l_extendedprice*(1-l_discount)) as sum_disc_price, sum(l_extendedprice*(1-l_discount)*(1+l_tax)) as sum_charge, avg(l_quantity) as avg_qty, avg(l_extendedprice) as avg_price, avg(l_discount) as avg_disc, count(*) as count_order FROM lineitem WHERE l_shipdate date '1995-03-15' GROUP BY l_orderkey, o_orderdate, o_shippriority ORDER BY revenue desc, o_orderdate LIMIT 20;
This query has joins of three tables and can be very useful to compare Amazon Redshift Spectrum’s performance with that of native Amazon Redshift.
Best practices for concurrency
These recommended practices can help you optimize your concurrent workload performance using Amazon Redshift Spectrum.
1. Use Amazon Redshift Spectrum to improve scan-intensive concurrent workloads
Amazon Redshift Spectrum resides on dedicated Amazon Redshift servers that are independent of your cluster. It pushes many compute-intensive tasks, such as predicate filtering and aggregation, down to the Amazon Redshift Spectrum layer, so queries use much less of your cluster’s processing capacity. In addition, Amazon Redshift Spectrum scales intelligently. Based on the demands of your queries, Amazon Redshift Spectrum can potentially use thousands of instances to take advantage of massively parallel processing (MPP).
For some use cases of concurrent scan and/or aggregate intensive workloads, Amazon Redshift Spectrum may perform better than native Amazon Redshift on average.
The most resource-intensive aspect of any MPP system is the data-load process. This is because it competes with active analytic queries not only for compute resources, but also for locking on the tables through multiversion concurrency control (MVCC). By contrast, if you add new files to an existing external table using Amazon Redshift Spectrum by writing to Amazon S3, and then updating the meta-data to include them as new partitions, you eliminate this workload from the Amazon Redshift cluster. This has an immediate and direct positive impact on concurrency.
2. Use multiple on-demand Amazon Redshift clusters to scale concurrency
Amazon Redshift Spectrum stores data in Amazon S3, which can be accessed by multiple Amazon Redshift clusters to improve concurrent workload performance. A common Amazon Redshift customer scenario is what to do with seasonal spiky, highly concurrent query workloads. Before Amazon Redshift Spectrum, to handle the increased concurrency, customers often had to spin up multiple “read-only” Amazon Redshift clusters by restoring from a snapshot. The problem with this approach is that for a large Amazon Redshift data warehouse with hundreds of terabytes of data, the restore process can take a long time, resulting in data latency issues.
With Amazon Redshift Spectrum, you can move the largest tables to Amazon S3, and each Amazon Redshift cluster needs to keep only a small amount of data on local disks. Because of the reduction of data volume, it would be much faster to spin up or restore multiple “read-only” Amazon Redshift clusters to handle these seasonal spiky query workloads (see Figure 1). To reduce cost, you should terminate these “on-demand” Amazon Redshift clusters as soon as they have finished the job.
Figure 1: Multiple read-only Amazon Redshift clusters access a shared Redshift Spectrum layer
Amazon Redshift customers have been using pgbouncer-rr to simplify and control client query routing when deploying multiple Redshift clusters to scale concurrency. More details can be found in the blog: Query Routing and Rewrite: Introducing pgbouncer-rr for Amazon Redshift and PostgreSQL .
Best practices for storage
In terms of storage optimization considerations, think about reducing the I/O workload at every step. That tends toward a columnar-based file format, using compression to fit more records into each storage block, and using a format that supports data partitioning. The file formats supported in Amazon Redshift Spectrum include CSV, TSV, Parquet, Sequence, and RCFile.
A further optimization is to use compression. Currently, Amazon Redshift Spectrum supports Gzip, Snappy, and BZ2.
3. Use Apache Parquet files for better performance and lower cost
Apache Parquet is a columnar storage format that is available to any project in the Apache Hadoop ecosystem, regardless of the choice of data processing framework, data model, or programming language. For more details, see the Apache Parquet information page .
Amazon Redshift Spectrum reads from S3 only the columns of a file that are needed for the query. Amazon Redshift Spectrum supports predicate pushdown (also called predicate filtering).
Amazon Redshift Spectrum charges you by the amount of data that is scanned from S3 per query. Parquet stores data in a columnar format, so Amazon Redshift Spectrum can eliminate unneeded columns during the scan. As an example, it would be interesting to compare the difference in query performance between CSV text files and Parquet partitioned files.
Various tests have shown that partitioned Parquet files are not only performing faster, but they are also much more cost-effective than non-partitioned row-based CSV files.
You can use SVL_S3QUERY_SUMMARY to gain insight into several interesting S3 metrics for the query that uses partitioned Parquet files:
select * from SVL_S3QUERY_SUMMARY where query=
;
Pay special attention to two interesting metrics: s3_scanned_rows and s3query_returned_rows. You would notice the tremendous reduction in the amount of data that returns from Amazon Redshift Spectrum to Amazon Redshift native for the final processing when compared to CSV files.
4. Partition Parquet files on frequently used columns
When you’re deciding on the optimal partition columns, consider the following:
Columns that are used as common filters are good candidates.
Excessively granular partitioning adds time for retrieving partition information. However, it can help in partition pruning and reduce the amount of data scanned from S3.
Actual performance varies depending on file placement, query pattern, file size distribution, number of files in a partition, number of qualified partitions, etc.
Measure and avoid data skew on partitioning columns.
One important consideration is that file size distribution should be as uniform as possible. That is, it’s better to have ten 256-MB Parquet files than one 1 GB file and six 256-MB files.
To illustrate the powerful benefits of partition pruning, you should consider creating two external tables using the Parquet format: One table is not partitioned, and the other is partitioned at the day level.
Scanning a partitioned external table can be 2x-4x times faster than a non-partitioned external table.
How do you know if “partition-pruning” is in effect? You can use the following SQL to analyze the effectiveness of partition pruning. If the query touches only a few partitions, you can verify if everything behaves as expected:
SELECT query, segment, max(assigned_partitions) as total_partitions, max(qualified_partitions) as qualified_partitions FROM svl_s3partition WHERE query=
GROUP BY 1,2;
Best practices for cluster configuration
5. Optimize Redshift Spectrum performance with the right Amazon Redshift cluster configuration
There are two levels of Amazon S3 request parallelism control for an Amazon Redshift Spectrum query:
Query level (up to 10 per query per slice)
This number depends on how many concurrent queries are running
Works to limit the number of threads used to support S3 scanning
Node level (for all S3 queries that run on a node; value varies for node type)
The more powerful the node type, the higher the limit
The simple math is as follows: when the total number of files

Images Powered by Shutterstock

The Data Daily

10 Best Practices for Amazon Redshift Spectrum