Automating Redshift Maintenance

Automating Redshift Maintenance


Overview

Amazon Redshift requires regular maintenance to make sure performance remains at optimal levels.  This regular housekeeping falls on the user as Redshift does not automatically reclaim disk space, re-sort new rows that are added, or recalculate the statistics of tables.  This is done when the user issues the VACUUM and ANALYZE statements.

This article will show you how this regular maintenance can be automated with Matillion ETL. 


Automatic Sorting

Amazon Redshift stores table data on disk in sorted order according to a table’s sort keys. The query optimizer and the query processor use the information about where the data is located to reduce the number of blocks that need to be scanned and thereby improve query speed.

If you load your tables in small increments (such as daily updates that represent a small percentage of the total number of rows in the table), running VACUUM regularly will help ensure that individual vacuum operations go quickly.

You can use Matillion ETL for Redshift to automatically find tables that have an unsorted row % between certain thresholds and then execute the VACUUM command for those tables. 

NOTE: If your table has a large unsorted region, a deep copy is much faster than a vacuum.


A deep copy recreates and repopulates a table by using a bulk insert, which automatically sorts the table.

To begin, create an orchestration job with a grid variable to contain the list of tables that we will use the Vacuum Tables Component.

Once you have the grid variable created, you can use the Query Result to Grid Component to query the svv_table_info view using the following query, then map to the grid variable previously created using the following query.

SELECT "table" FROM svv_table_info WHERE unsorted BETWEEN 10 AND 60

This will select tables that require sorting whilst excluding tables which may benefit from a deep copy instead.

Connect the Query Result to Grid Component  to a Vacuum Tables Component and configure the properties to use the grid variable table as input for the Tables to Vacuum parameter and set the Vacuum Options to SORT ONLY as shown in the screenshot below.


 

Automatic Reindex

The query engine is able to use sort order to efficiently select which data blocks need to be scanned to process a query. For an interleaved sort, Amazon Redshift analyzes the sort key column values to determine the optimal sort order. If the distribution of key values changes, or skews, as rows are added, the sort strategy will no longer be optimal, and the performance benefit of sorting will degrade. To reanalyze the sort key distribution you can run a VACUUM REINDEX. The reindex operation is time consuming, so to decide whether a table will benefit from a reindex, build an orchestration to find a list of tables that have an interleaved skew ratio of more than 1.4.

To begin, create an orchestration job with a grid variable to contain the list of tables that we will use in the Vacuum Tables Component .

Once you have the grid variable created, you can use the Query Result to Grid Component to query the svv_interleaved_columns view using the following query, then map to the grid variable previously created using the following query.

SELECT stv_tbl_perm.name as "table"
FROM svv_interleaved_columns, stv_tbl_perm
WHERE svv_interleaved_columns.tbl = stv_tbl_perm.id
AND interleaved_skew > 1.4;

This will select a list of tables that may benefit from reindexing according to Amazon best practice advice.

Connect the Query Result to Grid Component  to a Vacuum Tables Component and configure the properties to use the grid variable table as input for the Tables to Vacuum parameter and set the Vacuum Options to REINDEX as shown in the screenshot below.


 

Automating Analyze

The ANALYZE command updates the statistics metadata, which enables the query optimizer to generate more accurate query plans.  The ANALYZE operation is time consuming, so to decide whether a table will benefit from a reindex, build an orchestration to find a list of tables that have out of date statistics.

To begin, create an orchestration job with a grid variable to contain the list of tables that we will use in the Analyze Tables Component.

Once you have the grid variable created, you can use the Query Result to Grid Component to query the svv_interleaved_columns view using the following query, then map to the grid variable previously created using the following query.

SELECT "table" FROM svv_table_info WHERE stats_off > 10

This will select a list of tables that have out of date statistics.  A value of 0 is up to date while 100 is completely out of date.

Connect the Query Result to Grid Component  to a Analyze Tables Component and configure the properties to use the grid variable table as input for the Tables to Analyze parameter as shown in the screenshot below.

As the VACUUM and ANALYZE statements are time consuming and resource intensive it is recommended to schedule them for periods of low activity on your cluster.

Without regular housekeeping Redshift performance will suffer and having a system in place that will automatically run these maintenance jobs makes managing your cluster that much easier.


For further reading the following resources may be useful: