Schemas

Schemas


Overview

Matillion ETL is aware that you may have multiple schemas. Each Matillion ETL environment specifies a default schema.

The default schema is used by Matillion ELT in the following ways:

  • It is the default place that tables are searched for in components that read and write data such as Table Input.
  • It is the schema where Matillion ETL will write its own internal views that represent transformation components.

Although powerful, using multiple schemas adds an additional layer of complexity which needs to be carefully considered and managed.

Single-schema setup

When creating your first project, an environment is created for you. An environment describes your connection to where your tables are stored, and includes a default schema. Often, this is called ‘public’.

Default Schema

Default Schema

Important Information

  • If you have no particular need for multiple schemas you can simply leave this as ‘public’, ignore multiple schemas.
  • In any component that allows you to specify a specific schema, just leave the default setting of [Environment Default].
Schema - Environment Default

Schema - Environment Default

In case, you decided to change the default schema, you only have to change it at one place (the environment) and everything will continue to work in the new schema, only, if you leave the "Schema" settings as [Environement Default] while configuring properties of the component.

Whenever a component asks for a schema, it will almost always need a table or tables too - the schema defines which set of tables are visible.

Schema

Schema

Single-schema per environment setup

This is the setup favoured internally within Matillion for its own projects and balances the power of multiple schemas with the ease of use of Matillion ETL. It is particularly useful when you want to separate the different environments within a single database, but expect all the same database tables to exist in each schema.

There is always a currently selected environment which can be changed at any time in the environments panel. Many operations will use the currently selected environment, and therefore its default schema, unless you choose to use a different environment.

Selected Environment

Selected Environment

Please Note

Each environment can (and should) use a different Default Schema.

You can run a job in any environment, so you can design your ETLs and run them against a test environment, and when fully tested and production ready you can then schedule them to run in a Staging or Production environment.

The environment to run a job in is available everywhere - from the scheduler, within the SQS queue message, and the right-click menu on a job designer.

Schedule Job

Schedule Job

Component-level Explicit Schema

This setup can be useful if you want to use the “Single-schema per environment” setup outlined above, with the exception that all of those environments can share some of their data available in another schema. For example, source data may be staged into a single schema, which can then be run through to a Test or Production environment as above.

In this case, following the setup for the “Single-schema per environment”, but when reading or writing particular tables override the default schema within the component.

Change Schema

Change Schema

Once this is set, you may run the job in any environment but the selected schema will always be used for that component regardless of the default set in the environment.

Warning

You must ensure the database user specified in the environment has the necessary access to read/write data in those additional schemas.

Component-level Explicit Schema using Variables

One common best practice that Matillion customers employ is to use three schemas arranged as shown in the diagram below.

This approach gives you separation of concerns between the schemas and allows security grants to be applied at schema level. It means that the views created by Matillion ETL do not "pollute" the staging data schema and are not visible to analysis users using a Data Warehouse Analysis tool. Setting this up is best achieved with variables.

The downside of section 3 (using Component-level explicit schemas) is that once they are selected they are fixed. If you ever rename a schema, or use multiple clusters or multiple databases within a cluster, your schemas may not be named consistently and your job may fail when run against another cluster.

For this, you must use a second level of indirection using variables. All environments, in addition to specifying the database connection details and default schema, carry a set of variables which can be referenced in many places. Define a variable to use as the schema using the ProjectManage Environment Variables screen:

Manage Environment Variable

Manage Environment Variable

Then use the variable name when referencing a schema in a component:

Change Schema to Environment Variable

Change Schema to Environment Variable

The schema is a dynamic property, and takes it values from the variable value associated with the environment the job runs in.

New Schema-OK

New Schema-OK

Creating External Schemas

Using external tables requires the availability of Amazon Spectrum. Currently-supported regions are US East (Northern Virginia), US East (Ohio), US West (Oregon), as well as Europe (Ireland) and Asia Pacific (Tokyo). For information on how to connect Amazon Redshift Spectrum to your Matillion ETL instance, see here.

Please follow the steps below to Create External Schemas:

  1. Expand the Environments list at the bottom-left of the client.
  2. Right click on the intended environment (one that is associated with the Redshift cluster we previously enabled Amazon Redshift Spectrum policies on).
  3. Select Create External Schema from the right-click menu.
Create External Schemas

Create External Schemas

Please provide the below details required to create new external schema.

  1. External Schema: Enter a name for your new external schema. Ensure this name does not already exist as a schema of any kind.
  2. Data Catalog: Add the name of your athena data catalog. A new catalog will be created if the name entered does not correspond to an existing Athena catalog associated with your AWS account.
  3. Role Arn:Add the Role ARN of the role used to allow Amazon Redshift Spectrum access to your EC2 instance.
Create External Schemas details

Create External Schemas details

Now components within Matillion that make use of external tables (and thus, Amazon Redshift Spectrum) can be used providing they use this external schema.

Please Note

Any tables that exist within the linked Athena Catalog will be instantly available for use in Matillion ETL through the external schema that links to them.

Refresh Source Schema

Sometimes you might effect change of a schema that Matillion ETL is using outside of the client. An example of this might be creating a new table or view via console commands and then attempting to find that table in a Data Staging component, only to find it does not appear. This is due to Matillion ETL using a cache of table data for many components.

To sync the Matillion ETL client with your platform and resolve this issue, right-click the component in question and select Refresh Source Schema.

Refresh Source Schema

Refresh Source Schema

Please Note

The Refresh Source Schema is not enabled for all the components. Although, It does visible for some of the "Data Staging components".