Matillion ETL understands that you may have multiple schema. Each Matillion ETL environment uses a default schema.
The default schema is used by Matillion ETL in the following ways:
- It's the default location that tables are searched against within components that read and write data, such as Table Input.
- The default schema is where Matillion ETL will write its own internal views that represent transformation components.
Although powerful, using multiple schemas adds an additional layer of complexity.
When creating your first project, an environment is created for you. An environment describes your connection to where your tables are stored, and includes a default schema. Often, this is called public.
- If you have no particular need for multiple schema, leave this as public and ignore multiple schema.
- In any component that allows you to specify a schema, just leave the default setting of [Environment Default].
If you choose to change the default schema, you only have to change it at one place (the environment) and everything will continue to work in the new schema if you retain the [Environment Default] variable value.
When a component asks for a schema, it usually also needs a table (or tables, too). The schema defines which set of tables are visible.
Single-schema per environment setup
This setup balances the power of multiple schema with the ease of use of Matillion ETL. It is particularly useful when you want to separate the different environments within a single database, but expect all the same database tables to exist in each schema.
There is always a currently selected environment that can be changed at any time in the Environment panel. Many operations will use the currently selected environment—and therefore its default schema—unless you choose to use a different environment.
Each environment can (and should) use a different default schema.
You can run a job in any environment. You can design your ETL jobs and run them against a test environment, and when fully tested and production-ready, you can schedule them to run in a staging or production environment.
The environment that you select to run a job in is available for configuration widely across the application.
Component-level explicit schema
This setup can be useful if you want to use the "single-schema per environment" setup outlined above, with the exception that all of those environments can share some of their data available in another schema. For example, source data may be staged into a single schema, which can then be run through to a test or production environment as above.
In this case, follow the setup for the "single-schema per environment", but when reading or writing particular tables, override the default schema within the component.
Once this is configured, you may run the job in any environment, but the selected schema will always be used for that component, regardless of the default value set in the environment.
You must ensure the database user specified in the environment has the necessary access to read/write data in those additional schemas.
Component-level explicit schema using variables
One common best practice that Matillion customers employ is to use three schemas arranged as shown in the diagram below.
This approach gives you separation of concerns between the schemas and allows security grants to be applied at schema level. It means that the views created by Matillion ETL do not "pollute" the staging data schema and are not visible to analysis users using a data warehouse analysis tool. Setting this up is best achieved with variables.
The downside of using component-level explicit schema is that once they are selected, they are fixed. If you ever rename a schema, or use multiple clusters or multiple databases within a cluster, your schemas may not be named consistently and your job may fail when run against another cluster.
For this, you must use a second level of indirection using variables. All environments, in addition to specifying the database connection details and default schema, carry a set of variables that can be referenced in many places. Define a variable to use as the schema using the Project → Manage Environment Variables dialog.
Then use the variable name when referencing a schema in a component:
The schema is a dynamic property, and takes it values from the variable value associated with the environment the job runs in.
Amazon Redshift: Creating external schemas
Using external tables requires the availability of Amazon Redshift Spectrum.
Currently supported regions are us-east-1 (N. Virginia), us-east-2 (Ohio), us-west-2 (Oregon), eu-west-1 (Ireland), and ap-northeast-1 (Tokyo). Read Amazon Redshift endpoints and quotas to learn more about supported regions. To learn how to connect Amazon Redshift Spectrum to your Matillion ETL instance, read Getting Started with Amazon Redshift Spectrum.
Please follow the steps below to create an external schema:
- Expand the Environments list in the lower-left of the app.
- Right-click on the intended environment (one that is associated with the Redshift cluster you have previously enabled Amazon Redshift Spectrum policies on).
- Click Create External Schema.
Specify details required to create the new external schema.
- External Schema: Enter a name for your new external schema. Must be unique.
- Data Catalog: Add the name of your Athena data catalog. A new catalog will be created if the name entered does not correspond to an existing Athena catalog associated with your AWS account.
- Role ARN: Add the Role ARN of the role used to allow Amazon Redshift Spectrum access to your EC2 instance.
Now, components within Matillion ETL that make use of external tables (and thus, Amazon Redshift Spectrum) can be used, providing they use this external schema.
Any tables that exist within the linked Athena Catalog will be instantly available for use in Matillion ETL through the external schema that links to them.
Refresh Source Schema
Sometimes you might change a schema that Matillion ETL is using outside of the client.
Scenarios may occur where you wish to effect change to your schema while outside Matillion ETL. An example of this would be creating a new table using console commands and then locating that table with a data staging component, only to find it does not appear. This is due to Matillion ETL using a cache of table data for many components.
To sync the Matillion ETL client with your platform and resolve this issue, right-click the component in question and select Refresh Source Schema.
Not all components make use of Refresh Source Schema.