Building a Data Vault
This article demonstrates how to implement a simple Data Vault model using Matillion ETL. A publicly-available civil aviation dataset is used in examples throughout this article and is available for download as an attachment. This article illustrates a Matillion implementation of the following three tier data architecture.
The three data tiers in Redshift are:
- Staging - transient data, loaded either fully or incrementally from external sources.
- Data Vault - the middle tier data structures (i.e. the enterprise data warehouse).
- Star Schema - the data presentation layer, for use in reporting and analytics.
In this case we’re using Data Vault techniques to model the middle tier, rather than an alternative such as 3rd normal form (3NF).
So what is Data Vault, and what makes it a good choice for this purpose?
Data Vault overview
Data Vault is a data modeling approach, typically used to design the relational tables in an enterprise data warehouse. It’s actually a specialised form of 3NF modeling, which requires that all objects be decomposed into their fundamental parts, and which assumes that all relationships may be many-to-many.
This approach addresses some of the limitations of other commonly used approaches:
- 2-tier (“Kimball” style) - Known for inflexibility and higher maintenance costs.
- 3NF middle tier - Can be difficult to adapt to changes in the business, especially when source systems are altered. Can also become abstract and difficult to interpret.
Data Vault’s primary focus is on the core concepts of the underlying business. In this aviation example they would include “flight”, “airport”, “passenger” and “aircraft”. Every core concept is decomposed into its fundamental parts, and these parts are defined by three kinds of relational database tables: Hubs, Links and Satellites.
Data Vault Hub tables
Hub tables represent a core business concept. Records are identified by a natural business key, expressed in a way that anyone working in the business could understand.
An “aircraft” for example would be uniquely identified by its tail number. Other hub tables might have a multi-part identifier. In every case, the natural business key allows core entities to be handled without ambiguity.
Hub tables are represented by a blue square.
The only other columns on a Hub table are:
- A surrogate key (to simplify joins).
- The timestamp when the entity was first recorded.
- The name of the source system from which it originated.
Data Vault Link tables
Link tables represent relationships between hubs. Data Vault permits any relationship to be many-to-many, so Link tables are exactly like 3NF intersection (or “associative”) entities. Records are uniquely identified by two or more Hub table surrogate keys.
One flight, for example, would always be taken by exactly one aircraft. In a 3NF model you’d expect the aircraft ID to be a foreign key column on the flight table. But what if an aircraft had to be substituted due to mechanical problems? Modeling this in 3NF becomes more complex, maybe requiring abstract concepts such as a “flight-aircraft-relationship-category”, with data quality rules that are difficult to enforce. But in Data Vault it’s as simple as having two Link records for one flight, differentiated by some contextual Satellite data.
Link tables are represented by a red hexagon.
The only other columns on a Link table are:
- A surrogate key (to enable the addition of a satellite if necessary).
- The timestamp when the link was first recorded.
- The name of the source system from which the information originated.
Data Vault Satellite tables
Satellite tables are used to record all the “context” (or “attributes”) of a Hub or Link, and are always related to exactly one Hub or Link table. For example a Satellite table could record the departure and arrival times of a flight, both scheduled and actual. For an aircraft-flight Link table, a satellite could record whether this was the original or a substitute aircraft.
It’s fine to have multiple Satellite tables for a Hub, and this is especially useful when data is sourced from multiple places. Similarly it’s easy to add new Satellite tables when source systems change, rather than having to face the typically expensive 3NF problem of adding new sparse columns and having to back-fill data.
Satellite tables are represented by a yellow square.
The only other columns on a Satellite table are:
- The timestamp when the information was recorded.
- The name of the source system from which the information originated.
Two of the core business concepts in aviation are Aircraft and Airport. These are clearly important, widely used and understood, easy to identify, and so should be modeled as Data Vault Hub entities.
Every flight always involves one aircraft, and is always scheduled to leave from a departure Airport, bound for an arrival Airport. So it would be possible to model the “flight” area as a Link table joining the Aircraft and Airport hubs, illustrated below:
This would be a good start in a 3NF model, but it rather undermines the importance of the “flight” concept. How would it link to Passengers? And what if the flight was diverted or cancelled? Passengers who have paid for a seat would not want to hear that the “flight” they booked no longer exists!
There are many occasions when an entity could be modelled as either a Hub or a Link. But in this case, a “flight” probably really is a core business concept, and should therefore be modelled as a Hub. With the Link and Satellite (context) tables added, the flight “ensemble” would look more like the example below:
Now that the “data warehouse” part of the data model is established, let’s consider how to present the data for use in reporting.
Presentation Layer data model
Focusing on the “flight” area, what would be the best design for a star schema to support queries?
There’s a clear justification for a Flight fact table, to support queries like flights by date, by departure or arrival airport, by aircraft type etc. The Flight fact table would include core attributes such as the arrival and departure times. It would very likely also include calculated columns, for example a “delayed” flag. The calculated columns would be set according to business rules (such as actual vs scheduled arrival or departure time), and would be perfect for enabling analysis or measuring KPIs.
However from a passenger’s point of view, a “flight” is more like a dimension, and would be involved in queries like passengers by route. If you’ve already got a flight fact in a 2-tier data architecture then there’s no flexibility: you’re stuck with maintaining it and a new “dimension” version. Also, you’ve got the extra maintenance cost of maintaining the calculated columns in two places (answering how many passengers were on delayed flights). This introduces a new risk that the rule defining “delayed” be implemented differently in different places.
The good news is that by deriving the flight’s “delayed” flag in the Data Vault layer, it’s available for use in any number of different star schema configurations. This would include both Type 1 and Type 2 (time variant) versions of the same data.
All the star schema objects can be derived from the Data Vault structures. The star schema design itself (a bus matrix) can be altered as needed without significant rework.
In fact, we don’t even need to physically materialise the star schema: it can simply be made up of a layer of views.
In common with most Matillion implementations, there is a standalone job to be run once only, which creates the permanent Redshift objects:
- Data Vault
- Star schema
For ease of reading, the Data Vault tables are grouped into Hubs, Links and Satellites. Once these have been created, the star schema views which use them can be added.
Since the Hub and Link tables are very narrow, there’s a strong argument for distributing these across all nodes rather than by the hub surrogate key. This is a tuning option that you should consider when planning the physical layout of the Data Vault.
Extract and Load
The “EL” part of “ELT” is to bring the data into Redshift from source. In this example we’ll use Matillion ETL’s sample flight and plane data set, which is available in S3.
Your own implementation will almost certainly be an incremental load, and probably taken from a variety of data sources.
After the Extract and Load have completed, the current (or incremental) data is available as database tables in the Staging layer. It can now be transformed into shape using Matillion Transformation jobs.
In this example, we are not making any distinction here between “raw” and “business” data vault structures. The derivations and business rules are applied at the same point: during initial data preparation. So the stages are:
- Hub load
- Link load
- Satellite load
This transformation involves two main steps:
- Ensuring that it’s fit for ingestion into Data Vault - i.e. with clear business keys, no duplicates, and business rules enforced to set any missing values
- Derivation of extra information, for example the “delayed” flag on every flight
The Flight data preparation is fairly typical in this respect, using the Matillion Calculator to tidy up datatypes and apply business rules. Deduplication is performed using Redshift’s ROW_NUMBER() analytic function inside a Calculator component.
The Airport data load illustrates how to perform Unite component to let Redshift ETL do the hard work of de-duplication!
All Hub tables can be loaded in parallel since Data Vault guarantees that there are no inter-dependencies between them.
Recall that every Hub table’s purpose is to keep a permanent record of a core business concept. The most important technical factor, then, is to ensure that the business key is recorded correctly and there are no duplicates.
The Airport hub load is a typical example, and illustrates some commonly-used techniques.
Uniqueness is guaranteed by the “New H_Airport” component, which is a SQL EXCEPT transformation, using Redshift to do the hard work of filtering out all Airport records which are already known and present. This ensures that incremental loads don’t create duplicates.
New surrogate keys are created using a combination of an analytic ROW_NUMBER function and a SQL MAX to find the largest surrogate key that’s already present. This removes the need for an identity column, and ensures there are no gaps in the surrogate key sequence.
The Aircraft hub load follows a very similar pattern, but includes a Unite step to include aircraft which are involved in flights, but which are not present in the aircraft reference data.
This illustrates how Matillion deals with late arriving dimensions in a Data Vault model. It neatly avoids the problems of having to make retroactive changes that can occur with a 2-tier data model.
New records in all the Link tables can be added once the Hub updates are complete. Again they can all be loaded in parallel since Data Vault guarantees that there are no inter-dependencies.
Link load transformations typically start with the records which identify the relationship, and join to two (or more) Hub tables to acquire the necessary surrogate keys.
The below L_Flight_Aircraft load shows the Matillion implementation pattern.
If the relationship is likely to change over time, then there’s no problem adding multiple Links between the same Hub tables. The load_timestamp column can always be used to find the latest relationship. Alternatively a Satellite table could be used to provide additional context and full time variance.
The final stage of the Data Vault load, and likely to be physically the largest, is the Satellite table update.
The below Flight Satellite load illustrates the typical Matillion pattern.
There are two main features:
- Join to the Hub table to find its surrogate key.
- Compare against the latest Satellite record to check if there have been any changes.
The order of the loads guarantees that every necessary Hub table is already in place, so the hub table join can be inner.
The data comparison uses a Matillion ETL Detect Changes component, comparing the newly-loaded, current columns against the latest version of the Satellite data. Only new or changed records are relevant, so they are filtered before being appended into the S_Flight table. Inserts take advantage of an identity column to provide a unique identifier.
The resulting data captures how contextual data related to the Hub has changed over time, and enables the automatic creation of time variant dimensions.
Star Schema implementation
The middle-tier Data Vault structures are just ordinary relational database tables, and could be queried directly for reporting purposes. But it’s usually more convenient for end-users to have a star schema presentation layer.
The Data Vault contains enough information to build these structures, and that could easily be accomplished by writing some more Matillion Transformation jobs, and running them once the Data Vault loads had finished.
But there’s actually no need to physically materialise any data at all. This example demonstrates how to set up a layer of views which look like star schema tables, but which in fact have:
- Zero maintenance cost.
- Zero storage costs.
- High flexibility.
This is illustrated by having dimensional views based on the same underlying data, representing three different “viewpoints”: a type 1 dimension, a type 2 dimension and a fact table.
D_Flight: a type 1 dimension
This dimension simply shows the latest contextual information for every flight.
The main technique is to use an analytic function to filter the S_Flight satellite, to only show the most recent data.
Since this guarantees that there’s only one record per Hub, the dimension surrogate key can simply be the Hub’s surrogate key. This information will be used later when populating the fact table.
The example Transformation job uses a Fixed Flow component to add the fixed, special-purpose value -1 meaning “unknown”.
D_Flight_TV: a type 2 dimension
This dimension is time variant, meaning that there are separate records for every version of a flight that has been recorded over time.
The data flow is actually simpler than for the non-time-variant version, because it can simply use all the Satellite records, taking advantage of the associated timestamp to derive the time variance attributes.
A Calculator component contains analytic functions to create three extra time variance columns for every row:
- Version - using ROW_NUMBER, ordered by the record timestamp.
- Valid To - using LEAD, ordered by timestamp again, to subtract one second from the next timestamp in the series.
- Current Flag - using ROW_NUMBER again, and set to Y only if it’s the last update in the sequence.
This technique means there are guaranteed to never be any problems with overlapping time-ranges, multiple “current” records, and other data integrity problems that especially bedevil manually-maintained time-variant dimensional structures.
Because there may be multiple versions over time, the dimension surrogate key must be the Satellite’s surrogate key this time.
F_Flight: a fact table
Creating a Flight fact table from the same Data Vault structures follows a similar pattern to the dimensions. The H_Flight table defines the granularity, guaranteeing that there will always be exactly one fact record per flight.
Two different join techniques are needed to set up the foreign keys to the dimensions, depending whether or not the dimension is time variant.
For a time-variant dimension such as D_Flight_TV, it’s necessary to use the latest Satellite record, remembering that we earlier used the Satellite table’s surrogate key as the dimension key.
To establish links to the other, non-time-variant dimensions, it’s sufficient to left-join to the relevant Data Vault Link table. This contains the associated Hub surrogate key which was used earlier as the dimension key.
At this point, you now have a fully functional three-tier data warehouse, using:
- A Data Vault model as the middle tier.
- A star schema presentation layer, made entirely of views, which requires no storage, loading or maintenance.
The Matillion implementation is attached for download, so you can try out this modelling technique in your own environment. To use it, import the JSON file, then run the following jobs:
- DV DDL (only ever needs to be run once).
- DV Extract and Load data.
- DV AAA Xforms.