External Table (BigQuery)

External Table (BigQuery)


This article is specific to the following platforms - BigQuery.

Create External Table

Create a table that references data stored in an external storage system, such as Google Cloud Storage.

For full information on working with tables on Google Cloud Platorm, see the official documentation here.



BigQuery Properties

Property Setting Description
Name Text The descriptive name for the component.

This is automatically determined from the table name when the Table Name property is first set.
Project Text Enter the name of the Google Cloud Platform Project that the table belongs to.
Dataset Text Enter the name of the Google Cloud Platform Dataset that the table belongs to.
New Table Name Text Select an existing table to load data into.
Table Metadata Column Name The name of the new column
Data Type For more information on available BigQuery data types please refer to the GCP documentation.
String: this type can hold any kind of data, subject to a maximum size.
Integer: this type is suitable for whole-number types (no decimals).
Float: this type is suitable for numeric types, with or without decimals.
Numeric: this data type is suitable for data of an exact numeric value, allowing 38 digits of precision and 9 decimal digits of scale.
Boolean: this type is suitable for data that is either 'true' or 'false'.
Date: a formatted date object without time. See the GCP documentation.
Time: a formatted time object without date. See the GCP documentation.
DateTime: a formatted timestamp containing both date and time that is easily readable by the user. See the GCP documentation.
Timestamp: this type is a timestamp left unformatted (exists as Unix/Epoch Time).
Mode The field mode. Default is 'NULLABLE'.
NULLABLE: Field allows null values
REQUIRED: Field does not accept null values
REPEATED: Field can accept multiple values
Table Metadata (Nested) When the Define Nested Metadata checkbox is ticked inside the Table Metadata property, a tree-structure can be defined for metadata. See Example 2 at the bottom of the full documentation for details.
Create/Replace Select Create: The default option, creates a new table. This will generate an error if a table with the same name already exists, but will never destroy existing data.
Create if not exists: This will only create a new table if a table of the same name does not already exist. It will not destroy existing data. If the schema of the existing table does not match the schema defined in this component, no attempt is made to fix or correct it, which could lead to errors later in the job if you did not expect an existing table to exist, or to have a different schema to the one defined in this component.
Replace This drops any existing table of the same name, and then creates a new table. This guarantees that after the component succeeds the table matches the schema defined in this component, however any existing data in an existing table will be lost.Note: Since other database objects may depends upon this table,
drop ... cascade
is used which may actually remove many other database objects.
Google Storage URL Location Select The URL of the Google Storage bucket to get the files from. This follows the format gs://bucket-name/location, where location is optional.
Compression Select Whether the input file is compressed in GZIP format or not compressed at all.
File Format Select Cloud Datastore Backup
CSV
JSON (New line delimited): this requires an additional "JSON Format".
Number of Errors Allowed Text The maximum number of individual parsing errors that cause the whole load to fail. Values up to this will be substituted as null values. This value defaults to 0.
Ignore Unknown Values Select Yes: Accept rows that contain values that do not match the schema. Unknown values are ignored. Will ignore extra values at the end of a line for CSV files.
No: Omit any rows with invalid values.
Delimiter Select The delimiter that separates columns. The default is a Comma. A [TAB] character can be specified as "\t".
CSV Quoter Text Specifies the character to be used as the quote character when using the CSV option.
Encoding Select The encoding the data is in. This defaults to UTF-8.
Header Rows To Skip Text The number of rows at the top of the file to ignore - defaults to 0.
Allow quoted newlines Select Yes: Allow a CSV value to contain a newline character when the value is encased in quotation marks.
No: A new line character, regardless of quotations, is always considered a new row.
Allow Jagged Rows Select Yes: Missing values are treated as 'null' but accepted.
No: Rows with missing data are treated as bad records. Note: A bad record will count toward the 'Maximum Errors' count.

Example

In this example, we will be referencing data that is held on a GCP bucket. Buckets can be viewed, managed and created through the GCP console along with the data they hold.

To begin, we use the data staging component 'Jira Query' to load data our desired data into a table, then we unload it (as a CSV) to a specific place on a storage bucket where it will be held. Since we only want to very occasionally query this large data set, it is prudent to avoid having it staged to a BQ table and so instead we use an external table. Thus, after unloading the data, we attach the External Table component before any transformations take place. The job is shown below.

The External Table component is used to create the external table that will reference our data. Its properties are shown below. Since the data was unloaded in CSV format, we prepare the External Table component to expect the same. It may be necessary to make use of the 'Ignore Unknown Values' and 'Number of Errors Allowed' properties if your CSV file is unusually formatted.

In the 'Table Metadata' property, we have provided column names and data types so that the component can correctly assess the data found in the CSV.

Now this data can be queried and sampled using a Transformation job. Sampling the data provides a quick test to ensure the external table has been created correctly and is referencing the data.



Example 2 - Nested Tables

When wanting to load structured data (such as a JSON file), it is necessary to create an external table capable of handling that structure. This can be done by ticking the 'Define Nested Table' checkbox in the 'Table Metadata' property.

Below is a snippet of a JSON file that contains nested data. There are 4 top-level records with name 's', and each contains a nested set of columns, "col1", an integer, and "col2", a string.

{
"s": {
"col1":1,
"col2":"one"
}
}
{
"s": {
"col1":2,
"col2":"two"
}
}
{
"s": {
"col1":3,
"col2":"three"
}
}
{
"s": {
"col1":4,
"col2":"four"
}
}

Normally, this data loaded into a table would have severely limited transformative use given its format. However, the Create External Table component can have a nested structure defined in the Table Metadata property by checking the Define Nested Metadata box.

To begin, we add a new structure by right-clicking the Columns structure and selecting Add. In the new menu that appears, we specify that our new Column Type is to be a structure, and name it as we like. In this case, we name it "s" to match our rather arbitrary JSON.

Note: Struct, Array, and Field names MUST match those in the JSON, so that data can be mapped correctly. Failing to do so is unlikely to cause an error message, but will cause Matillion ETL to overlook the data in the source files.

Note: Similar to the above, not all columns in the source JSON need to be defined, and users are free to be selective over the data they include in the external table.


Now that we have added the 's' structure to our table, we need to add the data nested inside it. To do so, right-click the 's' structure we just created and again click Add. This time, we will be selecting Field as the column type and specifying what data type to expect. We do this process for each column to be added.


Since we added those columns to our 's' structure, they exist nested within it in our metadata, matching that of the JSON. We're now ready to complete the configuration for the new External Table.

Our JSON from earlier now has matching metadata to our new table and can be loaded.