Cloud Storage Load

Cloud Storage Load

Allows a user to load data stored on the Google Cloud Storage service into an existing BigQuery table. Many of the configuration settings on this component have sensible defaults, mirroring the defaults provided by BigQuery when that option is not specified

Note: This component requires working Google Cloud Storage Credentials with read access to the source data file(s).


Properties

Property Setting Description
Name Text The descriptive name for the component.
Project Text The database in which to find or create the target table.
Dataset Select Select the table dataset. The special value, [Environment Default] will use the dataset defined in the environment.
For more information on Google Cloud Datasets, visit the official documentation.
Target Table Select Select an existing table to load data into.
Load Columns Select Multiple One or more columns that exist in the target table.
Google Storage URL Location Text/Tree The URL of the Google Storage bucket to get the files from. This follows a format such as gs://bucket-name/location, where location is optional.
File Format Select The type of expected data to load. Some may require additional formatting, explained in the GCP Documentation.
Available options are: Cloud Datastore Backup, CSV, JSON (New line delimited), ORC, PARQUET (see here for details).
Component properties will change to reflect the choice made here and give options based on the specific file type.
Write Preference Select Append to Table: Append an existing table with the loaded data.
Overwrite Table: Overwrite an existing table with the loaded data.
Write if Empty: Only write loaded data if the target table is empty.
Number of Errors Allows Text The maximum number of individual parsing errors that cause the whole load to fail. Values up to this will be substituted as null values.This value defaults to 0.
Ignore Unknown Values Select Yes: Accept rows that contain values that do not match the schema. Unknown values are ignored. Note: Will ignore extra values at the end of a line for CSV files.
No: Omit any rows with invalid values.
Delimiter Text The delimiter that separates columns. The default is a Comma. A [TAB] character can be specified as "\t".
CSV Quoter Text Quote character to use to enclose records. Default (field left blank) is double-quote ("). To indicate no quote character at all, use an empty string.
Encoding Select The encoding type to use on the output data. This defaults to UTF-8.
Header rows to skip Text The number of rows at the top of the file to ignore - defaults to 0.
Allow Quoted Newlines Select Yes: Allow a CSV value to contain a newline character when the value is encased in quotation marks.
No: A new line character, regardless of quotations, is always considered a new row.
Allow jagged rows Select Yes: Missing values are treated as 'null' but accepted.
No: Rows with missing data are treated as bad records. Note: A bad record will count toward the 'Maximum Errors' count.

Example

In this example, we have a table of email data that we wish to back up on a bucket for long-term storage. But we also want to create a copy of the table to transform, leaving the original in tact. One of the many ways to do this is to unload the table to a bucket using the Cloud Storage Unload component, then reload that data into a new table using the Cloud Storage Load component. The job layout is shown below.

The table data is unloaded by the Cloud Storage Unload component and a new table is created using a Create Table component. Now we can look to the Cloud Storage Load properties shown below. Many properties have been automatically set for us and we need only edit a few. It is important to mirror the details with which we unloaded the data - namely a CSV file with a header that should be ignored. We load in all available columns (note that these must be set in the Create Table component!) and point it to the 'Target Table' which will become our copied table.

The job can now be run, unloading the data and then reloading it into our new 'doc_email' table. We can use a Table Input component in a Transformation job to sample the data and confirm that the entire workflow has been as expected.