Data Stagers - Support

This document provides general troubleshooting guidance for Matillion’s Data Loading components. It focuses on Orchestration jobs which use Load/Unload components to copy data from a source system into your target database.

To connect to a 3rd Party REST API, particularly those not already supported by Matillion ETL, see here or Contact Support.

 

Data Loading with an ELT architecture

In Matillion's ELT architecture there are two phases:

  1. You first extract the data from source, then load it into the target cloud data warehouse. This is done using Orchestration jobs, and is the “EL” of ELT.
  2. Secondly, after the data has been loaded, you can then transform it inside the database. Typical transformations often include filtering, adding calculated fields and aggregation. These things are done using Transformation jobs and form the “T” of ELT.

Note that:

  • Data extraction is a read-only operation. With a very few exceptions, Matillion ETL does not support writing back to data sources.
  • Data integration is performed at the Transformation stage, using the Join component. Unlike an ETL tool, Matillion ETL does not integrate data from different sources during the extraction stage.
 

There are two main categories of problem:

  • Physical errors - your job either won't run at all, or else starts running but fails.
  • Logical errors - your job does run and finishes successfully, but the resulting data looks wrong.
 

For both categories, this document notes common scenarios to check, and provides instructions on ways to proceed.

Some of the instructions refer to the shared jobs which are attached to this page as .melt files. To install a shared job, navigate to Project → Manage Shared Jobs → Import.

To run a shared job which you have previously imported:

  • Open an existing or new Orchestration Job
  • Find the Shared Jobs panel at bottom left of the UI
  • Locate and drag the required shared job onto the canvas
  • Some shared jobs require parameters, and you must supply these by editing the component’s properties.
 
ELT data architecture

A good data architecture to follow with Matillion ETL is to first extract and load the data (using an Orchestration job) and then immediately afterwards append it into a permanent target table (using a Transformation job).

By default, every time a Load component executes, you’ll end up with a database table containing only the newly-extracted-and-loaded data. Load components either drop and re-create or else first truncate the table.

For this reason:

  • It’s good practice to follow a standard for naming Target Tables. For example always prefix the name with stg_ or load_
  • After the Load component has finished, you will need to run a Transformation job to (optionally transform) the new data and copy it into the permanent table.
  • You may treat the stg_ or load_ tables as if they were temporary. For example you may choose to drop them as a final step. Alternatively, simply wait until the next time the job executes and the target table will first be recreated or truncated anyway.
 

The Matillion ETL pattern often looks like this:

 

Physical Errors

With this category of problem the component either won’t even start running, or else starts but fails. A common cause of this is validation failure.

 
Component validation

Do any of the components have a red border?

If a component has a red border, you must click it, then go to the properties tab at the bottom center of the screen. Scroll through all the properties of the component, providing or correcting any values which do not have a green OK symbol.

Note that some properties cascade: they may appear, disappear or change depending on the choice made for earlier properties.

A component which appears on the canvas with a red border has failed to validate, and will probably fail to run. One possible exception to this is if you have used variables in any of the properties.

 
Using Variables

You can use variables in most component properties, referencing the value with ${variable_name} syntax. Variable names are case sensitive, so it’s best to copy and paste the names.

Are any of the component’s properties using variables?

  • Matillion ETL always uses the variable’s default value for validation.
  • Default values are specific to an Environment. If you have created more than one Environment (e.g. dev/test/production), you will need to specify a default value for every Environment.
  • Variables can be updated at runtime, meaning that a component which failed to validate may nevertheless still run successfully.
  • During development, it’s good practice to get the component working using hardcoded values. Once you have seen it working, you can then replace the hardcoded values with variables as required.
 

The first thing to check is that all environment variables have appropriate default values for the Environment you are using. For example:

Before (no default value)

Then:

 

After (with a default value)

Then:

You may need to force a revalidation after updating the default value. You can do this from the context menu which appears when you right-click the canvas of the job:

For debugging, it’s very useful to add a Python Script component immediately before the Load component, to simply display the value of the variable. You do not need to wrap the variable value in ${} inside a Python script.

After running the component, you will find the Python Script’s output in the Tasks window.

If you are running the job inside an iterator, or if you are using job-level variables, the values are likely to change every time the job is run. We strongly recommend that you use a Python Script component as shown above to record the actual values supplied at runtime. This is also valuable for audit purposes.

 
Variable lifecycle

Updates you make to the value of a variable do not persist beyond the execution of the job which made the update. This is especially important to remember when you are running a component on its own using “Run Component” or “Run from Component”:

Here’s an example:

  • A Job starts to run, and includes a Python Script component which calls context.updateVariable(..) to update the value of a variable.
  • A later component starts to execute. It sees the updated value.
  • The Job fails
  • You re-run the failed component on its own. It sees the default value of the variable.
 

Here’s another example involving multiple jobs:

  • Job A runs, and includes a Python Script which calls context.updateVariable(..) to update the value of a variable.
  • Later in the execution, a component inside Job A runs. It sees the updated value.
  • Job B starts (maybe while job A is still running, or maybe afterwards)
  • All components inside job B see the default value of the variable.
  • Job B must make its own changes to the variable if necessary, using context.updateVariable(..) inside a Python script. Those changes will not be visible to job A if it’s still running.

The only way to change the default value of a variable is by hand through the web user interface. Matillion ETL does not have an API endpoint or scripting method which can update the default value of a variable.

 

Dynamic variables

If you need to set variables dynamically at runtime, use a Python Script component in Jython mode.

If you’re setting a date or timestamp then it’s often best to use a Text datatype (rather than a DateTime), since you can then explicitly control the format and the timezone.

For example, to have a variable updated to today’s date in an clearly specified format, you could use the following Python script:

from datetime import datetime
context.updateVariable('updatedtm', "{:%Y-%m-%d}".format(datetime.now()))
print updatedtm

This document shows additional format masks. For example to include hours, minutes and seconds you could use

"{:%Y-%m-%dT%H:%M:%S}"

Matillion ETL defaults to UTC, so you might also choose to convert to your local timezone, like this:

from datetime import datetime

import pytz
local_tz = pytz.timezone('US/Mountain')
context.updateVariable('updatedtm', "{:%Y-%m-%d}".format(datetime.now().replace(tzinfo=pytz.utc).astimezone(local_tz)))
print updatedtm

Note that JavaScript expressions are not supported as default values for variables, or for parameter values. The expression may validate successfully, but evaluation is likely to fail at runtime.

 

Network connectivity

In order to successfully validate and run a data extraction and load component, your Matillion ETL  virtual machine needs network connectivity to the source API, database or service.

 

Internet connectivity

If you are trying to query an internet-based service (such as Facebook, Twitter or ZenDesk) then the Matillion ETL VM itself requires internet access.

To check that your Matillion ETL instance has internet access, install and run the below shared job for your version of Matillion ETL:

Download for Matillion ETL for Redshift - Check Internet Access - Redshift.melt
Download for Matillion ETL for Snowflake - Check Internet Access - Snowflake.melt
Download for Matillion ETL for Google BigQuery - Check Internet Access - BigQuery.melt


The job will only run successfully if Matillion ETL has internet access.

 

Database Query

In order to query data from a relational database (such as Oracle or SQL Server), you would use a Database Query component. Matillion ETL ships with some JDBC drivers pre-loaded, but you can use any Type 4 JDBC driver for which you have downloaded the JAR files.

Your Matillion ETL VM must have network access to the database server. This may require that you perform network administration outside Matillion ETL. You will need to know:

  • The database server’s IP address or hostname
  • The port number

To check that your Matillion ETL instance has access to the database server, install and run the Check Network Access shared job below:

Download for Matillion ETL for Redshift - Check Network Access - Redshift.melt
Download for Matillion ETL for Snowflake - Check Network Access - Snowflake.melt
Download for Matillion ETL for Google BigQuery - Check Network access - BigQuery.melt
 

Note: You may first need to sudo yum install nc from an SSH session into your Matillion ETL VM.

You will need to set the hostname and port number for your source database in the component properties.

The Check Network Access job will only run successfully if you have given your Matillion ETL VM the necessary network access.

 
Database Query performance

Four main steps are involved which affect the runtime performance of a Database Query:

 
Factor Where is the load?
1 Data extraction from source On the source database
2 Data transfer across the network to Matillion ETL On the network between source database and Matillion ETL
3 Matillion ETL creating temporary file(s) in cloud storage. Shared between Matillion ETL and the cloud storage provider.
4 Target database executing a bulk load. On the target database.

Steps 3 and 4 may be time-consuming. So during performance tuning it’s sometimes difficult to pinpoint which area to concentrate on. You may encounter a situation where the Database Query starts to run, begins to stage records, and maybe runs for several minutes or longer before eventually failing.

To perform extra testing in these circumstances, you should install and run the JDBC Query Tester shared job below:

Download for Matillion ETL for Redshift - JDBC Query Tester - Redshift.melt
Download for Matillion ETL for Snowflake - JDBC Query Tester - Snowflake.melt
Download for Matillion ETL for Google BigQuery - JDBC Query Tester - BigQuery.melt

You will need to set five properties to run this shared job:

  • JDBC Driver Class Name: The class name of JDBC driver, for example net.sourceforge.jtds.jdbc.Driver
  • Classpath: The full path to the JDBC driver jar file, for example /usr/share/emerald/WEB-INF/lib/jtds-1.3.1.jar. If more than one .jar file is needed then separate the filenames with : signs.
  • JDBC URL: The JDBC URL, for example jdbc:jtds:sqlserver://yourhostname.com/testdata
  • Properties: A semicolon-separated list of JDBC properties used by the driver. For example:
    user=u;password=p.
    If the property values contain an = sign, use \u003d instead. If the property values contain a ; sign, use \u003b instead. If you do not need any extra properties, set the value to a single space.
  • ​SELECT Statement: Make sure this is identical to the SQL query being performed by the Database Query component

If the JDBC Query Tester job fails:

  • Verify network connectivity to the source database as described earlier in this section
  • Check the JDBC driver documentation and verify that you have set the properties appropriately.
 

If the JDBC Query Tester job succeeds but the equivalent Matillion ETL Database Query component fails, please contact Matillion Support.

If both the JDBC Query Tester and the equivalent Matillion ETL Database Query component ran successfully:

  • If they both took about the same amount of time to finish, you need to concentrate tuning efforts outside Matillion ETL, on the SQL being executed against the source database.
  • If the JDBC Query Tester was significantly faster, please contact Matillion Support.
 

Logical Errors

With this category of problem the job runs successfully, but the resulting data looks wrong.

 
Authentication or Authorization failures

These often manifest at runtime as fairly obvious “invalid password” type errors, as below.

Authorization failures may appear superficially similar, for example saying that a table name is not valid. It may be the case that the credentials are valid, but the source application does not allow the privilege to read from the desired data source.

Check that the username/password or OAuth token is correct and is appropriately privileged.  In the Project / Manage OAuth menu, you should see a “Configured” status. OAuth credentials do occasionally expire so if you’re using OAuth and encountering an authentication failure it’s usually worth setting up new OAuth credentials and testing once with those instead.

 
Data errors

There are some common things to check if a Load component has run successfully but the data is not as expected.

  • Load components won’t necessarily extract all the data every time: they will only extract the data that matches the criteria you specified in the parameters or WHERE clause.
  • If you are using variables among the parameters, check that the actual runtime values are as expected. It’s good practice to audit the actual runtime values as described earlier.
  • Date parameters are especially vulnerable to misinterpretation due to formatting ambiguities. Check the component or API documentation to confirm the required format. Some load components allow “magic” strings with special meaning, such as LAST_WEEK.
  • Many components have a Limit property that sometimes defaults to 100. This is useful for testing, but you should remove the value when moving to production.
 
Auto Debug mode

Many load components have an “instrumentation” switch that you can enable to capture additional information into the Task History. Matillion support will notify you when this is a good option.

To switch it on, locate the Auto Debug and Debug Level properties, and set them as below.

Now, whenever you run the component, it will generate extra instrumentation into the Task History.

To extract the text, open Project → Task History, locate and expand the task, and press the ellipsis which expands the text.

Highlight the entire text, then copy and paste it into a new text document. Send the document to support@matillion.com in reply to the email support thread.

Never leave a Load component with this extra instrumentation enabled. It consumes a potentially large amount of space every time it’s run. Always switch the Auto Debug back Off after one execution of the component.