Name is required.
Email address is required.
Invalid email address
Answer is required.
Exceeding max length of 5KB

Cannot connect to Matillion running on Google Cloud

We have Matillion running on a Google Cloud Instance -- it used to work fine, until it crashed while working (connection lost) - after restarting the VM the connection was not possible any longer (The browser shows the site does not exist).
I wasn't doing anything super special at the time - last Action I took was running an SQL query containing the "dense_rank" function (not sure if that has anything to do with it, the query ran fine, and then the crash came -- I ran the query first directly on BigQuery to make sure it gives the right result)

It is also not possible to access the vm with ssh or any other way to read out some logs - it does not show any errors when starting or exiting it.

We are currently setting up a new instance and moving over our data, which works, but it would be good to know what has caused the issue so we can avoid it in the future.

23 Community Answers

Matillion Agent  

Damian Chan —

Hello Kevin,

How are you connecting to your instance? Via a URL or by using the public IP address of the compute engine?

Best Regards,
Damian


Kevin Blank —

Hi Damian,

I'm using the public IP adress of the compute Engine.


Matillion Agent  

Damian Chan —

Hello Kevin,

Is it possible that your Public IP address is now different after restarting your compute engine?

Best Regards,
Damian


Kevin Blank —

Hi Damien,
Yes that is possible, as it usually changes with each restart - but I am trying to connect to the right one.
One theory we have is that it just takes a long time and times out eventually - it used to take some time to connect (about 30s to 1 Minute) but not that long.


Matillion Agent  

Damian Chan —

Hello Kevin,

I assume the instance is in a running state at the moment? Have there been any recent changes to the firewall rules?

Best Regards,
Damian


Kevin Blank —

Hello Damien,

We've managed to get a closer look of the logs of the storage -- it appears that somehow the filesystem on the storage that Matillion was using was corrupted (we don't know yet how that happened).
We are trying to recover the data thats on there -- do you have any information on what data we need and where to find it? We are mostly interested in the work that was done in the jobs that were set up.

regards,
Kevin


Matillion Agent  

Damian Chan —

Hello Kevin,

Does this mean you have SSH access now? or are you still left with no method of interacting with the compute engine?

So just to confirm, you want to recover the orchestration/transformation jobs that you’ve created in Matillion? as assumably the data that you’re ingesting and transforming would be stored in BigQuery.

Also, would you be able to send us a copy of the logs that you’re looking at, please? Thank you.

Best Regards,
Damian


Kevin Blank —

Hi Damian,

SSH access still doesn't work - but we can get some logging from looking at the serial port of the compute engine.
Yes, we want to recover the orchestration and transformation jobs - they take data from BigQuery.

Here is the log we are looking at:
--------------------------------
69484] tsc: Refined TSC clocksource calibration: 2299.839 MHz
[ 1.840724] input: ImExPS/2 Generic Explorer Mouse as /devices/platform/i8042/serio1/input/input3
[ 1.843890] md: Waiting for all devices to be available before autodetect
[ 1.845987] md: If you don't use raid, use raid=noautodetect
[ 1.847939] md: Autodetecting RAID arrays.
[ 1.849373] md: autorun ...
[ 1.850384] md: ... autorun DONE.
[ 1.851558] List of all partitions:
[ 1.852552] No filesystem could mount root, tried:
[ 1.854001] Kernel panic - not syncing: VFS: Unable to mount root fs on unknown-block(0,0)
[ 1.857058] CPU: 0 PID: 1 Comm: swapper/0 Not tainted 3.10.0-957.1.3.el7.x86_64 #1
[ 1.859413] Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
[ 1.861551] Call Trace:
[ 1.862435] [<ffffffff90561e41>] dump_stack+0x19/0x1b
[ 1.864274] [<ffffffff9055b550>] panic+0xe8/0x21f
[ 1.865334] [<ffffffff90b86761>] mount_block_root+0x291/0x2a0
[ 1.867239] [<ffffffff90b867c3>] mount_root+0x53/0x56
[ 1.868560] [<ffffffff90b86902>] prepare_namespace+0x13c/0x174
[ 1.870209] [<ffffffff90b863df>] kernel_init_freeable+0x1f8/0x21f
[ 1.872339] [<ffffffff90b85b1f>] ? initcall_blacklist+0xb0/0xb0
[ 1.874027] [<ffffffff9054ff40>] ? rest_init+0x80/0x80
[ 1.876042] [<ffffffff9054ff4e>] kernel_init+0xe/0x100
[ 1.877424] [<ffffffff90574c37>] ret_from_fork_nospec_begin+0x21/0x21
[ 1.878819] [<ffffffff9054ff40>] ? rest_init+0x80/0x80
[ 1.882076] Kernel Offset: 0xee00000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)
------------------------------
It can't mount the filesystem, so presumably it has been corrupted - maybe Matillion crashed during some crucial save that led to the corruption?
We'd also be happy to try and recover any matillion logs if those could be of use.

regards,
Kevin


Matillion Agent  

Kalyan Arangam —

Hi Kevin,

Do you have any snapshots/backups of this instance you can restore to? Do you have a back up of the matillion jobs so you can move them to a new server if necessary.

I am no linux expert. Would you be able to retrieve the matillion logs even ith the file-system not being available?

/var/logs/tomcat/catalina.out

Best
Kalyan


Kevin Blank —

Unfortunately we don't have backups (yeah I know, thats stupid - we were just using Matillion for a few weeks and havn't gotten around to include it into our backup workflows).

If we have the path, we can try to restore them - Can you also give me the path of where the jobs would be stored?


Matillion Agent  

Laura Malins —

Hi Kevin

This shouldn’t have occurred by simply running a Matillion job. This error occurs when somethign has happened on the File system such as removing a partition or editing the Bootloader.

I assume you tried to mount it to an existing VM which you could access? If so it looks like you’re limited in options here.

The data for the jobs is stored on a Postgres database installed on the instance. You could try using a database tool to connect to the instance but I would be doubtful that would work.

As Kalyan says, if you can get those log files please share them and we may be able to help further.

Thanks
Laura


Kevin Blank —

Hi Laura,

Thanks for the fast answer! We'll try to recover the logs and see what comes out of that. Let's hope it tells us something about what happens and how to avoid that in the future (and maybe recover the lost data, which is a lot)

Cheers,
Kevin


Matillion Agent  

Laura Malins —

Hi Kevin

You’re welcome. Please do keep us posted here and share any log files so we can help you if possible.

If you’re sure nothing has changed on the instance it might be worthwhile reaching out to Google to get their opinion on what may have gone wrong. We’ve never seen this occur before, across any products, without the customer making a change to the file system.

Thanks
Laura


Kevin Blank —

Hi Laura,

After many more hours of looking into, we now know definitely, that the Bootloader on that particular storage is broken (Why, we still don't know). Which means that all the data should still be intact, and we can mount it as a secondary storage on a new Matillion set up.

Do you have any pointers on how we would go about migrating the database from the old storage to the new?

Cheers,
Kevin


Matillion Agent  

Kalyan Arangam —

Hi Kevin,

Here’s a general outline of what you may try. No guarantees this will work though.

- take a snapshot/backup of the current disk before you proceed
- create a new matillion instance
- attach the non-starting disk to this instance
- stop postgres and tomcat services
- copy data across from old disk to new disk overwriting things
- start postgres and tomcat service

Please can you confirm you managed to mount this disk in a second instance?

Best
Kalyan


Kevin Blank —

Hi Kalyan,

Thanks a bunch for your help - we have managed to recover all the data (except the accounts, but thats only a very minor loss).
Unfortunately, we weren't able to find any logfiles in the folder you mentioned or around that - so those might be lost.

What I can say is that the last change I did in Mattilion was not saved any more, so presumably it crashed during the save procedure of that.

Thanks a lot for the quick and good help you provided!

Best regards,
Kevin


Matillion Agent  

Laura Malins —

Hi Kevin

Thanks for confirming that you’re back up and running with only minor loss. It’s odd that there weren’t any logs at all in /var/log/tomcat/. You do need sudo permissions to access it. It’s unfortunate as Kalyan and I have been chatting about this and we’re keen to get to the bottom of what went on here.

However if you’re happy I’ll close this case. Please do reach out again if anything else comes up.

Thanks
Laura


Kevin Blank —

Hi Laura,
We tried to access the folder with sudo, but there was still nothing relevant there.
After working with it again for a bit, I had another crash, this time it did manage to restart, but we looked a bit closer as we feared it could be the same issue (as it seemed similar) -- First of all, when working, it lost connection, and send me to the "500 - Internal Server Error" page, which came up even when accessing the root directly.
A restart of the VM fixed the issue, but we checked the Log when the issue happened, and the last few actions before the crash seem odd:
-----------------------
Jan 8 17:06:30 matillion-etl-for-bigquery-pro-2-2-vm polkitd[5520]: Started polkitd version 0.112
Jan 8 17:06:31 matillion-etl-for-bigquery-pro-2-2-vm systemd: Started Authorization Manager.
Jan 8 17:06:31 matillion-etl-for-bigquery-pro-2-2-vm systemd: Reloading.
Jan 8 17:06:31 matillion-etl-for-bigquery-pro-2-2-vm systemd: Reloading.
Jan 8 17:06:31 matillion-etl-for-bigquery-pro-2-2-vm systemd: Stopping Run automatic yum updates as a cron job...
Jan 8 17:06:31 matillion-etl-for-bigquery-pro-2-2-vm systemd: Stopped Run automatic yum updates as a cron job.
Jan 8 17:06:31 matillion-etl-for-bigquery-pro-2-2-vm systemd: Starting Run automatic yum updates as a cron job...
Jan 8 17:06:31 matillion-etl-for-bigquery-pro-2-2-vm systemd: Started Run automatic yum updates as a cron job.
Jan 8 17:06:31 matillion-etl-for-bigquery-pro-2-2-vm systemd: Stopped Dynamic System Tuning Daemon.
Jan 8 17:06:31 matillion-etl-for-bigquery-pro-2-2-vm systemd: Starting Dynamic System Tuning Daemon...
Jan 8 17:06:32 matillion-etl-for-bigquery-pro-2-2-vm systemd: Reloading.
Jan 8 17:06:32 matillion-etl-for-bigquery-pro-2-2-vm systemd: Stopping Command Scheduler...
Jan 8 17:06:32 matillion-etl-for-bigquery-pro-2-2-vm systemd: Stopped Command Scheduler.
Jan 8 17:06:32 matillion-etl-for-bigquery-pro-2-2-vm systemd: Started Command Scheduler.
Jan 8 17:06:32 matillion-etl-for-bigquery-pro-2-2-vm systemd: Started Dynamic System Tuning Daemon.
Jan 8 17:06:33 matillion-etl-for-bigquery-pro-2-2-vm systemd: Reloading.
Jan 8 17:06:34 matillion-etl-for-bigquery-pro-2-2-vm systemd: Reloading.
Jan 8 17:07:04 matillion-etl-for-bigquery-pro-2-2-vm systemd: Reloading.
Jan 8 17:07:04 matillion-etl-for-bigquery-pro-2-2-vm systemd: Reloading.
Jan 8 17:09:55 matillion-etl-for-bigquery-pro-2-2-vm systemd: Removed slice User Slice of root.
---------------------------------------------------------

Especially the last action (which happened right when it crashed) seems like it has the potential to destroy things.

As I said, a restart fixed the issue, and it might be unrelated, but we got the logs that go with it - maybe they help you to get them. I can see lots of exceptions at around the time the crash happened - so it could be related.
https://we.tl/t-LDTeg9MGsi

cheers,
Kevin


Matillion Agent  

Laura Malins —

Hi Kevin

Thank you for the log. We’ve downloaded it.

The log snippet you refer to is nothing to be worried about, however I am concerned about what you’re seeing in the logs, especially this error:

org.apache.catalina.loader.WebappClassLoaderBase openJARs
WARNING: Failed to open JAR [null]
java.io.FileNotFoundException: /usr/share/emerald/WEB-INF/lib/emerald-admin-1.36.5.jar (No such file or directory)

It looks like something has corrupted on the instance.

Can you please try spinning up a new instance of Matillion and using the Server Migration tool to migrate all of your jobs across?

https://bigquerysupport.matillion.com/customer/en/portal/articles/2904980-server-migration-tool-?b_id=8914

Thanks
Laura


Kevin Blank —

Hi Laura,

Thanks a lot for the help!
We've set up a new instance and migrated everything over - so far everything works fine, if any more crashes happen we'll let you know.

Cheers,
Kevin


Matillion Agent  

Laura Malins —

Hi Kevin

Thank you for the update. I’ll leave this open for now so please do let us know how you get on.

Thanks
Laura


Mayank Singhal —

Hi Kevin,

I am facing the same issue since last week.
After starting the instance, I am also not able to login into matillion instance via URL or via public key.
Please can you let me know what steps you have followed to resolve this issue.
I want to retrieve my jobs.

Thanks and Regards,
Mayank Singhal
+917506736006
mastekgcpasset@gmail.com
mnksinghal4@gmail.com


Matillion Agent  

Dan D'Orazio —

Hi Mayank -

If you have a backup / snapshot of the instance, then we would recommend creating a new instance from the the snapshot. If not and the system has been corrupted, here’s a general outline of what you may try. No guarantees this will work though.

- take a snapshot/backup of the current disk before you proceed
- create a new matillion instance
- attach the non-starting disk to this instance
- stop postgres and tomcat services
- copy data across from old disk to new disk overwriting things
- start postgres and tomcat service

Let us know how this works out.

Best -
Dan

Post Your Community Answer

To add an answer please login