Cannot connect to Matillion running on Google Cloud
We have Matillion running on a Google Cloud Instance -- it used to work fine, until it crashed while working (connection lost) - after restarting the VM the connection was not possible any longer (The browser shows the site does not exist). I wasn't doing anything super special at the time - last Action I took was running an SQL query containing the "dense_rank" function (not sure if that has anything to do with it, the query ran fine, and then the crash came -- I ran the query first directly on BigQuery to make sure it gives the right result)
It is also not possible to access the vm with ssh or any other way to read out some logs - it does not show any errors when starting or exiting it.
We are currently setting up a new instance and moving over our data, which works, but it would be good to know what has caused the issue so we can avoid it in the future.
21 Community Answers
Damian Chan —
How are you connecting to your instance? Via a URL or by using the public IP address of the compute engine?
Hi Damien, Yes that is possible, as it usually changes with each restart - but I am trying to connect to the right one. One theory we have is that it just takes a long time and times out eventually - it used to take some time to connect (about 30s to 1 Minute) but not that long.
We've managed to get a closer look of the logs of the storage -- it appears that somehow the filesystem on the storage that Matillion was using was corrupted (we don't know yet how that happened). We are trying to recover the data thats on there -- do you have any information on what data we need and where to find it? We are mostly interested in the work that was done in the jobs that were set up.
Does this mean you have SSH access now? or are you still left with no method of interacting with the compute engine?
So just to confirm, you want to recover the orchestration/transformation jobs that you’ve created in Matillion? as assumably the data that you’re ingesting and transforming would be stored in BigQuery.
Also, would you be able to send us a copy of the logs that you’re looking at, please? Thank you.
SSH access still doesn't work - but we can get some logging from looking at the serial port of the compute engine. Yes, we want to recover the orchestration and transformation jobs - they take data from BigQuery.
Here is the log we are looking at: -------------------------------- 69484] tsc: Refined TSC clocksource calibration: 2299.839 MHz [ 1.840724] input: ImExPS/2 Generic Explorer Mouse as /devices/platform/i8042/serio1/input/input3 [ 1.843890] md: Waiting for all devices to be available before autodetect [ 1.845987] md: If you don't use raid, use raid=noautodetect [ 1.847939] md: Autodetecting RAID arrays. [ 1.849373] md: autorun ... [ 1.850384] md: ... autorun DONE. [ 1.851558] List of all partitions: [ 1.852552] No filesystem could mount root, tried: [ 1.854001] Kernel panic - not syncing: VFS: Unable to mount root fs on unknown-block(0,0) [ 1.857058] CPU: 0 PID: 1 Comm: swapper/0 Not tainted 3.10.0-957.1.3.el7.x86_64 #1 [ 1.859413] Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 [ 1.861551] Call Trace: [ 1.862435] [<ffffffff90561e41>] dump_stack+0x19/0x1b [ 1.864274] [<ffffffff9055b550>] panic+0xe8/0x21f [ 1.865334] [<ffffffff90b86761>] mount_block_root+0x291/0x2a0 [ 1.867239] [<ffffffff90b867c3>] mount_root+0x53/0x56 [ 1.868560] [<ffffffff90b86902>] prepare_namespace+0x13c/0x174 [ 1.870209] [<ffffffff90b863df>] kernel_init_freeable+0x1f8/0x21f [ 1.872339] [<ffffffff90b85b1f>] ? initcall_blacklist+0xb0/0xb0 [ 1.874027] [<ffffffff9054ff40>] ? rest_init+0x80/0x80 [ 1.876042] [<ffffffff9054ff4e>] kernel_init+0xe/0x100 [ 1.877424] [<ffffffff90574c37>] ret_from_fork_nospec_begin+0x21/0x21 [ 1.878819] [<ffffffff9054ff40>] ? rest_init+0x80/0x80 [ 1.882076] Kernel Offset: 0xee00000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff) ------------------------------ It can't mount the filesystem, so presumably it has been corrupted - maybe Matillion crashed during some crucial save that led to the corruption? We'd also be happy to try and recover any matillion logs if those could be of use.
Thanks for the fast answer! We'll try to recover the logs and see what comes out of that. Let's hope it tells us something about what happens and how to avoid that in the future (and maybe recover the lost data, which is a lot)
You’re welcome. Please do keep us posted here and share any log files so we can help you if possible.
If you’re sure nothing has changed on the instance it might be worthwhile reaching out to Google to get their opinion on what may have gone wrong. We’ve never seen this occur before, across any products, without the customer making a change to the file system.
After many more hours of looking into, we now know definitely, that the Bootloader on that particular storage is broken (Why, we still don't know). Which means that all the data should still be intact, and we can mount it as a secondary storage on a new Matillion set up.
Do you have any pointers on how we would go about migrating the database from the old storage to the new?
Here’s a general outline of what you may try. No guarantees this will work though.
- take a snapshot/backup of the current disk before you proceed
- create a new matillion instance
- attach the non-starting disk to this instance
- stop postgres and tomcat services
- copy data across from old disk to new disk overwriting things
- start postgres and tomcat service
Please can you confirm you managed to mount this disk in a second instance?
Thanks a bunch for your help - we have managed to recover all the data (except the accounts, but thats only a very minor loss). Unfortunately, we weren't able to find any logfiles in the folder you mentioned or around that - so those might be lost.
What I can say is that the last change I did in Mattilion was not saved any more, so presumably it crashed during the save procedure of that.
Thanks a lot for the quick and good help you provided!
Thanks for confirming that you’re back up and running with only minor loss. It’s odd that there weren’t any logs at all in /var/log/tomcat/. You do need sudo permissions to access it. It’s unfortunate as Kalyan and I have been chatting about this and we’re keen to get to the bottom of what went on here.
However if you’re happy I’ll close this case. Please do reach out again if anything else comes up.
Hi Laura, We tried to access the folder with sudo, but there was still nothing relevant there. After working with it again for a bit, I had another crash, this time it did manage to restart, but we looked a bit closer as we feared it could be the same issue (as it seemed similar) -- First of all, when working, it lost connection, and send me to the "500 - Internal Server Error" page, which came up even when accessing the root directly. A restart of the VM fixed the issue, but we checked the Log when the issue happened, and the last few actions before the crash seem odd: ----------------------- Jan 8 17:06:30 matillion-etl-for-bigquery-pro-2-2-vm polkitd: Started polkitd version 0.112 Jan 8 17:06:31 matillion-etl-for-bigquery-pro-2-2-vm systemd: Started Authorization Manager. Jan 8 17:06:31 matillion-etl-for-bigquery-pro-2-2-vm systemd: Reloading. Jan 8 17:06:31 matillion-etl-for-bigquery-pro-2-2-vm systemd: Reloading. Jan 8 17:06:31 matillion-etl-for-bigquery-pro-2-2-vm systemd: Stopping Run automatic yum updates as a cron job... Jan 8 17:06:31 matillion-etl-for-bigquery-pro-2-2-vm systemd: Stopped Run automatic yum updates as a cron job. Jan 8 17:06:31 matillion-etl-for-bigquery-pro-2-2-vm systemd: Starting Run automatic yum updates as a cron job... Jan 8 17:06:31 matillion-etl-for-bigquery-pro-2-2-vm systemd: Started Run automatic yum updates as a cron job. Jan 8 17:06:31 matillion-etl-for-bigquery-pro-2-2-vm systemd: Stopped Dynamic System Tuning Daemon. Jan 8 17:06:31 matillion-etl-for-bigquery-pro-2-2-vm systemd: Starting Dynamic System Tuning Daemon... Jan 8 17:06:32 matillion-etl-for-bigquery-pro-2-2-vm systemd: Reloading. Jan 8 17:06:32 matillion-etl-for-bigquery-pro-2-2-vm systemd: Stopping Command Scheduler... Jan 8 17:06:32 matillion-etl-for-bigquery-pro-2-2-vm systemd: Stopped Command Scheduler. Jan 8 17:06:32 matillion-etl-for-bigquery-pro-2-2-vm systemd: Started Command Scheduler. Jan 8 17:06:32 matillion-etl-for-bigquery-pro-2-2-vm systemd: Started Dynamic System Tuning Daemon. Jan 8 17:06:33 matillion-etl-for-bigquery-pro-2-2-vm systemd: Reloading. Jan 8 17:06:34 matillion-etl-for-bigquery-pro-2-2-vm systemd: Reloading. Jan 8 17:07:04 matillion-etl-for-bigquery-pro-2-2-vm systemd: Reloading. Jan 8 17:07:04 matillion-etl-for-bigquery-pro-2-2-vm systemd: Reloading. Jan 8 17:09:55 matillion-etl-for-bigquery-pro-2-2-vm systemd: Removed slice User Slice of root. ---------------------------------------------------------
Especially the last action (which happened right when it crashed) seems like it has the potential to destroy things.
As I said, a restart fixed the issue, and it might be unrelated, but we got the logs that go with it - maybe they help you to get them. I can see lots of exceptions at around the time the crash happened - so it could be related. https://we.tl/t-LDTeg9MGsi
The log snippet you refer to is nothing to be worried about, however I am concerned about what you’re seeing in the logs, especially this error:
WARNING: Failed to open JAR [null]
java.io.FileNotFoundException: /usr/share/emerald/WEB-INF/lib/emerald-admin-1.36.5.jar (No such file or directory)
It looks like something has corrupted on the instance.
Can you please try spinning up a new instance of Matillion and using the Server Migration tool to migrate all of your jobs across?