Service outage
Incident Report for Intuo
Postmortem

There was a severe multi-hour outage that started around 8am on Friday morning of the 21st of September. We were performing regular server security upgrades, which included a linux kernel upgrade to version 4.4.0-1067.77.

After the kernel upgrade, the application servers stopped working, and we identified unusual activity with the database. The first assessment lead us to believe the database was corrupted and we recreated the production database instance. No data was lost in the process. After recreating the database, the servers were working for 30 minutes before they started failing again. We turned our attention to the servers, and rolled back the linux kernel upgrade. After monitoring the servers for another hour during high load, we concluded that the linux kernel rollback did not completely solve the issues.

Our application servers were still going down every 30 to 60 minutes. At this point in time, we decided to get in touch with Amazon Web Services support. After escalating our support ticket to the AWS Engineering team, they investigated our database servers and concluded that we had database queries that were consuming significant amounts of memory (RAM). Our performance monitoring tools did not catch these non-performant queries, but the AWS team helped us identify them through enhanced monitoring.

We subsequently improved all of the queries and pushed a final fix around 8pm of the 21st of September. We are continuing the monitoring of our systems and will keep improving the performance of our database queries continuously and over time.

Posted 11 months ago. Sep 21, 2018 - 22:06 CEST

Resolved
We have successfully solved the database issue and the system is fully operational now. Sorry for all the inconvenience we caused
Posted 11 months ago. Sep 21, 2018 - 19:13 CEST
Monitoring
We have restored the database from a backup and are monitoring the situation. No data was lost.
Posted 11 months ago. Sep 21, 2018 - 12:07 CEST
Update
We are continuing to work on a fix for this issue.
Posted 11 months ago. Sep 21, 2018 - 11:25 CEST
Identified
We have found out that the outage is being caused because the database hardware was faulty at the Amazon location and we are restoring the database.
Posted 11 months ago. Sep 21, 2018 - 10:42 CEST
Update
We are continuing to monitor for any further issues.
Posted 11 months ago. Sep 21, 2018 - 10:11 CEST
Monitoring
We have experienced a service outage this morning. The issue was resolved and we are monitoring the platform at the moment.
Posted 11 months ago. Sep 21, 2018 - 10:11 CEST
This incident affected: Intuo application.