There was a severe multi-hour outage that started around 8am on Friday morning of the 21st of September. We were performing regular server security upgrades, which included a linux kernel upgrade to version 4.4.0-1067.77.
After the kernel upgrade, the application servers stopped working, and we identified unusual activity with the database. The first assessment lead us to believe the database was corrupted and we recreated the production database instance. No data was lost in the process. After recreating the database, the servers were working for 30 minutes before they started failing again. We turned our attention to the servers, and rolled back the linux kernel upgrade. After monitoring the servers for another hour during high load, we concluded that the linux kernel rollback did not completely solve the issues.
Our application servers were still going down every 30 to 60 minutes. At this point in time, we decided to get in touch with Amazon Web Services support. After escalating our support ticket to the AWS Engineering team, they investigated our database servers and concluded that we had database queries that were consuming significant amounts of memory (RAM). Our performance monitoring tools did not catch these non-performant queries, but the AWS team helped us identify them through enhanced monitoring.
We subsequently improved all of the queries and pushed a final fix around 8pm of the 21st of September. We are continuing the monitoring of our systems and will keep improving the performance of our database queries continuously and over time.