On Thursday September 13th, 2018, customers using the iPaper platform was experiencing load issues with their flipbooks. Some flipbooks would work without any issues but from time to time they would receive an error instead of a flipbook. The issue didn't have heavy impact on pop-ups, but it did directly have an impact on our flipbook viewer, administration and API. During the incident endusers and administrators will have experienced outtages with a duration from 1-5 minutes.
We apologize to our customers who were impacted by the issues during this incident. We have already changed internal processes to minimize potential incidents, while we reassess our current workflow for development and releases.
Our automated monitoring notified us of an issue at 08:26 (CEST) on August 13th, 2018. The initial symptoms was flipbooks not loading as expected, and very slow response from our administration system and API. As some flipbooks was served from our servers it was investigated as a load issue. As it wasn't all requests to our system that returned error codes, and it was affecting both our flipbook viewer, administration and API it was clear that it was a widespread issue.
First expected source of the issue was a high load on our database server, unfortunately initial look at the metrics didn't show any higher load than usual. Going through the internal metrics in the database we spotted two queries that wasn't performing as expected. To fix those queries a hotfix was applied and deployed at 10:44, which removed the initial percieved high load on the database.
After a period with stable reponses and no errors showing up in our logs, the system was stable for a period. Around 11:50 we received a second round of outages, which lead further investigations into a new caching functionality that was added as part of the upgrade that day.
The caching functionality hadn't shown any deviating signs from the rest of the system at that point. After further investigation it appeared that the new caching functionality that should prevent high load on our system, itself was implying a sporadic high load on the system. In the process of fetching the data that should be cached, the query was fetching data in a wider scope than was needed. That lead to a substantial high load when some queries were performed, which ended up in a timeout. When a certain number of timeouts was hit our applicationservers halted the website as a safety measure of not impacting the whole system. When the website was started again it would run for 20 to 40 minutes before showing direct signs of degraded performance again.
At 13:29 a fix for the wide scope of the query was deployed, and the system was reported stable at 13:41 again.
At 15:20 the incident was closed as no further signs of issues surfaced.
Our monitoring proved that it's working effectively as we received alerts within short notice of first requests not being served from the servers. Though the engineering team became aware of the issue we want to improve our dashboard metrics even further as we didn't have direct monitoring of the websites running state. That would have made it clear that a manual restart of the websites could have lowered the amount of requests being dropped while there was being worked on a hotfix for the issues that was surfaced. Besides the running state of the actual website we will reevaluate the metrics we currently have available to see if we can improve our response time even more.
As the root cause of the issue was a too wide scope of the data lookup, we will be adding further steps in our QA process to make sure that we test with a high enough amount of data that will surface such a problem in the future. Besides that we will evaluate if loadtests could have surfaced the issue before hitting production.
Due to the timing of when the issue was hitting our system, we will move our update process to be outside of periods with a certain amount of visitors to the system. Until we have made sure our change management process is covering the potential impact and risk, we will keep at that schedule to ensure as little as possible impact.
Again we would like to apologize to our customers for the impact of this incident, and ensure you that we are putting effort into removing potential issues that could cause downtime like this again.