On Wednesday August 8th, 2018, customers using the iPaper platform was experiencing intermittent load issues for flipbooks. This issue did not affect any feeds, pop-ups, API or the administration system. During the issue end users would experience flipbooks not loading for periods of 1-2 minutes, in intervals of 5-10 minutes. We apologize to our customers who were impacted during this incident, and we are taking immediate steps to ensure a similar issue is significantly less likely to happen, as well as to ensure we can remediate similar issues in a more speedy manner going forward.
Our automated monitoring first notified us of an issue at 17:37 (CEST) on August 8th, 2018. Initial symptoms indicated that flipbooks were hanging while loading, though the rest of the system was performing nominally. Further investigation revealed an increased memory load on our application servers, as well as requests queuing up, rather than being served as usual.
As an initial attempt to solve the issue, we deployed larger application server instances with significantly more memory capacity. Unfortunately this did not solve the issue, as requests were still being queued up after a while. The system was kept up partially by preemptively clearing the hanging requests from the queues. While flipbooks were still intermittently not loading, this ensured that most customers were minimally affected, while we continued diagnosing the issue.
Investigating the hanging requests that caused the queue backlog, we concluded that all hanging requests were trying to download external assets for processing. Looking further into those requests, all requests were attempting to download data from AWS S3, where we store image assets for flipbooks. While such requests will usually either succeed or fail, in this case they were hanging without failing. This caused the requests to stay around for a significantly longer time than usual, causing other requests to queue up, ultimately causing requests to fail.
At 20:37 we deployed a filter which blocked the requests we knew were hanging, resulting in further stability for the majority of our clients. At 21:02 we deployed a final fix which circumvented the root network issue that caused the downloads to hang, as well as to limit potential hangs by aggressively terminating & retrying hanging requests, before they caused any issues.
The system was monitored closely afterwards. After the final fix was deployed, we saw immediate effect and all flipbooks were loading normally with no requests hanging or queueing any longer.
While our overall monitoring worked and notified us of the issues within a single minute of the issue occurring, it took too long time for us to identify that requests were hanging rather than failing, and thus causing the request queue to fill up. We have ample monitoring and reporting of failures, but lacked easy access to metrics that would enable us to identify the long running requests that ultimately caused the failure. Once those requests were identified, diagnosing the reason for them hanging, as well as deploying a fix, was swift.
We will be implementing several changes to ensure an issue like this is significantly less likely to happen, as well as speeding up the recovery process:
We would again like to apologize for the impact this incident had on all our customers and their clients. We take all issues the affects availability and performance of our system extremely serious, especially when it affects all customer facing parts of our system. Having identified and fixed this issue, we have confidence that this issue will not recur, while improving our overall durability.