Intermittent flipbook load failures

Incident Report for iPaper

Postmortem

Incident Summary

On Wednesday August 8th, 2018, customers using the iPaper platform was experiencing intermittent load issues for flipbooks. This issue did not affect any feeds, pop-ups, API or the administration system. During the issue end users would experience flipbooks not loading for periods of 1-2 minutes, in intervals of 5-10 minutes. We apologize to our customers who were impacted during this incident, and we are taking immediate steps to ensure a similar issue is significantly less likely to happen, as well as to ensure we can remediate similar issues in a more speedy manner going forward.

Detailed Description of Impact & Remediation

Our automated monitoring first notified us of an issue at 17:37 (CEST) on August 8th, 2018. Initial symptoms indicated that flipbooks were hanging while loading, though the rest of the system was performing nominally. Further investigation revealed an increased memory load on our application servers, as well as requests queuing up, rather than being served as usual.

As an initial attempt to solve the issue, we deployed larger application server instances with significantly more memory capacity. Unfortunately this did not solve the issue, as requests were still being queued up after a while. The system was kept up partially by preemptively clearing the hanging requests from the queues. While flipbooks were still intermittently not loading, this ensured that most customers were minimally affected, while we continued diagnosing the issue.

Investigating the hanging requests that caused the queue backlog, we concluded that all hanging requests were trying to download external assets for processing. Looking further into those requests, all requests were attempting to download data from AWS S3, where we store image assets for flipbooks. While such requests will usually either succeed or fail, in this case they were hanging without failing. This caused the requests to stay around for a significantly longer time than usual, causing other requests to queue up, ultimately causing requests to fail.

At 20:37 we deployed a filter which blocked the requests we knew were hanging, resulting in further stability for the majority of our clients. At 21:02 we deployed a final fix which circumvented the root network issue that caused the downloads to hang, as well as to limit potential hangs by aggressively terminating & retrying hanging requests, before they caused any issues.

The system was monitored closely afterwards. After the final fix was deployed, we saw immediate effect and all flipbooks were loading normally with no requests hanging or queueing any longer.

Preventing Similar Issues

While our overall monitoring worked and notified us of the issues within a single minute of the issue occurring, it took too long time for us to identify that requests were hanging rather than failing, and thus causing the request queue to fill up. We have ample monitoring and reporting of failures, but lacked easy access to metrics that would enable us to identify the long running requests that ultimately caused the failure. Once those requests were identified, diagnosing the reason for them hanging, as well as deploying a fix, was swift.

We will be implementing several changes to ensure an issue like this is significantly less likely to happen, as well as speeding up the recovery process:

Expand our metric monitoring to include metrics specific to long-running requests, request execution concurrency as well as queue size across our fleet of servers.
Reduced timeouts on certain endpoints, ensuring an endpoint that under normal usage should never take more than 5 seconds to execute, cannot continue executing for any longer than necessary. While this will still cause hanging request to fail, it won’t take down other parts of the system.
Expanding our recovery diagnosis plan to include soft-failing requests where we don’t get an explicit network error, but rather a hang that never resolves.

We would again like to apologize for the impact this incident had on all our customers and their clients. We take all issues the affects availability and performance of our system extremely serious, especially when it affects all customer facing parts of our system. Having identified and fixed this issue, we have confidence that this issue will not recur, while improving our overall durability.

Posted Aug 09, 2018 - 10:53 CEST

Resolved

We are confident that the underlying stability issue has been resolved and all flipbooks are performing nominally, and have been for the last hours. We will follow up with a full post mortem once we have covered all details internally.

Posted Aug 08, 2018 - 22:33 CEST

Monitoring

The underlying issue has been identified and corrected. All flipbooks have been operational for the last couple of hours. We are continuing to monitor the system before finally considering the issue resolved.

Posted Aug 08, 2018 - 21:35 CEST

Investigating

We're seeing intermittent load failures of flipbooks and are investigating the issue.

Posted Aug 08, 2018 - 18:01 CEST

This incident affected: Flipbooks (Viewer).