On the 22nd of October, 2018, customers experienced a series of intermittent Flipbook load issues. The issue did not affect feeds, pop-ups, API or the administration system. During the issue end users would experience Flipbooks not loading at all or parts there off not loading in shorter periods of 1-2 minutes at the time. We apologize to all of our customers that were impacted during the incident.
At 10:29 (CEST) our automated monitoring system notified us of an issue with reaching our Flipbooks. The first investigations showed that one of our application servers had stopped serving content. The effect was that sometimes content wouldn't be served to the browser, but since that could be the initial page load that would seem as no response at all.
Due to the nature of not having seen this before, the team were quite certain that it was related to the release we had Thursday/Friday the week before. Since the application server had stopped responding at all it was certain that it was caused by a series of uncaught errors where the application server shutdowns down as a preventive measure, or it was caused by recursive code without an exit. While one part of the team were going through the code recently deployed to identify changes that included recursion, another part of the team started investigating crash dumps from the server to identify the actual failing parts.
At 11:23 we deployed a toggle that disabled a new feature that we had deployed the week before. We decided to disable that feature as we could see quite a high amount of traffic going down that features code path. After that feature was disabled the system went back to a stable state, which left room for the team to identify the root cause.
At 12:05 the cause for the dying application server was identified to a recursive issue with our error page. Recently there had been a change in the way that we serve error pages, and unfortunately that could cause recursive requests in the system. This was caused by an error occurring while rendering the error page, which then tried to render the error page again. A fix for this was deployed at 13:17, that makes sure that the system defaults back to the servers built in error pages in case an error occurs while we are rendering the custom error page.
After monitoring the stability of the system for a while we enabled the new feature again at 14:19, and kept monitoring for further issues.
At 16:30 the issue was closed after roughly 2 hours of monitoring for potential errors while none surfaced.
As the underlying issue was caused by an error page failing, we have made sure that in case an error happens while rending the custom error page we will fall back to the servers built in error page.
The main reason for us keeping the custom error page is being able to guide our users in case there is an issue with the configuration of the Flipbook. This provides highly valuable feedback to our users, and keeps them in the safe and well-known UI of our environment.
Besides adding the safe guard of making sure we fallback to the built in error pages, we are working on adding abnormality detection to catch increases in errors in certain areas that can be caused by errors in new features.