System unavailable
Incident Report for iPaper
Postmortem

Incident Summary

On Tuesday, 24th July 2018, customers using the iPaper platform was experiencing sporadic unreachable flipbooks, pop-ups, and an API and administration system that had sporadic issues as well. The length of the periods where the system was unresponsive was in the range of 1-7 minutes. We apologize to our customers who were impacted during this incident, and we are taking immediate steps to improve the platforms stability and availability.

Detailed Description of Impact

On Tuesday, 24th July 2018, from 14:42 to 22:35 (CEST) our system responded 404 not found or 504 service unavailable for some of the requests to the system. The length of the periods where the system responded in that manner was 1-7 minutes, which we know is far from what our customers expect, and we truly apologize. Automated monitoring alerted the engineering team at 14:44, about issues with serving flipbooks to our customers. After the initial overview of the incident it was investigated as a networking issue in our infrastructure. Unfortunately, these investigations led to no result, and it was investigated as a database issue as the application servers was unable to fetch data in the periods with downtime. After further investigation on the database it was discovered that our processor of new incoming flipbooks held some unintended long transaction locks, that led to the rest of the system not being able to fetch data. At 22:35 the last investigations were done that lead to pinpointing the root cause, and the processor was shut down until a final fix for the lock issue was ready. At 23:02 a final deployment was done of our processor and afterwards it was enabled to process new flipbooks again. The systems were monitored overnight, and by the morning no further issues were discovered with regard of serving flipbooks.

The issues with our processor led to a long queue of getting new flipbooks processed, which have resulted in our customers having to wait for their new flipbooks on the 25th of July 2018.

Remediation and Prevention

The engineering team received alerts within 2 minutes of the first errors appearing in the system, and immediately began troubleshooting the issue. Around 22:15 the root cause was discovered and was confirmed at 22:35 where the processor was shutdown to ensure that flipbooks was available at all time again.

In addition to solving the issue we will be implementing some changes: • Whenever there are made changes that have effects on the length of transaction scope there will be further review and testing to ensure our scopes are as short as possible to prevent deadlocks. • Due to the extended queue in our processor we will be looking at improving the speed of processing new flipbooks, to be able to clear the queue faster in the future when we see high numbers. • To speed up diagnosis time in the future we will add further monitoring to our system, especially our database will be further monitored to ensure we have more qualified data available for debugging issues that might turn up in the future.

We would again like to apologize for the impact this incident had on all our customers and their clients. We take all issues the affects availability and performance of our system extremely serious, especially when it affects all customer facing parts of our system. We will do further investigations to ensure that there are no further pitfalls that could affect our availability and performance, this is done to ensure that our customers always can trust that flipbooks will be online.

Posted Jul 25, 2018 - 20:38 CEST

Resolved
All parts of the iPaper platform have been stable for 12+ hours now, and we are considering the incident as closed. As promised last night a full postmortem will follow later today.

Due to the issues we had yesterday we have quite a decent queue of new flipbooks pending processing, we do expect those to be completed processing within this business day.

A big amount of those new flipbooks that are in the queue are from a few customers, but the distribution in our processor will make sure that we process evenly so no one should be stuck in traffic.
Posted Jul 25, 2018 - 11:46 CEST
Update
We have identified the issue and resolved it, we will continue our monitoring and we will post a full postmortem within the next 24 hours.
Posted Jul 24, 2018 - 23:26 CEST
Monitoring
We have adjusted parts of our infrastructure and all parts of our system is online again
Posted Jul 24, 2018 - 17:20 CEST
Investigating
We are currently investigating the continued outages
Posted Jul 24, 2018 - 16:41 CEST
Monitoring
All parts of our system is back and operational again, but we'll keep monitoring for any any potential errors
Posted Jul 24, 2018 - 15:52 CEST
Investigating
We are currently experiencing an outage in our hosting setup, we are working on a solution to get everything back online
Posted Jul 24, 2018 - 14:47 CEST
This incident affected: Admin, API, Flipbooks, and Pop-ups.