On Tuesday, 24th July 2018, customers using the iPaper platform was experiencing sporadic unreachable flipbooks, pop-ups, and an API and administration system that had sporadic issues as well. The length of the periods where the system was unresponsive was in the range of 1-7 minutes. We apologize to our customers who were impacted during this incident, and we are taking immediate steps to improve the platforms stability and availability.
On Tuesday, 24th July 2018, from 14:42 to 22:35 (CEST) our system responded 404 not found or 504 service unavailable for some of the requests to the system. The length of the periods where the system responded in that manner was 1-7 minutes, which we know is far from what our customers expect, and we truly apologize. Automated monitoring alerted the engineering team at 14:44, about issues with serving flipbooks to our customers. After the initial overview of the incident it was investigated as a networking issue in our infrastructure. Unfortunately, these investigations led to no result, and it was investigated as a database issue as the application servers was unable to fetch data in the periods with downtime. After further investigation on the database it was discovered that our processor of new incoming flipbooks held some unintended long transaction locks, that led to the rest of the system not being able to fetch data. At 22:35 the last investigations were done that lead to pinpointing the root cause, and the processor was shut down until a final fix for the lock issue was ready. At 23:02 a final deployment was done of our processor and afterwards it was enabled to process new flipbooks again. The systems were monitored overnight, and by the morning no further issues were discovered with regard of serving flipbooks.
The issues with our processor led to a long queue of getting new flipbooks processed, which have resulted in our customers having to wait for their new flipbooks on the 25th of July 2018.
The engineering team received alerts within 2 minutes of the first errors appearing in the system, and immediately began troubleshooting the issue. Around 22:15 the root cause was discovered and was confirmed at 22:35 where the processor was shutdown to ensure that flipbooks was available at all time again.
In addition to solving the issue we will be implementing some changes: • Whenever there are made changes that have effects on the length of transaction scope there will be further review and testing to ensure our scopes are as short as possible to prevent deadlocks. • Due to the extended queue in our processor we will be looking at improving the speed of processing new flipbooks, to be able to clear the queue faster in the future when we see high numbers. • To speed up diagnosis time in the future we will add further monitoring to our system, especially our database will be further monitored to ensure we have more qualified data available for debugging issues that might turn up in the future.
We would again like to apologize for the impact this incident had on all our customers and their clients. We take all issues the affects availability and performance of our system extremely serious, especially when it affects all customer facing parts of our system. We will do further investigations to ensure that there are no further pitfalls that could affect our availability and performance, this is done to ensure that our customers always can trust that flipbooks will be online.