On Friday the 7th of May 2021, customers experience a partial outage of the iPaper platform. All parts of our system, Admin, Flipbook viewer, Display and Popups was periodically not responding with the expected content. During the period of the outage customers saw various levels of responsiveness from our platform. At the worst period (between 13:08 (CEST) and 18:43 (CEST)) the platform would only respond to 25% of the requests, while at other times it was fluctuating between 25 and 100%.
The incident was a result of an underlying hardware failure in our data storage, that lead to sporadic outages forcing us to do an emergency storage migration.
We apologize to all of our customers that have been affected by this incident.
At 07:51 (CEST) our team received notifications, that one of our hosting zones reported no functional servers available. Immediately we started investigating the issue and spun extra capacity in the healthy zone to ensure capacity for the expected upcoming traffic.
At 09:30 (CEST) we had identified that it was not the hosting zone in itself that was inaccessible, but it was in the communication between our data storage and our application servers.
At 10.24 (CEST) we detected an increase in errors related to the connection to our data storage which lead to sporadic outages, this made us escalate to an emergency storage migration.
At 12:42 (CEST) we initialized migration of our data storage to mitigate the sporadic outages.
At 13:08 (CEST) we detected increased latency from our platform, that will have resulted in our end users experiencing downtime.
At 15:17 (CEST) application servers started to respond well from the first parts of the storage that was migrated.
At 18:43 (CEST) we finished the last part of our migration. Vital metrics such as latency and request count, dropped back to normal levels. Flipbooks viewer, Displays and Popups loaded as expected from all application servers. An issue in our administration system was detected, causing our customers not being able to access their data.
At 18:54 (CEST) we had mitigated the loading issue in our administration system.
At 19:19 (CEST) our Customer Care team reported new flipbooks wasn't being processed for publication.
At 20:08 (CEST) the flipbooks processing was back to normal, and verified by our Customer Care team
During the process of migration our data storage we did discover a few minor issues that already have been updated in our storage migration plan. Besides the updates we have made sure that we have a more frequent schedule for testing our migrations, to ensure that our plan is always up to date.
To ensure we are able to catch such incidents faster we will be incorporating a wider list of metrics into our health checks. Besides the actual health check we will make sure to gather even more metrics, to assist us in pinpointing the issues for faster resolution.
In the final stages of our migration we noticed a few services didn't have access to our data storage due to lack of incorrect service mapping, we will initiate a process of ensuring automation in this field with a following validation to this step. By automating this step we won't have a server trying to contact the old data storage after it has been migrated away from.