Intermittent load failures

Incident Report for iPaper

Postmortem

Incident Summary

On Friday the 7th of May 2021, customers experience a partial outage of the iPaper platform. All parts of our system, Admin, Flipbook viewer, Display and Popups was periodically not responding with the expected content. During the period of the outage customers saw various levels of responsiveness from our platform. At the worst period (between 13:08 (CEST) and 18:43 (CEST)) the platform would only respond to 25% of the requests, while at other times it was fluctuating between 25 and 100%.

The incident was a result of an underlying hardware failure in our data storage, that lead to sporadic outages forcing us to do an emergency storage migration.

We apologize to all of our customers that have been affected by this incident.

Detailed Description of Impact & Remediation

At 07:51 (CEST) our team received notifications, that one of our hosting zones reported no functional servers available. Immediately we started investigating the issue and spun extra capacity in the healthy zone to ensure capacity for the expected upcoming traffic.

At 09:30 (CEST) we had identified that it was not the hosting zone in itself that was inaccessible, but it was in the communication between our data storage and our application servers.

At 10.24 (CEST) we detected an increase in errors related to the connection to our data storage which lead to sporadic outages, this made us escalate to an emergency storage migration.

At 12:42 (CEST) we initialized migration of our data storage to mitigate the sporadic outages.

At 13:08 (CEST) we detected increased latency from our platform, that will have resulted in our end users experiencing downtime.

At 15:17 (CEST) application servers started to respond well from the first parts of the storage that was migrated.

At 18:43 (CEST) we finished the last part of our migration. Vital metrics such as latency and request count, dropped back to normal levels. Flipbooks viewer, Displays and Popups loaded as expected from all application servers. An issue in our administration system was detected, causing our customers not being able to access their data.

At 18:54 (CEST) we had mitigated the loading issue in our administration system.

At 19:19 (CEST) our Customer Care team reported new flipbooks wasn't being processed for publication.

At 20:08 (CEST) the flipbooks processing was back to normal, and verified by our Customer Care team

Prevent Similar Issues

During the process of migration our data storage we did discover a few minor issues that already have been updated in our storage migration plan. Besides the updates we have made sure that we have a more frequent schedule for testing our migrations, to ensure that our plan is always up to date.

To ensure we are able to catch such incidents faster we will be incorporating a wider list of metrics into our health checks. Besides the actual health check we will make sure to gather even more metrics, to assist us in pinpointing the issues for faster resolution.

In the final stages of our migration we noticed a few services didn't have access to our data storage due to lack of incorrect service mapping, we will initiate a process of ensuring automation in this field with a following validation to this step. By automating this step we won't have a server trying to contact the old data storage after it has been migrated away from.

Posted May 12, 2021 - 15:44 CEST

Resolved

We have found no further issues and will mark the incident as resolved.

Posted May 07, 2021 - 23:19 CEST

Monitoring

All known issues have been resolved and we'll keep monitoring the system to ensure everything is as it should be now

Posted May 07, 2021 - 19:06 CEST

Update

We are completing the final stages of our resolution, we should be back online within 30 minutes. Next update 19:00 (GMT+2)

Posted May 07, 2021 - 18:35 CEST

Update

We are still progressing well with the plan, by the looks of it we will be up and running at full speed again in a couple of hours, next update will be at 18:30 (GMT+2)

Posted May 07, 2021 - 17:31 CEST

Update

We are progressing well according to plan with resolving the issue, next update will be at 17:30 (GMT+2)

Posted May 07, 2021 - 16:31 CEST

Update

We are seeing progress towards resolution though we can't supply and ETA yet, next update will be at 16:30 (GMT+2)

Posted May 07, 2021 - 15:30 CEST

Update

We are still working on resolving the issue, next update will be at 15:30 (GMT+2)

Posted May 07, 2021 - 14:45 CEST

Identified

We are currently seeing intermittent load issues of Admin, Flipbooks, Popups and Display, we are working on resolving the issue

Posted May 07, 2021 - 13:49 CEST

This incident affected: Flipbooks (Viewer, Admin, Display Viewer, Pop-ups).