System Unavailable

Incident Report for iPaper

Postmortem

Incident Summary

On the 9th of July, 2020, customers experienced a total outage of the iPaper platform. All parts of our system, Flipbooks, Display, API and Popups was inaccessible for an extended period of time. During the majority of the downtime customers experience no improvement on the situation, due to notification channels failing to notify those that were on call that night.

We apologize to all of our customers that were impacted during the incident.

Detailed Description of Impact & Remediation

At 02:33 (CEST) our automated monitoring system detected that our system had become unresponsive, and triggered alerts. These alerts did however not get through to the personnel on call that night, which lead to an extended period of downtime due to lack of alarms.

At 04:40 (CEST) several customers that are highly dependent on our systems managed notified us by the regular support channels, which lead to reach out to the staff on call that night through extended notification channels.

At 04:42 (CEST) our incident response began to investigate the issues in our system, to ensure a fast mitigation to get all customers back online.

At 04:50 (CEST) we identified that all our application servers had stopped responding to other parts of our infrastructure and there by no longer serving any requests to our platform. Due to a change in configuration of our applicationservers done during the 8th of July, new applicationservers was enrolled with the previous configuration to ensure that it wasn't affected by any new configuration.

At 05:07 (CEST) our automated monitoring system detected that our system was back online, and we could see that all system metrics detected the usual high level of activity on all parts of the system again.

Prevent Simliar Issues

While we already have automated monitoring combined with notification channels and escalation policies we have revised the setup. Due to the broken notification channels we have setup further notification channels with earlier escalation options, including personel off duty to ensure we can mitigate issues faster when outside regular working hours. We attribute the majority of the downtime to broken notification channels as it is clear that as soon as we had identified the issue, we were back online as soon as we could enroll new servers, roughly 20 minutes. In our recorded history we have faster responsetime, and we have to go back to October 2018 before we could find any downtime of the same or longer length.

We have reverted the configuration that was changed in the new application servers that failed to ensure stability again, and ensured that we follow the same practices and flows that through out the last year have worked well for us even with the increase in traffic we have seen in the spring of 2020. We have introduced further steps in the process of changing central configurations in our applicationservers to ensure further reviews and test procedures before they are enrolled into the production environment.

Posted Jul 15, 2020 - 08:40 CEST

Resolved

We experienced a complete system outage from 2:30 (CEST) lasting until 5:08 (CEST).

We have ensured that the system is back online, and we are inspecting the outage to ensure it does not happen again. We will ensure a post-mortem as soon as we have identified all the facets of the issue.

Posted Jul 09, 2020 - 02:30 CEST