Over the period from the 9th of July 2020 until the 5th of August 2020 we have been experiencing a series of issues that have led our systems to be unavailable for our customers. The outage periods have been in the order of 30 minutes, with an outlier of 2 hours and 38 minutes due to failed notifications which we have previously addressed in a post-mortem on the incident from the 9th of July 2020.
We apologize to all of our customers that have been affected by these incidents.
When we published the post-mortem on the 9th of July, we saw issues that could be correlated with the new application server configuration that we had put into action on the 8th of July. This have later proved not to be the case, as we figured out during that outage the 31st of July, 4th and 5th of August. We did change our configuration of the servers on the 8th of July, so it was highly likely that the issue was based in that configuration, as we had no previous record of servers showing the symptoms we saw back then.
On the 31st of July we responded to an increasing error rate from our automated monitoring system, that showed the same symptoms as we saw on the 9th of July. Due to the point in time of the day where this incident happened, it led us to believe that we were facing capacity issues on our servers as it was during peak hours for our services. With the bottleneck in mind we managed to mitigate the first outage by increasing our capacity, but after monitoring for an hour we received new notifications that the error rate was increasing again. Due to the nature of us increasing the capacity but still seeing issues, we increased our capacity to 200% of our normal capacity, to mitigate the issues to give more time for finding the root cause of the issue. Unfortunately, the investigations did not show any signs of the underlying issues, so we concluded it was our capacity that was not up to par with the increased amount of traffic we were receiving at that point in time. After we had consulted the metrics logged in our monitoring system, we saw that the system had normalized again and concluded it was a capacity issue.
On the 4th of August we were notified again by our automated monitoring about increasing error rates, with the previous experience of increasing capacity to mitigate the issue we initiated that to get the systems back online which worked. At the same time, we started looking into alternative measures that could surface the root cause for us to mitigate the issue permanently, as it became clear to us that it wasn't a bottleneck in capacity to process the incoming requests. Due to the finding that it wasn't a capacity issue we initiated further initiatives to be able to find the root cause of our issues. This included raising capacity to 300% of normal capacity to ensure that it wasn't a factor in our investigations, plus we built up a range of scripts to assist us in debugging the issue.
On the 5th of August we got notified by the automated monitoring, and we instantly started applying the countermeasures we earlier had seen resolve the issue. During the period where we awaited the countermeasures to come into effect we managed to find the root cause of the incidents, due to the scripts we had prepared after spending substantial amount of time researching the options as there were no clear signals of what was wrong on the servers. During the previous incident we had seen an issue where a server supposedly had decommissioned itself from our infrastructure, and due to that couldn't serve any content. With the scripts we had produced we managed to do some forensics on the server and saw that we had an issue with port exhaustion on the server. When the server loses the ability to connect due to a process not releasing it's temporal used ports the server will appear as decommissioned in the infrastructure. During the earlier incidents we had been inspecting the parts of our system that frequently is under heavy load, but this time it was one of our less used services that was the root cause. Due to us finding the root cause we managed to setup a temporary solution to mitigate the issue, before it would affect the whole server to give some time to solve the actual issue.
After further inspection of the failing service, we found a flow in the system that would end up not releasing the used port. The affected service is one of our older parts of the system that is only running in a synchronous context, but a component that is being leveraged had switched to running in an asynchronous context as default. Due to the mismatch in context, the internal requests were never completed which resulted in the component exhausting the amount of free ports on the servers. We have ensure that the context takes care of these completions to ensure we do not block any ports that isn't in use anymore.
Due to the nature of the incident being port exhaustion we had no measure in place to automatically counter the incident. Usually when we experience higher loads or unhealthy servers, in our system we have automatic scaling in place, but even if it had been triggered by the port exhaustion it wouldn't have solved the issue. The issue would have come back within an hour or two again. To regain the blocked ports of the servers we have applied a fix that has proven well as we have further verified that the underlying issue have been fixed.
To ensure the system being responsive and well running, we have increased the range of metrics we have identified showing if a server is running as expected. We already have a wide range of numbers that shows the overall health of the servers and other parts of our infrastructure, but we will take the usage of ports into account as well now as it have the potential of fatal impact.
Due to the incidents being caused by more frequent use of one of our older parts of the system, we have applied measures to prevent this from happening while we modernize the service to have native handling of an asynchronous context.