Loadbalancer issue
Incident Report for Vapor Cloud
Postmortem

Issue surrounding our Loadbalancers October 10 2017

We encountered a small problem with new deployments to one of our webservers, and started investigating this. During this investigation we quickly started narrowing our focus to our Loadbalancer configurations. This is usually something we have automated systems to detect. Before we could do a full investigation, the configuration issue caused our loadbalancers to go down across our entire cluster.

We quickly identified the specific configuration, and rebooted our loadbalancers, to get them to faster initialize the new configuration.

The root cause

The root cause was caused by the way we cache configuration in our loadbalancers. Basically, we allocate a fixed size for our loadbalancers to cache the configuration in memory, for faster lookup in the application registry. The cache size had grown too small, causing our loadbalancer to not be able to write to the memory buffer.

We have systems to detect this, that apparently malfunctioned, we will look into optimizing our configuration detection system.

How we are planning to prevent this from happening

This is a very edge case situation, we have optimized our entire caching policy, to make it more stable. Besides this, we will optimize our procedures, to do more testing in our automated systems, and make sure these work as expected.

Total downtime

Our estimation is the total downtime was between 5-10 minutes, based on the host the project was located at.

We sincerely apologize for any inconvenience.

Posted about 1 year ago. Oct 10, 2017 - 17:01 UTC

Resolved
We are now confident the issue is resolved. Everything is back up at full capacity
Posted about 1 year ago. Oct 10, 2017 - 16:34 UTC
Monitoring
Our tests show the system is running again, we will keep monitoring the system.
Posted about 1 year ago. Oct 10, 2017 - 16:19 UTC
Identified
We are currently experiencing a major outage on our loadbalancers, we are working hard on fixing the problem.
Posted about 1 year ago. Oct 10, 2017 - 16:16 UTC