We encountered a small problem with new deployments to one of our webservers, and started investigating this. During this investigation we quickly started narrowing our focus to our Loadbalancer configurations. This is usually something we have automated systems to detect. Before we could do a full investigation, the configuration issue caused our loadbalancers to go down across our entire cluster.
We quickly identified the specific configuration, and rebooted our loadbalancers, to get them to faster initialize the new configuration.
The root cause was caused by the way we cache configuration in our loadbalancers. Basically, we allocate a fixed size for our loadbalancers to cache the configuration in memory, for faster lookup in the application registry. The cache size had grown too small, causing our loadbalancer to not be able to write to the memory buffer.
We have systems to detect this, that apparently malfunctioned, we will look into optimizing our configuration detection system.
This is a very edge case situation, we have optimized our entire caching policy, to make it more stable. Besides this, we will optimize our procedures, to do more testing in our automated systems, and make sure these work as expected.
Our estimation is the total downtime was between 5-10 minutes, based on the host the project was located at.
We sincerely apologize for any inconvenience.