ClassLink depends on a variety of 3rd party services to deliver fast and reliable Single Sign-On. Today, we experienced Amazon Web Services (AWS) issues that impacted our customers for over 4 hours.
This morning, at approximately 8 am EST, we noticed many 502 ”Bad Gateway” errors coming from our load balancers for LaunchPad. This manifested itself as slow logins or 502 Bad Gateway errors.
Our team immediately started investigating the cause of the “Bad Gateway” errors and could not uncover any abnormalities on our server infrastructure. All our nodes were under 30% CPU. (Our systems will automatically autoscale when it crosses a threshold of > 50% CPU).
One of the websites we routinely visit is Downdetector, https://downdetector.com. We look for any outages patterns that could point to more widespread issues with routing, DNS or AWS itself. Several other large websites experienced “Bad Gateway” errors during the same time-frame ClassLink had problems.
We immediately reached out to our contacts at AWS for assistance.
At ~8:30 am EST, our team began reprovisioning our front-end servers, which were experiencing the “Bad Gateway” errors. It became clear that there was an underlying network issue preventing traffic from reaching our servers behind the Amazon Load Balancers, even though the logs of these load balancers showed no problems. At this point, we assumed it was corrupt networking on the application servers or an issue with the Amazon Load Balancers, the latter less likely, given clean logs and past reliability.
At ~9:20 am EST, we refreshed the configuration on our main LaunchPad load balancers and noticed a dip in the number of “Bad Gateway” errors returned. We continued to swap out application servers while waiting for a response from our contacts at AWS. When most of the front end web servers were reprovisioned, we continued to see large numbers of “Bad Gateway” errors coming from our main load balancers.
At ~12:30 pm EST, we refreshed the configuration on our main API load balancers and the system almost immediately stabilized.
This issue took much longer to resolve than expected, and we always try to minimize any impact on our customers. The AWS Load Balancers, which we ultimately determined to be the cause of the errors, have worked flawlessly for years, and their logs did not indicate issues. After today’s experiences, our team knows what to look for and how to resolve this issue if we reencounter it.
Our team has a meeting scheduled with Amazon Web Services to review the issues and prevent them from occurring in the future.