During the early morning hours (Eastern time) on Wednesday, August 17, 2022, many users could not log in to ClassLink LaunchPad. The following is an Incident Report to ensure trust and transparency between our clients and ClassLink.
Overview of Incident
Our infrastructure team reacted immediately when login issues were first discovered and began activating alternate login pathways to restore system performance. Within approximately ~30 minutes, intermittent logins were restored. Within ~90 minutes, nearly 75% load capacity was achieved, and by ~120 minutes, essentially all logins were successful. We later discovered the root cause to be related to some new virtual servers.
Our team is always hard at work on improving scaling and stability. Towards that goal, we routinely upgrade our infrastructure to provide faster and more reliable services. From time to time, this includes incorporating new generation virtual servers that have greater compute capacities than previous generations. Over this past weekend, we adopted some new generation servers into the fabric of our authentication server clusters. No issues came up on Monday or Tuesday. However, on Wednesday, these new generation servers actually worked too fast. They consumed all the available network connections on the servers themselves and prevented outbound communication to the authentication infrastructure. It is unusual for a new generation of virtual server to be incompatible with our existing infrastructure. That has never happened to us before. Further, because the issue occurred days after the introduction of these servers to our architecture, they were not suspected to be the cause of the problem, which somewhat delayed our ability to resolve the issue more quickly. We have since discontinued using these new generation servers until we can better test their functionality within our server architecture.
On Wednesday, August 17, 2022:
- ~7:20 AM Eastern Time (11:20 AM UTC): system indicators and reports of users not able to log in to ClassLink LaunchPad
- ~7:27 AM Eastern Time (11:27 AM UTC): initial post to status.classlink.com of issue
- ~9:00 AM Eastern Time (1:00 PM UTC): 75% login load capacity restored
- ~9:30 AM Eastern Time (1:30 PM UTC): 100% login load capacity restored
- ~9:41 AM Eastern Time (1:41 PM UTC): Quality Assurance testing completed and incident marked resolved
We sincerely apologize for the outage you may have experienced.
Below is a calendly link if you wish to schedule a meeting, and I will be happy to go over any details of this outage and our plans to prevent them from occurring in the future.
Stanley Watts, CTO