Slack outage update

12/18/2023

It only takes a degradation or outage in one component to have a flow-on impact, potentially taking out the entire service. Any deviation from standard operating procedures and documented change processes, can introduce risk. Every component in an end-to-end service delivery chain, and the teams and individuals responsible for those components, need to work in sync to maintain the service’s availability. The incident (and others like it ) also highlight how important it is to have a certificate that’s valid in all aspects (with valid credentials, a correct domain name, etc.), as well as appropriate change verification steps. Microsoft likely had such processes in place, and it’s not clear how the erroneous change made it through, but events like these are good reminders for organizations to make sure they have strategies in place to help catch issues like a domain name mismatch. However, with manual updates, having checks and validations in place to guard against human error is especially important. There were likely specific and valid reasons that Microsoft engineers manually replaced the certificate. Detection and rollback are also often automated. Generally speaking, a majority of outages today appear to be triggered by unexpected conditions encountered during the operation of automated change or deployment processes. Around 21:34 UTC, Microsoft announced that the outage was the result of a configuration issue and had been resolved.Ī curious aspect of this outage is that it seemed to be triggered by a manual change. Certificate issued on July 24 shows incorrect domain nameĪpproximately ten minutes later, at around 17:15 UTC, it appeared to be replaced with a valid certificate, and SharePoint and OneDrive service reachability was restored for most users by around 17:20 UTC.

Users encountered a certificate error when attempting to access SharePoint Online and OneDrive due to an erroneous change in the SSL certificate that prevented the establishment of a secure connection to the services.įigure 2. SharePoint Online and OneDrive Business connectivity impacted globally. First observed around 19:05 UTC, it appeared to impact connectivity for users globally.įigure 1. On July 24, Microsoft experienced an issue that impacted connectivity to SharePoint Online and OneDrive for Business services. Read on to learn more about this outage and other recent incidents, or use the links below to jump to the sections that most interest you. We saw this play out this past fortnight, as a manual TLS/SSL certificate change by Microsoft introduced the type of error that an automated system would probably have detected and prevented.

Mistakes may be more likely to happen, perhaps causing an outage or service disruption. These manual changes require special care, and can be especially challenging because they require engineering teams to understand the intricacies of change processes, without the usual automated deployment checks and balances to assist them. However, there will always be exceptions where changes need to be made manually, outside of the standardized, automated change process.

Organizations today normally rely on highly automated change processes teams are technology-assisted to begin with and deployment and rollback may occur with little, if any, human intervention. Having a detailed grasp of all dependencies is particularly important when making manual changes. These incidents underscore the importance of understanding the entire service delivery chain in order to be aware of every dependency and interconnection, helping you keep impact and footprint to a minimum. We often encounter (and analyze) incidents where work in one part of an app or service has an unanticipated flow-on impact.

0 Comments

Slack outage update

Leave a Reply.

Author

Archives

Categories