Microsoft explains the cause of yesterday’s massive service outage

If infosec is about Confidentiality, Integrity…and Availability, then taking your systems offline because of a previous decision to “support a complex cross-cloud migration” is a big Availability fail…:

[…] As Microsoft explained, the authentication and login issues behind yesterday’s outage were caused by an error that affected the correct rotation of the signing keys used to support Azure AD’s use of OpenID.

Signing keys are private and public cryptographic key pairs that are used to sign authentication requests from a user.

Microsoft’s identity platform rotates signing keys on a periodic basis for security purposes, with apps being required to handle key rollover events so that authentication attempts don’t fail.

“As part of standard security hygiene, an automated system, on a time-based schedule, removes keys that are no longer in use,” Microsoft said.

“Over the last few weeks, a particular key was marked as ‘retain’ for longer than normal to support a complex cross-cloud migration. This exposed a bug where the automation incorrectly ignored that ‘retain’ state, leading it to remove that particular key.”

After the signing key was removed, even though it was marked to be retained longer, apps using Azure AD authentication services immediately stopped trusting the tokens signed with the removed key.

This led to all user login attempts to affected apps and services being rejected and, as a result, users no longer were able to access their accounts.

Microsoft engineers rolled back the key metadata to the state before the worldwide service outage started to mitigate the issue.

However, the outage wasn’t immediately mitigated due to the different “server implementations that handle caching differently.”

Users continued experiencing issues until the impacted apps managed to pick up the updated key metadata and refresh their caches.

[…]

Original Article