Azure Network Infrastructure - Issues accessing a subset of Microsoft services

Incident Report for Workspot

Resolved

The issue has been mitigated. Microsoft have shared the below preliminary post-incident review regarding this incident (Tracking ID: KTY1-HW8):

This is our Preliminary PIR that we endeavor to publish within 3 days of incident mitigation to share what we know so far. After our internal retrospective is completed (generally within 14 days) we will publish a "Final" PIR with additional details/learnings.

What happened?

Between 11:45 and 13:58 UTC on 30 July 2024, a subset of customers experienced intermittent connection errors, timeouts, or latency spikes while connecting to Microsoft services that leverage Azure Front Door (AFD) and Azure Content Delivery Network (CDN). The two main impacted services were Azure Front Door (AFD) and Azure Content Delivery Network (CDN), and downstream services that rely on these – including the Azure portal, and a subset of Microsoft 365 and Microsoft Purview services. From 13:58 to 19:43 UTC, a smaller set of customers continued to observe a low rate of connection timeouts.

What went wrong and why?

Azure Front Door (AFD) is Microsoft's scalable platform for web acceleration, global load balancing, and content delivery, operating in nearly 200 locations worldwide – including datacenters within Azure regions, and edge sites. AFD and Azure CDN are built with platform defenses against network and application layer Distributed Denial-of-Service (DDoS) attacks. In addition to this, these services rely on the Azure network DDoS protection service, for the attacks at the network layer. You can read more about the protection mechanisms at https://learn.microsoft.com/azure/ddos-protection/ddos-protection-overview and https://learn.microsoft.com/azure/frontdoor/front-door-ddos.

Between 10:15 and 10:45 UTC, a volumetric distributed TCP SYN flood DDoS attack occurred at multiple Azure Front Door and CDN sites. This attack was automatically mitigated by the Azure Network DDoS protection service and had minimal customer impact.

At 11:45 UTC, as the Network DDoS protection service was disengaging and resuming default traffic routing to the Azure Front Door service, the network routes could not be updated within one specific site in Europe. This happened because of Network DDoS control plane failures to that specific site, due to a local power outage. Consequently, traffic inside Europe continued to be forwarded to AFD through our DDoS protection services, instead of returning directly to AFD. This event in isolation would not have caused any impact.

However, an unrelated latent network configuration issue caused traffic from outside Europe to be routed to the DDoS protection system within Europe. This led to localized congestion, which caused customers to experience high latency and connectivity failures across multiple regions. The vast majority of the impact was mitigated by 13:58 UTC, around two hours later when we resolved the routing issue. A small subset of customers without retry logic in their application may have experienced residual effects until 19:43 UTC.

How did we respond?

Our internal monitors detected impact on our Europe edge sites at 11:47 UTC, immediately prompting a series of investigations. Once we identified that the network routes could not be updated within that one specific site, we updated the DDoS protection configuration system to avoid traffic congestion. These changes successfully mitigated most of the impact by 13:58 UTC. Availability returned to pre-incident levels by 19:43 UTC once the default network policies were fully restored.

How we are making incidents like this less likely or less impactful

- We have already added the missing configuration on network devices to ensure a DDoS mitigation issue in one geography cannot spread to other geographies in the Europe region which resulted in traffic redirection. (Completed)
- We are enhancing our existing validation and monitoring in the Azure network, to detect invalid configurations. (Estimated completion: November 2024)
- We are improving our monitoring where our DDoS protection service is unreachable from the control plane, but is still serving traffic. (Estimated completion: November 2024)
- This is our Preliminary PIR that we endeavor to publish within 3 days of incident mitigation to share what we know so far. After our internal retrospective is completed (generally within 14 days) we will publish a "Final" PIR with additional details/learnings.

How can customers make incidents like this less impactful

- For customers of Azure Front Door/Azure CDN products, implementing retry logic in your client-side applications can help handle temporary failures when connecting to a service or network resource during mitigations of network layer DDoS attacks. For more information, refer to our recommended error-handling design patterns: https://learn.microsoft.com/azure/well-architected/resiliency/app-design-error-handling#implement-retry-logic.
- Applications that use exponential-backoff in their retry strategy may have seen success, as an immediate retry during intervals of high packet loss may have also seen high packet loss. A retry conducted during periods of lower loss would likely have succeeded. For more details on retry patterns, refer to https://learn.microsoft.com/azure/architecture/patterns/retry.
- More generally, consider evaluating the reliability of your applications using guidance from the Azure Well-Architected Framework and its interactive Well-Architected Review: https://docs.microsoft.com/azure/architecture/framework/resiliency.
- Finally, ensure that the right people in your organization will be notified about any future service issues by configuring Azure Service Health alerts. These alerts can trigger emails, SMS, push notifications, webhooks, and more: https://aka.ms/ash-alerts

How can we make our incident communications more useful?

You can rate this PIR and provide any feedback using our quick 3-question survey: https://aka.ms/AzPIR/KTY1-HW8

Posted Aug 07, 2024 - 16:30 UTC

Monitoring

As per Microsoft, they have implemented networking configuration changes, telemetry shows improvement in service availability.

Posted Jul 30, 2024 - 15:57 UTC

Update

Microsoft updated the status page with the below details:
We have implemented networking configuration changes and have performed failovers to alternate networking paths to provide relief. Monitoring telemetry shows improvement in service availability from approximately 14:10 UTC onwards, and we are continuing to monitor to ensure full recovery.

Posted Jul 30, 2024 - 15:57 UTC

Investigating

Microsoft updated its status page that they are investigating reports of issues connecting to Microsoft services globally. Customers may experience timeouts connecting to Azure services.

Please refer to https://azure.status.microsoft/en-us/status for the latest update.

Posted Jul 30, 2024 - 13:36 UTC