Resolved -
Microsoft has informed us that the outage in the region has been mitigated. Please, find the summary below, as per Microsoft:
STATUS: Mitigated 9/10/2025 9:55:15 PM UTC
SUMMARY OF IMPACT:
What happened?
Between 09:12 UTC and 18:50 UTC on 10 September 2025, a platform issue resulted in an impact on multiple Azure services in the East US 2 region, more specifically, two zones (Az02 and Az03). Impacted customers may have experienced error notifications when performing service management operations, such as creating, deleting, updating, scaling, starting, or stopping, for resources hosted in this region. The primary impacted service was Virtual Machines or Virtual Machine Scale Sets, but this would have resulted in issues for services dependent on such Compute resources, such as Azure Databricks, Azure Kubernetes Service, Azure Synapse Analytics, Backup, and Data Factory.
Customers who still see failed or unhealthy resources should attempt to update or redeploy the resource.
What do we know so far?
Our investigation identified that the issue impacting resource provisioning in East US 2 was linked to a failure in the platform component responsible for managing resource placement. The system is designed to recover quickly from transient issues, but in this case, the prolonged performance degradation caused recovery mechanisms themselves to become a source of instability.
The incident was primarily driven by a combination of platform recovery behavior and sustained performance degradation. While customer-generated load remained within expected limits, internal platform services began retrying failed operations aggressively when performance issues emerged. These retries, intended to support resilience, instead created a surge in internal system activity.
How did we respond?
09:12 UTC on 10 September 2025 – Customer impact began.
09:13 UTC on 10 September 2025– Our monitoring systems observed a rise in failure rates, triggering an alert and prompting our team to initiate an investigation.
12:08 UTC on 10 September 2025 – We identified unhealthy dependencies in core infrastructure components as initial contributing factors.
13:34 UTC on 10 September 2025 – Began mitigation efforts that included - Restarted critical service components to restore functionality, rerouted workloads from affected infrastructure, initiated multiple recovery cycles for the impacted backend service, on recovery, internal workloads processed through backlogs to get to the current healthy state, and executed capacity operations to free up resources.
18:50 UTC on 10 September 2025 – After a period of monitoring to validate the health of services, we were confident that the control plane service was restored, and no further impact was observed on downstream services for this issue.
What happens next?
Our team will be completing an internal retrospective to understand the incident in more detail. We will publish a Preliminary Post Incident Review (PIR) within approximately 72 hours to share more details on what happened and how we responded. After our internal retrospective is completed, generally within 14 days, we will publish a Final Post Incident Review with any additional details and learnings.
To get notified when that happens, and/or to stay informed about future Azure service issues, make sure that you configure and maintain Azure Service Health alerts – these can trigger emails, SMS, push notifications, webhooks, and more: https://aka.ms/ash-alerts
For more information on Post Incident Reviews, refer to https://aka.ms/AzurePIRs
The impact times above represent the full incident duration, so they are not specific to any individual customer. Actual impact to service availability may vary between customers and resources – for guidance on implementing monitoring to understand granular impact: https://aka.ms/AzPIR/Monitoring
Finally, for broader guidance on preparing for cloud incidents, refer to https://aka.ms/incidentreadiness
Stay informed about your Azure services
- Visit Azure Service Health to get your personalized view of possible impacted Azure resources, downloadable Issue Summaries, and engineering updates.
- Set up service health alerts to stay notified of future service issues, planned maintenance, or health advisories.
Sep 17, 04:05 UTC
Monitoring -
A fix has been implemented and we are monitoring the results.
Sep 11, 15:49 UTC
Update -
As per Microsoft, this issue has been mitigated. Here is the incident summary, shared by Microsoft
SUMMARY OF IMPACT:
What happened?
Between 09:12 UTC and 18:50 UTC on 10 September 2025, a platform issue resulted in an impact to multiple Azure services in the East US 2 region, more specifically two zones (Az02 and Az03). Impacted customers may have experienced error notifications when performing service management operations - such as create, delete, update, scaling, start or stop - for resources hosted in this region. The primary impacted service affected was Virtual Machines or Virtual Machines Scale Sets, but this would have resulted in issues for services dependent upon such Compute resources, such as Azure Databricks, Azure Kubernetes Service, Azure Synapse Analytics, Backup, and Data Factory.
Customers that still see failed or unhealthy resources should attempt to update or redeploy the resource.
What do we know so far?
Our investigation identified that the issue impacting resource provisioning in East US 2 was linked to a failure in the platform component responsible for managing resource placement. The system is designed to recover quickly from transient issues, but in this case, the prolonged performance degradation caused recovery mechanisms themselves to become a source of instability.
The incident was primarily driven by a combination of platform recovery behavior and sustained performance degradation. While customer-generated load remained within expected limits, internal platform services began retrying failed operations aggressively when performance issues emerged. These retries, intended to support resilience, instead created a surge in internal system activity.
How did we respond?
• 09:12 UTC on 10 September 2025 – Customer impact began.
• 09:13 UTC on 10 September 2025– Our monitoring systems observed a rise in failure rates, triggering an alert and prompting our team to initiate an investigation.
• 12:08 UTC on 10 September 2025 – We identified unhealthy dependencies in core infrastructure components as initial contributing factors.
• 13:34 UTC on 10 September 2025 – Began mitigation efforts that included - Restarted critical service components to restore functionality, reroute workloads from affected infrastructure, initiated multiple recovery cycles for the impacted backend service, on recovery, internal workloads processed through backlogs to get to the current healthy state, and executed capacity operations to free up resources.
• 18:50 UTC on 10 September 2025 – After a period of monitoring to validate the health of services, we were confident that the control plane service was restored, and no further impact was observed to downstream services for this issue.
Sep 10, 22:26 UTC
Update -
Below is the summary provided by Microsoft regarding the ongoing issue in the East US 2 region:
"Current Status:
We detected the issue through automated monitoring following a spike in failure rates. We have identified a performance issue in a core infrastructure component responsible for managing resource placement. This is causing delays and failures in virtual machine provisioning. The issue stems from severe transaction delays and high system load in two zones of the region.
Active Recovery Efforts:
• Zone 2 (Az02) - Gradually re-enabling traffic (~50%) to Zone 2 using controlled allocation strategies. This should be introducing more capacity for the region and allowing higher allocation success rates.
• Zone 3 (Az03) and Zone 1 (Az01) – Is recovered but due to partial traffic enabled for Zone 2, customers may still see allocation failures here. As Zone 2 traffic increases, allocation success should increase in all zones.
Note that the 'logical' zones used by each customer subscription may correspond to different physical zones - customers can use the Locations API to understand this mapping, to confirm which resources run in this physical AZ.
The next update will be provided within 60 minutes, or sooner if significant progress is made.
"
Sep 10, 19:00 UTC
Update -
Below is the summary provided by Microsoft regarding the ongoing issue in the East US 2 region:
"Current Status:
We detected the issue through automated monitoring following a spike in failure rates. The root cause has been traced to a backend service responsible for managing resource placement, which is experiencing performance degradation. This has led to delays and failures in resource creation and management.
Our engineering teams have attempted several recovery actions, including restarting key service components and shifting workloads away from affected infrastructure. However, these efforts have not yet fully resolved the issue due to system-level constraints.
Mitigation Actions Taken:
• Restarted critical service components to restore functionality.
• Attempted to reroute workloads from affected infrastructure.
• Initiated multiple recovery cycles for the impacted backend service.
Active Recovery Efforts:
• Zone 3 (Az03) is showing signs of improvement after targeted recovery actions. System performance has stabilized and is being closely monitored.
• Zone 2 (Az02) is undergoing similar recovery steps. While new resource deployments remain restricted, existing resources are beginning to recover.
• We are working to redistribute workloads to healthier zones (Az01 and Az03). However, limited capacity in these zones is causing throttling and delays.
• In parallel, we are exploring emergency capacity expansion to alleviate resource constraints and accelerate recovery.
Note that the 'logical' zones used by each customer subscription may correspond to different physical zones - customers can use the Locations API to understand this mapping, to confirm which resources run in this physical Availability Zone(AZ).
Next Steps:
We continue to monitor all zones and prioritize recovery in Az02 to restore capacity in the region."
Sep 10, 16:50 UTC
Update -
As per the latest update from Microsoft, they have observed that two of the three impacted zones have returned to a healthy state, and failure rates are now trending downward. Customers may see signs of recovery over time.
This failure issue is limited to the East US2 region.
Sep 10, 15:40 UTC
Update -
We are continuing to work with Microsoft
Sep 10, 14:43 UTC
Identified -
We are aware that some customers are experiencing issues with VM provisioning and resuming operations in the East US and East US 2 Azure regions.
Our team has already engaged Microsoft Azure Support with a Severity 1 ticket. According to Microsoft, the issue is related to unhealthy dependencies that are causing failures. They are actively investigating to determine the root cause and identify possible mitigation steps.
We will share further updates as soon as Microsoft provides more information.
Sep 10, 14:43 UTC