Resolved -
Update:
We have received the Preliminary Post Incident Review(PIR) from Microsoft for the Azure East US control plane issue that impacted your environment on April 24.
Here is a summary of what happened and why:
STATUS: RCA
COMMUNICATION:
This is our Preliminary PIR to share what we know so far. After our internal retrospective is completed (generally within 14 days) we will publish a Final PIR with additional details.
What happened?
Between 11:30 UTC on 24 April and 00:15 UTC on 25 April 2026, customers may have experienced failures or delays when attempting to provision, scale, or update resources in East US. Beyond this, a smaller subset of impacted customers may have experienced intermittent connectivity issues on existing workloads (including Virtual Machines and Azure Virtual Desktop sessions) for scenarios dependent on unhealthy internal service dependencies.
The issue initially began with impact to a subset of customers in a single Availability Zone (physical AZ-01) but as demand shifted, similar symptoms were observed impacting a subset of customers in AZ-02 and AZ-03. While none of these zones were impacted for the full duration of the incident, customers experienced periods of impact in each zone for portions of the incident.
The following services were among those affected: Azure Application Gateway, Azure App Service, Azure Batch, Azure Cache for Redis, Azure Data Explorer, Azure Data Factory, Azure Databricks, Azure Health Data Services, Azure Kubernetes Service (AKS), Azure Red Hat OpenShift, Azure Service Fabric, Azure Synapse Analytics, Azure Virtual Desktop, Azure Virtual Machines, Azure Virtual Network Manager, Azure VMware Solution, Oracle Database@Azure, Virtual Machine Scale Sets – and potentially additional services that were dependent on new compute allocations in the region.
Note: Logical availability zones assigned to customer subscriptions may map to different physical availability zones. Customers can use the Locations API to understand this mapping: https://learn.microsoft.com/rest/api/resources/subscriptions/list-locations?HTTP#availabilityzonemappings.
What went wrong and why?
The Azure PubSub service is a key component of the networking control plane, acting as an intermediary between resource providers and networking agents on Azure hosts. Resource providers, such as the Network Resource Provider, publish customer configurations during Virtual Machine or networking create, update, or delete operations. Networking agents (subscribers) on the hosts retrieve these configurations to program the hosts networking stack. Additionally, the service functions as a cache, ensuring efficient retrieval of configurations during VM reboots or restarts. This capability is essential for deployments, resource allocation, and traffic management in Azure Virtual Network (VNet) environments.
During normal platform operations, one partition of this PubSub service in AZ-01 became unhealthy and automatically attempted to fail over to a secondary replica. The failover did not complete successfully, resulting in a partial loss of control plane availability within AZ-01. We intervened to investigate and attempted a manual failover of the primary partition, but this attempt was also unsuccessful.
Shortly afterward, we observed a similar condition in AZ-03, which led to a partial loss of control plane availability in AZ-03 as well. As the investigation progressed, we suspected that a previously deployed update to a regional control plane dependency had introduced a latent regression. This issue did not surface during earlier validation and only manifested when failover conditions were triggered under sustained production load.
As part of our mitigation efforts, we identified a version from the prior week that represented a Last Known Good (LKG) state. We first applied this rollback in AZ-03, which successfully restored control plane service health in that zone. Based on this, we began rolling back the affected components in AZ-01. By design, rollback operations are executed in stages by Azure Fabric controllers using update domains to ensure platform safety, while recovery proceeds incrementally.
While mitigation was in progress, the platform was unable to maintain two healthy instances of the PubSub service across availability zones simultaneously, which is a requirement for normal replication and control plane operations. This resulted in a loss of quorum of the service. As the system attempted to rebalance, impact shifted between availability zones, leading to periods of degraded behavior across multiple zones.
Similar failure patterns began to appear in AZ-02 and again in AZ-03, expanding the scope of impact across the region. For AZ-02 we initiated and completed a rollback, and although AZ-03 had previously shown recovery following the rollback, subsequent instability indicated that the rollback in that zone had not fully completed, because of an orchestration fault. As impact reemerged, rollback operations in AZ-03 were restarted and then completed, fully restoring service health.
How did we respond?
• 11:30 UTC on 24 April 2026 – Customer impact began. We observed failures or delays when customers attempted to provision, scale, or update resources in the affected region.
• 11:38 UTC on 24 April 2026 – We detected an issue in AZ-01. A control plane partition became unhealthy and automatic failover attempts did not complete successfully.
• 11:38–13:40 UTC on 24 April 2026 – We attempted manual failover in AZ-01. These efforts did not successfully restore service.
• 13:40 UTC on 24 April 2026 – We identified a recently deployed update as the likely cause of the issue.
• 13:50 UTC on 24 April 2026 – We began observing similar symptoms in AZ-03, indicating the issue was affecting multiple availability zones.
• 14:07 UTC on 24 April 2026 – We initiated rollback to a previously known good version in AZ-03.
• 15:03 UTC on 24 April 2026 – We observed significant recovery in AZ-03. Control plane availability exceeded 99%.
• 15:04 UTC on 24 April 2026 – We initiated rollback actions to AZ-01.
• 18:52 UTC on 24 April 2026 – We observed significant improvement in AZ-01 as rollback progressed.
• 19:02 UTC on 24 April 2026 – We confirmed AZ-01 had recovered to greater than 99% availability, while the rollback continued in the background.
• 19:05 UTC on 24 April 2026 – We observed similar symptoms in AZ-02 as load redistributed across the region.
• 19:10 UTC on 24 April 2026 – We initiated rollback to a known good version in AZ-02.
• 21:02 UTC on 24 April 2026 – We observed instability reappear in AZ-03. We determined this was because the rollback had not yet completed across all update domains. Consequently, we manually unblocked the rollback across remaining update domains in AZ03 to ensure stable recovery.
• 22:39 UTC on 24 April 2026 – We confirmed rollback was fully completed in AZ-03.
• 23:22 UTC on 24 April 2026 – We confirmed rollback was fully completed in AZ-02, completing PubSub mitigation across all affected zones.
• 00:15 UTC on 25 April 2026 – We validated downstream service recovery and PubSub health across all zones in the region.
How are we making incidents like this less likely or less impactful?
• We have assessed the risk of occurrence in other high volume regions, and have taken steps to rollback this PubSub service in these regions out of an abundance of caution. (Completed)
• We are investing in improving our test coverage surrounding the failure cases and load patterns that contributed to this incident, to catch issues like this one before they reach production. (Estimated completion: TBD)
• We are working to reduce rollback complexity, to be able to mitigate issues like this more quickly in future. (Estimated completion: TBD)
• This is our Preliminary PIR to share what we know so far. After our internal retrospective is completed (generally within 14 days) we will publish a Final PIR with additional details.
How can customers make incidents like this less impactful?
• Consider using Availability Zones (AZs) to run your services across physically separate locations within an Azure region. To help services be more resilient to localized failures like this one (which predominantly impacted zones at different times) many Azure services support zonal, zone-redundant, and/or always-available configurations: https://docs.microsoft.com/azure/availability-zones/az-overview
• For mission-critical workloads, customers should consider a multi-region geodiversity strategy to avoid impact from incidents like this one that impacted a single region: https://learn.microsoft.com/azure/architecture/patterns/geodes and https://learn.microsoft.com/azure/well-architected/design-guides/regions-availability-zones
• More generally, consider evaluating the reliability of your applications using guidance from the Azure Well-Architected Framework and its interactive Well-Architected Review: https://aka.ms/AzPIR/WAF
• The impact times above represent the full incident duration, so are not specific to any individual customer. Actual impact to service availability varied between customers and resources – for guidance on implementing monitoring to understand granular impact: https://aka.ms/AzPIR/Monitoring
• Finally, consider ensuring that the right people in your organization will be notified about any future service issues – by configuring Azure Service Health alerts. These can trigger emails, SMS, push notifications, webhooks, and more: https://aka.ms/AzPIR/Alerts
Apr 28, 06:52 UTC
Monitoring -
Update: Microsoft has shared below update on the ongoing Azure platform incident impacting Virtual Machines and Virtual Machine Scale Sets in the East US region.
Current Status: Resolved
What happened?
Between 11:39 UTC on 24th April and 00:15 UTC on 25 April 2026, a platform issue resulted in an impact to a subset of Azure services in the East US region. Impacted customers experienced failures or delays when provisioning, scaling, or updating Azure resources. Intermittent connectivity issues may also be seen on existing running workloads in Virtual Machines and Azure Virtual Desktop sessions.
This issue is now mitigated. An update with more information will be provided shortly.
Next Steps:
The next update will be shared once Microsoft provides the RCA. In the meantime, Workspot will continue to monitor impacted pools and virtual desktops to ensure stability and identify any residual issues.
Apr 25, 04:15 UTC
Update -
Microsoft has shared below update on the ongoing Azure platform incident impacting Virtual Machines and Virtual Machine Scale Sets in the East US region.
Current Status: Rollback is complete for Availability Zone (AZ) 01 and AZ03, and is still in progress for AZ02. Mitigation is expected to complete within the next 1 hour.
Customers should observe improvements to their services, and downstream services are also starting to recover.
NEXT STEPS:
The next update will be shared within 60 minutes, or as events warrant.
Apr 24, 23:38 UTC
Update -
Microsoft has shared below update on the ongoing Azure platform incident impacting Virtual Machines and Virtual Machine Scale Sets in the East US region.
CURRENT STATUS:
We detected this issue through automated monitoring after identifying an unusual drop in service success rates. The issue was caused by a recent change, and rollback actions are in progress across availability zones (AZs). Customers may begin to see recovery if using resources in AZ03, with Availability Zone 01 currently in progress. Please refer to this documentation to understand the logical to physical availability zone mapping for your subscription.
However, there is no confirmed estimated time for full resolution at this moment.
The next update will be shared within the next 60 minutes, or as events warrant.
Apr 24, 17:56 UTC
Update -
Current Status:
At this time, no new update has been received from Microsoft beyond their previous communication. The incident remains under investigation on their side.
We are continuing to monitor the situation closely and are tracking customer impact, which currently includes multiple reported cases of virtual desktop connectivity issues and VM accessibility failures.
Next Update:
We will provide an update as soon as Microsoft shares further information or once there is any material change in status.
Apr 24, 17:06 UTC
Update -
Microsoft has shared an update on the ongoing Azure platform incident impacting Virtual Machines and Virtual Machine Scale Sets in the East US region.
Microsoft identified the issue through automated monitoring, which detected an unusual increase in failures within the service supporting virtual machine operations in East US. Microsoft is currently investigating the underlying cause and is actively working toward mitigation.
Current Status:
The issue remains in progress and under investigation by Microsoft. Customers may continue to experience VM connectivity or access issues in East US, and related impact may also be seen in East Asia.
Microsoft indicated that the next update will be shared within approximately 60 minutes, or sooner if there are meaningful developments.
Workspot continues to monitor the situation closely and will provide further updates as new information becomes available from Microsoft.
Apr 24, 15:52 UTC
Update -
Update from the Microsoft:
Service: Virtual Machine Scale Sets, Virtual Machines
Region: East US, East Asia
Impact Statement: Starting at 11:39 UTC on 24 Apr 2026, a subset of customers using Virtual Machines and Virtual Machine Scale Sets in East US may experience issues with resources hosted in this region.
Current Status: We are aware of this issue and are actively investigating. The next update will be provided as events warrant.
IMPACTED SERVICE(S) AND REGION(S):
Service Name Region
Virtual Machine Scale Sets -->East US
Virtual Machines --> East Asia
Apr 24, 14:38 UTC
Update -
Impact Statement: Starting at 11:39 UTC on 24 Apr 2026, a subset of customers using Virtual Machines and Virtual Machine Scale Sets in East US may experience issues with resources hosted in this region.
Current Status: We are aware of this issue and are actively investigating. The next update will be provided as events warrant.
IMPACTED SERVICE(S) AND REGION(S):
Service Name Region
Virtual Machine Scale Sets -->East US
Virtual Machines --> East Asia
Apr 24, 13:38 UTC
Investigating -
Workspot has identified an issue affecting accessibility of virtual desktops hosted in the Azure East US region.
Microsoft has confirmed an ongoing Azure platform incident impacting a subset of Virtual Machines and Virtual Machine Scale Sets in this region. As a result, some customers may experience loss of network connectivity or inability to access their VMs.
This Azure incident aligns with the connectivity issues currently being observed.
Workspot is monitoring the situation closely and is in communication with Microsoft regarding the incident status. We will provide updates as they become available.
Apr 24, 13:31 UTC