As per Microsoft, this issue has been resoved. Here is the incident summary, shared by Microsoft:
Issue Summary: Between 09:07 UTC and 16:25 UTC on 29 May 2025, a platform issue resulted in an impact to the following services in the East US region: - Virtual Machines & Virtual Machine Scale Sets: Error notifications when performing service management operations - such as create, delete, update, scaling, start, stop - for resources hosted in this region. This impact was restricted to a single Availability Zone (AZ01), Physical AZ01. Retries may have been successful. - Azure Synapse Analytics: Issues while executing Spark jobs through Synapse Pipelines or Notebooks, encountering the error code "CLUSTER_CREATION_TIMED_OUT". Retries may have been successful. - Azure Data Factory: Activity or Pipeline run failures and delays due to dataflow activity failures.
MS Response Timeline: - 09:07 UTC: Customer impact began. - 09:12 UTC: Auto-recovery attempts started, including load-shedding and failover. - 09:15 UTC: Service monitoring detected spikes in VM failures; investigation began. - 11:45 UTC: Platform engineers terminated problematic service instances to free compute resources. - 12:53 UTC: Services started processing backlogged VM requests, with some customers still seeing timeouts and throttling. - 13:15 UTC: Engineers redirected VM deployment traffic to alternate management services to speed recovery. - 13:48 UTC: Failover progress noted, backlog began draining. - 13:58 UTC: Azure Data Factory service restored. - 14:09 UTC: Azure Synapse Analytics service restored. - 16:25 UTC: All services fully restored; customer impact mitigated.
Posted May 30, 2025 - 12:30 UTC
Identified
Issue Summary:
Start Time: 09:15 UTC on 29 May 2025 Impact: Errors may occur during service management operations (create, delete, update, scale, start, stop) for VMs. Cause: A sudden spike in usage has caused backend VM components to hit operational limits, resulting in delays and failures. Current Status: Microsoft is mitigating the issue by failing over to a healthy backend instance.
Please monitor the updates in Azure subscription under Service Health, if your resources are in East US region.