Azure Service Alert – East US Region VMs might be impacted

Incident Report for Workspot

Resolved

As per Microsoft, this issue has been resoved. Here is the incident summary, shared by Microsoft:

Issue Summary:
Between 09:07 UTC and 16:25 UTC on 29 May 2025, a platform issue resulted in an impact to the following services in the East US region:
- Virtual Machines & Virtual Machine Scale Sets: Error notifications when performing service management operations - such as create, delete, update, scaling, start, stop - for resources hosted in this region. This impact was restricted to a single Availability Zone (AZ01), Physical AZ01. Retries may have been successful.
- Azure Synapse Analytics: Issues while executing Spark jobs through Synapse Pipelines or Notebooks, encountering the error code "CLUSTER_CREATION_TIMED_OUT". Retries may have been successful.
- Azure Data Factory: Activity or Pipeline run failures and delays due to dataflow activity failures.

MS Response Timeline:
- 09:07 UTC: Customer impact began.
- 09:12 UTC: Auto-recovery attempts started, including load-shedding and failover.
- 09:15 UTC: Service monitoring detected spikes in VM failures; investigation began.
- 11:45 UTC: Platform engineers terminated problematic service instances to free compute resources.
- 12:53 UTC: Services started processing backlogged VM requests, with some customers still seeing timeouts and throttling.
- 13:15 UTC: Engineers redirected VM deployment traffic to alternate management services to speed recovery.
- 13:48 UTC: Failover progress noted, backlog began draining.
- 13:58 UTC: Azure Data Factory service restored.
- 14:09 UTC: Azure Synapse Analytics service restored.
- 16:25 UTC: All services fully restored; customer impact mitigated.

Posted May 30, 2025 - 12:30 UTC

Identified

Issue Summary:

Start Time: 09:15 UTC on 29 May 2025
Impact: Errors may occur during service management operations (create, delete, update, scale, start, stop) for VMs.
Cause: A sudden spike in usage has caused backend VM components to hit operational limits, resulting in delays and failures.
Current Status: Microsoft is mitigating the issue by failing over to a healthy backend instance.

Please monitor the updates in Azure subscription under Service Health, if your resources are in East US region.

Posted May 29, 2025 - 14:03 UTC

This incident affected: Workspot Control.