Databricks issue in West Europe
Resolved
Jun 11, 2024 at 01:08am UTC
We have had no failures in any regions on Monday 10 June and believe the issue is now resolved.
We will continue to monitor in case it resurfaces.
Affected services
Updated
Jun 03, 2024 at 05:03am UTC
Databricks have deployed a platform change in Australia East (Sydney) and Europe West (Amsterdam). Since those changes were deployed on 29 May (Europe time), we have seen no failures in Databricks in Sydney or Amsterdam.
We are working with Databricks to get this out of cycle change deployed in UK South (London), which has had no failures in the last 2 days. We would also like the change to be deployed in our other regions (UAE North, US East, Southeast Asia, Japan East, Canada Central). These regions have lower usage/throughput and have not been affected in the same way with failures.
Affected services
Updated
May 24, 2024 at 09:49pm UTC
There have been no job failures in Europe West, UK South and Australia East (the high load locations) since we deployed the mitigation yesterday.
There may still be some slower running jobs where there is high usage, we are working on a mitigation for this with Databricks.
Affected services
Updated
May 23, 2024 at 11:41pm UTC
The updated configuration is now deployed in all regions.
Affected services
Updated
May 23, 2024 at 11:11pm UTC
There were 32 failures on 23 May in EUW, so we have accelerated deployment of a mitigation.
We have deployed updated configuration to Production in AUE and will shortly deploy to other high load countries (EUW and UKS).
The change adjusts the version of Java used in Databricks to avoid a bug in Java (which is used by Databricks) - based on our testing in Staging this materially improves throughput.
The Databricks team continue to work on the residual issues.
Affected services
Updated
May 23, 2024 at 11:02pm UTC
In collaboration with Databricks we have identified an additional mitigation which will be deployed into Production in Australia and Europe today.
Affected services
Updated
May 22, 2024 at 05:03am UTC
Databricks continue to work on the identified bug. We do not have an ETA on a fix being deployed and continue to monitor for failures on Data Studio.
There have been 3 failures in w/c 20 May in Data Studio (across 2000+ executions), post the mitigations we deployed last week.
Affected services
Updated
May 16, 2024 at 03:13am UTC
We are still working on this with the Databricks platform team.
Since Monday we have had a 0.3% failure rate in Amsterdam, 0.6% in Sydney and 0.08% in London.
If your Data Studio activity fails, please retry.
Affected services
Updated
May 05, 2024 at 11:32pm UTC
We are still working with that Databricks team to identify the cause of the issue.
In the meantime we have
- scaled up capacity
- set the clusters to restart more regularly
- added notifications for the L3 team to see job failures/timeouts
- granted access to the L3 team to complete restarts
Affected services
Updated
Apr 24, 2024 at 07:18am UTC
The issue has reoccured.
The L4 team are working with Microsoft and Databricks to identify a root cause.
In the meantime we have:
- scaled up the EUW cluster to provide more memory headroom
- onboarded the L3 support team to monitor the cluster state and manually restart the cluster if needed
Affected services
Updated
Apr 22, 2024 at 09:00pm UTC
Issue resolved through a cluster restart.
The team are investigating to find the root cause.
Affected services
Created
Apr 22, 2024 at 04:00pm UTC
There was a Databricks issue that meant that DBR jobs did not complete.
The issue was resolved with a cluster restart.
The team are investigating the root cause.
Affected services