Databricks issue in West Europe
Resolved
Jun 11 at 11:08am AEST
We have had no failures in any regions on Monday 10 June and believe the issue is now resolved.
We will continue to monitor in case it resurfaces.
Affected services
Updated
Jun 03 at 03:03pm AEST
Databricks have deployed a platform change in Australia East (Sydney) and Europe West (Amsterdam). Since those changes were deployed on 29 May (Europe time), we have seen no failures in Databricks in Sydney or Amsterdam.
We are working with Databricks to get this out of cycle change deployed in UK South (London), which has had no failures in the last 2 days. We would also like the change to be deployed in our other regions (UAE North, US East, Southeast Asia, Japan East, Canada Central). These regions have lower usage/throughput and have not been affected in the same way with failures.
Affected services
Updated
May 25 at 07:49am AEST
There have been no job failures in Europe West, UK South and Australia East (the high load locations) since we deployed the mitigation yesterday.
There may still be some slower running jobs where there is high usage, we are working on a mitigation for this with Databricks.
Affected services
Updated
May 24 at 09:41am AEST
The updated configuration is now deployed in all regions.
Affected services
Updated
May 24 at 09:11am AEST
There were 32 failures on 23 May in EUW, so we have accelerated deployment of a mitigation.
We have deployed updated configuration to Production in AUE and will shortly deploy to other high load countries (EUW and UKS).
The change adjusts the version of Java used in Databricks to avoid a bug in Java (which is used by Databricks) - based on our testing in Staging this materially improves throughput.
The Databricks team continue to work on the residual issues.
Affected services
Updated
May 24 at 09:02am AEST
In collaboration with Databricks we have identified an additional mitigation which will be deployed into Production in Australia and Europe today.
Affected services
Updated
May 22 at 03:03pm AEST
Databricks continue to work on the identified bug. We do not have an ETA on a fix being deployed and continue to monitor for failures on Data Studio.
There have been 3 failures in w/c 20 May in Data Studio (across 2000+ executions), post the mitigations we deployed last week.
Affected services
Updated
May 16 at 01:13pm AEST
We are still working on this with the Databricks platform team.
Since Monday we have had a 0.3% failure rate in Amsterdam, 0.6% in Sydney and 0.08% in London.
If your Data Studio activity fails, please retry.
Affected services
Updated
May 06 at 09:32am AEST
We are still working with that Databricks team to identify the cause of the issue.
In the meantime we have
- scaled up capacity
- set the clusters to restart more regularly
- added notifications for the L3 team to see job failures/timeouts
- granted access to the L3 team to complete restarts
Affected services
Updated
Apr 24 at 05:18pm AEST
The issue has reoccured.
The L4 team are working with Microsoft and Databricks to identify a root cause.
In the meantime we have:
- scaled up the EUW cluster to provide more memory headroom
- onboarded the L3 support team to monitor the cluster state and manually restart the cluster if needed
Affected services
Updated
Apr 23 at 07:00am AEST
Issue resolved through a cluster restart.
The team are investigating to find the root cause.
Affected services
Created
Apr 23 at 02:00am AEST
There was a Databricks issue that meant that DBR jobs did not complete.
The issue was resolved with a cluster restart.
The team are investigating the root cause.
Affected services