Back to overview
Downtime

Databricks issue in West Europe

Apr 23 at 02:00am AEST

Resolved
Jun 11 at 11:08am AEST

We have had no failures in any regions on Monday 10 June and believe the issue is now resolved.

We will continue to monitor in case it resurfaces.

Updated
Jun 03 at 03:03pm AEST

Databricks have deployed a platform change in Australia East (Sydney) and Europe West (Amsterdam). Since those changes were deployed on 29 May (Europe time), we have seen no failures in Databricks in Sydney or Amsterdam.

We are working with Databricks to get this out of cycle change deployed in UK South (London), which has had no failures in the last 2 days. We would also like the change to be deployed in our other regions (UAE North, US East, Southeast Asia, Japan East, Canada Central). These regions have lower usage/throughput and have not been affected in the same way with failures.

Updated
May 25 at 07:49am AEST

There have been no job failures in Europe West, UK South and Australia East (the high load locations) since we deployed the mitigation yesterday.

There may still be some slower running jobs where there is high usage, we are working on a mitigation for this with Databricks.

Updated
May 24 at 09:41am AEST

The updated configuration is now deployed in all regions.

Updated
May 24 at 09:11am AEST

There were 32 failures on 23 May in EUW, so we have accelerated deployment of a mitigation.

We have deployed updated configuration to Production in AUE and will shortly deploy to other high load countries (EUW and UKS).

The change adjusts the version of Java used in Databricks to avoid a bug in Java (which is used by Databricks) - based on our testing in Staging this materially improves throughput.

The Databricks team continue to work on the residual issues.

Updated
May 24 at 09:02am AEST

In collaboration with Databricks we have identified an additional mitigation which will be deployed into Production in Australia and Europe today.

Updated
May 22 at 03:03pm AEST

Databricks continue to work on the identified bug. We do not have an ETA on a fix being deployed and continue to monitor for failures on Data Studio.

There have been 3 failures in w/c 20 May in Data Studio (across 2000+ executions), post the mitigations we deployed last week.

Updated
May 16 at 01:13pm AEST

We are still working on this with the Databricks platform team.

Since Monday we have had a 0.3% failure rate in Amsterdam, 0.6% in Sydney and 0.08% in London.

If your Data Studio activity fails, please retry.

Updated
May 06 at 09:32am AEST

We are still working with that Databricks team to identify the cause of the issue.

In the meantime we have
- scaled up capacity
- set the clusters to restart more regularly
- added notifications for the L3 team to see job failures/timeouts
- granted access to the L3 team to complete restarts

Updated
Apr 24 at 05:18pm AEST

The issue has reoccured.

The L4 team are working with Microsoft and Databricks to identify a root cause.

In the meantime we have:
- scaled up the EUW cluster to provide more memory headroom
- onboarded the L3 support team to monitor the cluster state and manually restart the cluster if needed

Updated
Apr 23 at 07:00am AEST

Issue resolved through a cluster restart.

The team are investigating to find the root cause.

Created
Apr 23 at 02:00am AEST

There was a Databricks issue that meant that DBR jobs did not complete.

The issue was resolved with a cluster restart.

The team are investigating the root cause.