Databricks issue in West Eu...

Resolved
Jun 11, 2024 at 01:08am UTC

We have had no failures in any regions on Monday 10 June and believe the issue is now resolved.

We will continue to monitor in case it resurfaces.

Updated
Jun 03, 2024 at 05:03am UTC

Databricks have deployed a platform change in Australia East (Sydney) and Europe West (Amsterdam). Since those changes were deployed on 29 May (Europe time), we have seen no failures in Databricks in Sydney or Amsterdam.

We are working with Databricks to get this out of cycle change deployed in UK South (London), which has had no failures in the last 2 days. We would also like the change to be deployed in our other regions (UAE North, US East, Southeast Asia, Japan East, Canada Central). These regions have lower usage/throughput and have not been affected in the same way with failures.

Updated
May 24, 2024 at 09:49pm UTC

There have been no job failures in Europe West, UK South and Australia East (the high load locations) since we deployed the mitigation yesterday.

There may still be some slower running jobs where there is high usage, we are working on a mitigation for this with Databricks.

Updated
May 23, 2024 at 11:41pm UTC

The updated configuration is now deployed in all regions.

Updated
May 23, 2024 at 11:11pm UTC

There were 32 failures on 23 May in EUW, so we have accelerated deployment of a mitigation.

We have deployed updated configuration to Production in AUE and will shortly deploy to other high load countries (EUW and UKS).

The change adjusts the version of Java used in Databricks to avoid a bug in Java (which is used by Databricks) - based on our testing in Staging this materially improves throughput.

The Databricks team continue to work on the residual issues.

Updated
May 23, 2024 at 11:02pm UTC

In collaboration with Databricks we have identified an additional mitigation which will be deployed into Production in Australia and Europe today.

Updated
May 22, 2024 at 05:03am UTC

Databricks continue to work on the identified bug. We do not have an ETA on a fix being deployed and continue to monitor for failures on Data Studio.

There have been 3 failures in w/c 20 May in Data Studio (across 2000+ executions), post the mitigations we deployed last week.

Updated
May 16, 2024 at 03:13am UTC

We are still working on this with the Databricks platform team.

Since Monday we have had a 0.3% failure rate in Amsterdam, 0.6% in Sydney and 0.08% in London.

If your Data Studio activity fails, please retry.

Updated
May 05, 2024 at 11:32pm UTC

We are still working with that Databricks team to identify the cause of the issue.

In the meantime we have
- scaled up capacity
- set the clusters to restart more regularly
- added notifications for the L3 team to see job failures/timeouts
- granted access to the L3 team to complete restarts

Updated
Apr 24, 2024 at 07:18am UTC

The issue has reoccured.

The L4 team are working with Microsoft and Databricks to identify a root cause.

In the meantime we have:
- scaled up the EUW cluster to provide more memory headroom
- onboarded the L3 support team to monitor the cluster state and manually restart the cluster if needed

Updated
Apr 22, 2024 at 09:00pm UTC

Issue resolved through a cluster restart.

The team are investigating to find the root cause.

Created
Apr 22, 2024 at 04:00pm UTC

There was a Databricks issue that meant that DBR jobs did not complete.

The issue was resolved with a cluster restart.

The team are investigating the root cause.

Databricks issue in West Europe