Azure Databricks Disaster Recovery Automation for a Cloud Data Platform

Client: Enterprise Data Engineering Team (Cloud Gravity Solutions initiative) Stack: Azure Databricks, Terraform, Azure DevOps Deliverable: Reusable infrastructure-as-code module for cross-workspace DR replication

Marko Skendo & Ditmir Spahiu

4/16/20241 min read

The Challenge

The client ran a production Azure Databricks workspace with hundreds of jobs, notebooks, clusters, SQL warehouses, and instance pools — all built up organically over time. They had no disaster recovery strategy. A regional outage or workspace corruption would mean days of manual reconstruction and significant data engineering downtime.

The core problem: Databricks workspaces contain complex, interdependent resources (jobs referencing specific cluster IDs, notebooks nested in directory trees, instance pools with workspace-specific identifiers) that can't be simply copy-pasted between environments.

The Solution

We designed and built a Terraform module that automatically replicates an entire Databricks workspace to a secondary Azure region, acting as a live DR target.

Key engineering challenges solved:

  • Dual-provider pattern — A single terraform apply reads from the primary workspace and writes to the DR workspace simultaneously, using Terraform provider aliases (databricks.primary_site / databricks.dr_site)

  • ID translation layer — Instance pool IDs are workspace-specific; we built a local mapping table that transparently remaps resource references between environments, preventing broken dependencies

  • Dependency-aware replication — Directories are created before notebooks, notebooks before jobs, pools before clusters — Terraform's dependency graph handles ordering automatically with no manual sequencing

  • 15+ task type support — Spark JAR, Python, Notebook, SQL, DBT, Pipeline, and more — all replicated with full schedule, notification, and Git integration configs preserved

  • Resilient configuration — Extensive use of Terraform's try() and can() functions handles optional fields gracefully, so the module works across workspaces with different feature sets

The Result

  • ~0 manual effort to stand up a DR workspace — a single pipeline run replicates the full environment

  • RTO dramatically reduced from days of manual work to under an hour

  • The module is fully parameterized and reusable across other client workspaces with no code changes

  • Integrated with Azure DevOps pipelines for automated, scheduled sync of DR state

"Before this, a Databricks outage meant hours of archaeology through the UI. Now we run one pipeline and the DR site is current."