Spark Sync (Databricks)
The SparkSync class provides automated, convention-based synchronisation between a Spark catalog (e.g. Databricks / Unity Catalog) and TitanRDM. It extends ConventionSync to automatically read from and write to Spark catalog tables.
Naming Convention
SparkSync follows a three-level naming convention for Spark tables:
{catalog}.{schema}.{domain_abbreviation}_{database_table_name}
For example:
- Download target: dev.rdmin.clin_sites
- Upload source: dev.rdmout.clin_sites
Setup
from titan_rdm_sdk import TitanRDMClient
from titan_rdm_sdk.spark_sync import SparkSync
# Authenticate
client = TitanRDMClient(
url=TITAN_URL,
client_id=TITAN_CLIENT_ID,
client_secret=TITAN_CLIENT_SECRET,
)
# Resolve branch
branch = client.get_branch_by_name("prod")
# Create SparkSync (automatically picks up the active SparkSession in Databricks)
sync = SparkSync(client=client, spark=spark)
In Databricks, the
sparkvariable is available globally. In other Spark environments, pass yourSparkSessionexplicitly.
Upload: Spark Catalog → TitanRDM
Upload an Entire Domain
Upload all deployed tables in a domain. SparkSync reads each table from {catalog}.{schema}.{abbreviation}_{database_table_name} and uploads it to TitanRDM:
results = sync.upload_sync_by_convention(
branch_id=branch.id,
source_catalog="dev",
source_schema="rdmout",
target_domain_name="Clinics",
)
for r in results:
print(f" {r['table']}: {r['rows']} rows — {r['status']}")
Upload Specific Tables
Upload only selected tables from the domain:
results = sync.upload_sync_by_convention(
branch_id=branch.id,
source_catalog="dev",
source_schema="rdmout",
target_domain_name="Clinics",
target_table_names=["Site", "Delivery Centre", "Org Unit"],
)
Upload Parameters
| Parameter | Type | Required | Description |
branch_id | int | Yes | Target branch ID |
source_catalog | str | Yes | Source catalog name (e.g. 'dev') |
source_schema | str | Yes | Source schema name (e.g. 'rdmout') |
target_domain_name | str | Yes | Exact domain name in TitanRDM |
target_table_names | list[str] | No | Filter to specific table names |
description | str | No | Import batch description |
correlation_code | str | No | Tracking identifier |
Download: TitanRDM → Spark Catalog
Download an Entire Domain
Download all deployed tables in a domain and write them to your Spark catalog:
results = sync.download_sync_by_convention(
branch_id=branch.id,
target_catalog="dev",
target_schema="rdmin",
source_domain_name="Clinics",
)
for r in results:
print(f" {r['table']}: {r['rows']} rows — {r['status']}")
Download Specific Tables
results = sync.download_sync_by_convention(
branch_id=branch.id,
target_catalog="dev",
target_schema="rdmin",
source_domain_name="Clinics",
source_table_names=["Site", "Delivery Centre", "Org Unit"],
)
Download Parameters
| Parameter | Type | Required | Description |
branch_id | int | Yes | Target branch ID |
target_catalog | str | Yes | Destination catalog (e.g. 'dev') |
target_schema | str | Yes | Destination schema (e.g. 'rdmin') |
source_domain_name | str | Yes | Exact domain name in TitanRDM |
source_table_names | list[str] | No | Filter to specific table names |
correlation_code | str | No | Tracking identifier prefix |
poll_interval | float | No | Seconds between export checks (default: 2.0) |
max_wait | float | No | Max seconds to wait per export (default: 300.0) |
Prerequisites
Before running SparkSync:
Create schemas in your catalog:
sql CREATE SCHEMA IF NOT EXISTS dev.rdmin; CREATE SCHEMA IF NOT EXISTS dev.rdmout;Populate upload source tables in
rdmoutwith data that matches TitanRDM'sdatabase_table_namevalues.Install the SDK in your cluster:
python %pip install titan-rdm-sdkStore credentials in a Databricks secret scope:
bash databricks secrets create-scope --scope titan-rdm databricks secrets put --scope titan-rdm --key url databricks secrets put --scope titan-rdm --key client_id databricks secrets put --scope titan-rdm --key client_secret
Complete Example
from titan_rdm_sdk import TitanRDMClient
from titan_rdm_sdk.spark_sync import SparkSync
# Configuration
TITAN_URL = dbutils.secrets.get(scope="titan-rdm", key="url")
TITAN_CLIENT_ID = dbutils.secrets.get(scope="titan-rdm", key="client_id")
TITAN_CLIENT_SECRET = dbutils.secrets.get(scope="titan-rdm", key="client_secret")
CATALOG = "dev"
DOWNLOAD_SCHEMA = "rdmin"
UPLOAD_SCHEMA = "rdmout"
# Initialise
client = TitanRDMClient(url=TITAN_URL, client_id=TITAN_CLIENT_ID, client_secret=TITAN_CLIENT_SECRET)
branch = client.get_branch_by_name("prod")
sync = SparkSync(client=client, spark=spark)
# Download all Clinics tables → dev.rdmin
download_results = sync.download_sync_by_convention(
branch_id=branch.id,
target_catalog=CATALOG,
target_schema=DOWNLOAD_SCHEMA,
source_domain_name="Clinics",
)
# Upload all Clinics tables from dev.rdmout → TitanRDM
upload_results = sync.upload_sync_by_convention(
branch_id=branch.id,
source_catalog=CATALOG,
source_schema=UPLOAD_SCHEMA,
target_domain_name="Clinics",
)
SparkSync vs Manual Convention Sync
| Feature | Manual (ConventionSync) | SparkSync |
| Catalog read/write | Manual spark.table() / .saveAsTable() | Automatic |
| Table filtering | Manual loop logic | Pass target_table_names / source_table_names |
| Batch management | Manual get_upload() / complete() | Handled internally |
| Lines of code | ~60 per direction | ~5 per direction |
Widgets for Parameterised Notebooks
Use Databricks widgets to make your sync notebooks configurable:
dbutils.widgets.text("branch_name", "prod", "Branch Name")
dbutils.widgets.text("download_schema", "rdmin", "Download Schema")
dbutils.widgets.text("upload_schema", "rdmout", "Upload Schema")
dbutils.widgets.text("catalog", "hive_metastore", "Catalog")
BRANCH_NAME = dbutils.widgets.get("branch_name")
DOWNLOAD_SCHEMA = dbutils.widgets.get("download_schema")
UPLOAD_SCHEMA = dbutils.widgets.get("upload_schema")
CATALOG = dbutils.widgets.get("catalog")
Example Notebook
For a complete working example, see the SparkSync Example Notebook.
Next Steps
- Convention Sync (Pandas) — Understand the base class
- Platform Integrations — BigQuery and Snowflake equivalents
- Example Notebooks — Download ready-to-use Databricks notebooks