Databricks
This guide walks you through connecting Databricks to Solid so you can get insights into your data lineage, usage, and quality.
Overview
Solid requires access to:
- The schema of your Databricks datasets
- Query logs from the Query History table in Databricks
- System tables for metadata extraction
You have two options for connecting Databricks to Solid:
| Option | Best For | What You'll Do |
|---|---|---|
| Automatic Pull (Recommended) | Ongoing use and continuous monitoring | Set up a service principal with API token |
| Manual Export | Quick proof of concept or one-time setup | Export data using Python notebooks and upload |
Option 1: Automatic Pull (Recommended)
Best for: Production use with continuous monitoring and updates.
Step 1: Create Authentication Credentials
Choose one of the following authentication methods:
Option A: Service Principal (Recommended)
-
As a Databricks account admin, log into the Databricks Manage Account Console → User Management → Service Principals tab
-
Click Add service principal, enter a name (e.g.,
solid_collector), and click Add -
Ensure the Service Principal has the following entitlements:
- Databricks SQL access
- Workspace access Location: Workspace settings → Identity and access → Service principals → Configurations → Entitlements
-
Create a Service Principal Token following the Databricks documentation and save the token securely
Option B: Personal Access Token (Not Recommended)
- Create a Databricks personal access token for workspace users as described in the Databricks documentation
Step 2: Grant Permissions
System Tables Access
Solid extracts metadata from system tables. Follow these steps:
-
Enable System Tables in Databricks by following the Databricks system tables documentation
- Execute the instructions for each workspace that has Unity Catalog
-
Grant permissions to the Solid Service Principal to read system tables:
GRANT USE CATALOG ON CATALOG system TO <solid_collector_service_principal>;
GRANT SELECT ON system.information_schema TO <solid_collector_service_principal>;
GRANT SELECT ON system.query TO <solid_collector_service_principal>;Catalog Access
Grant permissions to the Solid Service Principal for each catalog you want to integrate. This will cascade to all schemas within the catalog:
GRANT USE CATALOG ON CATALOG <CATALOG> TO <solid_collector_service_principal>;
GRANT USE SCHEMA ON CATALOG <CATALOG> TO <solid_collector_service_principal>;
GRANT SELECT ON CATALOG <CATALOG> TO <solid_collector_service_principal>;Example:
GRANT USE CATALOG ON CATALOG analytics_prod TO solid_collector;
GRANT USE SCHEMA ON CATALOG analytics_prod TO solid_collector;
GRANT SELECT ON CATALOG analytics_prod TO solid_collector;User List Access
To retrieve the list of users, you need a Workspace admin account.
Step 3: Create SQL Warehouse
-
Create a SQL Warehouse following the Databricks SQL Warehouse documentation
-
Grant the service principal or user access to the SQL Warehouse:
- Click the Permissions button at the top right of the SQL Warehouse configuration page
- Add the user/service principal
- Grant "Can use" permission
Step 4: Verify Data Access
Confirm that the SQL Warehouse has access to the catalogs, schemas, and tables you want to integrate.
Run the following commands in the Databricks SQL editor to verify access:
-- Verify system tables access
SHOW TABLES IN system.information_schema;
-- Verify catalog access
SHOW CATALOGS;
-- Verify schema access (replace <CATALOG> with your catalog name)
SHOW SCHEMAS IN <CATALOG>;
-- Verify table access (replace with your catalog and schema)
SHOW TABLES IN <CATALOG.SCHEMA>;
-- Verify detailed table information
DESCRIBE EXTENDED <CATALOG.SCHEMA.TABLE>;Important: If you're testing with a different user token than what will be configured in Solid, the test results may not reflect the actual permissions Solid will have.
Step 5: Configure Solid
Once you've completed the setup:
- Log into the Solid platform
- Navigate to Settings → Integrations → Databricks
- Enter the following information:
- Workspace URL: Your Databricks workspace URL (e.g.,
https://yourcompany.databricks.com) - Token: The personal access token or service principal token
- SQL Warehouse ID: The ID of the SQL Warehouse created in Step 3
- Workspace URL: Your Databricks workspace URL (e.g.,
- Click Test Connection
- Select the Catalogs and Schemas you want to monitor
- Click Save
Solid will begin analyzing your Databricks metadata.
Option 2: Manual Export
Best for: Testing Solid or if you can't grant direct access yet.
Steps
- Export schema and query history using Python notebooks in Databricks
- Archive the exported files (ZIP or GZIP)
- Upload to the Solid Azure Storage container
Export Schema Information
Open a notebook in Databricks and paste this code:
from pyspark.sql import Row
# Lists to hold the schema information
schema_info = []
# Step 1: Get all schemas
schemas = spark.sql("SHOW SCHEMAS").collect()
# Iterate through each schema
for schema in schemas:
schema_name = schema["databaseName"]
# Step 2: Get all tables in the schema
tables = spark.sql(f"SHOW TABLES IN {schema_name}").collect()
for table in tables:
table_name = table["tableName"]
# Step 3: Get the fields and data types of the table
fields = spark.sql(f"DESCRIBE {schema_name}.{table_name}").collect()
for field in fields:
schema_info.append(
Row(
schema_name=schema_name,
table_name=table_name,
field_name=field["col_name"],
data_type=field["data_type"],
)
)
# Convert the list to a DataFrame
schema_df = spark.createDataFrame(schema_info)
# Save the DataFrame
schema_df.toPandas().to_csv("/dbfs/path/to/save/databricks_schema.csv", index=False)Export Query History
Open a notebook in Databricks and paste this code:
from databricks.sdk import WorkspaceClient
from databricks.sdk.service import sql
import time
import json
# Initialize the Databricks Workspace client
w = WorkspaceClient()
# Calculate the current time in milliseconds
end_time_ms = int(time.time() * 1000)
# Calculate the time 29 days ago in milliseconds
start_time_ms = end_time_ms - (29 * 24 * 60 * 60 * 1000)
# Retrieve the list of queries within the specified time range
queries = w.query_history.list(
filter_by=sql.QueryFilter(
query_start_time_range=sql.TimeRange(
start_time_ms=start_time_ms,
end_time_ms=end_time_ms
)
)
)
# Convert the list of queries to JSON format and write to a file
with open('/dbfs/path/to/save/query_history.json', 'w') as f:
json.dump([query.to_dict() for query in queries], f, indent=4)Upload to Solid
- Download the exported files from DBFS
- Archive the files (ZIP or GZIP format)
- Upload to the Solid Azure Storage container
What Data Does Solid Collect?
Metadata
Solid collects table and column information from system tables:
- Database/catalog names
- Schema names
- Table names
- Column names and data types
- Table properties and descriptions
- Clustering information
Query History
Solid collects query execution information:
- Query text
- Execution time
- User information
- Query status
- Resource usage metrics
System Tables
Solid accesses the following system tables:
system.information_schema: Metadata about catalogs, schemas, tables, and columnssystem.query: Query execution history and statistics
Troubleshooting
Connection Issues
If you're having trouble connecting:
- Verify the workspace URL is correct and includes the protocol (https://)
- Check the token is valid and hasn't expired
- Confirm the SQL Warehouse is running and accessible
- Verify all grants completed successfully by running the verification queries in Step 4
Permission Issues
If Solid can't access certain catalogs or tables:
- Check service principal permissions by running
SHOW GRANTS ON CATALOG <catalog_name> - Verify system tables are enabled for your workspace
- Ensure the SQL Warehouse has access to the Unity Catalog metastore
- Confirm catalog and schema grants were applied correctly
SQL Warehouse Issues
If the SQL Warehouse isn't accessible:
- Verify the service principal has "Can use" permission on the SQL Warehouse
- Check the SQL Warehouse is running and not in a stopped state
- Ensure the SQL Warehouse is connected to the correct metastore
Security Best Practices
- Use service principals instead of personal access tokens for production integrations
- Store tokens securely and never commit them to version control
- Rotate tokens regularly according to your organization's security policies
- Grant minimum necessary permissions following the principle of least privilege
- Monitor token usage through Databricks audit logs
- Use dedicated SQL Warehouses for integration workloads to isolate resource usage
Updated about 2 months ago
