Databricks

This guide walks you through connecting Databricks to Solid so you can get insights into your data lineage, usage, and quality.

Overview

Solid requires access to:

  • The schema of your Databricks datasets
  • Query logs from the Query History table in Databricks
  • System tables for metadata extraction

You have two options for connecting Databricks to Solid:

OptionBest ForWhat You'll Do
Automatic Pull (Recommended)Ongoing use and continuous monitoringSet up a service principal with API token
Manual ExportQuick proof of concept or one-time setupExport data using Python notebooks and upload

Option 1: Automatic Pull (Recommended)

Best for: Production use with continuous monitoring and updates.

Step 1: Create Authentication Credentials

Choose one of the following authentication methods:

Option A: Service Principal (Recommended)

  1. As a Databricks account admin, log into the Databricks Manage Account Console → User Management → Service Principals tab

  2. Click Add service principal, enter a name (e.g., solid_collector), and click Add

  3. Ensure the Service Principal has the following entitlements:

    • Databricks SQL access
    • Workspace access Location: Workspace settings → Identity and access → Service principals → Configurations → Entitlements
  4. Create a Service Principal Token following the Databricks documentation and save the token securely

Option B: Personal Access Token (Not Recommended)

  1. Create a Databricks personal access token for workspace users as described in the Databricks documentation

Step 2: Grant Permissions

System Tables Access

Solid extracts metadata from system tables. Follow these steps:

  1. Enable System Tables in Databricks by following the Databricks system tables documentation

    • Execute the instructions for each workspace that has Unity Catalog
  2. Grant permissions to the Solid Service Principal to read system tables:

GRANT USE CATALOG ON CATALOG system TO <solid_collector_service_principal>;
GRANT SELECT ON system.information_schema TO <solid_collector_service_principal>;
GRANT SELECT ON system.query TO <solid_collector_service_principal>;

Catalog Access

Grant permissions to the Solid Service Principal for each catalog you want to integrate. This will cascade to all schemas within the catalog:

GRANT USE CATALOG ON CATALOG <CATALOG> TO <solid_collector_service_principal>;
GRANT USE SCHEMA ON CATALOG <CATALOG> TO <solid_collector_service_principal>;
GRANT SELECT ON CATALOG <CATALOG> TO <solid_collector_service_principal>;

Example:

GRANT USE CATALOG ON CATALOG analytics_prod TO solid_collector;
GRANT USE SCHEMA ON CATALOG analytics_prod TO solid_collector;
GRANT SELECT ON CATALOG analytics_prod TO solid_collector;

User List Access

To retrieve the list of users, you need a Workspace admin account.

Step 3: Create SQL Warehouse

  1. Create a SQL Warehouse following the Databricks SQL Warehouse documentation

  2. Grant the service principal or user access to the SQL Warehouse:

    • Click the Permissions button at the top right of the SQL Warehouse configuration page
    • Add the user/service principal
    • Grant "Can use" permission

Step 4: Verify Data Access

Confirm that the SQL Warehouse has access to the catalogs, schemas, and tables you want to integrate.

Run the following commands in the Databricks SQL editor to verify access:

-- Verify system tables access
SHOW TABLES IN system.information_schema;

-- Verify catalog access
SHOW CATALOGS;

-- Verify schema access (replace <CATALOG> with your catalog name)
SHOW SCHEMAS IN <CATALOG>;

-- Verify table access (replace with your catalog and schema)
SHOW TABLES IN <CATALOG.SCHEMA>;

-- Verify detailed table information
DESCRIBE EXTENDED <CATALOG.SCHEMA.TABLE>;

Important: If you're testing with a different user token than what will be configured in Solid, the test results may not reflect the actual permissions Solid will have.

Step 5: Configure Solid

Once you've completed the setup:

  1. Log into the Solid platform
  2. Navigate to Settings → Integrations → Databricks
  3. Enter the following information:
    • Workspace URL: Your Databricks workspace URL (e.g., https://yourcompany.databricks.com)
    • Token: The personal access token or service principal token
    • SQL Warehouse ID: The ID of the SQL Warehouse created in Step 3
  4. Click Test Connection
  5. Select the Catalogs and Schemas you want to monitor
  6. Click Save

Solid will begin analyzing your Databricks metadata.


Option 2: Manual Export

Best for: Testing Solid or if you can't grant direct access yet.

Steps

  1. Export schema and query history using Python notebooks in Databricks
  2. Archive the exported files (ZIP or GZIP)
  3. Upload to the Solid Azure Storage container

Export Schema Information

Open a notebook in Databricks and paste this code:

from pyspark.sql import Row

# Lists to hold the schema information
schema_info = []

# Step 1: Get all schemas
schemas = spark.sql("SHOW SCHEMAS").collect()

# Iterate through each schema
for schema in schemas:
    schema_name = schema["databaseName"]
    
    # Step 2: Get all tables in the schema
    tables = spark.sql(f"SHOW TABLES IN {schema_name}").collect()
    
    for table in tables:
        table_name = table["tableName"]
        
        # Step 3: Get the fields and data types of the table
        fields = spark.sql(f"DESCRIBE {schema_name}.{table_name}").collect()
        
        for field in fields:
            schema_info.append(
                Row(
                    schema_name=schema_name,
                    table_name=table_name,
                    field_name=field["col_name"],
                    data_type=field["data_type"],
                )
            )

# Convert the list to a DataFrame
schema_df = spark.createDataFrame(schema_info)

# Save the DataFrame
schema_df.toPandas().to_csv("/dbfs/path/to/save/databricks_schema.csv", index=False)

Export Query History

Open a notebook in Databricks and paste this code:

from databricks.sdk import WorkspaceClient
from databricks.sdk.service import sql
import time
import json

# Initialize the Databricks Workspace client
w = WorkspaceClient()

# Calculate the current time in milliseconds
end_time_ms = int(time.time() * 1000)

# Calculate the time 29 days ago in milliseconds
start_time_ms = end_time_ms - (29 * 24 * 60 * 60 * 1000)

# Retrieve the list of queries within the specified time range
queries = w.query_history.list(
    filter_by=sql.QueryFilter(
        query_start_time_range=sql.TimeRange(
            start_time_ms=start_time_ms,
            end_time_ms=end_time_ms
        )
    )
)

# Convert the list of queries to JSON format and write to a file
with open('/dbfs/path/to/save/query_history.json', 'w') as f:
    json.dump([query.to_dict() for query in queries], f, indent=4)

Upload to Solid

  1. Download the exported files from DBFS
  2. Archive the files (ZIP or GZIP format)
  3. Upload to the Solid Azure Storage container

What Data Does Solid Collect?

Metadata

Solid collects table and column information from system tables:

  • Database/catalog names
  • Schema names
  • Table names
  • Column names and data types
  • Table properties and descriptions
  • Clustering information

Query History

Solid collects query execution information:

  • Query text
  • Execution time
  • User information
  • Query status
  • Resource usage metrics

System Tables

Solid accesses the following system tables:

  • system.information_schema: Metadata about catalogs, schemas, tables, and columns
  • system.query: Query execution history and statistics

Troubleshooting

Connection Issues

If you're having trouble connecting:

  1. Verify the workspace URL is correct and includes the protocol (https://)
  2. Check the token is valid and hasn't expired
  3. Confirm the SQL Warehouse is running and accessible
  4. Verify all grants completed successfully by running the verification queries in Step 4

Permission Issues

If Solid can't access certain catalogs or tables:

  1. Check service principal permissions by running SHOW GRANTS ON CATALOG <catalog_name>
  2. Verify system tables are enabled for your workspace
  3. Ensure the SQL Warehouse has access to the Unity Catalog metastore
  4. Confirm catalog and schema grants were applied correctly

SQL Warehouse Issues

If the SQL Warehouse isn't accessible:

  1. Verify the service principal has "Can use" permission on the SQL Warehouse
  2. Check the SQL Warehouse is running and not in a stopped state
  3. Ensure the SQL Warehouse is connected to the correct metastore

Security Best Practices

  • Use service principals instead of personal access tokens for production integrations
  • Store tokens securely and never commit them to version control
  • Rotate tokens regularly according to your organization's security policies
  • Grant minimum necessary permissions following the principle of least privilege
  • Monitor token usage through Databricks audit logs
  • Use dedicated SQL Warehouses for integration workloads to isolate resource usage