Databricks

Overview

Solid requires access to:

The schema of your Databricks datasets
Query logs from the Query History table in Databricks
System tables for metadata extraction

You have two options for connecting Databricks to Solid:

Option	Best For	What You'll Do
Automatic Pull (Recommended)	Ongoing use and continuous monitoring	Set up a service principal with API token
Manual Export	Quick proof of concept or one-time setup	Export data using Python notebooks and upload

Option 1: Automatic Pull (Recommended)

Best for: Production use with continuous monitoring and updates.

Step 1: Create Authentication Credentials

Choose one of the following authentication methods:

Option A: Service Principal (Recommended)

As a Databricks account admin, log into the Databricks Manage Account Console → User Management → Service Principals tab
Click Add service principal, enter a name (e.g., solid_collector), and click Add
Ensure the Service Principal has the following entitlements:
- Databricks SQL access
- Workspace access
  Location: Workspace settings → Identity and access → Service principals → Configurations → Entitlements
Create a Service Principal Token following the Databricks documentation and save the token securely

Option B: Personal Access Token (Not Recommended)

Create a Databricks personal access token for workspace users as described in the Databricks documentation

Step 2: Grant Permissions

System Tables Access

Solid extracts metadata from system tables. Follow these steps:

Enable System Tables in Databricks by following the Databricks system tables documentation
- Execute the instructions for each workspace that has Unity Catalog
Grant permissions to the Solid Service Principal to read system tables:

GRANT USE CATALOG ON CATALOG system TO <solid_collector_service_principal>;
GRANT SELECT ON system.information_schema TO <solid_collector_service_principal>;
GRANT SELECT ON system.query TO <solid_collector_service_principal>;

Catalog Access

Grant permissions to the Solid Service Principal for each catalog you want to integrate. This will cascade to all schemas within the catalog:

GRANT USE CATALOG ON CATALOG <CATALOG> TO <solid_collector_service_principal>;
GRANT USE SCHEMA ON CATALOG <CATALOG> TO <solid_collector_service_principal>;
GRANT SELECT ON CATALOG <CATALOG> TO <solid_collector_service_principal>;

Example:

GRANT USE CATALOG ON CATALOG analytics_prod TO solid_collector;
GRANT USE SCHEMA ON CATALOG analytics_prod TO solid_collector;
GRANT SELECT ON CATALOG analytics_prod TO solid_collector;

User List Access

To retrieve the list of users, you need a Workspace admin account.

Step 3: Create SQL Warehouse

Create a SQL Warehouse following the Databricks SQL Warehouse documentation
Grant the service principal or user access to the SQL Warehouse:
- Click the Permissions button at the top right of the SQL Warehouse configuration page
- Add the user/service principal
- Grant "Can use" permission

Step 4: Verify Data Access

Confirm that the SQL Warehouse has access to the catalogs, schemas, and tables you want to integrate.

Run the following commands in the Databricks SQL editor to verify access:

-- Verify system tables access
SHOW TABLES IN system.information_schema;

-- Verify catalog access
SHOW CATALOGS;

-- Verify schema access (replace <CATALOG> with your catalog name)
SHOW SCHEMAS IN <CATALOG>;

-- Verify table access (replace with your catalog and schema)
SHOW TABLES IN <CATALOG.SCHEMA>;

-- Verify detailed table information
DESCRIBE EXTENDED <CATALOG.SCHEMA.TABLE>;

Important: If you're testing with a different user token than what will be configured in Solid, the test results may not reflect the actual permissions Solid will have.

Step 5: Configure Solid

Once you've completed the setup:

Log into the Solid platform
Navigate to Settings → Integrations → Databricks
Enter the following information:
- Workspace URL: Your Databricks workspace URL (e.g., https://yourcompany.databricks.com)
- Token: The personal access token or service principal token
- SQL Warehouse ID: The ID of the SQL Warehouse created in Step 3
Click Test Connection
Select the Catalogs and Schemas you want to monitor
Click Save

Solid will begin analyzing your Databricks metadata.

Option 2: Manual Export

Best for: Testing Solid or if you can't grant direct access yet.

Steps

Export schema and query history using Python notebooks in Databricks
Archive the exported files (ZIP or GZIP)
Upload to the Solid Azure Storage container

Export Schema Information

Open a notebook in Databricks and paste this code:

from pyspark.sql import Row

# Lists to hold the schema information
schema_info = []

# Step 1: Get all schemas
schemas = spark.sql("SHOW SCHEMAS").collect()

# Iterate through each schema
for schema in schemas:
    schema_name = schema["databaseName"]
    
    # Step 2: Get all tables in the schema
    tables = spark.sql(f"SHOW TABLES IN {schema_name}").collect()
    
    for table in tables:
        table_name = table["tableName"]
        
        # Step 3: Get the fields and data types of the table
        fields = spark.sql(f"DESCRIBE {schema_name}.{table_name}").collect()
        
        for field in fields:
            schema_info.append(
                Row(
                    schema_name=schema_name,
                    table_name=table_name,
                    field_name=field["col_name"],
                    data_type=field["data_type"],
                )
            )

# Convert the list to a DataFrame
schema_df = spark.createDataFrame(schema_info)

# Save the DataFrame
schema_df.toPandas().to_csv("/dbfs/path/to/save/databricks_schema.csv", index=False)

Export Query History

Open a notebook in Databricks and paste this code:

from databricks.sdk import WorkspaceClient
from databricks.sdk.service import sql
import time
import json

# Initialize the Databricks Workspace client
w = WorkspaceClient()

# Calculate the current time in milliseconds
end_time_ms = int(time.time() * 1000)

# Calculate the time 29 days ago in milliseconds
start_time_ms = end_time_ms - (29 * 24 * 60 * 60 * 1000)

# Retrieve the list of queries within the specified time range
queries = w.query_history.list(
    filter_by=sql.QueryFilter(
        query_start_time_range=sql.TimeRange(
            start_time_ms=start_time_ms,
            end_time_ms=end_time_ms
        )
    )
)

# Convert the list of queries to JSON format and write to a file
with open('/dbfs/path/to/save/query_history.json', 'w') as f:
    json.dump([query.to_dict() for query in queries], f, indent=4)

Upload to Solid

Download the exported files from DBFS
Archive the files (ZIP or GZIP format)
Upload to the Solid Azure Storage container

What Data Does Solid Collect?

Metadata

Solid collects table and column information from system tables:

Database/catalog names
Schema names
Table names
Column names and data types
Table properties and descriptions
Clustering information

Query History

Solid collects query execution information:

Query text
Execution time
User information
Query status
Resource usage metrics

System Tables

Solid accesses the following system tables:

system.information_schema: Metadata about catalogs, schemas, tables, and columns
system.query: Query execution history and statistics

Troubleshooting

Connection Issues

If you're having trouble connecting:

Verify the workspace URL is correct and includes the protocol (https://)
Check the token is valid and hasn't expired
Confirm the SQL Warehouse is running and accessible
Verify all grants completed successfully by running the verification queries in Step 4

Permission Issues

If Solid can't access certain catalogs or tables:

Check service principal permissions by running SHOW GRANTS ON CATALOG <catalog_name>
Verify system tables are enabled for your workspace
Ensure the SQL Warehouse has access to the Unity Catalog metastore
Confirm catalog and schema grants were applied correctly

SQL Warehouse Issues

If the SQL Warehouse isn't accessible:

Verify the service principal has "Can use" permission on the SQL Warehouse
Check the SQL Warehouse is running and not in a stopped state
Ensure the SQL Warehouse is connected to the correct metastore

Security Best Practices

Use service principals instead of personal access tokens for production integrations
Store tokens securely and never commit them to version control
Rotate tokens regularly according to your organization's security policies
Grant minimum necessary permissions following the principle of least privilege
Monitor token usage through Databricks audit logs
Use dedicated SQL Warehouses for integration workloads to isolate resource usage