BigQuery

This guide explains how to set up a dedicated Service Account and Custom Role in Google Cloud Platform (GCP) to allow Solid to analyze your BigQuery environment and build data lineage and usage profiles.

Overview

Solid requires specific permissions to crawl metadata and query history from your BigQuery environment.

Permissions Required

  • Project Access: To list projects under your account
  • Dataset & Table Access: To extract metadata (schemas, clustering, descriptions)
  • Job Access: To read query logs from completed jobs
  • [Optional] Read Access: To collect data samples (statistics/null counts)

IMPORTANT: You must have Owner or IAM Admin permissions on the relevant GCP project(s) to create these roles and service accounts.


Step 1: Create a Custom Role

To follow the principle of least privilege, create a custom role with only the necessary job-related permissions.

Instructions

  1. In the GCP Console, go to IAM & Admin → Roles
  2. Select your project at the top of the dashboard
  3. Click + Create Role
  4. Enter the following details:
    • Title: Solid Data Role
    • Role launch stage: General Availability
  5. Click + Add Permissions and filter/add the following:
    • bigquery.jobs.create
    • bigquery.jobs.get
    • bigquery.jobs.list
    • bigquery.jobs.listAll
    • bigquery.tables.getData (Optional: Only if data profiling is required)
  6. Click Create

Step 2: Create a Service Account

Instructions

  1. In the GCP Console, go to IAM & Admin → Service Accounts
  2. Click + Create Service Account
  3. Enter the following details:
    • Name: Solid Data
    • Click Create and Continue
  4. Assign Roles: Add the following two roles:
    • BigQuery Metadata Viewer (Predefined)
    • Solid Data Role (The custom role you created in Step 1)
  5. Click Done

Supporting Multi-Project Access

If your data is spread across multiple GCP projects, repeat the following for each project:

  1. Go to the target project's IAM & Admin page
  2. Click Grant Access
  3. Enter the email address of the service account you just created
  4. Assign the same two roles listed above:
    • BigQuery Metadata Viewer
    • Solid Data Role

Step 3: Create a JSON Key for Solid

  1. In the Service Accounts list, click on the account you just created (Solid Data)
  2. Go to the Keys tab
  3. Click Add Key → Create new key
  4. Select JSON and click Create
  5. A JSON file will download automatically

Security Note: Keep this JSON file secure—you will upload it to the Solid platform. Do not share it or commit it to version control.


Step 4: Configure Solid

Once you have the JSON key:

  1. Log into the Solid platform
  2. Navigate to Settings → Integrations → BigQuery
  3. Upload the JSON Key file
  4. Click Test Connection
    • If the connection fails, verify that:
      • The service account has the correct roles assigned
      • The JSON key file is valid and not corrupted
      • All required permissions are granted
  5. Once successful, select the Datasets you wish to monitor
  6. Click Save

Solid will begin indexing your BigQuery metadata.


What Data Does Solid Collect?

For transparency, here are the specific queries and data sources Solid uses when collecting information from BigQuery.

Note: BigQuery calculates usage and identifies users differently than Snowflake by parsing query logs rather than using aggregated DML history tables.

Metadata & Lineage

Solid joins COLUMNS, TABLES, and TABLE_OPTIONS to fetch descriptions and clustering information:

SELECT 
    t1.*, 
    t2.*, 
    t3.option_value AS TABLE_COMMENT, 
    t4.description AS COLUMN_COMMENT
FROM 
    `project.region.INFORMATION_SCHEMA.COLUMNS` t1
LEFT JOIN `project.region.INFORMATION_SCHEMA.TABLES` t2 
    ON t1.table_name = t2.table_name
LEFT JOIN `project.region.INFORMATION_SCHEMA.TABLE_OPTIONS` t3 
    ON t1.table_name = t3.table_name 
    AND t3.option_name = 'description'
WHERE 
    t1.table_schema IN ('your_datasets');

Usage & User Tracking

Unlike Snowflake, BigQuery does not have an aggregated DML history table. Instead, Solid identifies usage and active users by parsing the Query Logs from INFORMATION_SCHEMA.JOBS_BY_PROJECT.

MetricLogic
UsageAggregates inserted_row_count, deleted_row_count, etc., from JOBS_BY_PROJECT where statement type is INSERT, UPDATE, DELETE, or MERGE
UsersExtracts user_email and last_success_login from the query history to identify active data consumers

Query History

Solid collects job information from the INFORMATION_SCHEMA.JOBS_BY_PROJECT view to understand query patterns and data access:

SELECT 
    job_id,
    creation_time,
    start_time,
    end_time,
    user_email,
    statement_type,
    query,
    total_bytes_processed,
    total_slot_ms,
    state,
    error_result
FROM 
    `project.region.INFORMATION_SCHEMA.JOBS_BY_PROJECT`
WHERE 
    creation_time >= TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 365 DAY)
    AND state = 'DONE'
    AND statement_type IN ('SELECT', 'INSERT', 'UPDATE', 'DELETE', 'MERGE')
ORDER BY 
    creation_time DESC;

Data Profiling (Optional)

Note: This only applies if you granted the bigquery.tables.getData permission in Step 1.

Solid samples up to 100,000 rows per table to profile your data:

Numeric Columns

SELECT 
    'project:dataset.table.column' AS column_name,
    'NUMERIC' AS data_type,
    MIN(column) AS min_value,
    MAX(column) AS max_value,
    AVG(column) AS avg_value,
    COUNT(*) AS row_count,
    COUNT(DISTINCT column) AS distinct_count,
    COUNTIF(column IS NULL) AS null_count
FROM 
    `project.dataset.table`
LIMIT 100000;

Text Columns

SELECT 
    'project:dataset.table.column' AS column_name,
    'TEXT' AS data_type,
    COUNT(*) AS row_count,
    COUNT(DISTINCT column) AS distinct_count,
    COUNTIF(column IS NULL) AS null_count,
    STRING_AGG(DISTINCT column, '|||' ORDER BY column LIMIT 100) AS text_distinct_values
FROM 
    `project.dataset.table`
WHERE 
    column IS NOT NULL
LIMIT 100000;

Date Columns

SELECT 
    'project:dataset.table.column' AS column_name,
    'DATE' AS data_type,
    COUNT(*) AS row_count,
    COUNT(DISTINCT column) AS distinct_count,
    COUNTIF(column IS NULL) AS null_count,
    MIN(column) AS date_min_value,
    MAX(column) AS date_max_value
FROM 
    `project.dataset.table`
LIMIT 100000;

Troubleshooting

Connection Issues

If you're having trouble connecting:

  1. Verify the JSON key file is valid and properly formatted
  2. Check service account permissions by going to IAM & Admin → IAM and confirming the service account has the required roles
  3. Ensure the custom role was created with all necessary permissions
  4. Verify project access if working across multiple projects

Permission Issues

If Solid can't access certain datasets or tables:

  1. Check the service account roles in each project that contains data you want to monitor
  2. Verify dataset permissions by ensuring the service account has access to specific datasets
  3. Confirm the custom role includes all required permissions listed in Step 1

Data Profiling Not Working

If data profiling is failing:

  1. Verify you granted the bigquery.tables.getData permission in the custom role
  2. Check table-level permissions to ensure the service account can read data
  3. Review BigQuery quotas to ensure you're not hitting API limits

Security Best Practices

  • Store the JSON key securely and never commit it to version control
  • Rotate service account keys regularly (recommended every 365 days)
  • Use the principle of least privilege by only granting bigquery.tables.getData if data profiling is absolutely necessary
  • Monitor service account activity through GCP's audit logs
  • Restrict service account access to only the projects and datasets that need to be monitored