Monitor Databricks with Grafana Cloud for instant visibility into your workloads

- 集成免配置导出器,开箱即用三类预建仪表板:成本概览、作业管道、SQL仓库性能。
- FinOps 可追踪 DBU 消耗与成本趋势,SRE 监控作业成功率与延迟,BI 关注查询错误率与并发瓶颈。
- 关键指标如 databricks_billing_cost_estimate_usd_sliding 和 pipeline_freshness_lag_seconds_sliding 支持快速异常定位。
If you're running Databricks workloads, you've probably asked yourself these types of questions: _How much is this costing me?_ _Why did that job fail last night?_ _Why are my dashboard queries suddenly slow?_
We've been there, too. Databricks is fantastic for data engineering, ML, and analytics. But once you start running jobs, pipelines, and SQL queries at scale, you need a way to keep tabs on what's happening. That's why we built the **Databricks integration for Grafana Cloud**.
With this integration, you can pull metrics from your Databricks workspaces directly into Grafana Cloud—no custom exporters to manage, no dashboards to build from scratch. You get visibility into billing, job reliability, and SQL warehouse performance all in one place.
**Who should use the Databricks integration for Grafana Cloud**
Different teams care about different things when it comes to Databricks:
- **FinOps teams** want to know where the money is going; DBU consumption, cost trends, surprise spikes—the usual suspects.
- **Platform and SRE teams** need to know if jobs and pipelines are healthy. Are they succeeding? How long are they taking? Are we meeting SLAs?
- **Analytics and BI teams** care about SQL warehouse performance. If query latency spikes or error rates climb, their dashboards break, and they hear about it.
We designed this integration with all three groups in mind.
**What you get: dashboards**
This integration comes with three prebuilt dashboards you'll see in your Grafana instance once you've installed it.
**Databricks overview**
This is essentially your executive summary: costs, DBU consumption, and high-level reliability metrics. It's intended to serve as a high-level snapshot so you can quickly spot anomalies and track overall platform health.
At the top, you'll see stat panels with the numbers that matter: total cost over the past 24 hours, day-over-day cost change, total DBUs consumed, and aggregate success rates for jobs and pipelines. Below that, time series panels show trends over time, and tables break down costs by SKU and workspace.

**Key metrics:**
- `databricks_billing_cost_estimate_usd_sliding`
- `databricks_billing_dbus_sliding`
- `databricks_job_run_status_sliding`
- `databricks_pipeline_run_status_sliding`
**Databricks jobs and pipelines**
This is for platform and SRE teams, providing visibility into performance for your jobs and pipelines so you can quickly identify issues and ensure data workloads run reliably.
You'll see job and pipeline throughput, success rates, and duration trends. There are drill-down panels so you can filter by workspace, job name, or pipeline name when you're investigating a specific workload. The collapsed rows at the bottom give you detailed views for individual jobs and pipelines.

**Key metrics:**
- `databricks_job_runs_sliding`
- `databricks_job_run_duration_seconds_sliding` (p50, p95, p99)
- `databricks_pipeline_runs_sliding`
- `databricks_pipeline_freshness_lag_seconds_sliding`
**Databricks warehouses and queries**
This is for analytics and BI teams, providing visibility into warehouse and query performance so you can quickly identify bottlenecks and keep SQL workloads running smoothly.
You get query throughput, latency percentiles, error rates, and concurrency metrics. Tables at the bottom show the top warehouses by query volume, errors, and latency—useful for spotting which warehouse is giving you trouble. You can filter by workspace or warehouse ID to narrow things down.

**Key metrics:**
- `databricks_queries_sliding`
- `databricks_query_duration_seconds_sliding` (p50, p95, p99)
- `databricks_query_errors_sliding`
- `databricks_queries_running_sliding`
**What you get: alerts**
The integration comes with 14 alerting rules out of the box. They're organized by persona, so you can route them to the right teams.
**For FinOps**
- **DatabricksWarnSpendSpike:** Fires when day-over-day cost jumps more than 25%
- **DatabricksCriticalSpendSpike:** Fires when it jumps more than 50%
- **DatabricksWarnNoBillingData**: Fires if no billing data comes in for two hours
- **DatabricksCriticalNoBillingData:** Fires if it's been four hours
**For platform and SRE teams**
- **DatabricksWarnJobFailureRate**: Fires when job failure rate exceeds 10%
- **DatabricksCriticalJobFailureRate:** Fires at 20%
- **DatabricksWarnJobDurationRegression:** Fires when job duration is 30% above the seven-day median
- **DatabricksCriticalJobDurationRegression**: Fires at 60% above
Similar alerts exist for pipelines.
**For analytics and BI teams**
- **DatabricksWarnSqlQueryErrorRate:** Fires when SQL error rate exceeds 5%
- **DatabricksCriticalSqlQueryErrorRate:** Fires at 10%
- **DatabricksWarnSqlQueryLatencyRegression:** Fires when p95 latency is 30% above the seven-day median
- **DatabricksCritQueryLatencyHigh**: Fires at 60% above
All the thresholds are configurable; these are just sensible defaults to get you started.
**How the integration works under the hood**
The integration uses an open source exporter we built called databricks-prometheus-exporter. It connects to your Databricks workspace through a SQL Warehouse and queries Databricks System Tables—the same tables Databricks uses internally for billing, audit logs, and operational data.
We've embedded the exporter into Alloy, so you don't need to run it separately. Just configure Alloy with your Databricks credentials and it handles the rest.
Here's what gets collected:
| **Domain** | **System tables queried** | **What you get** | | --- | | **Billing** | `system.billing.usage`, `system.billing.list_prices` | DBU consumption, cost estimates by workspace and SKU | | **Jobs** | `system.lakeflow.job_run_timeline`, `system.lakeflow.jobs` | Run counts, success/failure rates, duration percentiles | | **Pipelines** | `system.lakeflow.pipeline_update_timeline`, `system.lakeflow.pipelines` | Pipeline status, duration, data freshness lag | | **SQL queries** | `system.query.history` | Query throughput, latency percentiles, error rates |
**Getting started**
Here's how to set it up:
2. In your Grafana instance, go to **Connections**>**Add new connection** and search for **Databricks**.
3. Follow the setup wizard to configure Alloy. You'll need four things from your Databricks workspace:
- **Server hostname** - your workspace URL (something like `dbc-abc123.cloud.databricks.com`)
- **Warehouse HTTP path:** The SQL warehouse that'll run the queries
- **Client ID:** The OAuth2 client ID for your service principal
- **Client secret:** The corresponding secret
4. Grant your service principal access to the system tables. The setup instructions include the exact SQL `GRANT` statements you need.
5. Click **Install dashboards and alerts**, and you're done.
The whole thing takes about 10 minutes if you already have a service principal set up.
**A few things to keep in mind**
**Billing data has lag**
Databricks billing data in system tables has an inherent lag of 24 to 48 hours. This is a Databricks limitation, not something we can work around. The cost numbers you see in the dashboards are great for trend analysis, but don't expect real-time billing.
**Scrape interval and timeouts**
The integration uses a 10-minute scrape interval by default. The exporter queries can take 90 to 120 seconds to run (it's querying a lot of data), so the scrape timeout is set to nine minutes. If you're seeing gaps in your data, check that your SQL Warehouse isn't auto-suspending between scrapes.
**Pipeline table permissions**
The `system.lakeflow.pipeline_update_timeline` table sometimes needs explicit `SELECT` permissions beyond the standard System Tables grants. If you're not seeing pipeline metrics, double-check that your service principal has access to this table.
**Try it out**
We think this integration makes it a lot easier to keep an eye on your Databricks workspaces - whether you care about costs, job reliability, or SQL performance. The dashboards and alerts give you a solid starting point, and you can customize from there.
Give it a try and let us know what you think. We hang out in the Grafana Community Slack. Drop by the #integrations channel if you have questions or feedback.
And if you're monitoring other data platforms, you might also be interested in our Snowflake integration, which offers similar capabilities.
_The Grafana Cloud integrations team contributed to this blog post._
_Grafana Cloud_ _is the easiest way to get started with metrics, logs, traces, dashboards, and more. We have a generous forever-free tier and plans for every use case._ _Sign up for free now!_
Tags