Administration

Monitoring & Alerting

Monitor pipeline health, cluster utilization, query performance, and cost in real time.

6 min read · Updated April 2025

On this page

Key Monitoring Dashboards
Creating Alerts
Alert Configuration Example

The NATIS Observability Console provides real-time metrics for all platform components. Set up alerts to notify your team when pipelines fail, cluster costs exceed budget, or query SLAs are breached.

Key Monitoring Dashboards

Pipeline Health — success rate, average runtime, failure trends, SLA breach heatmap
Cluster Utilization — CPU, memory, network I/O per cluster and per user
SQL Performance — query latency P50/P90/P99, queue depth, data scanned per warehouse
Cost Dashboard — DBU consumption by team, pipeline, user; month-to-date vs budget
Data Quality — row counts, null rates, schema drift alerts across all pipeline outputs
Security Events — login failures, permission denied events, policy violations

Creating Alerts

1. Navigate to Admin Console → Monitoring → Alerts → New Alert.
2. Select the alert type: Pipeline Failure, Cluster Cost, Query SLA, Data Quality, or Custom Metric.
3. Define the condition: metric, comparison operator, threshold, and evaluation window.
4. Configure notification channels: email, Slack, Microsoft Teams, PagerDuty, or webhook.
5. Set alert frequency: notify on every occurrence, or suppress for N minutes after first trigger.
6. Click Save Alert.

Alert Configuration Example

YAML

# Alert: Pipeline SLA breach
name: daily_sales_sla_breach
type: pipeline_duration
pipeline: daily_sales_pipeline
condition:
  metric: duration_minutes
  operator: greater_than
  threshold: 90  # Alert if pipeline takes more than 90 minutes
  window: last_run

notifications:
  - channel: slack
    target: "#data-ops"
    message: "⚠️ daily_sales_pipeline exceeded 90-minute SLA: {{ duration }} min"
  - channel: email
    to: [data-lead@company.com]
    subject: "[NATIS] Pipeline SLA Breach — daily_sales_pipeline"

cooldown_minutes: 60  # Don't re-alert for 60 min after first trigger

Was this page helpful?

Thanks for your feedback!