Administration
Monitoring & Alerting
Monitor pipeline health, cluster utilization, query performance, and cost in real time.
The NATIS Observability Console provides real-time metrics for all platform components. Set up alerts to notify your team when pipelines fail, cluster costs exceed budget, or query SLAs are breached.
Key Monitoring Dashboards
- Pipeline Health — success rate, average runtime, failure trends, SLA breach heatmap
- Cluster Utilization — CPU, memory, network I/O per cluster and per user
- SQL Performance — query latency P50/P90/P99, queue depth, data scanned per warehouse
- Cost Dashboard — DBU consumption by team, pipeline, user; month-to-date vs budget
- Data Quality — row counts, null rates, schema drift alerts across all pipeline outputs
- Security Events — login failures, permission denied events, policy violations
Creating Alerts
- 1. Navigate to Admin Console → Monitoring → Alerts → New Alert.
- 2. Select the alert type: Pipeline Failure, Cluster Cost, Query SLA, Data Quality, or Custom Metric.
- 3. Define the condition: metric, comparison operator, threshold, and evaluation window.
- 4. Configure notification channels: email, Slack, Microsoft Teams, PagerDuty, or webhook.
- 5. Set alert frequency: notify on every occurrence, or suppress for N minutes after first trigger.
- 6. Click Save Alert.
Alert Configuration Example
YAML
# Alert: Pipeline SLA breach
name: daily_sales_sla_breach
type: pipeline_duration
pipeline: daily_sales_pipeline
condition:
metric: duration_minutes
operator: greater_than
threshold: 90 # Alert if pipeline takes more than 90 minutes
window: last_run
notifications:
- channel: slack
target: "#data-ops"
message: "⚠️ daily_sales_pipeline exceeded 90-minute SLA: {{ duration }} min"
- channel: email
to: [data-lead@company.com]
subject: "[NATIS] Pipeline SLA Breach — daily_sales_pipeline"
cooldown_minutes: 60 # Don't re-alert for 60 min after first trigger
Was this page helpful?
Thanks for your feedback!