Data Transformation

SQL Engine Reference

Complete reference for the NATIS SQL Engine — syntax, functions, performance tips, and supported SQL dialects.

9 min read · Updated May 2025

The NATIS SQL Engine is ANSI SQL-compliant with extensions for Delta Lake operations, time-travel queries, and ML inference. It supports 700+ built-in functions and is compatible with HiveQL, SparkSQL, and standard ANSI SQL syntax.

Running SQL Queries

SQL
-- Basic aggregation
SELECT 
  region,
  product_category,
  DATE_TRUNC('month', sale_date) AS month,
  SUM(revenue) AS total_revenue,
  COUNT(DISTINCT customer_id) AS unique_customers
FROM catalog.gold.sales_transactions
WHERE sale_date >= '2024-01-01'
GROUP BY 1, 2, 3
ORDER BY month DESC, total_revenue DESC;

-- Time travel query (query data as of a specific date)
SELECT * FROM catalog.silver.customers
TIMESTAMP AS OF '2024-06-01 00:00:00'
WHERE country = 'VN';

-- Window function example
SELECT
  customer_id,
  order_date,
  order_value,
  SUM(order_value) OVER (
    PARTITION BY customer_id 
    ORDER BY order_date
    ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
  ) AS cumulative_spend
FROM catalog.silver.orders;

Delta Lake SQL Extensions

Command | Description | Example — | — | — MERGE INTO | Upsert: insert or update based on condition | MERGE INTO target USING source ON key WHEN MATCHED THEN UPDATE SET ... DESCRIBE HISTORY | View table version history | DESCRIBE HISTORY catalog.silver.orders RESTORE TABLE | Restore table to a previous version | RESTORE TABLE orders TO VERSION AS OF 5 OPTIMIZE | Compact small files for better read performance | OPTIMIZE catalog.silver.orders WHERE date = '2024-06-01' VACUUM | Remove old files beyond retention period | VACUUM catalog.silver.orders RETAIN 168 HOURS ZORDER BY | Co-locate data for faster filtered reads | OPTIMIZE orders ZORDER BY (customer_id, order_date)

Performance Best Practices

The Query Profiler (accessible via the ⓘ icon in the SQL Workspace) shows the execution plan, stage durations, data scanned, and cache hit rates for any query.

  • Use ZORDER BY on columns frequently used in WHERE clauses for 10-100x faster reads
  • Run OPTIMIZE weekly on high-write tables to consolidate small files
  • Partition large tables by a low-cardinality column (e.g., date, region, country)
  • Use approximate functions like APPROX_COUNT_DISTINCT for large-scale analytics
  • Cache frequently accessed DataFrames in Spark notebooks with .cache() or .persist()
  • Use BROADCAST hints for small lookup tables in JOIN operations

Was this page helpful?

Thanks for your feedback!