Data Ingestion

Real-Time Streaming Ingestion

Ingest real-time event streams from Kafka, Kinesis, and Event Hubs into your lakehouse.

7 min read · Updated April 2025

NATIS Structured Streaming processes unbounded real-time data streams using Apache Spark Streaming under the hood. Streams are consumed from message brokers and written to Delta Lake tables with exactly-once semantics.

Kafka Streaming Setup

PYTHON
from natis.streaming import StreamReader

# Read from Kafka topic
stream = (
    StreamReader
    .from_kafka(
        bootstrap_servers="kafka.example.com:9092",
        topic="user-events",
        consumer_group="natis-consumer-01",
        start_offset="latest",  # or "earliest" or specific offset
        schema_registry="https://schema-registry.example.com"
    )
    .with_json_schema({
        "event_id": "string",
        "user_id": "string",
        "event_type": "string",
        "timestamp": "timestamp",
        "properties": "map<string,string>"
    })
)

# Write to Delta Lake
(
    stream
    .write_delta(
        table="catalog.raw.user_events",
        mode="append",
        trigger_interval="30 seconds",
        checkpoint_location="/checkpoints/user-events"
    )
    .start()
)

Stream Processing Guarantees

Guarantee | Description | Enabled By Default — | — | — Exactly-Once | Each record processed exactly once even on restarts | Yes (Delta Lake) Checkpointing | Progress saved to checkpoint directory for fault tolerance | Yes Schema Evolution | New columns in source schema handled automatically | Configurable Late Data Handling | Watermarking to handle out-of-order events | Configurable

For low-latency use cases, use trigger_interval of '5 seconds' or 'availableNow' for continuous processing mode. Shorter intervals increase cost due to more frequent micro-batch commits.

Was this page helpful?

Thanks for your feedback!