Real-Time Streaming Ingestion
Ingest real-time event streams from Kafka, Kinesis, and Event Hubs into your lakehouse.
On this page
NATIS Structured Streaming processes unbounded real-time data streams using Apache Spark Streaming under the hood. Streams are consumed from message brokers and written to Delta Lake tables with exactly-once semantics.
Kafka Streaming Setup
from natis.streaming import StreamReader
# Read from Kafka topic
stream = (
StreamReader
.from_kafka(
bootstrap_servers="kafka.example.com:9092",
topic="user-events",
consumer_group="natis-consumer-01",
start_offset="latest", # or "earliest" or specific offset
schema_registry="https://schema-registry.example.com"
)
.with_json_schema({
"event_id": "string",
"user_id": "string",
"event_type": "string",
"timestamp": "timestamp",
"properties": "map<string,string>"
})
)
# Write to Delta Lake
(
stream
.write_delta(
table="catalog.raw.user_events",
mode="append",
trigger_interval="30 seconds",
checkpoint_location="/checkpoints/user-events"
)
.start()
)
Stream Processing Guarantees
Guarantee | Description | Enabled By Default — | — | — Exactly-Once | Each record processed exactly once even on restarts | Yes (Delta Lake) Checkpointing | Progress saved to checkpoint directory for fault tolerance | Yes Schema Evolution | New columns in source schema handled automatically | Configurable Late Data Handling | Watermarking to handle out-of-order events | Configurable
For low-latency use cases, use trigger_interval of '5 seconds' or 'availableNow' for continuous processing mode. Shorter intervals increase cost due to more frequent micro-batch commits.
Was this page helpful?
Thanks for your feedback!