Data Ingestion

Batch Data Connectors

Configure batch ingestion jobs to load data from databases, file systems, and cloud storage.

8 min read · Updated May 2025

On this page

Creating a Batch Ingestion Job
Connection String Formats
Incremental Load Configuration

Configure batch ingestion jobs to load data from databases, file systems, and cloud storage.

Creating a Batch Ingestion Job

1. Navigate to Pipelines → Ingestion → New Ingestion Job.
2. Select source type (e.g., PostgreSQL) and click Configure Connection.
3. Enter connection details: host, port, database, username, and password. Click Test Connection.
4. Select the tables or queries to ingest. Use the schema browser to preview data.
5. Configure the load strategy: Full Load, Incremental (by timestamp column), or Partition-based.
6. Set the destination: select a catalog, schema, and target table name in the Raw Zone.
7. Configure the schedule (cron expression or preset frequency) and click Save & Activate.

Connection String Formats

YAML

# PostgreSQL
source:
  type: postgresql
  host: db.example.com
  port: 5432
  database: production
  username: natis_reader
  password: "{{ secret:db_password }}"  # stored in NATIS Secrets Manager
  ssl: require

# MySQL
source:
  type: mysql
  host: db.example.com
  port: 3306
  database: orders
  username: natis_reader
  password: "{{ secret:mysql_password }}"

# S3 (File Source)
source:
  type: s3
  bucket: my-data-bucket
  prefix: /raw/sales/
  file_format: parquet
  region: ap-southeast-1
  credentials:
    type: iam_role
    role_arn: arn:aws:iam::123456789:role/natis-s3-reader

Incremental Load Configuration

For incremental loads, NATIS tracks the high-water mark of your chosen incremental key (usually a timestamp or auto-increment ID). On each run, only records with a value greater than the last high-water mark are loaded.

Using Upsert mode requires that the source table has a primary key or unique identifier. Without this, NATIS defaults to Append mode and may produce duplicate records.

Incremental Key — Column used to detect new/updated records (e.g., updated_at)
Load Type — Append Only or Upsert (requires primary key)
Lookback Window — Additional time range to catch late-arriving records (default: 1 hour)
Parallelism — Number of parallel threads for extraction (default: auto)

Was this page helpful?

Thanks for your feedback!