SmartDataMark logo

Autoloader

What is Auto Loader?

A Databricks feature for continuous ingestion from cloud storage (S3, Azure Blob/ADLS, GCS). It handles incremental discovery, schema inference/evolution, and high scalability with minimal ops.

Key options you’ll actually use:

Format

  • cloudFiles.format → file type (csv, json, parquet, avro, orc, text).
  • File detection

  • cloudFiles.schemaLocation → where Auto Loader stores detected schemas & metadata.
  • cloudFiles.maxFilesPerTrigger → cap files per micro-batch.
  • cloudFiles.includeExistingFiles → process the historical backlog on first run.
  • cloudFiles.allowOverwrites → allow file overwrites in the source.
  • Schema

  • cloudFiles.inferColumnTypes → infer numbers/booleans/dates (instead of treating everything as strings).
  • cloudFiles.schemaEvolutionMode → how to react to new columns:
  • Filtering

  • cloudFiles.pathGlobFilter → include by glob (e.g., .json).
  • cloudFiles.excludePattern → exclude by pattern (e.g., _backup.json).
  • Performance

  • cloudFiles.useNotifications → use storage events instead of listing (faster/cheaper at scale).
  • cloudFiles.queueName (Azure)/cloudFiles.subscriptionId/cloudFiles.resourceGroup or SQS/SNS on AWS → where change events arrive.
  • cloudFiles.backfillInterval → periodic backfill scan to catch missed files.
  • Errors & control

  • cloudFiles.schemaHints → declare expected types (strong guardrails).
  • cloudFiles.ignoreCorruptFiles, cloudFiles.ignoreMissingFiles → skip bad/missing inputs.
  • Minimal example (AWS S3 → Delta):

            df = (spark.readStream
          .format("cloudFiles")
          .option("cloudFiles.format", "json")
          .option("cloudFiles.schemaLocation", "dbfs:/schemas/ingestion/")
          .option("cloudFiles.inferColumnTypes", "true")
          .option("cloudFiles.includeExistingFiles", "true")
          .option("cloudFiles.schemaEvolutionMode", "addNewColumns")
          # Optional performance:
          # .option("cloudFiles.useNotifications", "true")
          # .option("cloudFiles.queueName", "sqs-auto-loader-queue")  # AWS example
          .load("s3://my-bucket/data/"))
    
    (df.writeStream
       .format("delta")
       .option("checkpointLocation", "dbfs:/checkpoints/data/")
       .start("dbfs:/delta/data/"))
    
    
          

    Auto Loader = fewer manual jobs, safer schema changes, and ingestion that keeps up with your lake.