Autoloader
What is Auto Loader?
A Databricks feature for continuous ingestion from cloud storage (S3, Azure Blob/ADLS, GCS). It handles incremental discovery, schema inference/evolution, and high scalability with minimal ops.
Key options you’ll actually use:
Format
File detection
Schema
Filtering
Performance
Errors & control
Minimal example (AWS S3 → Delta):
df = (spark.readStream
.format("cloudFiles")
.option("cloudFiles.format", "json")
.option("cloudFiles.schemaLocation", "dbfs:/schemas/ingestion/")
.option("cloudFiles.inferColumnTypes", "true")
.option("cloudFiles.includeExistingFiles", "true")
.option("cloudFiles.schemaEvolutionMode", "addNewColumns")
# Optional performance:
# .option("cloudFiles.useNotifications", "true")
# .option("cloudFiles.queueName", "sqs-auto-loader-queue") # AWS example
.load("s3://my-bucket/data/"))
(df.writeStream
.format("delta")
.option("checkpointLocation", "dbfs:/checkpoints/data/")
.start("dbfs:/delta/data/"))
Auto Loader = fewer manual jobs, safer schema changes, and ingestion that keeps up with your lake.