SmartDataMark logo

What is DuckLake, and why is it interesting?

DuckLake is an open lakehouse format that stores table data in open files like Parquet while managing all metadata in a SQL database.That design makes lakehouse management simpler, faster, and more reliable.

Instead of relying on many metadata files, DuckLake moves catalog and table metadata into relational tables managed through SQL transactions.

That brings a few interesting advantages:

  • Open storage: data stays in open formats like Parquet on object storage.
  • SQL-backed metadata: metadata lives in a database instead of file-heavy metadata chains.
  • ACID transactions: supports transactional updates, including cross-table transactions.
  • Snapshots and time travel: DuckLake supports snapshots and querying changes across snapshots.
  • Schema evolution and transactional DDL: table and schema changes are transactional.
  • Partition-aware pruning: query planning can use partition and file statistics to reduce reads early.
  • Why use it?

    Because it offers a lightweight way to combine:

  • open Parquet storage
  • a SQL catalog
  • shared concurrent access
  • DuckDB simplicity
  • For development, the catalog can even be a local DuckDB file.For more centralized setups, DuckLake can use systems like PostgreSQL, SQLite, MySQL, or MotherDuck as the catalog backend.

    The interesting part is that DuckLake can support multiple compute nodes reading and writing the same dataset through the central catalog, which solves a concurrency limitation you typically have with plain DuckDB alone.

    In short: DuckLake is trying to make the lakehouse model simpler by keeping the data open and letting a real SQL database do what it does best: manage metadata and transactions.