{"id":3833,"date":"2026-05-17T08:54:43","date_gmt":"2026-05-17T01:54:43","guid":{"rendered":"https:\/\/daiilynews.cu.ma\/?p=3833"},"modified":"2026-05-17T08:54:43","modified_gmt":"2026-05-17T01:54:43","slug":"anatomy-of-duck-db-for-python-developers","status":"publish","type":"post","link":"https:\/\/daiilynews.cu.ma\/?p=3833","title":{"rendered":"Anatomy of Duck DB for Python Developers"},"content":{"rendered":"<p> <br \/>\n<br \/>\n                Introduction &#8211; SQL without a Server<\/p>\n<p>Pandas is widely used for data analysis and almost every data analyst or even data engineers utilize it for faster analysis with table like data structure called DataFrames.The drawback is that it suffers once the data goes beyond few GB&#8217;s and spinning up a Postgres or a Redshift is an overkill for quick analysis.Duck DB fills this gap with Zero-setup columnar SQL.<\/p>\n<p>Getting Started &#8211; zero config, instant power<\/p>\n<p>DuckDb is an open source OLAP database management system designed for  analytics and for running within the same process as the application.It is lightweight, can work directly with data files in csv, parquet etc without needing a server.<\/p>\n<p>Installation and first query<\/p>\n<p>pip install duckdb &#8211; No ports to open, No configuration and No daemon<\/p>\n<p>  In-Memory and Persistent Database &#8211; Two Operating Modes<\/p>\n<p>In-MemoryWhen DuckDB connection is created without specifying a file, a database lives entirely in RAM.<\/p>\n<p>import duckdb<br \/>\ncon = duckdb.connect()          # or duckdb.connect(&#8216;:memory:&#8217;)<\/p>\n<p>    Enter fullscreen mode<\/p>\n<p>    Exit fullscreen mode<\/p>\n<p>All data is stored in RAM and no files are written to disk<br \/>\nExtremely fast reads\/writes since there is zero I\/O overhead.<br \/>\nData is completely lost when connection closes.<br \/>\nNo file locking or concurrency concerns<\/p>\n<p>Persistent ModeWhen the user provides a location DuckDB can write the results to disk in .duckDb format.<\/p>\n<p>con = duckdb.connect(&#8216;my_database.duckdb&#8217;)<\/p>\n<p>    Enter fullscreen mode<\/p>\n<p>    Exit fullscreen mode<\/p>\n<p>Tables,Schemas and indexes are persisted.<br \/>\nUses a columnar storage format with compression and buffered I\/O<br \/>\nOnly one write connection at a time but multiple read connection are allowed.<br \/>\nSupports WAL(Write Ahead Logging) for crash recovery<\/p>\n<p>Powerful Pattern<\/p>\n<p>DuckDb allows you to mix both modes where user can start with in-memory and attach a persistent database or use copy\/export to snapshot in-memory result to disk.<\/p>\n<p>con = duckdb.connect()<\/p>\n<p>#Query a CSV, transform it, save the result to a persistent file<br \/>\ncon.execute(&#8220;&#8221;&#8221;<br \/>\n    COPY(SELECT region, SUM(sales) AS total FROM read_csv(&#8216;data.csv&#8217;)<br \/>\n         GROUP BY region<br \/>\n     )<br \/>\n    TO &#8216;results.parquet&#8217; (FORMAT PARQUET)<br \/>\n&#8220;&#8221;&#8221;)<\/p>\n<p>    Enter fullscreen mode<\/p>\n<p>    Exit fullscreen mode<\/p>\n<p>Users gets the speed of In-Memory processing which accelerates the pipeline processing with an option to persist.<\/p>\n<p>  Reading files directly &#8211;CSV,PARQUET,JSON,Arrow,<\/p>\n<p>Query CSV without loading into memory<\/p>\n<p>Select * from read_csv(&#8216;data_csv&#8217;, auto_detect=true);<\/p>\n<p>    Enter fullscreen mode<\/p>\n<p>    Exit fullscreen mode<\/p>\n<p>-Auto detects delimiter, compression and data types-Handles malformed rows gracefully-Can read multiple CSVs at once read_csv(&#8216;data\/*.csv&#8217;)<\/p>\n<p>Parquet<\/p>\n<p>Select * from read_parquet(&#8216;data.parquet&#8217;);<br \/>\n&#8211;even from S3 directly<br \/>\nSelect * from read_parquet(&#8216;s3:\/\/bucket\/data\/*.parquet&#8217;);<\/p>\n<p>    Enter fullscreen mode<\/p>\n<p>    Exit fullscreen mode<\/p>\n<p>Exploits column pruning as it only reads columns you need<br \/>\nLeverages row group skipping using Parquet&#8217;s build in min\/max stats<br \/>\nNative support for nested types(structs,list,maps)<\/p>\n<p>JSON\/NDJSON<\/p>\n<p>SELECT * FROM read_json(&#8216;events.ndjson&#8217;, auto_detect=true);<\/p>\n<p>    Enter fullscreen mode<\/p>\n<p>    Exit fullscreen mode<\/p>\n<p>-AUTO INFERS schema from data-NDJSON(Newline delimited) streams efficiently line by line-Can unnest deeply nested JSON fields using DuckDB&#8217;s json_extract, UNNEST, or -> operators<\/p>\n<p>Apache Arrow<\/p>\n<p>import pyarrow as pa<br \/>\narrow_table = pa.Table.from_pandas(df)<br \/>\nduckdb.query(&#8220;&#8221;SELECT * from arrow_table&#8221;&#8221;&#8221;)<\/p>\n<p>    Enter fullscreen mode<\/p>\n<p>    Exit fullscreen mode<\/p>\n<p>-Zero copy integration: DuckDB reads from Arrow memory without serialization-Ideal for pipelines where data never needs to touch disk<\/p>\n<p>  SQL Beyond Select<\/p>\n<p>DuckDB is not just a query engine, it supports rich SQL that covers data transformation, creation, and some genuinely unique syntax extensions to available in most databases.<\/p>\n<p>Full Suite of WINDOW Functions<\/p>\n<p>Select<br \/>\n    customer,<br \/>\n    ordered_at,<br \/>\n    amount,<\/p>\n<p>    &#8212; Running total<br \/>\n    SUM(amount) OVER (PARTITION BY customer ORDER BY ordered_at) AS running_tot,<\/p>\n<p>    &#8212; Lag\/lead comparisons<br \/>\n    LAG(amount) OVER (PARTITION BY customer ORDER BY ordered_at) AS prev_amt,<\/p>\n<p>    &#8212; Percentile rank<br \/>\n    PERCENT_RANK() OVER (ORDER BY amount) AS pct_rank,<\/p>\n<p>    &#8212; Named window reuse<br \/>\n    FIRST_VALUE(amount) OVER w AS first_order<br \/>\nFROM orders<br \/>\nWINDOW w AS (PARTITION BY customer ORDER BY ordered_at);<\/p>\n<p>    Enter fullscreen mode<\/p>\n<p>    Exit fullscreen mode<\/p>\n<p>DuckDB also allows the use of qualify clause which filters on window result without a subquery.<\/p>\n<p>Select * From orders<br \/>\nQUALIFY ROW_NUMBER() OVER (PARTITION BY customer ORDER BY amount DESC) = 1;<\/p>\n<p>    Enter fullscreen mode<\/p>\n<p>    Exit fullscreen mode<\/p>\n<p>PIVOT and UNPIVOT<\/p>\n<p>Most databases make you write case when manually for PIVOTS.DuckDB does it natively.<\/p>\n<p>&#8211;PIVOT- rows to columns<br \/>\nPIVOT orders on region USING SUM(amount) GROUP BY year;<\/p>\n<p>&#8211;UNPIVOT- Column to rows<br \/>\nUNPIVOT sales_wide ON(q1,q2,q3,q4) INTO NAME quarter VALUE revenue;<\/p>\n<p>    Enter fullscreen mode<\/p>\n<p>    Exit fullscreen mode<\/p>\n<p>MULTI DATABASE SQL<\/p>\n<p>&#8211;Attach another DuckDB file<br \/>\nATTACH &#8216;archive.duckdb&#8217; AS archive;<\/p>\n<p>&#8212; Cross-database join<br \/>\nSELECT a.*, b.region<br \/>\nFROM main.orders a<br \/>\nJOIN archive.customers b ON a.customer_id = b.id;<\/p>\n<p>&#8211;Attach another database<br \/>\nATTACH &#8216;postgres:\/\/user:pass@host\/db&#8217; AS pg (TYPE POSTGRES);<br \/>\nSELECT * FROM pg.public.users LIMIT 10;<\/p>\n<p>    Enter fullscreen mode<\/p>\n<p>    Exit fullscreen mode<\/p>\n<p>  DUCKDB+Pandas+Polars &#8211;Choosing your stack<\/p>\n<p>DuckDB does not replace pandas or Polars it solves a problem which is niche.The sweet spot of the industry is to use DuckDB for SQL-shaped operations and pandas\/polars for row level python logic.<\/p>\n<p>  Where Duck DB shines<\/p>\n<p>Feature Engineering for ML: Window functions or group by&#8217;s for feature computation are often faster and more readable in DuckDB then pandas before handing it over to Sklearn or pytorch<br \/>\nUnit testing DBT models locally:DuckDB lets you run complete dbt project locally without a cloud warehouse providing fast feedback loop for data engineers.<br \/>\nLight weight ETL Pipelines: One can read raw parquet from S3, transform with SQL, write cleaned output back without any spark cluster or airflow jobs. <\/p>\n<p>  Conclusion<\/p>\n<p>DuckDB lets you think in SQL for analytical tasks without worrying about infrastructure setup. Anyone using python can utilize duckdb for analysis of larger files where regular pandas will give headache.Given the advantages, it is important to know whare DuckDB should not be used which in case of concurrent writes,OLTP workloads and long running multi user services.<\/p>\n<p>Reference-https:\/\/duckdb.org\/docs\/current\/data\/overview<\/p>\n<p><br \/>\n<br \/><a href=\"https:\/\/dev.to\/varunjoshi12\/anatomy-of-duck-db-for-python-developers-emh\">Source link <\/a><\/p>\n","protected":false},"excerpt":{"rendered":"<p>Introduction &#8211; SQL without a Server Pandas is widely used for data analysis and almost every data analyst or even data engineers utilize it for faster analysis with table like data structure called DataFrames.The drawback is that it suffers once the data goes beyond few GB&#8217;s and spinning up a Postgres or a Redshift is [&hellip;]<\/p>\n","protected":false},"author":1,"featured_media":3834,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[676],"tags":[761,765,1333,762,1476,763,764,860,905,760],"class_list":["post-3833","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-tech-ai","tag-coding","tag-community","tag-database","tag-development","tag-duckdb","tag-engineering","tag-inclusive","tag-programming","tag-python","tag-software"],"_links":{"self":[{"href":"https:\/\/daiilynews.cu.ma\/index.php?rest_route=\/wp\/v2\/posts\/3833","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/daiilynews.cu.ma\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/daiilynews.cu.ma\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/daiilynews.cu.ma\/index.php?rest_route=\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/daiilynews.cu.ma\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=3833"}],"version-history":[{"count":0,"href":"https:\/\/daiilynews.cu.ma\/index.php?rest_route=\/wp\/v2\/posts\/3833\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/daiilynews.cu.ma\/index.php?rest_route=\/wp\/v2\/media\/3834"}],"wp:attachment":[{"href":"https:\/\/daiilynews.cu.ma\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=3833"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/daiilynews.cu.ma\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=3833"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/daiilynews.cu.ma\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=3833"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}