DAILY NEWS

Stay Ahead, Stay Informed – Every Day

Advertisement
Anatomy of Duck DB for Python Developers



Introduction – SQL without a Server

Pandas is widely used for data analysis and almost every data analyst or even data engineers utilize it for faster analysis with table like data structure called DataFrames.The drawback is that it suffers once the data goes beyond few GB’s and spinning up a Postgres or a Redshift is an overkill for quick analysis.Duck DB fills this gap with Zero-setup columnar SQL.

Getting Started – zero config, instant power

DuckDb is an open source OLAP database management system designed for analytics and for running within the same process as the application.It is lightweight, can work directly with data files in csv, parquet etc without needing a server.

Installation and first query

pip install duckdb – No ports to open, No configuration and No daemon

In-Memory and Persistent Database – Two Operating Modes

In-MemoryWhen DuckDB connection is created without specifying a file, a database lives entirely in RAM.

import duckdb
con = duckdb.connect() # or duckdb.connect(‘:memory:’)

Enter fullscreen mode

Exit fullscreen mode

All data is stored in RAM and no files are written to disk
Extremely fast reads/writes since there is zero I/O overhead.
Data is completely lost when connection closes.
No file locking or concurrency concerns

Persistent ModeWhen the user provides a location DuckDB can write the results to disk in .duckDb format.

con = duckdb.connect(‘my_database.duckdb’)

Enter fullscreen mode

Exit fullscreen mode

Tables,Schemas and indexes are persisted.
Uses a columnar storage format with compression and buffered I/O
Only one write connection at a time but multiple read connection are allowed.
Supports WAL(Write Ahead Logging) for crash recovery

Powerful Pattern

DuckDb allows you to mix both modes where user can start with in-memory and attach a persistent database or use copy/export to snapshot in-memory result to disk.

con = duckdb.connect()

#Query a CSV, transform it, save the result to a persistent file
con.execute(“””
COPY(SELECT region, SUM(sales) AS total FROM read_csv(‘data.csv’)
GROUP BY region
)
TO ‘results.parquet’ (FORMAT PARQUET)
“””)

Enter fullscreen mode

Exit fullscreen mode

Users gets the speed of In-Memory processing which accelerates the pipeline processing with an option to persist.

Reading files directly –CSV,PARQUET,JSON,Arrow,

Query CSV without loading into memory

Select * from read_csv(‘data_csv’, auto_detect=true);

Enter fullscreen mode

Exit fullscreen mode

-Auto detects delimiter, compression and data types-Handles malformed rows gracefully-Can read multiple CSVs at once read_csv(‘data/*.csv’)

Parquet

Select * from read_parquet(‘data.parquet’);
–even from S3 directly
Select * from read_parquet(‘s3://bucket/data/*.parquet’);

Enter fullscreen mode

Exit fullscreen mode

Exploits column pruning as it only reads columns you need
Leverages row group skipping using Parquet’s build in min/max stats
Native support for nested types(structs,list,maps)

JSON/NDJSON

SELECT * FROM read_json(‘events.ndjson’, auto_detect=true);

Enter fullscreen mode

Exit fullscreen mode

-AUTO INFERS schema from data-NDJSON(Newline delimited) streams efficiently line by line-Can unnest deeply nested JSON fields using DuckDB’s json_extract, UNNEST, or -> operators

Apache Arrow

import pyarrow as pa
arrow_table = pa.Table.from_pandas(df)
duckdb.query(“”SELECT * from arrow_table”””)

Enter fullscreen mode

Exit fullscreen mode

-Zero copy integration: DuckDB reads from Arrow memory without serialization-Ideal for pipelines where data never needs to touch disk

SQL Beyond Select

DuckDB is not just a query engine, it supports rich SQL that covers data transformation, creation, and some genuinely unique syntax extensions to available in most databases.

Full Suite of WINDOW Functions

Select
customer,
ordered_at,
amount,

— Running total
SUM(amount) OVER (PARTITION BY customer ORDER BY ordered_at) AS running_tot,

— Lag/lead comparisons
LAG(amount) OVER (PARTITION BY customer ORDER BY ordered_at) AS prev_amt,

— Percentile rank
PERCENT_RANK() OVER (ORDER BY amount) AS pct_rank,

— Named window reuse
FIRST_VALUE(amount) OVER w AS first_order
FROM orders
WINDOW w AS (PARTITION BY customer ORDER BY ordered_at);

Enter fullscreen mode

Exit fullscreen mode

DuckDB also allows the use of qualify clause which filters on window result without a subquery.

Select * From orders
QUALIFY ROW_NUMBER() OVER (PARTITION BY customer ORDER BY amount DESC) = 1;

Enter fullscreen mode

Exit fullscreen mode

PIVOT and UNPIVOT

Most databases make you write case when manually for PIVOTS.DuckDB does it natively.

–PIVOT- rows to columns
PIVOT orders on region USING SUM(amount) GROUP BY year;

–UNPIVOT- Column to rows
UNPIVOT sales_wide ON(q1,q2,q3,q4) INTO NAME quarter VALUE revenue;

Enter fullscreen mode

Exit fullscreen mode

MULTI DATABASE SQL

–Attach another DuckDB file
ATTACH ‘archive.duckdb’ AS archive;

— Cross-database join
SELECT a.*, b.region
FROM main.orders a
JOIN archive.customers b ON a.customer_id = b.id;

–Attach another database
ATTACH ‘postgres://user:pass@host/db’ AS pg (TYPE POSTGRES);
SELECT * FROM pg.public.users LIMIT 10;

Enter fullscreen mode

Exit fullscreen mode

DUCKDB+Pandas+Polars –Choosing your stack

DuckDB does not replace pandas or Polars it solves a problem which is niche.The sweet spot of the industry is to use DuckDB for SQL-shaped operations and pandas/polars for row level python logic.

Where Duck DB shines

Feature Engineering for ML: Window functions or group by’s for feature computation are often faster and more readable in DuckDB then pandas before handing it over to Sklearn or pytorch
Unit testing DBT models locally:DuckDB lets you run complete dbt project locally without a cloud warehouse providing fast feedback loop for data engineers.
Light weight ETL Pipelines: One can read raw parquet from S3, transform with SQL, write cleaned output back without any spark cluster or airflow jobs.

Conclusion

DuckDB lets you think in SQL for analytical tasks without worrying about infrastructure setup. Anyone using python can utilize duckdb for analysis of larger files where regular pandas will give headache.Given the advantages, it is important to know whare DuckDB should not be used which in case of concurrent writes,OLTP workloads and long running multi user services.

Reference-https://duckdb.org/docs/current/data/overview



Source link

PostgreSQL Benchmarking Tool & SQLite Internals: API Error Handling, Join Optimization


PostgreSQL Benchmarking Tool & SQLite Internals: API Error Handling, Join Optimization

Today’s Highlights

This week’s highlights feature a new multi-backend benchmarking tool for PostgreSQL, alongside deep dives into SQLite’s C API error handling and practical insights into optimizing joins with CASE statements.

paradedb/benchmarker: a workload agnostic, multi-backend benchmarking tool. (r/PostgreSQL)

Source: https://reddit.com/r/PostgreSQL/comments/1tbh7j2/paradedbbenchmarker_a_workload_agnostic/

The ParadeDB team has open-sourced Benchmarker, a new workload-agnostic, multi-backend benchmarking framework built on top of Grafana k6. This tool is designed to provide comprehensive insights into database performance, with a strong initial focus on PostgreSQL. It allows developers and database administrators to rigorously test database configurations, versions, and even different database systems under various synthetic and real-world workloads.

Benchmarker helps users understand latency, throughput, and resource utilization by enabling them to define custom test scenarios using a JavaScript API (k6 scripts). This capability is crucial for identifying performance bottlenecks, validating the impact of database changes, and ensuring new systems meet stringent performance requirements before they are deployed to production. By offering a standardized and repeatable method for performance measurement, the tool significantly aids in effective performance tuning and strategic migration planning within the PostgreSQL ecosystem and beyond.

Comment: This looks like a robust, open-source framework for database performance engineers. Leveraging k6 is smart, offering a flexible way to compare PostgreSQL performance across different setups and prevent regressions.

Reply: sqlite3_create_function_v2() error handling inconsistency (SQLite Forum)

Source: https://sqlite.org/forum/info/050cbc2c58fd2c05e80e6d4ebc6cb264611f676f7b339d1e1f0876163e066e5e

A discussion on the SQLite forum explores a potential inconsistency in the error handling mechanisms of SQLite’s sqlite3_create_function_v2() C API. This function is fundamental for developers who want to extend SQLite’s capabilities by registering custom SQL functions, embedding application-specific logic directly into the database engine. The thread delves into the nuances of how errors—such as invalid input, runtime exceptions within the custom function, or resource limitations—are expected to be propagated and handled by the API.

Understanding these error propagation paths is critical for building robust and reliable SQLite extensions. Inconsistent behavior can lead to unpredictable application crashes, data integrity issues, or extremely difficult-to-debug problems in embedded database environments. The conversation likely dissects specific code examples, return values, and internal SQLite error codes, offering insights into best practices for ensuring custom functions gracefully handle errors and communicate them effectively back to the SQLite core and the calling application.

Comment: Debugging error paths in C APIs can be a nightmare. This deep dive into sqlite3_create_function_v2()’s error handling is essential for anyone serious about writing stable, performant SQLite extensions.

Reply: Joins with CASE statement dont match index (SQLite Forum)

Source: https://sqlite.org/forum/info/1a8da89554683ba858846d409c820d0bb96154ee7c2ba5ea8b9b19a3e6c09eed

This SQLite forum thread tackles a common performance challenge: when SQL queries using JOIN operations in conjunction with CASE statements fail to leverage existing database indexes efficiently. In SQLite, as with many relational databases, effective index utilization is paramount for query performance, particularly when dealing with large datasets. The issue arises because CASE expressions within a join condition or WHERE clause can sometimes obfuscate the underlying logic, preventing the query optimizer from recognizing and applying relevant indexes, often resulting in costly full table scans.

The discussion provides invaluable insights for performance tuning in the SQLite ecosystem, shedding light on the internal workings of its query planner. It likely explores specific query patterns, demonstrates the impact using EXPLAIN QUERY PLAN outputs, and proposes practical workarounds. These might include refactoring complex CASE logic into separate computed columns or pre-processing data to enable index usage, helping developers to significantly improve query execution speed while maintaining data accuracy.

Comment: Hitting optimizer limits with CASE statements in joins is a classic performance gotcha. This SQLite discussion provides crucial insights for crafting efficient queries and understanding when to refactor complex logic to enable index scans.



Source link

Your AI database agent does not know what revenue means



The fastest way to get a wrong answer from an AI database agent is to ask a simple business question.

What was revenue last month?

That sounds easy.

The database has invoices, subscriptions, payments, refunds, credits, discounts, taxes, trials, failed charges, and test accounts.

The model sees tables.

Your business sees definitions.

If those definitions are not part of the system, the model has to guess.

Valid SQL can still be wrong

A table called payments may include failed attempts.

subscriptions may include trials.

amount may be gross, net, pre-tax, post-tax, or stored in cents.

created_at may mean invoice creation, payment capture, or customer signup.

An AI agent can write syntactically valid SQL against all of that and still answer the wrong question.

This is why natural-language SQL needs metric context, not just schema context.

Approved views beat clever prompts

A prompt can tell the model how to calculate MRR.

An approved view makes the definition executable.

Instead of exposing raw invoice and payment tables, expose something like:

reporting.monthly_recurring_revenue

Enter fullscreen mode

Exit fullscreen mode

with reviewed columns, tenant scope, time grain, currency assumptions, and test-account filtering already handled.

The model still helps users ask flexible questions.

But the business definition lives in infrastructure, not in a fragile instruction.

What should travel with the tool

For AI reporting, the MCP tool should carry context such as:

metric description
allowed dimensions
time zone and grain
exclusions
freshness timestamp
exact vs estimated status
scope and tenant boundaries
warnings the final answer must preserve

Otherwise the model may produce a confident answer while hiding the caveats that matter.

Longer version: Metric definitions for AI database agents

The practical rule:

If a metric is important enough for a leadership meeting, it is important enough to define before an agent calculates it.



Source link