dataengineering – DAILY NEWS

TECH & AI

Understanding Apache Airflow DAGs: Structure, Communication, and Deployment

jackminion Jun 23, 2026 0

Apache Airflow has become one of the most widely used workflow orchestration platforms for building, scheduling, and monitoring data pipelines. At the heart of Airflow lies the Directed Acyclic Graph (DAG), a structure that defines how tasks are organized and executed. Understanding DAGs is essential for anyone working with data engineering, ETL pipelines, or workflow automation.

What is a DAG?A Directed Acyclic Graph (DAG) is a collection of tasks organized in a way that defines dependencies and execution order.

Directed- means tasks have a specific direction of execution.
Acyclic- means there are no loops; a task cannot eventually depend on itself.
Graph- represents the relationship between tasks.

Basic DAG StructureA typical Airflow DAG consists of:

DAG definition
Tasks (Operators or TaskFlow functions)
Dependencies

from airflow.sdk import dag, task
from datetime import datetime
@dag(
start_date=datetime(2026, 1, 1),
schedule=”@daily”,
catchup=False
)
def sample_dag():
@task def extract():
return “data”
@task def transform(data):
return data.upper()
@task def load(data):
print(data)
load(transform(extract()))
sample_dag()

Enter fullscreen mode

Exit fullscreen mode

This DAG follows a simple Extract → Transform → Load pattern.

Task Communication with XCom

Tasks in Airflow are isolated from one another. To share information between tasks, Airflow provides Cross-Communication (XCom).

XCom allows tasks to push and pull small pieces of data.

Deploying DAGs with SCP

In many production environments, Airflow runs on a remote Linux server. Instead of manually recreating DAG files, engineers often use Secure Copy Protocol (SCP) to transfer DAGs.

scp gas_prices_dag.py user@server:/home/user/airflow/dags/

Enter fullscreen mode

Exit fullscreen mode

This command securely copies the DAG file to the server’s DAG directory.

SCP is especially useful when deploying updated pipelines from a development machine to a production Airflow environment.

Running Airflow Services with nohup

Airflow components such as the scheduler and webserver need to remain running even after a terminal session closes.

The nohup command helps achieve this.

nohup airflow standalone &

Enter fullscreen mode

Exit fullscreen mode

This starts the scheduler in the background and prevents it from stopping when the terminal closes.The output is redirected to log files for troubleshooting.

Managing Airflow with systemd

For production environments, systemd is the preferred way to manage Airflow services.

A systemd service can automatically:

Start Airflow after system boot
Restart failed services
Manage logs
Monitor service health

Monitoring and Troubleshooting DAGs

Airflow provides a web interface where users can:

Trigger DAG runs
Monitor task execution
View task logs
Retry failed tasks
Inspect XCom values

ConclusionApache Airflow DAGs provide a powerful way to orchestrate complex workflows and data pipelines. By understanding DAG structure, task dependencies, XCom communication, and deployment techniques such as SCP, nohup, and systemd, data engineers can build reliable and maintainable ETL systems. Whether running a simple daily pipeline or a large-scale production workflow, mastering DAGs is the foundation of effective workflow orchestration with Apache Airflow.

Source link

TECH & AI

Why PostgreSQL and ClickHouse Work So Well Together

jackminion May 11, 2026 0

A lot of people compare PostgreSQL and ClickHouse like they are competing databases.

They really are not.

In fact, modern data systems often use both together.

And once you understand what each database is optimized for, the reason becomes pretty obvious.

The biggest mistake people make is expecting both databases to behave similarly.

They are built for entirely different workloads.

PostgreSQL is primarily an OLTP database.

ClickHouse is primarily an OLAP database.

That single difference changes almost everything about how they think internally.

PostgreSQL is extremely good at handling transactional workloads.

Things like:

user data
payments
inventory
banking records
order systems
application state

These are systems where:

consistency matters
updates happen frequently
rows are modified constantly
transactions must be reliable

For example:

UPDATE inventory
SET stock = stock – 1
WHERE product_id = 101;

Enter fullscreen mode

Exit fullscreen mode

This kind of workload is where PostgreSQL shines.

You want:

ACID guarantees
reliable transactions
row-level updates
strong consistency

PostgreSQL is designed around exactly that.

ClickHouse approaches data very differently.

Instead of optimizing for frequent row updates, it optimizes for analytical queries across massive datasets.

Things like:

metrics
observability
logs
event streams
analytical dashboards
time-series workloads

For example:

SELECT
service_name,
avg(response_time_ms)
FROM metrics
WHERE timestamp >= now() – INTERVAL 1 HOUR
GROUP BY service_name;

Enter fullscreen mode

Exit fullscreen mode

This is a completely different style of workload.

Instead of:

modifying small numbers of rows

ClickHouse is optimized for:

scanning huge amounts of data efficiently
aggregating billions of records
compressing analytical datasets
fast columnar reads

This is honestly the simplest way I think about it now.

PostgreSQL usually stores:

current application state
transactional business data
operational records

ClickHouse usually stores:

analytical history
events
metrics
large-scale queryable telemetry

One powers the application.

The other explains what the application is doing.

This is where things get interesting.

In many modern architectures, PostgreSQL becomes the operational source of truth.

Then data flows into ClickHouse for analytics.

Something like this:

Application
↓
PostgreSQL
↓
CDC / Airbyte / Kafka
↓
ClickHouse
↓
Dashboards / Analytics / Observability

Enter fullscreen mode

Exit fullscreen mode

This pattern is far more common than many people realize.

Because each database is doing what it is best at.

PostgreSQL can do analytical queries.

But analytical workloads behave very differently from transactional workloads.

For example:

scanning billions of rows
large aggregations
observability queries
real-time analytics
historical trend analysis

These workloads stress databases differently.

ClickHouse is optimized around:

columnar storage
vectorized execution
aggressive compression
analytical query execution

That is why queries over huge datasets often feel dramatically faster in ClickHouse.

This is another common misunderstanding.

ClickHouse is incredible for analytics.

But transactional systems require things like:

frequent updates
transactional consistency
row-level modifications
operational application state

That is not the primary design goal of ClickHouse.

You generally do not want your:

user authentication system
banking transactions
inventory updates
operational business logic

to depend entirely on analytical database behavior.

What I personally find interesting is how these systems complement each other instead of replacing each other.

PostgreSQL handles:

ClickHouse handles:

That separation creates much cleaner architectures.

Instead of forcing one database to solve every problem, each system handles the workload it was designed for.

One thing that makes this architecture powerful is CDC (Change Data Capture).

Instead of manually exporting data repeatedly, systems can stream changes from PostgreSQL into ClickHouse continuously.

Tools like:

Debezium
Airbyte
Kafka pipelines

make this pattern extremely practical now.

The operational system continues running normally while analytical systems receive data almost in real time.

The differences go deeper than just “transactions vs analytics”.

PostgreSQL thinks heavily about:

rows
transactional consistency
updates
locking
relational integrity

ClickHouse thinks heavily about:

columns
compression
merges
partitions
analytical scans
aggregation efficiency

Even their storage engines reflect completely different priorities.

Once you stop viewing databases as competitors and instead view them as workload-specific systems, the architecture starts making much more sense.

PostgreSQL handles the operational side.

ClickHouse handles the analytical side.

Together, they create systems that can:

process transactions reliably
scale analytical workloads efficiently
support observability
power dashboards
retain huge historical datasets

without forcing a single database to do everything.

The more I learn about databases, the more I realize that most modern architectures are really about separation of responsibilities.

PostgreSQL and ClickHouse work well together because they optimize for fundamentally different problems.

One is built to preserve business state reliably.

The other is built to analyze massive amounts of history efficiently.

And when combined properly, they complement each other extremely well.

Source link