A Layer Map of the Modern Data Engineering Ecosystem
The biggest change in the data stack over the past decade is that “the Hadoop-era unified stack has fallen apart”: where one Hadoop ecosystem used to package storage, compute, scheduling, and querying, each layer has now split into independent star projects. Building a production-grade data platform today, you will hear Debezium, Kafka, Airbyte, Spark, Flink, dbt, Iceberg, ClickHouse, Trino, Redis, Superset, Metabase, Airflow, DataHub — a long list of names that correspond to different cells on the same layered architecture diagram.
This article organizes that map into six layers: Sources → Ingestion → Compute → Storage → Query → Serve, plus two cross-cutting concerns spanning the full stack: Orchestration and Governance. Each project is unpacked into four parts: History / Core Design / Users / Demo Code. The demo doesn’t pursue a runnable complete project, only the minimal snippet that expresses each tool’s “soul” — after reading this, you should be able to answer two questions: why does this tool live in this layer, and what does the simplest line of code look like.
Overview — Six Layers · One Architecture Diagram
First, the full picture. The diagram below is the skeleton of this article — six horizontal layers from top to bottom, each picking the most mainstream 1–2 representative projects; Airflow on the left spans vertically because scheduling is a cross-cutting concern “effective on all compute nodes”; Catalog at the bottom is another cross-cutting concern — the truth about “what data exists” in all storage is registered in the metadata catalog.
A few facts worth pointing out from this diagram:
- Layer 3 “Compute” and Layer 5 “Query” share the same projects (Spark / Spark SQL, Flink / Flink SQL): this is because modern distributed engines are dual-identity “compute + query” — the same jar, the same cluster, can run offline ETL or accept interactive SQL requests.
- Layer 4 isn’t a single box but two parallel hubs — the lakehouse (S3 + Iceberg) solves “complete history + cross-engine sharing,” and real-time OLAP (ClickHouse) solves “low latency + high concurrency.” The two hubs coexist, not replace each other: the lakehouse stores “everything,” OLAP materializes “the most recent hot-query window.”
- Airflow and Catalog aren’t tools of any one layer; they’re cross-cutting: Airflow schedules all compute tasks, Catalog registers metadata for all tables in storage — their position in the diagram is deliberately outside the layer bands.
Below we go layer by layer.
Sources — OLTP databases · logs · SaaS
This layer isn’t strictly “tools” but rather an inventory of “where data comes from.” Roughly three categories:
- Business system relational databases — orders / users / inventory, typically Postgres / MySQL / MongoDB / Oracle. Characteristics: strict schema, single-table row counts from millions to billions, frequent row-level changes. Key fact: the true source-of-truth of these databases is the WAL / binlog / oplog, not the table itself; CDC tools all read from here.
- Application telemetry and server logs — user behavior events, app / web SDK reports, backend access logs. Characteristics: loose schema, many fields, append-only (almost no updates), TB-scale daily volume.
- Third-party SaaS APIs — Salesforce (CRM), Stripe (payments), Google Ads (advertising), HubSpot (marketing), etc. Characteristics: schema dictated by the other side, collection by pull (API polling) or push (webhook), frequency ranging from minutes to hours.
The next three layers (Ingestion / Processing / Query) and tool selection are fundamentally determined by “which category your data sources fall into, and whether you need real-time or batch”: CDC path goes Debezium → Kafka → Flink, telemetry goes through Kafka, SaaS goes through Airbyte / Fivetran.
Ingestion — Debezium · Kafka · Airbyte · Fivetran
The ingestion layer is the engineering form of “moving data from source to data platform.” The key split is streaming vs batch: database changes and event streams must use streaming pipelines (Debezium → Kafka); SaaS pulls fit scheduled batch (Airbyte / Fivetran).
Debezium · Database Change Capture (CDC)
History — Debezium is led by Red Hat, open-sourced in 2016, and contributed to the CNCF Sandbox. Its original intent is “treat database changes as event streams”: instead of polling whole tables to compute diffs, read MySQL binlog / Postgres WAL / MongoDB oplog directly and convert each INSERT / UPDATE / DELETE into a real-time JSON event. Early Debezium was tightly coupled to Kafka Connect as runtime; the 2.0 release (2022) introduced Debezium Server (standalone sink, can write directly to Pulsar / Kinesis / RabbitMQ) and Debezium Engine (embeddable into any JVM application).
Core design — Three layers of abstraction:
- Connector — one per database, reads WAL / binlog and formats changes into a unified Debezium event (
before/after/op(c/u/d) /ts_ms/source). - Runtime — Kafka Connect / Debezium Server / Debezium Engine, consumes connector output and writes downstream.
- Schema registry and evolution — maintains Avro / Protobuf schemas via Confluent Schema Registry or Apicurio for schema evolution.
The essential difficulty of CDC is seamless splicing of the first-time snapshot + subsequent incremental. Debezium solves this with “watermark-based incremental snapshot”: while doing the snapshot, also consume binlog, and use watermarks to mark a cut-off point that guarantees no misses or duplicates.
Users — LinkedIn (early major contributor), Convoy, WePay, JD.com, ByteDance, and others run Debezium CDC pipelines in production. A common combination in China is Debezium + Kafka + Flink CDC (the flink-cdc-connectors project contributed by Alibaba essentially uses Debezium’s Engine library to read binlog). It’s the de facto open-source CDC standard, with almost no competitive open-source alternatives (closed-source competitors include Fivetran HVR, Striim, Qlik Replicate).
Demo code — Debezium MySQL Connector configuration (registered via Kafka Connect REST API in JSON):
{
"name": "inventory-connector",
"config": {
"connector.class": "io.debezium.connector.mysql.MySqlConnector",
"database.hostname": "mysql-prod.internal",
"database.port": "3306",
"database.user": "debezium",
"database.password": "${SECRET:mysql_pwd}",
"database.server.id": "184054",
"topic.prefix": "inventory",
"database.include.list": "orders,inventory",
"table.include.list": "orders.line_items,inventory.products",
"schema.history.internal.kafka.bootstrap.servers": "kafka-1:9092",
"schema.history.internal.kafka.topic": "schema-changes.inventory",
"snapshot.mode": "initial"
}
}
Once registered, per-row change events appear in Kafka topic inventory.orders.line_items; consumers only need to subscribe to this topic to receive a downstream sync stream with sub-second latency.
Kafka · Distributed Commit Log
History — LinkedIn started the Kafka project internally in 2010 (Jay Kreps, Neha Narkhede, Jun Rao), to solve the problem of two systems duplicating effort: “user activity events + operations monitoring data.” Open-sourced in 2011, contributed to Apache in 2012, and the three founders left LinkedIn in 2014 to found Confluent (the commercial company). Over a decade, Kafka has evolved from a pure message queue into an “event streaming platform”:
- 2015 — Kafka Connect (standard connector framework; Debezium is based on this)
- 2016 — Kafka Streams (Java client stream processing library, no external cluster needed)
- 2017 — KSQL (SQL on Kafka, later renamed ksqlDB)
- 2022 — KRaft (self-managed metadata, gradually replacing ZooKeeper, production-ready)
- 2024 — Tiered Storage GA (cold data tiered to S3, lowering long-tail log retention cost)
Core design — Kafka’s foundation is abstracting messages as “partitioned, ordered append logs.”
- Topic is split into multiple Partitions, each Partition is a sequentially-written file on disk; Consumer tracks where it’s read using offsets.
- Each Partition has N replicas in the Broker cluster, Leader handles reads/writes, Followers sync; the ISR (In-Sync Replicas) mechanism ensures replicas immediately take over if Leader fails.
- Key trade-off: sequential disk writes + zero-copy + page cache lets Kafka run at hundreds of thousands of msg/s on commodity hardware on a single Broker; but consumers must handle at-least-once semantics and idempotency themselves.
- KRaft uses Raft consensus to replace ZooKeeper, placing metadata into Kafka’s own log, simplifying deployment from “two clusters” to “one cluster.”
Users — Kafka is the de facto standard distributed data bus. LinkedIn, Uber, Netflix, Airbnb, Twitter, Alibaba, ByteDance, Meituan all use it; LinkedIn’s internal Kafka cluster processes 7 trillion messages per day. Commercial layer Confluent (NASDAQ: CFLT, market cap $7B+) monetizes primarily through hosted service Confluent Cloud and the Kafka Streams / Connect / Schema Registry / ksqlDB surroundings. Competitors:
- Redpanda — C++ rewrite, Kafka protocol compatible, marketing lower latency / no JVM / no ZooKeeper
- Apache Pulsar — open-sourced by Yahoo!, separates broker / bookie, stronger multi-tenant features
- AWS Kinesis / GCP Pub/Sub — cloud-hosted, ecosystem-locked but low ops cost
Demo code — Java Producer + Consumer (Kafka’s official client, JVM is its native runtime):
import org.apache.kafka.clients.producer.*;
import org.apache.kafka.clients.consumer.*;
import org.apache.kafka.common.serialization.*;
import java.time.Duration;
import java.util.*;
Properties p = new Properties();
p.put("bootstrap.servers", "kafka-1:9092,kafka-2:9092");
p.put("key.serializer", StringSerializer.class.getName());
p.put("value.serializer", StringSerializer.class.getName());
try (var producer = new KafkaProducer<String, String>(p)) {
producer.send(new ProducerRecord<>("user.events", "u-42",
"{\"event\":\"login\",\"ts\":1715472000}"));
}
p.put("group.id", "analytics-pipeline");
p.put("auto.offset.reset", "earliest");
p.put("key.deserializer", StringDeserializer.class.getName());
p.put("value.deserializer", StringDeserializer.class.getName());
try (var consumer = new KafkaConsumer<String, String>(p)) {
consumer.subscribe(List.of("user.events"));
while (true) {
for (var r : consumer.poll(Duration.ofSeconds(1))) {
System.out.println(r.key() + " " + r.value());
}
}
}
Producer writes to partitions by key hash; Consumer joins a group and is automatically assigned partitions for parallel consumption — this is what Kafka’s “horizontal scaling” looks like to the client.
Airbyte · Open-Source ELT Connectors
History — Airbyte was founded in 2020 by former LiveRamp engineers Michel Tricot and John Lafleur. It targets the pain point of “connectors for long-tail SaaS data sources”: every company needs to extract Stripe, Salesforce, Mixpanel, HubSpot, Intercom, etc. into its warehouse, but closed-source Fivetran is too expensive and in-house connectors are laborious. Airbyte’s pitch is an open-source + community-contributed connector catalog, currently with 600+ Source / Destination connectors. Business model: Airbyte Open Source (MIT) + Airbyte Cloud (SaaS, billed per row) + Airbyte Enterprise (self-hosted commercial). 2022 Series B at $1.5B valuation.
Core design — Airbyte packages every connector as a Docker container and communicates with the scheduler via the Airbyte Protocol (JSON over stdin/stdout). This design lets any language and any-dependency connector be integrated. Core components:
- Worker — schedules and runs connector containers
- Source / Destination Connector — containers that actually pull / write data, implementing the five interfaces
spec/check/discover/read/write - Connection — binds a Source and a Destination into a sync task, with schedule, incremental column, normalization, and other config
- CDK (Connector Development Kit) — Python / Low-code templates that reduce the cost of contributing a new connector to a few hundred lines of YAML
Released in 2023, PyAirbyte lets you call a connector as a Python library directly without starting a full service — very popular in notebook scenarios.
Users — Mid-sized companies + data teams are the mainstay; Cargill, Calendly, Perplexity, Replicate, and others use it. It’s the de facto open-source ELT standard. Competitors:
- Meltano — spun out of GitLab, based on the Singer Tap standard
- dlt (data load tool) — Python-native, marketing “embedded ELT,” inspired by dbt
- Stitch — acquired by Talend, SaaS-only, author of Singer protocol
- Hevo Data — Indian team, commercial SaaS
Demo code — Airbyte’s manifest.yaml (low-code connector, declarative pulling from a REST API):
version: 0.50.0
type: DeclarativeSource
streams:
- name: customers
primary_key: ["id"]
retriever:
record_selector:
extractor:
field_path: ["data"]
requester:
type: HttpRequester
url_base: "https://api.example.com/v1"
path: "/customers"
http_method: "GET"
authenticator:
type: BearerAuthenticator
api_token: "{{ config['api_key'] }}"
paginator:
type: DefaultPaginator
page_token_option:
type: RequestOption
field_name: "cursor"
inject_into: "request_parameter"
This YAML is a complete Airbyte Source connector without writing a line of Python.
Fivetran · Closed-Source Managed ELT
History — Fivetran was founded by George Fraser and Taylor Brown in 2012 at YC W13, eight years before Airbyte. It markets “zero-maintenance SaaS sync” — customers just authorize, no deployment, no connector maintenance, fully SaaS. This positioning became the standard ELT during the 2017–2020 rise of cloud warehouses (Snowflake, BigQuery, Redshift); 2021 Series D at $5.6B, later merged with HVR to complete CDC capability. Business model is pure MAR (monthly active rows) billing, one of the most expensive layers in the data stack — which is precisely what gave rise to open-source alternatives like Airbyte.
Core design — Fivetran uses a unified “schema-on-write” model: each connector automatically replicates the source schema to the target warehouse; new columns / new tables sync automatically. CDC uses binlog; SaaS uses API. All connectors are closed-source; Fivetran maintains quality and upgrades themselves. This contrasts cleanly with Airbyte: Fivetran sells “peace of mind,” Airbyte sells “control + cheap.”
Users — Notion, Square, DocuSign, Autodesk, and many other public companies use Fivetran. Combined with Snowflake / BigQuery / dbt, it forms the “modern data stack” trio (EL → Fivetran; T → dbt; Storage → Snowflake).
Demo code — Terraform configuration for a Fivetran connector (Fivetran itself is UI/API-driven, but managing it with Terraform is the community-mainstream practice):
resource "fivetran_connector" "stripe_to_snowflake" {
group_id = fivetran_group.prod.id
service = "stripe"
sync_frequency = 60 # minutes
destination_schema {
name = "stripe_raw"
}
config {
secret_key = var.stripe_secret_key
}
}
resource "fivetran_connector_schedule" "stripe_schedule" {
connector_id = fivetran_connector.stripe_to_snowflake.id
schedule_type = "auto"
paused = false
}
After terraform apply, a few minutes later Snowflake will have dozens of tables like stripe_raw.charges / stripe_raw.customers mirroring the Stripe API object model.
Compute — Spark · Flink · dbt
Once data is in the platform, it must be cleaned, aggregated, joined, and feature-engineered — this is the work of the compute layer. The three mainstream engines split by “processing paradigm”: Spark for heavy batch + general compute, Flink for stream-native + stateful compute, dbt for SQL templating + in-warehouse transformation.
Apache Spark · Distributed Batch and ML
History — Spark originated at UC Berkeley AMPLab (2009), Matei Zaharia’s PhD project, open-sourced in 2010, contributed to Apache in 2014 as a top-level project. At a time when Hadoop MapReduce was clunky and slow, Spark proposed the RDD (Resilient Distributed Dataset) abstraction + in-memory compute, eliminating the IO-to-disk overhead of each MapReduce round; typical ETL jobs ran 10–100× faster. Subsequent evolution:
- 2014 — Spark SQL + DataFrame API (Catalyst SQL optimizer)
- 2016 — Structured Streaming (micro-batch stream processing)
- 2018 — Spark 2.4 launches Spark on K8s
- 2020 — Spark 3.0 + Adaptive Query Execution (AQE)
- 2023 — Spark Connect (C/S architecture, slim client)
- 2024 — Photon (Databricks commercial version’s C++ vectorized execution engine) + Spark 4.0
Databricks (2013), founded by Matei, is the commercial mothership of Spark, valued at $62B in 2024, IPO countdown. Cloud Spark = Databricks; on-prem Spark is largely hosted by EMR / Dataproc.
Core design — Spark’s core is “staged DAG execution”:
- DataFrame / Dataset API → Catalyst optimizer (rule-based + cost-based) → physical plan → DAG of Stages → Tasks
- Between Stages, shuffle splits (similar to MapReduce but spill in memory)
- AQE adjusts join strategies / shuffle partition count at runtime based on actual data distribution — the key to Spark 3+ performance leap
- Data can come from HDFS / S3 / JDBC / Iceberg / Delta / Kafka, all abstracted as DataFrame
- ML (MLlib) and GraphX are submodules of the same engine, making Spark also a common tool for large-scale feature engineering
Users — Almost every large data team uses Spark. Apple, Uber, Netflix, LinkedIn, ByteDance, Alibaba, Tencent all run PB-scale Spark workloads. Spark’s strengths are comprehensive batch + ML integration, weakness is stream processing (micro-batch latency can’t beat Flink). Common combo: Iceberg + Spark + Airflow runs daily / hourly ETL.
Demo code — A Scala aggregation job (reads Iceberg, writes back to Iceberg). Spark itself is written in Scala; the Scala API is the closest to the engine:
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.functions._
val spark = SparkSession.builder
.appName("daily_orders_summary")
.config("spark.sql.extensions", "org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions")
.config("spark.sql.catalog.iceberg", "org.apache.iceberg.spark.SparkCatalog")
.config("spark.sql.catalog.iceberg.type", "glue")
.getOrCreate()
import spark.implicits._
val orders = spark.read.table("iceberg.raw.orders").where($"status" === "paid")
val daily = orders
.groupBy(to_date($"created_at").as("day"), $"country")
.agg(count("*").as("order_count"),
sum($"total_cents").as("gmv_cents"))
.orderBy("day", "country")
daily.writeTo("iceberg.marts.daily_orders").overwritePartitions()
Running this job, Spark automatically handles Iceberg table partition pruning, predicate pushdown, and shuffle tuning.
Apache Flink · Stream-Native Stateful Compute
History — Flink originated at the Stratosphere research project at TU Berlin (2010); the core team (Stephan Ewen, Kostas Tzoumas, etc.) founded Data Artisans in 2014, renamed Stratosphere to Flink, and contributed it to Apache, becoming a top-level project the same year. Flink’s killer feature is true streaming + event time + exactly-once semantics — three things that were trade-offs in the Spark Streaming era; Flink gave a unified solution.
In 2019 Alibaba acquired Data Artisans for €90M, renaming it Ververica; Flink saw large-scale deployment at Alibaba for Double 11 real-time dashboards / real-time risk control / real-time recommendation. In 2023 Confluent acquired Immerok (a subsequent venture from the original Data Artisans founding team), betting on “Kafka + Flink” as a unified de facto standard.
Core design — Flink’s core abstractions are DataStream (unbounded stream) and State (operator-held local state).
- Event time + Watermark — data carries timestamps; watermarks tell the system “I can assume all events before t have arrived,” enabling correct window aggregation
- Checkpoint / Savepoint — uses the Chandy-Lamport algorithm to periodically take distributed snapshots of all operator state; on failure, rolls back to the most recent checkpoint, achieving exactly-once
- State Backend — state can live on heap, off-heap, or RocksDB (default for large state); RocksDB state can be backed by S3 for incremental checkpoints
- Flink SQL / Table API — upper-layer SQL interface, integrating smoothly with Kafka / Iceberg / JDBC, lowering the barrier to entry
Users — Alibaba (entire group), Uber (real-time pricing), Netflix (real-time recommendation features), LinkedIn (real-time data streams), ByteDance (real-time warehouse), DiDi (real-time risk control). The default engine for real-time warehouses at large Chinese tech firms. Competitors:
- Spark Structured Streaming — micro-batch, second-plus latency, but Spark ecosystem unified
- Kafka Streams — Java client library, lightweight but JVM-only and tightly bound to Kafka
- Materialize — based on Differential Dataflow, SQL-first, incremental materialized views
- RisingWave — open-sourced by a Chinese team, positioning similar to Materialize
Demo code — Flink SQL real-time aggregation, consuming Kafka stream and writing into Iceberg:
-- Source: Kafka topic
CREATE TABLE orders_stream (
order_id BIGINT,
user_id BIGINT,
amount_cents BIGINT,
ts TIMESTAMP_LTZ(3),
WATERMARK FOR ts AS ts - INTERVAL '5' SECOND
) WITH (
'connector' = 'kafka',
'topic' = 'orders.v1',
'properties.bootstrap.servers' = 'kafka:9092',
'format' = 'json',
'scan.startup.mode'= 'latest-offset'
);
-- Sink: Iceberg table
CREATE TABLE orders_5m_agg WITH (
'connector' = 'iceberg',
'catalog-name' = 'glue',
'database-name' = 'marts'
) AS
SELECT
TUMBLE_START(ts, INTERVAL '5' MINUTE) AS window_start,
COUNT(*) AS orders,
SUM(amount_cents) AS gmv_cents
FROM orders_stream
GROUP BY TUMBLE(ts, INTERVAL '5' MINUTE);
After submission, Flink continuously consumes Kafka, aggregates over 5-minute event-time windows, writes exactly-once into Iceberg, with second-level latency.
dbt · In-Warehouse SQL Modeling
History — dbt (data build tool) was founded in 2016 by Tristan Handy at consulting firm Fishtown Analytics (later renamed dbt Labs); core team Drew Banin, Connor McArthur, Jason Ganz. dbt didn’t invent a new compute engine — it’s just a SQL templating compiler + execution framework — compiling models/orders.sql into CREATE TABLE ... and pushing it to Snowflake / BigQuery / Redshift / Databricks / Postgres / Trino / DuckDB for execution.
Its growth logic: in the cloud-warehouse era, compute is outsourced to Snowflake; what data teams really need is “engineered SQL workflow” — version control, dependency management, testing, documentation, CI/CD. dbt perfectly filled this gap and birthed the new job title “Analytics Engineer.” 2022 Series D at $4.2B. Commercial layer dbt Cloud (hosted IDE + scheduling + docs) is main revenue. 2024 launched dbt Fusion (a Rust engine based on SDF acquisition, 30× faster SQL / model parsing).
Core design — Built around a few key concepts:
- Model — a
.sqlfile containing a SELECT, compiled by dbt intoCREATE TABLE / VIEW AS {{ ref('upstream_model') }}— inter-model references, dbt automatically builds a DAG to determine execution order{{ source('raw', 'orders') }}— references raw tables, decoupled from downstream models- Test — built-in
unique/not_null/accepted_values/relationships, plus custom SQL tests - Macro — Jinja templates, reusable SQL fragments
- Snapshot — SCD Type 2 historical dimension tables
- Materialization —
view/table/incremental/ephemeral; changing one config switches from full to incremental - Package — install model / macro collections from dbt Hub like npm
dbt’s strength is forcing analytics workflow to conform to “software engineering”: every model has an owner, has tests, has docs, goes through git commits / CI / pull request review — a stark contrast with the chaos of “write an ad-hoc SQL inside a BI tool” from ten years ago.
Users — Standard issue for nearly every modern data stack team; JetBlue, Casper, HubSpot, GitLab, Notion, Vercel, and many more. The de facto standard. The open-source ecosystem dbt-utils / dbt-expectations / elementary / re_data is very active. Competitors:
- SQLMesh — open-sourced by Toby Mao (ex-Airbnb) and others, marketing “automatic schema migration + true column-level lineage”
- Dataform — acquired by Google, integrated into BigQuery
- Lea — simplified SQL build tool, Python-integrated
Demo code — An incremental model + test:
-- models/marts/fct_orders.sql
{{ config(
materialized='incremental',
unique_key='order_id',
on_schema_change='append_new_columns'
) }}
select
o.order_id,
o.user_id,
o.created_at,
o.total_cents,
u.country
from {{ source('raw', 'orders') }} o
left join {{ ref('dim_users') }} u using (user_id)
{% if is_incremental() %}
where o.updated_at >= (select coalesce(max(updated_at), '1970-01-01') from {{ this }})
{% endif %}
# models/marts/_schema.yml
version: 2
models:
- name: fct_orders
columns:
- name: order_id
tests: [unique, not_null]
- name: user_id
tests:
- relationships:
to: ref('dim_users')
field: user_id
After dbt run --select fct_orders, dbt compiles to MERGE INTO / INSERT INTO and pushes it to the target warehouse, then runs dbt test for validation.
Storage — Lakehouse (S3 + Iceberg) · Real-Time OLAP (ClickHouse)
The biggest change in the storage layer in recent years is that “lakehouse” has replaced the binary of “data lake vs data warehouse.” The bottom layer is cheap object storage like S3 / R2, and on top of it sits an open table format (Iceberg / Delta / Hudi) providing ACID + schema evolution, readable / writable by any compute engine. Real-time OLAP is another coexisting form: data is materialized into compute-storage-integrated engines like ClickHouse / Doris / StarRocks, trading multi-engine sharing for millisecond response.
S3 / R2 · Object Storage Substrate
History — Amazon S3 launched in 2006, the first object store in cloud computing and one of the oldest AWS services. Its 11 nines (99.999999999%) of durability, 99.99% availability, and unlimited objects per bucket essentially killed the “how to safely store PB of data” problem. Cloudflare R2 (GA 2022)‘s differentiation is “zero egress” — S3’s egress traffic is so expensive that “pulling my data out of AWS to use elsewhere” is nearly infeasible; R2 broke this barrier, forcing AWS in 2024 to also announce free egress (with conditions). Two other commonly mentioned: Google Cloud Storage (GCS), Backblaze B2.
Core design — S3 is an HTTP REST + strong consistency + immutable object storage model:
- Once uploaded, an object can’t be modified (overwrite creates a new version), suitable for append-only data
- Organized by prefix (pseudo directory), but not a real directory tree — listing keys under a prefix is O(n) scanning
- Tiered storage: Standard / Infrequent / Glacier (cold), billed on storage + requests + egress three dimensions
- Bandwidth within the same region is nearly free; cross-region / egress to internet is the expensive part
A key S3 constraint: LIST operations are slow (default 1000 keys/page; large directories take minutes to scan), so lakehouse table formats (Iceberg / Delta) must maintain their own manifest files to avoid LIST.
Users — Nearly every modern cloud-native data platform’s bedrock. S3 holds roughly hundreds of trillions of GB (public 2023 data). R2 is growing fast; Cloudflare’s 2024 report indicates R2 has stored trillions of objects.
Demo code — AWS CLI (the most common entrypoint for object-storage day-to-day ops):
# Write (single file with SSE encryption)
aws s3 cp part-00.parquet \
s3://my-lakehouse/raw/orders/dt=2026-05-11/part-00.parquet \
--sse AES256
# Sync entire directory (incremental, only transfers changed files)
aws s3 sync ./local-dir/ s3://my-lakehouse/raw/orders/dt=2026-05-11/
# Cloudflare R2 is S3 API compatible — just switch the endpoint
aws s3 cp file.parquet s3://my-r2-bucket/path/ \
--endpoint-url https://<account>.r2.cloudflarestorage.com
In real production, raw operations like this are rare; most data writes are indirect through Iceberg / Spark / Trino; CLI is reserved for ops: bucket policy, lifecycle rules, inspections.
Apache Iceberg · Open Table Format
History — Iceberg was started by Ryan Blue and Daniel Weeks at Netflix in 2017, open-sourced in 2018, and became an Apache top-level project in 2020. The core pain points it aims to solve are several fatal flaws of the Hive table format in large-data-lake scenarios:
- LIST is slow when finding files across an entire partition
- No ACID; concurrent writes frequently corrupt data
- No schema evolution (can only add columns, can’t change types / rename)
- No time travel / rollback
Iceberg’s solution is “adding an independent metadata layer above the file layer”: each table has a metadata.json pointing to a manifest list, the manifest list points to manifest files, and manifests point to data files — three levels of pointers. Each write produces a new metadata.json; ACID is achieved via atomic rename.
In 2024, Snowflake acquired Tabular for $1B+ (founded by Ryan Blue and Daniel Weeks, Iceberg’s main commercial entity), in contrast with Databricks losing the Tabular bid and pivoting to acquire the operator of Delta Lake. Iceberg has become the de facto standard: AWS Glue, Athena, Snowflake, BigQuery, Databricks (also natively supported since 2024), Cloudflare R2 SQL, ClickHouse, Trino, Flink, Spark all can read Iceberg.
Core design — Three-level metadata:
- Metadata file (JSON) — table schema, partition spec, snapshots list, current-snapshot pointer
- Manifest list (Avro) — one per write, listing all manifest files involved in this write
- Manifest file (Avro) — lists specific data files (Parquet / ORC / Avro) with paths, statistics, partition values
Read path: uses manifest column statistics (min/max) for partition + file pruning, avoiding scanning irrelevant data. Write path: each commit atomically replaces the metadata file pointer, achieving MVCC.
Trade-off: Iceberg isn’t bound to a specific compute engine (vs Delta Lake’s early tight Spark binding), but commit conflicts require retries.
Users — Mainstream choice for large data lakes at Netflix, Apple, AirBnB, Adobe, ByteDance, Stripe, and others. Competitors:
- Delta Lake — Databricks-led, partial open-source in 2019, fully open-sourced in 2022 (Delta Lake 3.0 UniForm supports dual-writing Iceberg metadata). The ecosystem is most complete on the Databricks platform.
- Apache Hudi — open-sourced by Uber in 2017, marketing upsert + incremental query; still advantageous in CDC / high-upsert scenarios, but community momentum trails Iceberg.
Demo code — Spark SQL creating + maintaining an Iceberg table:
-- Create table
CREATE TABLE iceberg.marts.fct_orders (
order_id BIGINT,
user_id BIGINT,
amount_cents BIGINT,
country STRING,
created_at TIMESTAMP
)
USING iceberg
PARTITIONED BY (days(created_at), country)
TBLPROPERTIES (
'write.format.default' = 'parquet',
'write.target-file-size-bytes' = '134217728'
);
-- Upsert
MERGE INTO iceberg.marts.fct_orders t
USING staging.orders_delta s
ON t.order_id = s.order_id
WHEN MATCHED THEN UPDATE SET *
WHEN NOT MATCHED THEN INSERT *;
-- Time travel
SELECT * FROM iceberg.marts.fct_orders
FOR TIMESTAMP AS OF '2026-05-10 00:00:00';
-- Remove old snapshots (compaction + GC)
CALL iceberg.system.expire_snapshots('marts.fct_orders', TIMESTAMP '2026-05-01');
ClickHouse · Real-Time OLAP
History — ClickHouse originated as an internal project at Yandex in 2009, supporting reporting queries for Yandex.Metrica (Russia’s Google Analytics) — processing hundreds of billions of events per day with sub-second aggregation results. Open-sourced in 2016, quickly became the de facto open-source OLAP engine. In 2021 the core team formed ClickHouse Inc. (US), independent from Yandex, valued at $6B in 2024. Commercialization focuses on ClickHouse Cloud (hosted).
Core design — All of ClickHouse’s performance advantages come from three foundations: columnar storage + MergeTree + vectorized execution:
- The MergeTree engine is the default table engine family (including ReplacingMergeTree / SummingMergeTree / AggregatingMergeTree variants); writes are ORDER BY-sorted, with background merging of small files into large ones — similar to LSM-tree thinking
- Data stored by column; only reads the columns used by the query; IO is linear in columns, not rows
- Vectorized execution — processes a batch (1024 rows) at a time instead of one row, with aggressive SIMD optimization
- Sparse index — one index point per N rows; uses ORDER BY prefix fields for range scans
- Distributed table + sharding + replication for horizontal scaling
- Materialized views — real-time aggregation on write (essentially attaching a trigger to the write path; engines like SummingMergeTree do incremental automatically)
Trade-off: Fast writes / fast queries / but weak UPDATE / DELETE (async, low efficiency, batch-only). This means ClickHouse is a typical “append-mostly” tool — perfect for real-time warehouses and OLAP, but not for transactional workloads.
Users — Cloudflare (raw log analytics, tens of billions of events per day), Uber (real-time risk), ByteDance (user behavior analytics; Cnch is ByteDance’s customized version of ClickHouse), Spotify, eBay, Disney, DiDi, Alibaba (the Hologres team were major ClickHouse contributors), and others. Competitors:
- Apache Doris (formerly PaloDB, Baidu, open-sourced in 2017, joined Apache in 2022) — many financial / government case studies in China
- StarRocks (Doris fork, open-sourced 2020, Mirror founded company for commercialization) — used by ByteDance / Tencent / Meituan in China
- Druid (commercialized by Imply, originated at Metamarkets) — veteran real-time OLAP
- Pinot (open-sourced by LinkedIn, joined Apache in 2018) — used by LinkedIn / Uber
Demo code — Table + materialized view:
-- Detail table
CREATE TABLE events (
event_date Date,
event_time DateTime,
user_id UInt64,
event_type LowCardinality(String),
country LowCardinality(String),
revenue_cents UInt64
)
ENGINE = MergeTree
PARTITION BY event_date
ORDER BY (event_date, event_type, user_id)
TTL event_date + INTERVAL 90 DAY;
-- Materialized view: hourly aggregation
CREATE MATERIALIZED VIEW events_hourly_mv
ENGINE = SummingMergeTree
PARTITION BY event_date
ORDER BY (event_date, event_hour, country, event_type)
AS SELECT
event_date,
toStartOfHour(event_time) AS event_hour,
country,
event_type,
count() AS events,
sum(revenue_cents) AS revenue_cents
FROM events
GROUP BY event_date, event_hour, country, event_type;
-- Query (hits materialized view, millisecond response)
SELECT country, sum(revenue_cents) / 100 AS revenue_usd
FROM events_hourly_mv
WHERE event_date = today() AND event_type = 'purchase'
GROUP BY country ORDER BY revenue_usd DESC LIMIT 10;
Query — Trino · Spark SQL · Redis
Once data is in storage, a “query entrypoint” is needed to accept SQL / KV requests. Trino handles cross-source federated interactive SQL, Spark SQL handles heavy analytics / feature engineering (the Spark from §04, in SQL form), Redis handles KV / cache / online metrics.
Trino · Federated SQL Engine
History — Trino’s predecessor is Presto, developed at Facebook in 2012 (Martin Traverso, David Phillips, Eric Hwang, Dain Sundstrom), to bring interactive SQL on Hadoop down to second-scale response times. Open-sourced in 2013, quickly becoming the representative MPP SQL engine in the industry. In 2018, the original core team left Facebook due to governance conflicts, forking PrestoSQL (later renamed Trino in 2020 because Facebook donated the Presto trademark to the Linux Foundation but retained governance) and founding Starburst for commercialization. What the market calls “Presto” today generally means Trino; the original PrestoDB is maintained by Linux Foundation but with low activity.
Trino’s core positioning is “one SQL engine to query all data sources”: no data replication, no unified format requirement; via connectors, federated queries against Hive / Iceberg / Delta / MySQL / Postgres / Kafka / Elasticsearch / Cassandra / MongoDB … 50+ data sources.
Core design — Trino is shared-nothing MPP:
- One Coordinator (SQL parse + plan + schedule)
- N Workers (executing physical operators)
- Connector abstraction — each data source implements
ConnectorMetadata/ConnectorSplitManager/ConnectorPageSource; data flows between workers as pages (column batches) - Fully in-memory + pipelined execution; intermediate results don’t spill (this is also why Trino isn’t suited for very large batch — OOM risk)
- Supports dynamic filtering (at runtime, pushes the build side’s join keys down to the probe side)
Trade-off: Trino’s strength is interactive queries (second-scale) + cross-source join, weakness is fault tolerance on long-running jobs (failures are expensive to re-run; 3.x introduced fault-tolerant execution to mitigate).
Users — Netflix (main interactive query entrypoint), LinkedIn, Lyft, Pinterest, Stripe, ByteDance, and others. Starburst Galaxy (hosted Trino) is the commercial mainstay. Competitors:
- Apache Spark SQL — batch-leaning, wider ecosystem
- Apache Doris / StarRocks / ClickHouse — OLAP databases, require data to be loaded first
- DuckDB — single-machine analytics, often faster in < 100GB scenarios
- AWS Athena — Trino’s AWS-managed commercial version, mainly querying S3 + Iceberg
Demo code — Trino cross-Iceberg + MySQL join (classic scenario: joining OLTP dimension tables with lakehouse fact tables):
SELECT
o.country,
p.product_name,
SUM(o.amount_cents) / 100 AS revenue_usd
FROM iceberg.marts.fct_orders o
JOIN mysql.production.products p
ON o.product_id = p.id
WHERE o.created_at >= DATE '2026-05-01'
AND o.country IN ('US', 'JP', 'CN')
GROUP BY 1, 2
ORDER BY revenue_usd DESC
LIMIT 50;
When Trino executes: Iceberg connector reads S3, MySQL connector reads the prod DB, join is done in worker memory. The whole table is joined only at query time, no pre-ETL needed.
Spark SQL · Pointer to §04 Compute
Spark SQL has already been covered in §04 Compute; here we only highlight the “query identity” angle:
- The same Spark cluster can run offline ETL (submit jobs via
spark-submit) and also expose a Thrift Server (JDBC / ODBC port) for SQL clients — at which point it’s a “quasi-real-time SQL engine” - Compared with Trino: Spark SQL fits heavy analytics (minute-scale OK but fault-tolerant strong), Trino fits interactive queries (second-scale needs to be fast but fault tolerance weaker)
- ML feature engineering defaults to Spark SQL (because downstream is PySpark MLlib / Pandas API), BI dashboards default to Trino
Redis · Cache and KV
History — Redis was written in C by Salvatore Sanfilippo in 2009 (Italian programmer, side project), to solve performance issues in his real-time stats website. In 2010 the project was hired by VMware for continued development; in 2015 Redis Labs (later renamed Redis Inc.) took over. Over a decade, Redis has evolved from “in-memory KV” to “in-memory data structure server”: string / hash / list / set / sorted-set / bitmap / hyperloglog / stream / geo / json (module) / vector (module) — almost any business scenario maps to some Redis structure.
In March 2024, Redis changed its license from BSD to RSALv2 + SSPL (dual license, restricting cloud-vendor commercial use); the community immediately forked Valkey under the Linux Foundation, with AWS / Google / Oracle / Snowflake all contributing engineering effort; competitors like Redict, Garnet (Microsoft, written in C#, marketing higher throughput) emerged in parallel. Today “Redis alternatives” is a real topic in the data stack discussion.
Core design — Redis is a textbook “single-threaded event loop + in-memory data structures”:
- All commands execute serially in a single main thread (IO multiplexing pattern), so no locking is needed
- Persistence: RDB (snapshots) + AOF (append-only log) for dual protection
- HA: Sentinel (failover) or Cluster (sharding)
- Redis Modules (since 7.0) allow third-party data-type extensions: RedisJSON, RedisSearch, RedisGraph, RedisAI, RediSearch + Vector all in production
- 6.0 introduces multi-threaded IO (command processing remains single-threaded, but IO is multi-threaded), bumping throughput another tier
Users — Nearly every internet company runs Redis: Twitter, GitHub, Stack Overflow, Snapchat, Pinterest, etc. Typical usage in the data stack:
- Cache ClickHouse / Trino query results by key in Redis (avoid repeated querying)
- As the hot store for real-time metrics (Flink real-time sink into Redis)
- As the online portion of ML feature stores (feature stores like Feast default to Redis)
Demo code — Go caching + sorted-set real-time leaderboard (Redis’s typical consumer is a Go / Java backend service):
package main
import (
"context"
"encoding/json"
"time"
"github.com/redis/go-redis/v9"
)
func main() {
ctx := context.Background()
rdb := redis.NewClient(&redis.Options{Addr: "redis.internal:6379"})
// 1. Cache query result (5-minute TTL)
cacheKey := "top_countries:2026-05-11"
if v, err := rdb.Get(ctx, cacheKey).Result(); err == nil {
var cached []Row
json.Unmarshal([]byte(v), &cached)
} else {
rows := expensiveClickHouseQuery()
b, _ := json.Marshal(rows)
rdb.Set(ctx, cacheKey, b, 5*time.Minute)
}
// 2. Sorted-set real-time leaderboard
for _, u := range updates {
rdb.ZIncrBy(ctx, "leaderboard:2026-w19", u.Score, u.UserID)
}
// 3. Take Top 10
top10, _ := rdb.ZRevRangeWithScores(ctx, "leaderboard:2026-w19", 0, 9).Result()
_ = top10
}
Serve — BI · Backend API · ML Applications
The serve layer is where data is finally “consumed.” Tools at this layer are not “mutually exclusive” like the lower layers but divided by user role: data analysts use BI dashboards, business systems use backend APIs, algorithm / ML teams use model inference.
Apache Superset · Data Analytics Platform
History — Superset was developed internally at Airbnb by Maxime Beauchemin in 2015 (the same author as Airflow!), open-sourced in 2016, entered Apache Incubator in 2017, became a top-level project in 2021. After leaving Airbnb, Maxime founded Preset (2019) for hosted SaaS Superset. Superset’s design philosophy is “exploration-oriented + engineer-friendly”: write SQL to explore → save as dataset → add charts → assemble dashboard, all completed in the browser.
Core design — Three core abstractions:
- Database connection — connects to anything SQL-supporting via SQLAlchemy / DB API (Trino / Iceberg via Trino / ClickHouse / Postgres / Snowflake / BigQuery …)
- Dataset — a SQL query or table as the data source for subsequent charts
- Chart / Dashboard — 80+ visualizations (based on ECharts / Plotly)
Features: SQL Lab (in-browser SQL editor with syntax highlighting + autocomplete + history); Jinja templating (dynamically compose SQL with {{ filter_values('country') }}); Row-level security (per-user WHERE clauses); Embedded analytics (SDK to embed in your own product).
Users — Airbnb, Lyft, Twitter, AT&T, Bloomberg, ByteDance, JD.com, DiDi, and others. First-tier of open-source BI. Preset Cloud is the hosted commercial version.
Demo code — Superset’s dataset YAML (used with superset import-datasources for batch management):
database_name: trino-lakehouse
sqlalchemy_uri: trino://datateam@trino-prod.internal:8443/iceberg
tables:
- table_name: fct_orders_daily
schema: marts
columns:
- column_name: day
type: DATE
is_dttm: true
- column_name: country
type: VARCHAR
- column_name: gmv_cents
type: BIGINT
metrics:
- metric_name: gmv_usd
expression: "SUM(gmv_cents) / 100.0"
- metric_name: orders
expression: "SUM(order_count)"
Metabase · Self-Service BI
History — Metabase was open-sourced by Sameer Al-Sakran in 2015; Metabase Inc. is the commercial mothership formed concurrently. Its positioning differs from Superset: “non-technical users can use it too” — generate queries via UI dimension / metric / filter selection, without needing to write SQL. Metabase 2024 Series D at $1.6B. Business model: open-source (AGPL) + Pro / Enterprise (closed-source add-ons) + Cloud.
Core design — Built around the “Question” abstraction: each chart is a question, and questions can be constructed via Notebook (UI drag-and-drop) or SQL editor.
- X-ray — auto-exploration on a table / segment, popping up common questions and charts (suits users who don’t know where to start)
- Dashboard + Subscription — scheduled email pushes (most-used by management)
- Models (essentially views + metadata) — similar to dbt models, defining “clean tables” at the BI layer
- Embedding — embedding dashboards into your own product (JWT signature + iframe)
Users — Mainstream for SMBs; 39k+ stars on GitHub. Compared with Superset: Metabase is more lightweight / easier to onboard, but its customization / visualization richness lags Superset. Mid-sized companies often install both: Metabase for business PMs, Superset for data analysts.
Demo code — Metabase is a pure HTTP API; curl is the most direct demonstration:
# Login to get session token
TOKEN=$(curl -s https://metabase.internal/api/session \
-H "Content-Type: application/json" \
-d '{"username":"bot","password":"secret"}' | jq -r .id)
# Run a saved question (id=42) for JSON results
curl -s -X POST https://metabase.internal/api/card/42/query/json \
-H "X-Metabase-Session: $TOKEN" | jq '.[0:3]'
In real production, scripting Metabase is rare; the vast majority of interaction is through the web UI.
Backend API / ML Application · Not a Single Tool
The backend API cell isn’t a specific project, but application code that exposes the data platform’s query capabilities to business systems:
- A FastAPI / Spring Boot / Go service that internally translates requests into Trino / ClickHouse queries + Redis cache
- Or a frontend pulls warehouse tables directly via a GraphQL gateway (Hasura, PostGraphile)
- Or data is sync’d back to OLTP DBs (reverse ETL, tools like Hightouch / Census)
ML applications similarly, typical pipeline:
- Feature engineering → Spark + Iceberg / Flink + Redis (feature stores like Feast manage both offline / online ends)
- Training → Spark MLlib / PyTorch on Ray / sklearn / xgboost / lightgbm
- Inference → inference services (Triton, TorchServe, KServe) + Redis vector / Qdrant / Pinecone for vector search
- RAG / LLM → stack a vector DB layer on top of the data stack, coexisting with traditional OLAP
This layer has no “mainstream single tool” — it’s an assembly of dozens of lightweight services. The interface between data teams and business engineering teams.
Orchestration and Governance — Airflow · Glue Catalog · DataHub
Scheduling and governance are cross-cutting concerns. Airflow decides “what task runs when, and what depends on what,” Glue Catalog decides “what is the schema of this table of mine,” DataHub decides “how did this table’s lineage form, who owns it, and what’s the quality.”
Apache Airflow · Workflow Orchestration
History — Airflow was developed by Maxime Beauchemin at Airbnb in 2014 (same author as Superset), open-sourced in 2015, entered Apache Incubator in 2016, top-level project in 2019. Its core innovation is “workflow as Python code” — DAGs aren’t written in YAML or XML but in Python, with task dependencies declared via Python function call order. This idea wasn’t mainstream in 2015 (when Oozie / Azkaban were the XML / YAML camp), but quickly became the standard. Astronomer is the commercial mothership, 2022 valuation $1.7B, acquired Datakin (OpenLineage) in 2024.
Core design — Main concepts:
- DAG — one Python file = one DAG, described via
@dag/@taskdecorators or Operator instances - Operator — a task type (BashOperator / PythonOperator / SparkSubmitOperator / KubernetesPodOperator …)
- Sensor — wait for some condition (file appears / time arrives / upstream task completes)
- Scheduler / Worker / Webserver — three main components, distributable
- XCom — pass small values between tasks (< a few KB)
- TaskFlow API (2.0+) — pure Python decorator-style dependency declaration; DAGs read like normal code
- Provider package — hundreds of official plugins for Snowflake / dbt / Databricks / GCP / AWS / Slack and more
Users — Almost every data team runs Airflow. Airbnb, Lyft, Square, Robinhood, Tesla, Google (Cloud Composer is hosted Airflow), AWS (MWAA), ByteDance, DiDi, and others. Competitor differentiation:
- Dagster — open-sourced by Elementl, marketing asset-centric (treating “data assets” as first-class, not tasks)
- Prefect — Python-first, marketing “code is scheduling, hardly any new DSL needed,” clean
- Argo Workflows — fully K8s-native, YAML DAGs, ops-leaning SRE
- Mage / Kestra / Temporal — each has distinguishing features (UI-friendly / YAML / long-running stateful workflows)
Demo code — TaskFlow API writing an ETL DAG:
from datetime import datetime
from airflow.decorators import dag, task
@dag(
schedule='0 2 * * *', # Every day at 2:00
start_date=datetime(2026, 5, 1),
catchup=False,
tags=['marts'],
)
def daily_orders_pipeline():
@task
def extract_kafka_offsets() -> dict:
# Get yesterday's end offset from Kafka
return {'orders.v1': 12345678}
@task
def run_spark(offsets: dict) -> str:
# Trigger Spark job reading Iceberg + Kafka, writing back to Iceberg
from airflow.providers.apache.spark.operators.spark_submit import SparkSubmitOperator
# ... simplified: in practice use an Operator instead of @task ...
return 'iceberg.marts.fct_orders'
@task
def trigger_dbt(table: str) -> None:
# Trigger dbt run on downstream models
...
offsets = extract_kafka_offsets()
table = run_spark(offsets)
trigger_dbt(table)
daily_orders_pipeline()
The scheduler picks up this file and automatically builds the dependency graph: extract → spark → dbt, running once daily at 2:00.
AWS Glue Data Catalog · Metadata Service
History — AWS Glue launched in 2017; Glue Data Catalog is its metadata service component, API fully compatible with Apache Hive Metastore (a key point, since it means any Hive-compatible engine can use Glue as a metastore). Glue itself also includes Glue ETL (Spark-based ETL) and Glue Studio (UI), but in real production “using Glue” essentially means using Glue Catalog.
Core design — Glue Catalog is a fully managed metastore service:
- Stores Database / Table / Partition / Column / Property metadata
- Doesn’t store the data itself; data still lives in S3
- Fine-grained permissions via IAM + Lake Formation
- Seamless integration with Athena / EMR / Redshift Spectrum / Databricks / Iceberg / Snowflake (supported in 2024) and others
- Crawler — scans S3 paths and auto-infers schema, registering in catalog (suits ad-hoc data)
Trade-off: AWS-locked, rarely used cross-cloud; but within AWS it’s the de facto entrypoint for Iceberg / Delta / Hive metadata.
Users — Any data team of meaningful scale on AWS defaults to Glue Catalog as the metastore; EMR / Athena / Redshift Spectrum all default to it. Non-AWS alternatives: Hive Metastore (veteran, self-managed), Project Nessie (Iceberg-native catalog with git-like branches), Polaris (open-sourced by Snowflake), Unity Catalog (Databricks-led, open-sourced in 2024).
Demo code — AWS CLI registering an Iceberg table (in real scenarios, most are auto-registered by Spark / Iceberg writers; here we show what the low-level interface looks like):
cat > table.json <<'EOF'
{
"Name": "fct_orders",
"TableType": "EXTERNAL_TABLE",
"Parameters": {
"table_type": "ICEBERG",
"metadata_location": "s3://my-lakehouse/marts/fct_orders/metadata/00000-xxx.metadata.json"
},
"StorageDescriptor": {
"Location": "s3://my-lakehouse/marts/fct_orders/"
}
}
EOF
aws glue create-table \
--database-name marts \
--table-input file://table.json
After this, Athena / Spark / Trino can all SELECT * FROM marts.fct_orders and see this Iceberg table.
DataHub · Data Asset Platform
History — DataHub evolved from LinkedIn’s internal WhereHows project, open-sourced in 2019 as a next-generation metadata platform, joined Linux Foundation in 2020. Acryl Data (originally LinkedIn DataHub core team) was founded in 2021 for commercialization, 2023 Series A at ~$300M valuation. DataHub addresses metadata + governance problems: “whose table is this, what’s the upstream/downstream lineage, was it modified recently, does it meet compliance” — a different layer from Glue Catalog (which only handles schema registration).
Core design — Several key abstractions:
- Entity — Dataset / Dashboard / Chart / Pipeline / DataJob / DataFlow / MLModel / Tag / Glossary / User / Group …, uniformly modeled
- Aspect — a “facet” of an entity (schema, ownership, documentation, lineage, deprecation, test results)
- MCE / MCP (Metadata Change Event / Proposal) — event protocol for metadata writes
- GraphQL API — upper-layer query interface; UI is built on GraphQL
- Push + pull dual modes — both proactive external push (Python emitter) and pull via ingestion source (Trino / Snowflake / dbt / Airflow / Kafka and dozens more)
DataHub’s killer feature is “column-level lineage”: after parsing dbt / Spark queries, it can say precisely “fct_orders.country comes from dim_users.country via a join.”
Users — LinkedIn (in-house), Saxo Bank, Adevinta, Wolt, Pinterest, ByteDance, Stripe, and others. Competitors:
- OpenMetadata (Collate company) — open-sourced 2022, later rising; more modern API / UI
- Amundsen (Lyft, open-sourced 2019) — earlier project but with weaker community momentum
- Marquez (originated at WeWork, later joined LF AI & Data) — marketing OpenLineage standard
- Atlan / Castor / Select Star / Sifflet — commercial SaaS data catalogs
Demo code — Python emitter pushing a Dataset:
from datahub.emitter.mce_builder import make_dataset_urn
from datahub.emitter.rest_emitter import DatahubRestEmitter
from datahub.metadata.schema_classes import (
DatasetPropertiesClass, OwnershipClass, OwnerClass, OwnershipTypeClass,
)
from datahub.emitter.mcp import MetadataChangeProposalWrapper
emitter = DatahubRestEmitter(gms_server='http://datahub-gms:8080')
urn = make_dataset_urn(platform='iceberg', name='marts.fct_orders', env='PROD')
mcp = MetadataChangeProposalWrapper(
entityUrn=urn,
aspect=DatasetPropertiesClass(
description='Daily orders fact table, grain (day, country, product_id)',
customProperties={'owner_team': 'commerce-data'},
),
)
emitter.emit(mcp)
mcp2 = MetadataChangeProposalWrapper(
entityUrn=urn,
aspect=OwnershipClass(owners=[
OwnerClass(
owner='urn:li:corpuser:frank',
type=OwnershipTypeClass.DATAOWNER,
)
]),
)
emitter.emit(mcp2)
After writing, marts.fct_orders in the DataHub UI gets a description, owner, tags; lineage is automatically pulled from dbt / Airflow metadata.
Data Flow Timeline — write path · read path
The architecture diagram only paints “which components are at which layer,” but misses “how data actually flows end-to-end.” The diagram below depicts the complete path of a single transaction (one order) from creation to consumption: the top is two parallel write paths (real-time / batch), the middle is the two shared hubs ClickHouse + Iceberg marts, and the bottom is three classes of read paths distributed from those hubs (API / BI / ML).
Key points to read from this diagram:
- The write path has two parallel entry points: CDC (Postgres → Debezium → Kafka) is streaming, SaaS (API → Airbyte) is batch; both ultimately converge into Kafka or Iceberg.
- The same data passes through two hubs: Flink real-time materializes high-frequency aggregations into ClickHouse (second-scale latency); Spark / dbt lands the full history into Iceberg (minute-to-hour scale). This isn’t replacement, it’s coexistence — real-time dashboards query ClickHouse, deep analytics / ML / ad-hoc query Iceberg.
- The read path splits by consumption scenario: Trino + Superset for data analysts / business, Backend API + Redis for business systems, Spark MLlib + Model Server for ML / AI.
- The two cross-cutting bands: Catalog is the source of truth for “what data exists where,” Airflow is the rhythm source for “what happens when” — they aren’t on the main path but every main path depends on them.
Industry Practice — Netflix · Uber · LinkedIn · Airbnb · Pinterest · Shopify · Stripe
To ground theoretical frameworks, the quickest way is to read the engineering blogs of companies that run the same stack to the extreme. Below we pick seven Western big-tech companies with sustained output on data infrastructure, each with one or two representative recent blogs or papers, dissecting what their stacks look like and what unique problems they solve. This section answers: “what do all the components discussed earlier look like running in production?”
Netflix · Iceberg’s Birthplace + Maestro Orchestration
Netflix is the birthplace of Apache Iceberg — Ryan Blue and Daniel Weeks designed it internally at Netflix to solve Hive’s pain points on S3 (slow LIST, no ACID, no schema evolution, no time travel), open-sourced in 2018. Today the core combo of Netflix’s data platform is Iceberg + Spark + Trino + Flink + Maestro (self-built workflow engine, open-sourced in 2024).
The recent representative blog Maestro: Netflix’s Workflow Orchestrator recounts Maestro’s design philosophy: horizontally scalable stateless microservices + CockroachDB persistence, replacing the previous-generation single-leader system Meson — an architecture reverse-engineered from “extreme scale + multilingual users (SQL / Python / Notebook all needed).” Companion blog Incremental Processing using Netflix Maestro and Apache Iceberg details how Maestro pairs with Iceberg for incremental materialization.
Lessons for Chinese readers: Netflix chose Iceberg over Delta fundamentally for “multi-engine neutrality” — Spark / Trino / Flink / Presto / Hive / Impala all need to read the same table. If your stack has more than one compute engine, that’s a strong signal.
Uber · Hudi’s Birthplace + Real-Time Data Stack
Uber is the birthplace of Apache Hudi (Vinoth Chandar et al., open-sourced 2017), and the industry’s most complete “stream-batch unified” industrial case. The most authoritative reference is the SIGMOD 2021 paper Real-time Data Infrastructure at Uber (Yupeng Fu, Chinmay Soman), which systematically explains Uber’s internal Kafka + Flink + Pinot + Hive + Hudi collaboration — from ride orders to supply-demand matching to risk control, nearly all trip data runs through this pipeline.
Hudi’s origin blog Hudi: Uber Engineering’s Incremental Processing Framework on Apache Hadoop clarifies the pain points Hudi solves: upserts on large tables, which Hive-era systems handled by “rewriting all of the past N days’ partitions,” replaced by Hudi with ACID + indexes + incremental consumption.
Lessons for Chinese readers: Uber’s stack is the most worth-comparing template for ride-hailing / instant-delivery / supply-demand-matching businesses in China — Flink + Pinot for sub-second real-time decisioning, Hive + Hudi for offline modeling, with Kafka as the shared event source rather than two separate pipelines.
LinkedIn · Kafka’s Birthplace + DataHub
LinkedIn is the birthplace of Apache Kafka (Jay Kreps, Neha Narkhede, Jun Rao, 2010), and the origin of DataHub, Pinot, Samza, Voldemort. The whole “log-centric” data architecture philosophy traces back to Jay Kreps’ 2013 widely-cited long-form essay The Log: What every software engineer should know about real-time data’s unifying abstraction — distilling “log” as the unified abstraction across databases, distributed systems, and data integration. This article is a “must-read classic” for data engineers; after reading you’ll understand why Kafka is more than just a message queue.
In 2020 LinkedIn open-sourced their internal metadata platform: Open sourcing DataHub: LinkedIn’s metadata search and discovery platform — most enterprises’ metadata / lineage / data governance references today evolved from this architecture.
Lessons for Chinese readers: LinkedIn’s design philosophy is “event streams are the truth, databases are materialized views” — this is the ideological root of the CDC + stream processing paradigm.
Airbnb · Airflow + Superset + Minerva
Airbnb is the mothership where Maxime Beauchemin simultaneously incubated Airflow + Superset — these two projects are today the open-source de facto standards for workflow orchestration and BI respectively. Airflow: a workflow management platform (2015) is the foundational article for the workflow orchestration field — before this the industry was writing XML / YAML DAGs; Airflow first established “DAG is Python code” as the standard.
More recent How Airbnb achieved metric consistency at scale (2021) introduces the internal “metric layer” Minerva — 12,000+ metrics and 4,000+ dimensions consistent on a single source of truth; underlying Airflow + Hive + Spark + Presto + Druid. The Minerva concept later gave rise to dbt Semantic Layer, Cube.js, and other metric-layer products.
Lessons for Chinese readers: The biggest takeaway from the Airbnb case is “metric definition is a first-class citizen of data architecture” — don’t let the same KPI have ten SQL implementations across ten dashboards; this is the most common pitfall for mid-sized teams.
Pinterest · 500 PB Data Lake + Spinner Orchestration
Pinterest is a large-scale Iceberg user (disclosed at AWS re:Invent 2025: 500 PB+ data lake, 100K+ tables), and built Airflow’s replacement Spinner. Spinner: Pinterest’s Workflow Platform (2022) recounts the migration from old scheduler Pinball to Spinner (based on Airflow + Kubernetes), carrying 4,000+ workflows, 10,000 daily flow runs, 38,000 tasks.
The early-2026 next-generation lake ingestion framework Next Generation DB Ingestion at Pinterest is more recent practice: Debezium / TiCDC + Kafka + Flink + Spark + Iceberg unified CDC lake-ingestion framework, compressing online-OLTP-to-offline-lakehouse latency from hours to minutes.
Lessons for Chinese readers: Pinterest is the Western company whose business shape most resembles ByteDance / Xiaohongshu (high-frequency user interaction + recommendations + content); their stack choices have the highest reference value.
Shopify · dbt + BigQuery Benchmark
Shopify is the typical sample of the “lightweight data stack” route — no Spark / Flink / Kafka-class heavy components; entirely relying on BigQuery’s compute capability + dbt’s engineering capability, scaling to a public-company body. The representative blog How to Build a Production Grade Workflow with SQL Modelling (2020) details the internal codename “Seamster” solution: dbt + BigQuery, dividing data models into Base / Application-ready / Presentation three layers, with 200+ data scientists collaborating.
Lessons for Chinese readers: Shopify is a counter-example to “you don’t need a full stack to run at scale” — if your team isn’t Uber / Netflix scale, this stack is more worth copying than “I should also adopt Kafka + Flink.” China equivalents of BigQuery are Alibaba MaxCompute / ByteHouse from Volcano Engine, with similar stack thinking.
Stripe · Trino + Iceberg Replacing Hive
Stripe’s early stack representative was the 2016 Reproducible research: Stripe’s approach to data science — a combination of Jupyter + Hadoop + Scalding + Impala + Redshift. The 2023 Trino Fest talk Inspecting Trino on ice updates the current state: Iceberg has fully replaced Hive, with Trino as the federated SQL entrypoint; using Trino event listener, each query is audited to PostgreSQL for reverse observation of Iceberg table actual usage patterns.
Lessons for Chinese readers: Stripe is a sample of data-stack continuous iteration — the same company transitioning fully from Hadoop+Hive to Iceberg+Trino over seven years validates that the lakehouse + federated query route from §05 / §06 holds up over time.
How to Choose — by team size and business shape
When transplanting this ecosystem to your own organization, very few people deploy all 16 projects from day one. Real selection follows several rules of thumb:
| Team / Stage | Minimum viable stack | Key omissions |
|---|---|---|
| Early (< 10-person data team) | Fivetran / Airbyte + Snowflake/BigQuery + dbt + Metabase | Skip Kafka / Flink / Iceberg / ClickHouse / Airflow (use dbt Cloud for scheduling) |
| Mid (10–50 people) | Airbyte + Iceberg + Spark + dbt + Trino + Superset + Airflow | Add lakehouse + orchestration + federated query; Kafka / Flink depends on real-time needs |
| Late (>50 people + real-time business) | Full stack: Debezium + Kafka + Flink + Iceberg + ClickHouse + dbt + Trino + Redis + Airflow + DataHub | Governance / lineage / real-time warehouse — none can be omitted |
| AI / Recs / Risk-heavy | All of the above + Feature Store (Feast) + Vector DB (Pinecone/Qdrant) + Ray | ML pipeline becomes the new main path |
Several common mistakes:
- Adopting Kafka blindly — without a real event-stream business (second-scale response + multi-consumer fan-out), Kafka is a heavy ops burden. A few daily batch syncs is fully sufficient with Airbyte / Fivetran.
- Using dbt as Airflow — dbt is a modeling tool, not a workflow engine. Complex dependencies, cross-system orchestration (trigger Spark + dbt + reverse ETL + Slack notify) still need Airflow.
- Lakehouse vs real-time warehouse, choose one — this is a false dichotomy. Both coexisting is normal: lakehouse for “everything,” OLAP for “the most recent hot data.”
- Leaving metadata catalog until last — adding a catalog after the data team gets chaotic is extremely expensive to migrate. Use Glue / DataHub from day one, even if only for schema registration.
One final judgment: The evolution direction of this stack isn’t “one project that eats all layers”, but “the boundary of each layer becomes blurred”:
- Iceberg + ClickHouse cooperation (ClickHouse 24.5+ natively read/write Iceberg) → lakehouse / OLAP boundary blurring
- Snowflake / Databricks both building “lakehouse + AI” platforms → compute / storage / query three-layer vertical integration
- Confluent + Flink + Kafka unification → Ingestion / Processing two layers merge
- dbt Fusion + Iceberg + Trino → Modeling / Query / Storage tighter integration
- The rise of the AI layer (Vector DB / Feature Store / Agent runtime) is stacking a whole new column of components on the right side of the data stack
Understanding the “why each layer exists” is the only way to know what to cut and what to keep when the boundaries shift. This article is the starting point of that judgment.
References — project sites · big-tech blogs · industry surveys
Project Sites
- Apache Kafka — distributed commit log. kafka.apache.org
- Debezium — database CDC framework. debezium.io
- Airbyte — open-source ELT connectors. airbyte.com
- Fivetran — closed-source managed ELT. fivetran.com
- Apache Spark — distributed batch + ML engine. spark.apache.org
- Apache Flink — stream-native stateful compute. flink.apache.org
- dbt — in-warehouse SQL modeling and engineering. getdbt.com
- Amazon S3 — de facto object-storage standard. aws.amazon.com/s3
- Cloudflare R2 — S3-compatible, zero egress fees. cloudflare.com/developer-platform/r2
- Apache Iceberg — open table format. iceberg.apache.org
- Delta Lake — Iceberg peer competitor (Databricks-led). delta.io
- Apache Hudi — Iceberg peer competitor (Uber-led). hudi.apache.org
- ClickHouse — columnar real-time OLAP. clickhouse.com
- Trino — MPP federated SQL engine. trino.io
- Redis — in-memory data structure store. redis.io
- Valkey — Linux Foundation fork after Redis’s 2024 license change. valkey.io
- Apache Superset — open-source BI and data exploration. superset.apache.org
- Metabase — self-service BI. metabase.com
- Apache Airflow — workflow orchestration. airflow.apache.org
- Dagster / Prefect — Airflow peer alternatives. dagster.io · prefect.io
- AWS Glue Data Catalog — metadata service. aws.amazon.com/glue
- DataHub — data asset + lineage + governance platform. datahubproject.io
- OpenMetadata — DataHub peer alternative. open-metadata.org
Commercial Motherships
- Confluent (Kafka) — founded by Jay Kreps, Neha Narkhede, Jun Rao in 2014. confluent.io
- Databricks (Spark) — founded by Matei Zaharia et al. in 2013, 2024 valuation $62B. databricks.com
- Astronomer (Airflow) — 2022 valuation $1.7B. astronomer.io
- dbt Labs (dbt) — founded by Tristan Handy in 2016 at Fishtown Analytics, 2022 valuation $4.2B. getdbt.com
- Starburst (Trino) — founded by Trino founding team led by Martin Traverso in 2018. starburst.io
- Acryl Data (DataHub) — founded by LinkedIn DataHub core team in 2021. acryldata.io
- Preset (Superset) — founded by Maxime Beauchemin in 2019. preset.io
- Snowflake acquired Tabular (Iceberg’s main commercial entity, 2024, $1B+). snowflake.com announcement
Big-Tech Blogs and Papers
- Jay Kreps, “The Log: What every software engineer should know about real-time data’s unifying abstraction” (LinkedIn Engineering, 2013) — foundational long-form for Kafka / stream-processing paradigm. engineering.linkedin.com
- Maxime Beauchemin, “Airflow: a workflow management platform” (Airbnb Engineering, 2015) — original Airflow public release post. medium.com/airbnb-engineering
- Vinoth Chandar & Prasanna Rajaperumal, “Hudi: Uber Engineering’s Incremental Processing Framework on Apache Hadoop” (Uber, 2017) — Hudi origin blog. uber.com/blog/hoodie
- Kerem Sahin, Mars Lan, Shirshanka Das, “Open sourcing DataHub: LinkedIn’s metadata search and discovery platform” (LinkedIn, 2020-02) — DataHub open-sourcing. linkedin.com/blog/engineering
- Michelle Ark & Chris Wu, “How to Build a Production Grade Workflow with SQL Modelling” (Shopify Engineering, 2020-11) — dbt + BigQuery three-layer modeling (Seamster). shopify.engineering
- Robert Chang et al., “How Airbnb achieved metric consistency at scale” (Airbnb Engineering, 2021-04) — first Minerva metric layer article. medium.com/airbnb-engineering
- Yupeng Fu & Chinmay Soman, “Real-time Data Infrastructure at Uber” (SIGMOD 2021) — Uber real-time stack system paper (Kafka + Flink + Pinot + Hudi). arxiv.org/abs/2104.00087
- Ace Haidrey et al., “Spinner: Pinterest’s Workflow Platform” (Pinterest Engineering, 2022-02) — Airflow + K8s self-built orchestration. medium.com/pinterest-engineering
- Kevin Liu, “Inspecting Trino on ice” (Trino Fest 2023, Stripe) — industrial case of Iceberg + Trino replacing Hive. trino.io/blog
- Jun He et al., “Maestro: Netflix’s Workflow Orchestrator” (Netflix Tech Blog) — horizontal scaling + CockroachDB replaces Meson. netflixtechblog.com
- “Incremental Processing using Netflix Maestro and Apache Iceberg” (Netflix Tech Blog) — Maestro + Iceberg incremental materialization. netflixtechblog.com
- Liang Mou et al., “Next Generation DB Ingestion at Pinterest” (Pinterest Engineering, 2026-02) — Debezium / TiCDC + Kafka + Flink + Iceberg CDC lake ingestion. medium.com/pinterest-engineering
- Dan Frank, “Reproducible research: Stripe’s approach to data science” (Stripe Blog, 2016) — early data stack + reproducibility. stripe.com/blog
Industry Surveys and Comparisons
- “Apache Hudi vs Delta Lake vs Apache Iceberg: Lakehouse Feature Comparison” (Onehouse) — in-depth comparison of three open table formats, frequently cited in Chinese community. onehouse.ai/blog
- Netflix Tech Blog — continuous data-infrastructure updates. netflixtechblog.com
- Uber Engineering Blog — first-hand Hudi / Pinot / Flink cases. uber.com/blog/engineering
- LinkedIn Engineering Blog. engineering.linkedin.com
- Airbnb Engineering Blog. medium.com/airbnb-engineering
- Pinterest Engineering Blog. medium.com/pinterest-engineering
- Databricks Blog — frontline perspective on lakehouse / AI platform. databricks.com/blog
- Confluent Blog — Kafka / Flink / event-driven architecture. confluent.io/blog