A Layer Map of the Modern Data Engineering Ecosystem

The biggest change in the data stack over the past decade is that “the Hadoop-era unified stack has fallen apart”: where one Hadoop ecosystem used to package storage, compute, scheduling, and querying, each layer has now split into independent star projects. Building a production-grade data platform today, you will hear Debezium, Kafka, Airbyte, Spark, Flink, dbt, Iceberg, ClickHouse, Trino, Redis, Superset, Metabase, Airflow, DataHub — a long list of names that correspond to different cells on the same layered architecture diagram.

This article organizes that map into six layers: Sources → Ingestion → Compute → Storage → Query → Serve, plus two cross-cutting concerns spanning the full stack: Orchestration and Governance. Each project is unpacked into four parts: History / Core Design / Users / Demo Code. The demo doesn’t pursue a runnable complete project, only the minimal snippet that expresses each tool’s “soul” — after reading this, you should be able to answer two questions: why does this tool live in this layer, and what does the simplest line of code look like.

Overview — Six Layers · One Architecture Diagram

First, the full picture. The diagram below is the skeleton of this article — six horizontal layers from top to bottom, each picking the most mainstream 1–2 representative projects; Airflow on the left spans vertically because scheduling is a cross-cutting concern “effective on all compute nodes”; Catalog at the bottom is another cross-cutting concern — the truth about “what data exists” in all storage is registered in the metadata catalog.

Orchestration · AirflowTrigger batch jobs · invoke dbt · monitor streaming jobs · run reverse ETL01SOURCESOLTP DBsPostgres · MySQL · MongoDBEvents / LogsWeb · App · Server logsSaaS APISalesforce · Stripe · ads02INGESTDebeziumCDC · real-time change captureKafkaStreaming data busAirbyte / FivetranSaaS batch sync03COMPUTESparkHeavy batch · MLFlinkStreaming real-timedbtIn-warehouse SQL modeling04STORAGELAKEHOUSE · storage-compute separationS3 / R2IcebergMassive history · multi-engine shared · low cost~$0.023/GB/month, de facto standardDelta / Hudi peer competitorsREAL-TIME OLAPClickHouseLow latency · high concurrencyColumnar + MergeTreeOnline analytical queries05QUERYTrinoFederated SQL · cross-source querySpark SQLComplex analytics · ML featuresRedisResult cache · KV06SERVESuperset / MetabaseBI visualizationBackend APIData products · business APIsML / AI applicationsFeatures · models · inferenceGovernance · Glue Catalog / DataHubSchema registry · data lineage · data quality · permissionsHub / cross-cuttingStreaming pathBatch / generalPrimary data flow
Six-layer architecture diagram — the division of labor in the mainstream data stack and representative projects. Orange borders are hubs and cross-cutting layers (orchestration / governance), blue is the streaming path, solid arrows are primary data flow direction.

A few facts worth pointing out from this diagram:

Below we go layer by layer.

Sources — OLTP databases · logs · SaaS

This layer isn’t strictly “tools” but rather an inventory of “where data comes from.” Roughly three categories:

The next three layers (Ingestion / Processing / Query) and tool selection are fundamentally determined by “which category your data sources fall into, and whether you need real-time or batch”: CDC path goes Debezium → Kafka → Flink, telemetry goes through Kafka, SaaS goes through Airbyte / Fivetran.

Ingestion — Debezium · Kafka · Airbyte · Fivetran

The ingestion layer is the engineering form of “moving data from source to data platform.” The key split is streaming vs batch: database changes and event streams must use streaming pipelines (Debezium → Kafka); SaaS pulls fit scheduled batch (Airbyte / Fivetran).

Debezium · Database Change Capture (CDC)

History — Debezium is led by Red Hat, open-sourced in 2016, and contributed to the CNCF Sandbox. Its original intent is “treat database changes as event streams”: instead of polling whole tables to compute diffs, read MySQL binlog / Postgres WAL / MongoDB oplog directly and convert each INSERT / UPDATE / DELETE into a real-time JSON event. Early Debezium was tightly coupled to Kafka Connect as runtime; the 2.0 release (2022) introduced Debezium Server (standalone sink, can write directly to Pulsar / Kinesis / RabbitMQ) and Debezium Engine (embeddable into any JVM application).

Core design — Three layers of abstraction:

The essential difficulty of CDC is seamless splicing of the first-time snapshot + subsequent incremental. Debezium solves this with “watermark-based incremental snapshot”: while doing the snapshot, also consume binlog, and use watermarks to mark a cut-off point that guarantees no misses or duplicates.

Users — LinkedIn (early major contributor), Convoy, WePay, JD.com, ByteDance, and others run Debezium CDC pipelines in production. A common combination in China is Debezium + Kafka + Flink CDC (the flink-cdc-connectors project contributed by Alibaba essentially uses Debezium’s Engine library to read binlog). It’s the de facto open-source CDC standard, with almost no competitive open-source alternatives (closed-source competitors include Fivetran HVR, Striim, Qlik Replicate).

Demo code — Debezium MySQL Connector configuration (registered via Kafka Connect REST API in JSON):

{
  "name": "inventory-connector",
  "config": {
    "connector.class": "io.debezium.connector.mysql.MySqlConnector",
    "database.hostname": "mysql-prod.internal",
    "database.port": "3306",
    "database.user": "debezium",
    "database.password": "${SECRET:mysql_pwd}",
    "database.server.id": "184054",
    "topic.prefix": "inventory",
    "database.include.list": "orders,inventory",
    "table.include.list": "orders.line_items,inventory.products",
    "schema.history.internal.kafka.bootstrap.servers": "kafka-1:9092",
    "schema.history.internal.kafka.topic": "schema-changes.inventory",
    "snapshot.mode": "initial"
  }
}

Once registered, per-row change events appear in Kafka topic inventory.orders.line_items; consumers only need to subscribe to this topic to receive a downstream sync stream with sub-second latency.

Kafka · Distributed Commit Log

History — LinkedIn started the Kafka project internally in 2010 (Jay Kreps, Neha Narkhede, Jun Rao), to solve the problem of two systems duplicating effort: “user activity events + operations monitoring data.” Open-sourced in 2011, contributed to Apache in 2012, and the three founders left LinkedIn in 2014 to found Confluent (the commercial company). Over a decade, Kafka has evolved from a pure message queue into an “event streaming platform”:

Core design — Kafka’s foundation is abstracting messages as “partitioned, ordered append logs.”

Users — Kafka is the de facto standard distributed data bus. LinkedIn, Uber, Netflix, Airbnb, Twitter, Alibaba, ByteDance, Meituan all use it; LinkedIn’s internal Kafka cluster processes 7 trillion messages per day. Commercial layer Confluent (NASDAQ: CFLT, market cap $7B+) monetizes primarily through hosted service Confluent Cloud and the Kafka Streams / Connect / Schema Registry / ksqlDB surroundings. Competitors:

Demo code — Java Producer + Consumer (Kafka’s official client, JVM is its native runtime):

import org.apache.kafka.clients.producer.*;
import org.apache.kafka.clients.consumer.*;
import org.apache.kafka.common.serialization.*;
import java.time.Duration;
import java.util.*;

Properties p = new Properties();
p.put("bootstrap.servers", "kafka-1:9092,kafka-2:9092");
p.put("key.serializer",   StringSerializer.class.getName());
p.put("value.serializer", StringSerializer.class.getName());

try (var producer = new KafkaProducer<String, String>(p)) {
    producer.send(new ProducerRecord<>("user.events", "u-42",
        "{\"event\":\"login\",\"ts\":1715472000}"));
}

p.put("group.id", "analytics-pipeline");
p.put("auto.offset.reset", "earliest");
p.put("key.deserializer",   StringDeserializer.class.getName());
p.put("value.deserializer", StringDeserializer.class.getName());

try (var consumer = new KafkaConsumer<String, String>(p)) {
    consumer.subscribe(List.of("user.events"));
    while (true) {
        for (var r : consumer.poll(Duration.ofSeconds(1))) {
            System.out.println(r.key() + " " + r.value());
        }
    }
}

Producer writes to partitions by key hash; Consumer joins a group and is automatically assigned partitions for parallel consumption — this is what Kafka’s “horizontal scaling” looks like to the client.

Airbyte · Open-Source ELT Connectors

History — Airbyte was founded in 2020 by former LiveRamp engineers Michel Tricot and John Lafleur. It targets the pain point of “connectors for long-tail SaaS data sources”: every company needs to extract Stripe, Salesforce, Mixpanel, HubSpot, Intercom, etc. into its warehouse, but closed-source Fivetran is too expensive and in-house connectors are laborious. Airbyte’s pitch is an open-source + community-contributed connector catalog, currently with 600+ Source / Destination connectors. Business model: Airbyte Open Source (MIT) + Airbyte Cloud (SaaS, billed per row) + Airbyte Enterprise (self-hosted commercial). 2022 Series B at $1.5B valuation.

Core design — Airbyte packages every connector as a Docker container and communicates with the scheduler via the Airbyte Protocol (JSON over stdin/stdout). This design lets any language and any-dependency connector be integrated. Core components:

Released in 2023, PyAirbyte lets you call a connector as a Python library directly without starting a full service — very popular in notebook scenarios.

Users — Mid-sized companies + data teams are the mainstay; Cargill, Calendly, Perplexity, Replicate, and others use it. It’s the de facto open-source ELT standard. Competitors:

Demo code — Airbyte’s manifest.yaml (low-code connector, declarative pulling from a REST API):

version: 0.50.0
type: DeclarativeSource
streams:
  - name: customers
    primary_key: ["id"]
    retriever:
      record_selector:
        extractor:
          field_path: ["data"]
      requester:
        type: HttpRequester
        url_base: "https://api.example.com/v1"
        path: "/customers"
        http_method: "GET"
        authenticator:
          type: BearerAuthenticator
          api_token: "{{ config['api_key'] }}"
      paginator:
        type: DefaultPaginator
        page_token_option:
          type: RequestOption
          field_name: "cursor"
          inject_into: "request_parameter"

This YAML is a complete Airbyte Source connector without writing a line of Python.

Fivetran · Closed-Source Managed ELT

History — Fivetran was founded by George Fraser and Taylor Brown in 2012 at YC W13, eight years before Airbyte. It markets “zero-maintenance SaaS sync” — customers just authorize, no deployment, no connector maintenance, fully SaaS. This positioning became the standard ELT during the 2017–2020 rise of cloud warehouses (Snowflake, BigQuery, Redshift); 2021 Series D at $5.6B, later merged with HVR to complete CDC capability. Business model is pure MAR (monthly active rows) billing, one of the most expensive layers in the data stack — which is precisely what gave rise to open-source alternatives like Airbyte.

Core design — Fivetran uses a unified “schema-on-write” model: each connector automatically replicates the source schema to the target warehouse; new columns / new tables sync automatically. CDC uses binlog; SaaS uses API. All connectors are closed-source; Fivetran maintains quality and upgrades themselves. This contrasts cleanly with Airbyte: Fivetran sells “peace of mind,” Airbyte sells “control + cheap.”

Users — Notion, Square, DocuSign, Autodesk, and many other public companies use Fivetran. Combined with Snowflake / BigQuery / dbt, it forms the “modern data stack” trio (EL → Fivetran; T → dbt; Storage → Snowflake).

Demo code — Terraform configuration for a Fivetran connector (Fivetran itself is UI/API-driven, but managing it with Terraform is the community-mainstream practice):

resource "fivetran_connector" "stripe_to_snowflake" {
  group_id  = fivetran_group.prod.id
  service   = "stripe"
  sync_frequency = 60   # minutes

  destination_schema {
    name = "stripe_raw"
  }

  config {
    secret_key = var.stripe_secret_key
  }
}

resource "fivetran_connector_schedule" "stripe_schedule" {
  connector_id     = fivetran_connector.stripe_to_snowflake.id
  schedule_type    = "auto"
  paused           = false
}

After terraform apply, a few minutes later Snowflake will have dozens of tables like stripe_raw.charges / stripe_raw.customers mirroring the Stripe API object model.

Compute — Spark · Flink · dbt

Once data is in the platform, it must be cleaned, aggregated, joined, and feature-engineered — this is the work of the compute layer. The three mainstream engines split by “processing paradigm”: Spark for heavy batch + general compute, Flink for stream-native + stateful compute, dbt for SQL templating + in-warehouse transformation.

Apache Spark · Distributed Batch and ML

History — Spark originated at UC Berkeley AMPLab (2009), Matei Zaharia’s PhD project, open-sourced in 2010, contributed to Apache in 2014 as a top-level project. At a time when Hadoop MapReduce was clunky and slow, Spark proposed the RDD (Resilient Distributed Dataset) abstraction + in-memory compute, eliminating the IO-to-disk overhead of each MapReduce round; typical ETL jobs ran 10–100× faster. Subsequent evolution:

Databricks (2013), founded by Matei, is the commercial mothership of Spark, valued at $62B in 2024, IPO countdown. Cloud Spark = Databricks; on-prem Spark is largely hosted by EMR / Dataproc.

Core design — Spark’s core is “staged DAG execution”:

Users — Almost every large data team uses Spark. Apple, Uber, Netflix, LinkedIn, ByteDance, Alibaba, Tencent all run PB-scale Spark workloads. Spark’s strengths are comprehensive batch + ML integration, weakness is stream processing (micro-batch latency can’t beat Flink). Common combo: Iceberg + Spark + Airflow runs daily / hourly ETL.

Demo code — A Scala aggregation job (reads Iceberg, writes back to Iceberg). Spark itself is written in Scala; the Scala API is the closest to the engine:

import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.functions._

val spark = SparkSession.builder
  .appName("daily_orders_summary")
  .config("spark.sql.extensions", "org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions")
  .config("spark.sql.catalog.iceberg", "org.apache.iceberg.spark.SparkCatalog")
  .config("spark.sql.catalog.iceberg.type", "glue")
  .getOrCreate()

import spark.implicits._

val orders = spark.read.table("iceberg.raw.orders").where($"status" === "paid")

val daily = orders
  .groupBy(to_date($"created_at").as("day"), $"country")
  .agg(count("*").as("order_count"),
       sum($"total_cents").as("gmv_cents"))
  .orderBy("day", "country")

daily.writeTo("iceberg.marts.daily_orders").overwritePartitions()

Running this job, Spark automatically handles Iceberg table partition pruning, predicate pushdown, and shuffle tuning.

History — Flink originated at the Stratosphere research project at TU Berlin (2010); the core team (Stephan Ewen, Kostas Tzoumas, etc.) founded Data Artisans in 2014, renamed Stratosphere to Flink, and contributed it to Apache, becoming a top-level project the same year. Flink’s killer feature is true streaming + event time + exactly-once semantics — three things that were trade-offs in the Spark Streaming era; Flink gave a unified solution.

In 2019 Alibaba acquired Data Artisans for €90M, renaming it Ververica; Flink saw large-scale deployment at Alibaba for Double 11 real-time dashboards / real-time risk control / real-time recommendation. In 2023 Confluent acquired Immerok (a subsequent venture from the original Data Artisans founding team), betting on “Kafka + Flink” as a unified de facto standard.

Core design — Flink’s core abstractions are DataStream (unbounded stream) and State (operator-held local state).

Users — Alibaba (entire group), Uber (real-time pricing), Netflix (real-time recommendation features), LinkedIn (real-time data streams), ByteDance (real-time warehouse), DiDi (real-time risk control). The default engine for real-time warehouses at large Chinese tech firms. Competitors:

Demo code — Flink SQL real-time aggregation, consuming Kafka stream and writing into Iceberg:

-- Source: Kafka topic
CREATE TABLE orders_stream (
  order_id     BIGINT,
  user_id      BIGINT,
  amount_cents BIGINT,
  ts           TIMESTAMP_LTZ(3),
  WATERMARK FOR ts AS ts - INTERVAL '5' SECOND
) WITH (
  'connector'        = 'kafka',
  'topic'            = 'orders.v1',
  'properties.bootstrap.servers' = 'kafka:9092',
  'format'           = 'json',
  'scan.startup.mode'= 'latest-offset'
);

-- Sink: Iceberg table
CREATE TABLE orders_5m_agg WITH (
  'connector' = 'iceberg',
  'catalog-name' = 'glue',
  'database-name' = 'marts'
) AS
SELECT
  TUMBLE_START(ts, INTERVAL '5' MINUTE) AS window_start,
  COUNT(*)                              AS orders,
  SUM(amount_cents)                     AS gmv_cents
FROM orders_stream
GROUP BY TUMBLE(ts, INTERVAL '5' MINUTE);

After submission, Flink continuously consumes Kafka, aggregates over 5-minute event-time windows, writes exactly-once into Iceberg, with second-level latency.

dbt · In-Warehouse SQL Modeling

History — dbt (data build tool) was founded in 2016 by Tristan Handy at consulting firm Fishtown Analytics (later renamed dbt Labs); core team Drew Banin, Connor McArthur, Jason Ganz. dbt didn’t invent a new compute engine — it’s just a SQL templating compiler + execution framework — compiling models/orders.sql into CREATE TABLE ... and pushing it to Snowflake / BigQuery / Redshift / Databricks / Postgres / Trino / DuckDB for execution.

Its growth logic: in the cloud-warehouse era, compute is outsourced to Snowflake; what data teams really need is “engineered SQL workflow” — version control, dependency management, testing, documentation, CI/CD. dbt perfectly filled this gap and birthed the new job title “Analytics Engineer.” 2022 Series D at $4.2B. Commercial layer dbt Cloud (hosted IDE + scheduling + docs) is main revenue. 2024 launched dbt Fusion (a Rust engine based on SDF acquisition, 30× faster SQL / model parsing).

Core design — Built around a few key concepts:

dbt’s strength is forcing analytics workflow to conform to “software engineering”: every model has an owner, has tests, has docs, goes through git commits / CI / pull request review — a stark contrast with the chaos of “write an ad-hoc SQL inside a BI tool” from ten years ago.

Users — Standard issue for nearly every modern data stack team; JetBlue, Casper, HubSpot, GitLab, Notion, Vercel, and many more. The de facto standard. The open-source ecosystem dbt-utils / dbt-expectations / elementary / re_data is very active. Competitors:

Demo code — An incremental model + test:

-- models/marts/fct_orders.sql
{{ config(
    materialized='incremental',
    unique_key='order_id',
    on_schema_change='append_new_columns'
) }}

select
    o.order_id,
    o.user_id,
    o.created_at,
    o.total_cents,
    u.country
from {{ source('raw', 'orders') }} o
left join {{ ref('dim_users') }} u using (user_id)

{% if is_incremental() %}
where o.updated_at >= (select coalesce(max(updated_at), '1970-01-01') from {{ this }})
{% endif %}
# models/marts/_schema.yml
version: 2
models:
  - name: fct_orders
    columns:
      - name: order_id
        tests: [unique, not_null]
      - name: user_id
        tests:
          - relationships:
              to: ref('dim_users')
              field: user_id

After dbt run --select fct_orders, dbt compiles to MERGE INTO / INSERT INTO and pushes it to the target warehouse, then runs dbt test for validation.

Storage — Lakehouse (S3 + Iceberg) · Real-Time OLAP (ClickHouse)

The biggest change in the storage layer in recent years is that “lakehouse” has replaced the binary of “data lake vs data warehouse.” The bottom layer is cheap object storage like S3 / R2, and on top of it sits an open table format (Iceberg / Delta / Hudi) providing ACID + schema evolution, readable / writable by any compute engine. Real-time OLAP is another coexisting form: data is materialized into compute-storage-integrated engines like ClickHouse / Doris / StarRocks, trading multi-engine sharing for millisecond response.

S3 / R2 · Object Storage Substrate

History — Amazon S3 launched in 2006, the first object store in cloud computing and one of the oldest AWS services. Its 11 nines (99.999999999%) of durability, 99.99% availability, and unlimited objects per bucket essentially killed the “how to safely store PB of data” problem. Cloudflare R2 (GA 2022)‘s differentiation is “zero egress” — S3’s egress traffic is so expensive that “pulling my data out of AWS to use elsewhere” is nearly infeasible; R2 broke this barrier, forcing AWS in 2024 to also announce free egress (with conditions). Two other commonly mentioned: Google Cloud Storage (GCS), Backblaze B2.

Core design — S3 is an HTTP REST + strong consistency + immutable object storage model:

A key S3 constraint: LIST operations are slow (default 1000 keys/page; large directories take minutes to scan), so lakehouse table formats (Iceberg / Delta) must maintain their own manifest files to avoid LIST.

Users — Nearly every modern cloud-native data platform’s bedrock. S3 holds roughly hundreds of trillions of GB (public 2023 data). R2 is growing fast; Cloudflare’s 2024 report indicates R2 has stored trillions of objects.

Demo code — AWS CLI (the most common entrypoint for object-storage day-to-day ops):

# Write (single file with SSE encryption)
aws s3 cp part-00.parquet \
    s3://my-lakehouse/raw/orders/dt=2026-05-11/part-00.parquet \
    --sse AES256

# Sync entire directory (incremental, only transfers changed files)
aws s3 sync ./local-dir/ s3://my-lakehouse/raw/orders/dt=2026-05-11/

# Cloudflare R2 is S3 API compatible — just switch the endpoint
aws s3 cp file.parquet s3://my-r2-bucket/path/ \
    --endpoint-url https://<account>.r2.cloudflarestorage.com

In real production, raw operations like this are rare; most data writes are indirect through Iceberg / Spark / Trino; CLI is reserved for ops: bucket policy, lifecycle rules, inspections.

Apache Iceberg · Open Table Format

History — Iceberg was started by Ryan Blue and Daniel Weeks at Netflix in 2017, open-sourced in 2018, and became an Apache top-level project in 2020. The core pain points it aims to solve are several fatal flaws of the Hive table format in large-data-lake scenarios:

  1. LIST is slow when finding files across an entire partition
  2. No ACID; concurrent writes frequently corrupt data
  3. No schema evolution (can only add columns, can’t change types / rename)
  4. No time travel / rollback

Iceberg’s solution is “adding an independent metadata layer above the file layer”: each table has a metadata.json pointing to a manifest list, the manifest list points to manifest files, and manifests point to data files — three levels of pointers. Each write produces a new metadata.json; ACID is achieved via atomic rename.

In 2024, Snowflake acquired Tabular for $1B+ (founded by Ryan Blue and Daniel Weeks, Iceberg’s main commercial entity), in contrast with Databricks losing the Tabular bid and pivoting to acquire the operator of Delta Lake. Iceberg has become the de facto standard: AWS Glue, Athena, Snowflake, BigQuery, Databricks (also natively supported since 2024), Cloudflare R2 SQL, ClickHouse, Trino, Flink, Spark all can read Iceberg.

Core design — Three-level metadata:

Read path: uses manifest column statistics (min/max) for partition + file pruning, avoiding scanning irrelevant data. Write path: each commit atomically replaces the metadata file pointer, achieving MVCC.

Trade-off: Iceberg isn’t bound to a specific compute engine (vs Delta Lake’s early tight Spark binding), but commit conflicts require retries.

Users — Mainstream choice for large data lakes at Netflix, Apple, AirBnB, Adobe, ByteDance, Stripe, and others. Competitors:

Demo code — Spark SQL creating + maintaining an Iceberg table:

-- Create table
CREATE TABLE iceberg.marts.fct_orders (
  order_id     BIGINT,
  user_id      BIGINT,
  amount_cents BIGINT,
  country      STRING,
  created_at   TIMESTAMP
)
USING iceberg
PARTITIONED BY (days(created_at), country)
TBLPROPERTIES (
  'write.format.default' = 'parquet',
  'write.target-file-size-bytes' = '134217728'
);

-- Upsert
MERGE INTO iceberg.marts.fct_orders t
USING staging.orders_delta s
ON t.order_id = s.order_id
WHEN MATCHED THEN UPDATE SET *
WHEN NOT MATCHED THEN INSERT *;

-- Time travel
SELECT * FROM iceberg.marts.fct_orders
  FOR TIMESTAMP AS OF '2026-05-10 00:00:00';

-- Remove old snapshots (compaction + GC)
CALL iceberg.system.expire_snapshots('marts.fct_orders', TIMESTAMP '2026-05-01');

ClickHouse · Real-Time OLAP

History — ClickHouse originated as an internal project at Yandex in 2009, supporting reporting queries for Yandex.Metrica (Russia’s Google Analytics) — processing hundreds of billions of events per day with sub-second aggregation results. Open-sourced in 2016, quickly became the de facto open-source OLAP engine. In 2021 the core team formed ClickHouse Inc. (US), independent from Yandex, valued at $6B in 2024. Commercialization focuses on ClickHouse Cloud (hosted).

Core design — All of ClickHouse’s performance advantages come from three foundations: columnar storage + MergeTree + vectorized execution:

Trade-off: Fast writes / fast queries / but weak UPDATE / DELETE (async, low efficiency, batch-only). This means ClickHouse is a typical “append-mostly” tool — perfect for real-time warehouses and OLAP, but not for transactional workloads.

Users — Cloudflare (raw log analytics, tens of billions of events per day), Uber (real-time risk), ByteDance (user behavior analytics; Cnch is ByteDance’s customized version of ClickHouse), Spotify, eBay, Disney, DiDi, Alibaba (the Hologres team were major ClickHouse contributors), and others. Competitors:

Demo code — Table + materialized view:

-- Detail table
CREATE TABLE events (
    event_date Date,
    event_time DateTime,
    user_id UInt64,
    event_type LowCardinality(String),
    country LowCardinality(String),
    revenue_cents UInt64
)
ENGINE = MergeTree
PARTITION BY event_date
ORDER BY (event_date, event_type, user_id)
TTL event_date + INTERVAL 90 DAY;

-- Materialized view: hourly aggregation
CREATE MATERIALIZED VIEW events_hourly_mv
ENGINE = SummingMergeTree
PARTITION BY event_date
ORDER BY (event_date, event_hour, country, event_type)
AS SELECT
    event_date,
    toStartOfHour(event_time) AS event_hour,
    country,
    event_type,
    count() AS events,
    sum(revenue_cents) AS revenue_cents
FROM events
GROUP BY event_date, event_hour, country, event_type;

-- Query (hits materialized view, millisecond response)
SELECT country, sum(revenue_cents) / 100 AS revenue_usd
FROM events_hourly_mv
WHERE event_date = today() AND event_type = 'purchase'
GROUP BY country ORDER BY revenue_usd DESC LIMIT 10;

Query — Trino · Spark SQL · Redis

Once data is in storage, a “query entrypoint” is needed to accept SQL / KV requests. Trino handles cross-source federated interactive SQL, Spark SQL handles heavy analytics / feature engineering (the Spark from §04, in SQL form), Redis handles KV / cache / online metrics.

Trino · Federated SQL Engine

History — Trino’s predecessor is Presto, developed at Facebook in 2012 (Martin Traverso, David Phillips, Eric Hwang, Dain Sundstrom), to bring interactive SQL on Hadoop down to second-scale response times. Open-sourced in 2013, quickly becoming the representative MPP SQL engine in the industry. In 2018, the original core team left Facebook due to governance conflicts, forking PrestoSQL (later renamed Trino in 2020 because Facebook donated the Presto trademark to the Linux Foundation but retained governance) and founding Starburst for commercialization. What the market calls “Presto” today generally means Trino; the original PrestoDB is maintained by Linux Foundation but with low activity.

Trino’s core positioning is “one SQL engine to query all data sources”: no data replication, no unified format requirement; via connectors, federated queries against Hive / Iceberg / Delta / MySQL / Postgres / Kafka / Elasticsearch / Cassandra / MongoDB … 50+ data sources.

Core design — Trino is shared-nothing MPP:

Trade-off: Trino’s strength is interactive queries (second-scale) + cross-source join, weakness is fault tolerance on long-running jobs (failures are expensive to re-run; 3.x introduced fault-tolerant execution to mitigate).

Users — Netflix (main interactive query entrypoint), LinkedIn, Lyft, Pinterest, Stripe, ByteDance, and others. Starburst Galaxy (hosted Trino) is the commercial mainstay. Competitors:

Demo code — Trino cross-Iceberg + MySQL join (classic scenario: joining OLTP dimension tables with lakehouse fact tables):

SELECT
    o.country,
    p.product_name,
    SUM(o.amount_cents) / 100 AS revenue_usd
FROM iceberg.marts.fct_orders o
JOIN mysql.production.products p
  ON o.product_id = p.id
WHERE o.created_at >= DATE '2026-05-01'
  AND o.country IN ('US', 'JP', 'CN')
GROUP BY 1, 2
ORDER BY revenue_usd DESC
LIMIT 50;

When Trino executes: Iceberg connector reads S3, MySQL connector reads the prod DB, join is done in worker memory. The whole table is joined only at query time, no pre-ETL needed.

Spark SQL · Pointer to §04 Compute

Spark SQL has already been covered in §04 Compute; here we only highlight the “query identity” angle:

Redis · Cache and KV

History — Redis was written in C by Salvatore Sanfilippo in 2009 (Italian programmer, side project), to solve performance issues in his real-time stats website. In 2010 the project was hired by VMware for continued development; in 2015 Redis Labs (later renamed Redis Inc.) took over. Over a decade, Redis has evolved from “in-memory KV” to “in-memory data structure server”: string / hash / list / set / sorted-set / bitmap / hyperloglog / stream / geo / json (module) / vector (module) — almost any business scenario maps to some Redis structure.

In March 2024, Redis changed its license from BSD to RSALv2 + SSPL (dual license, restricting cloud-vendor commercial use); the community immediately forked Valkey under the Linux Foundation, with AWS / Google / Oracle / Snowflake all contributing engineering effort; competitors like Redict, Garnet (Microsoft, written in C#, marketing higher throughput) emerged in parallel. Today “Redis alternatives” is a real topic in the data stack discussion.

Core design — Redis is a textbook “single-threaded event loop + in-memory data structures”:

UsersNearly every internet company runs Redis: Twitter, GitHub, Stack Overflow, Snapchat, Pinterest, etc. Typical usage in the data stack:

Demo code — Go caching + sorted-set real-time leaderboard (Redis’s typical consumer is a Go / Java backend service):

package main

import (
    "context"
    "encoding/json"
    "time"

    "github.com/redis/go-redis/v9"
)

func main() {
    ctx := context.Background()
    rdb := redis.NewClient(&redis.Options{Addr: "redis.internal:6379"})

    // 1. Cache query result (5-minute TTL)
    cacheKey := "top_countries:2026-05-11"
    if v, err := rdb.Get(ctx, cacheKey).Result(); err == nil {
        var cached []Row
        json.Unmarshal([]byte(v), &cached)
    } else {
        rows := expensiveClickHouseQuery()
        b, _ := json.Marshal(rows)
        rdb.Set(ctx, cacheKey, b, 5*time.Minute)
    }

    // 2. Sorted-set real-time leaderboard
    for _, u := range updates {
        rdb.ZIncrBy(ctx, "leaderboard:2026-w19", u.Score, u.UserID)
    }

    // 3. Take Top 10
    top10, _ := rdb.ZRevRangeWithScores(ctx, "leaderboard:2026-w19", 0, 9).Result()
    _ = top10
}

Serve — BI · Backend API · ML Applications

The serve layer is where data is finally “consumed.” Tools at this layer are not “mutually exclusive” like the lower layers but divided by user role: data analysts use BI dashboards, business systems use backend APIs, algorithm / ML teams use model inference.

Apache Superset · Data Analytics Platform

History — Superset was developed internally at Airbnb by Maxime Beauchemin in 2015 (the same author as Airflow!), open-sourced in 2016, entered Apache Incubator in 2017, became a top-level project in 2021. After leaving Airbnb, Maxime founded Preset (2019) for hosted SaaS Superset. Superset’s design philosophy is “exploration-oriented + engineer-friendly”: write SQL to explore → save as dataset → add charts → assemble dashboard, all completed in the browser.

Core design — Three core abstractions:

Features: SQL Lab (in-browser SQL editor with syntax highlighting + autocomplete + history); Jinja templating (dynamically compose SQL with {{ filter_values('country') }}); Row-level security (per-user WHERE clauses); Embedded analytics (SDK to embed in your own product).

Users — Airbnb, Lyft, Twitter, AT&T, Bloomberg, ByteDance, JD.com, DiDi, and others. First-tier of open-source BI. Preset Cloud is the hosted commercial version.

Demo code — Superset’s dataset YAML (used with superset import-datasources for batch management):

database_name: trino-lakehouse
sqlalchemy_uri: trino://datateam@trino-prod.internal:8443/iceberg
tables:
  - table_name: fct_orders_daily
    schema: marts
    columns:
      - column_name: day
        type: DATE
        is_dttm: true
      - column_name: country
        type: VARCHAR
      - column_name: gmv_cents
        type: BIGINT
    metrics:
      - metric_name: gmv_usd
        expression: "SUM(gmv_cents) / 100.0"
      - metric_name: orders
        expression: "SUM(order_count)"

Metabase · Self-Service BI

History — Metabase was open-sourced by Sameer Al-Sakran in 2015; Metabase Inc. is the commercial mothership formed concurrently. Its positioning differs from Superset: “non-technical users can use it too” — generate queries via UI dimension / metric / filter selection, without needing to write SQL. Metabase 2024 Series D at $1.6B. Business model: open-source (AGPL) + Pro / Enterprise (closed-source add-ons) + Cloud.

Core design — Built around the “Question” abstraction: each chart is a question, and questions can be constructed via Notebook (UI drag-and-drop) or SQL editor.

Users — Mainstream for SMBs; 39k+ stars on GitHub. Compared with Superset: Metabase is more lightweight / easier to onboard, but its customization / visualization richness lags Superset. Mid-sized companies often install both: Metabase for business PMs, Superset for data analysts.

Demo code — Metabase is a pure HTTP API; curl is the most direct demonstration:

# Login to get session token
TOKEN=$(curl -s https://metabase.internal/api/session \
    -H "Content-Type: application/json" \
    -d '{"username":"bot","password":"secret"}' | jq -r .id)

# Run a saved question (id=42) for JSON results
curl -s -X POST https://metabase.internal/api/card/42/query/json \
    -H "X-Metabase-Session: $TOKEN" | jq '.[0:3]'

In real production, scripting Metabase is rare; the vast majority of interaction is through the web UI.

Backend API / ML Application · Not a Single Tool

The backend API cell isn’t a specific project, but application code that exposes the data platform’s query capabilities to business systems:

ML applications similarly, typical pipeline:

This layer has no “mainstream single tool” — it’s an assembly of dozens of lightweight services. The interface between data teams and business engineering teams.

Orchestration and Governance — Airflow · Glue Catalog · DataHub

Scheduling and governance are cross-cutting concerns. Airflow decides “what task runs when, and what depends on what,” Glue Catalog decides “what is the schema of this table of mine,” DataHub decides “how did this table’s lineage form, who owns it, and what’s the quality.”

Apache Airflow · Workflow Orchestration

History — Airflow was developed by Maxime Beauchemin at Airbnb in 2014 (same author as Superset), open-sourced in 2015, entered Apache Incubator in 2016, top-level project in 2019. Its core innovation is “workflow as Python code” — DAGs aren’t written in YAML or XML but in Python, with task dependencies declared via Python function call order. This idea wasn’t mainstream in 2015 (when Oozie / Azkaban were the XML / YAML camp), but quickly became the standard. Astronomer is the commercial mothership, 2022 valuation $1.7B, acquired Datakin (OpenLineage) in 2024.

Core design — Main concepts:

Users — Almost every data team runs Airflow. Airbnb, Lyft, Square, Robinhood, Tesla, Google (Cloud Composer is hosted Airflow), AWS (MWAA), ByteDance, DiDi, and others. Competitor differentiation:

Demo code — TaskFlow API writing an ETL DAG:

from datetime import datetime
from airflow.decorators import dag, task

@dag(
    schedule='0 2 * * *',          # Every day at 2:00
    start_date=datetime(2026, 5, 1),
    catchup=False,
    tags=['marts'],
)
def daily_orders_pipeline():

    @task
    def extract_kafka_offsets() -> dict:
        # Get yesterday's end offset from Kafka
        return {'orders.v1': 12345678}

    @task
    def run_spark(offsets: dict) -> str:
        # Trigger Spark job reading Iceberg + Kafka, writing back to Iceberg
        from airflow.providers.apache.spark.operators.spark_submit import SparkSubmitOperator
        # ... simplified: in practice use an Operator instead of @task ...
        return 'iceberg.marts.fct_orders'

    @task
    def trigger_dbt(table: str) -> None:
        # Trigger dbt run on downstream models
        ...

    offsets = extract_kafka_offsets()
    table = run_spark(offsets)
    trigger_dbt(table)

daily_orders_pipeline()

The scheduler picks up this file and automatically builds the dependency graph: extract → spark → dbt, running once daily at 2:00.

AWS Glue Data Catalog · Metadata Service

History — AWS Glue launched in 2017; Glue Data Catalog is its metadata service component, API fully compatible with Apache Hive Metastore (a key point, since it means any Hive-compatible engine can use Glue as a metastore). Glue itself also includes Glue ETL (Spark-based ETL) and Glue Studio (UI), but in real production “using Glue” essentially means using Glue Catalog.

Core design — Glue Catalog is a fully managed metastore service:

Trade-off: AWS-locked, rarely used cross-cloud; but within AWS it’s the de facto entrypoint for Iceberg / Delta / Hive metadata.

Users — Any data team of meaningful scale on AWS defaults to Glue Catalog as the metastore; EMR / Athena / Redshift Spectrum all default to it. Non-AWS alternatives: Hive Metastore (veteran, self-managed), Project Nessie (Iceberg-native catalog with git-like branches), Polaris (open-sourced by Snowflake), Unity Catalog (Databricks-led, open-sourced in 2024).

Demo code — AWS CLI registering an Iceberg table (in real scenarios, most are auto-registered by Spark / Iceberg writers; here we show what the low-level interface looks like):

cat > table.json <<'EOF'
{
  "Name": "fct_orders",
  "TableType": "EXTERNAL_TABLE",
  "Parameters": {
    "table_type": "ICEBERG",
    "metadata_location": "s3://my-lakehouse/marts/fct_orders/metadata/00000-xxx.metadata.json"
  },
  "StorageDescriptor": {
    "Location": "s3://my-lakehouse/marts/fct_orders/"
  }
}
EOF

aws glue create-table \
    --database-name marts \
    --table-input file://table.json

After this, Athena / Spark / Trino can all SELECT * FROM marts.fct_orders and see this Iceberg table.

DataHub · Data Asset Platform

History — DataHub evolved from LinkedIn’s internal WhereHows project, open-sourced in 2019 as a next-generation metadata platform, joined Linux Foundation in 2020. Acryl Data (originally LinkedIn DataHub core team) was founded in 2021 for commercialization, 2023 Series A at ~$300M valuation. DataHub addresses metadata + governance problems: “whose table is this, what’s the upstream/downstream lineage, was it modified recently, does it meet compliance” — a different layer from Glue Catalog (which only handles schema registration).

Core design — Several key abstractions:

DataHub’s killer feature is “column-level lineage”: after parsing dbt / Spark queries, it can say precisely “fct_orders.country comes from dim_users.country via a join.”

Users — LinkedIn (in-house), Saxo Bank, Adevinta, Wolt, Pinterest, ByteDance, Stripe, and others. Competitors:

Demo code — Python emitter pushing a Dataset:

from datahub.emitter.mce_builder import make_dataset_urn
from datahub.emitter.rest_emitter import DatahubRestEmitter
from datahub.metadata.schema_classes import (
    DatasetPropertiesClass, OwnershipClass, OwnerClass, OwnershipTypeClass,
)
from datahub.emitter.mcp import MetadataChangeProposalWrapper

emitter = DatahubRestEmitter(gms_server='http://datahub-gms:8080')

urn = make_dataset_urn(platform='iceberg', name='marts.fct_orders', env='PROD')

mcp = MetadataChangeProposalWrapper(
    entityUrn=urn,
    aspect=DatasetPropertiesClass(
        description='Daily orders fact table, grain (day, country, product_id)',
        customProperties={'owner_team': 'commerce-data'},
    ),
)
emitter.emit(mcp)

mcp2 = MetadataChangeProposalWrapper(
    entityUrn=urn,
    aspect=OwnershipClass(owners=[
        OwnerClass(
            owner='urn:li:corpuser:frank',
            type=OwnershipTypeClass.DATAOWNER,
        )
    ]),
)
emitter.emit(mcp2)

After writing, marts.fct_orders in the DataHub UI gets a description, owner, tags; lineage is automatically pulled from dbt / Airflow metadata.

Data Flow Timeline — write path · read path

The architecture diagram only paints “which components are at which layer,” but misses “how data actually flows end-to-end.” The diagram below depicts the complete path of a single transaction (one order) from creation to consumption: the top is two parallel write paths (real-time / batch), the middle is the two shared hubs ClickHouse + Iceberg marts, and the bottom is three classes of read paths distributed from those hubs (API / BI / ML).

WRITE PATHData flows from source to hub storageREAL-TIME chainPostgres / MySQLOrder created, writes WALDebeziumReads binlog / WAL, emits eventsKafkaorders.v1 topicFlinkStreaming window aggregationBATCH chainSaaS APIStripe · SalesforceAirbyteHourly batch pullSpark / FlinkWrites to IcebergIceberg + S3Lakehouse · full historydbtstg → int → martsHUBS · shared storage — two hubs bridge read and writeClickHouseReal-time warehouse · materialized viewsIceberg martsFact / dimension tablesREAD PATHDistribute from hubs to API / BI / ML consumersBackend APIFastAPI / SpringRedisResult cacheTrinoFederated SQLSupersetBI dashboardSpark MLlib / RayFeatures · trainingModel ServerTriton / KServeFrontend / customerReal-time dashboards / business pagesGlue Catalog / DataHub · metadata + data lineage (all tables register to catalog)Airflow · orchestration (trigger batch jobs · run dbt · invoke reverse ETL · monitor streaming jobs)
Data flow timeline — the full path of an order from OLTP DB write to final consumption by BI / API / ML. Blue is the write flow (solid = streaming / dashed = batch), orange is the read flow, the middle orange boxes are the two shared hubs, and the bottom bands are governance and orchestration.

Key points to read from this diagram:

Industry Practice — Netflix · Uber · LinkedIn · Airbnb · Pinterest · Shopify · Stripe

To ground theoretical frameworks, the quickest way is to read the engineering blogs of companies that run the same stack to the extreme. Below we pick seven Western big-tech companies with sustained output on data infrastructure, each with one or two representative recent blogs or papers, dissecting what their stacks look like and what unique problems they solve. This section answers: “what do all the components discussed earlier look like running in production?”

Netflix · Iceberg’s Birthplace + Maestro Orchestration

Netflix is the birthplace of Apache Iceberg — Ryan Blue and Daniel Weeks designed it internally at Netflix to solve Hive’s pain points on S3 (slow LIST, no ACID, no schema evolution, no time travel), open-sourced in 2018. Today the core combo of Netflix’s data platform is Iceberg + Spark + Trino + Flink + Maestro (self-built workflow engine, open-sourced in 2024).

The recent representative blog Maestro: Netflix’s Workflow Orchestrator recounts Maestro’s design philosophy: horizontally scalable stateless microservices + CockroachDB persistence, replacing the previous-generation single-leader system Meson — an architecture reverse-engineered from “extreme scale + multilingual users (SQL / Python / Notebook all needed).” Companion blog Incremental Processing using Netflix Maestro and Apache Iceberg details how Maestro pairs with Iceberg for incremental materialization.

Lessons for Chinese readers: Netflix chose Iceberg over Delta fundamentally for “multi-engine neutrality” — Spark / Trino / Flink / Presto / Hive / Impala all need to read the same table. If your stack has more than one compute engine, that’s a strong signal.

Uber · Hudi’s Birthplace + Real-Time Data Stack

Uber is the birthplace of Apache Hudi (Vinoth Chandar et al., open-sourced 2017), and the industry’s most complete “stream-batch unified” industrial case. The most authoritative reference is the SIGMOD 2021 paper Real-time Data Infrastructure at Uber (Yupeng Fu, Chinmay Soman), which systematically explains Uber’s internal Kafka + Flink + Pinot + Hive + Hudi collaboration — from ride orders to supply-demand matching to risk control, nearly all trip data runs through this pipeline.

Hudi’s origin blog Hudi: Uber Engineering’s Incremental Processing Framework on Apache Hadoop clarifies the pain points Hudi solves: upserts on large tables, which Hive-era systems handled by “rewriting all of the past N days’ partitions,” replaced by Hudi with ACID + indexes + incremental consumption.

Lessons for Chinese readers: Uber’s stack is the most worth-comparing template for ride-hailing / instant-delivery / supply-demand-matching businesses in China — Flink + Pinot for sub-second real-time decisioning, Hive + Hudi for offline modeling, with Kafka as the shared event source rather than two separate pipelines.

LinkedIn · Kafka’s Birthplace + DataHub

LinkedIn is the birthplace of Apache Kafka (Jay Kreps, Neha Narkhede, Jun Rao, 2010), and the origin of DataHub, Pinot, Samza, Voldemort. The whole “log-centric” data architecture philosophy traces back to Jay Kreps’ 2013 widely-cited long-form essay The Log: What every software engineer should know about real-time data’s unifying abstraction — distilling “log” as the unified abstraction across databases, distributed systems, and data integration. This article is a “must-read classic” for data engineers; after reading you’ll understand why Kafka is more than just a message queue.

In 2020 LinkedIn open-sourced their internal metadata platform: Open sourcing DataHub: LinkedIn’s metadata search and discovery platform — most enterprises’ metadata / lineage / data governance references today evolved from this architecture.

Lessons for Chinese readers: LinkedIn’s design philosophy is “event streams are the truth, databases are materialized views” — this is the ideological root of the CDC + stream processing paradigm.

Airbnb · Airflow + Superset + Minerva

Airbnb is the mothership where Maxime Beauchemin simultaneously incubated Airflow + Superset — these two projects are today the open-source de facto standards for workflow orchestration and BI respectively. Airflow: a workflow management platform (2015) is the foundational article for the workflow orchestration field — before this the industry was writing XML / YAML DAGs; Airflow first established “DAG is Python code” as the standard.

More recent How Airbnb achieved metric consistency at scale (2021) introduces the internal “metric layer” Minerva — 12,000+ metrics and 4,000+ dimensions consistent on a single source of truth; underlying Airflow + Hive + Spark + Presto + Druid. The Minerva concept later gave rise to dbt Semantic Layer, Cube.js, and other metric-layer products.

Lessons for Chinese readers: The biggest takeaway from the Airbnb case is “metric definition is a first-class citizen of data architecture” — don’t let the same KPI have ten SQL implementations across ten dashboards; this is the most common pitfall for mid-sized teams.

Pinterest · 500 PB Data Lake + Spinner Orchestration

Pinterest is a large-scale Iceberg user (disclosed at AWS re:Invent 2025: 500 PB+ data lake, 100K+ tables), and built Airflow’s replacement Spinner. Spinner: Pinterest’s Workflow Platform (2022) recounts the migration from old scheduler Pinball to Spinner (based on Airflow + Kubernetes), carrying 4,000+ workflows, 10,000 daily flow runs, 38,000 tasks.

The early-2026 next-generation lake ingestion framework Next Generation DB Ingestion at Pinterest is more recent practice: Debezium / TiCDC + Kafka + Flink + Spark + Iceberg unified CDC lake-ingestion framework, compressing online-OLTP-to-offline-lakehouse latency from hours to minutes.

Lessons for Chinese readers: Pinterest is the Western company whose business shape most resembles ByteDance / Xiaohongshu (high-frequency user interaction + recommendations + content); their stack choices have the highest reference value.

Shopify · dbt + BigQuery Benchmark

Shopify is the typical sample of the “lightweight data stack” route — no Spark / Flink / Kafka-class heavy components; entirely relying on BigQuery’s compute capability + dbt’s engineering capability, scaling to a public-company body. The representative blog How to Build a Production Grade Workflow with SQL Modelling (2020) details the internal codename “Seamster” solution: dbt + BigQuery, dividing data models into Base / Application-ready / Presentation three layers, with 200+ data scientists collaborating.

Lessons for Chinese readers: Shopify is a counter-example to “you don’t need a full stack to run at scale” — if your team isn’t Uber / Netflix scale, this stack is more worth copying than “I should also adopt Kafka + Flink.” China equivalents of BigQuery are Alibaba MaxCompute / ByteHouse from Volcano Engine, with similar stack thinking.

Stripe · Trino + Iceberg Replacing Hive

Stripe’s early stack representative was the 2016 Reproducible research: Stripe’s approach to data science — a combination of Jupyter + Hadoop + Scalding + Impala + Redshift. The 2023 Trino Fest talk Inspecting Trino on ice updates the current state: Iceberg has fully replaced Hive, with Trino as the federated SQL entrypoint; using Trino event listener, each query is audited to PostgreSQL for reverse observation of Iceberg table actual usage patterns.

Lessons for Chinese readers: Stripe is a sample of data-stack continuous iteration — the same company transitioning fully from Hadoop+Hive to Iceberg+Trino over seven years validates that the lakehouse + federated query route from §05 / §06 holds up over time.

How to Choose — by team size and business shape

When transplanting this ecosystem to your own organization, very few people deploy all 16 projects from day one. Real selection follows several rules of thumb:

Team / StageMinimum viable stackKey omissions
Early (< 10-person data team)Fivetran / Airbyte + Snowflake/BigQuery + dbt + MetabaseSkip Kafka / Flink / Iceberg / ClickHouse / Airflow (use dbt Cloud for scheduling)
Mid (10–50 people)Airbyte + Iceberg + Spark + dbt + Trino + Superset + AirflowAdd lakehouse + orchestration + federated query; Kafka / Flink depends on real-time needs
Late (>50 people + real-time business)Full stack: Debezium + Kafka + Flink + Iceberg + ClickHouse + dbt + Trino + Redis + Airflow + DataHubGovernance / lineage / real-time warehouse — none can be omitted
AI / Recs / Risk-heavyAll of the above + Feature Store (Feast) + Vector DB (Pinecone/Qdrant) + RayML pipeline becomes the new main path

Several common mistakes:

One final judgment: The evolution direction of this stack isn’t “one project that eats all layers”, but “the boundary of each layer becomes blurred”:

Understanding the “why each layer exists” is the only way to know what to cut and what to keep when the boundaries shift. This article is the starting point of that judgment.

References — project sites · big-tech blogs · industry surveys

Project Sites

Commercial Motherships

Big-Tech Blogs and Papers

Industry Surveys and Comparisons