What Is Data Engineering?

Data engineering is the technical discipline of building and operating the systems and pipelines that acquire, transform, store, and serve data at scale — making data reliably available for analysts, data scientists, and AI applications. Data engineers are the builders of the data infrastructure that every data-driven organization depends on: they design and maintain the pipelines that move data from operational systems into the lakehouse, transform it through the Medallion Architecture, and make it available through governed APIs and query engines.

In the data lakehouse era, data engineering has become increasingly focused on open, portable tools: Apache Spark for batch ETL, Apache Flink for streaming ingestion, dbt for SQL-based transformation, Apache Airflow for orchestration, and Apache Iceberg for table management — all on open data in cloud object storage.

Modern Data Engineering Stack

The modern lakehouse data engineering stack by functional category:

FunctionToolPurpose
Streaming ingestionFlink + KafkaCDC and event streaming to Bronze Iceberg
Batch ingestionSpark + AirbyteJDBC loads and SaaS ELT to Bronze Iceberg
TransformationSpark, dbtBronze to Silver to Gold Iceberg tables
OrchestrationApache AirflowSchedule and monitor pipeline DAGs
Table formatApache IcebergACID transactions, schema evolution, time travel
Query engineDremio, TrinoSQL analytics on Iceberg tables
Qualitydbt tests, Great ExpectationsData quality validation in pipelines
Modern Data Engineering Stack diagram
Figure 1: The modern lakehouse data engineering stack — ingestion, transformation, orchestration, quality.

Table Management as a Data Engineering Responsibility

The lakehouse introduces a category of data engineering responsibility that didn't exist in managed data warehouse environments: table management. Apache Iceberg tables require active maintenance to sustain performance:

  • Compaction: Scheduling and monitoring file compaction jobs for Silver tables with frequent MERGE INTO operations
  • Z-Ordering: Periodic Z-order optimization for Gold tables with demanding query performance SLAs
  • Snapshot expiry: Cleaning up snapshot history to prevent metadata bloat
  • Partition evolution: Adjusting partition specs as data volumes grow and query patterns change

Data engineers who understand these table management responsibilities — and automate them through scheduled Airflow DAGs or Dremio's automatic optimization — maintain consistently high query performance without ad-hoc maintenance firefighting.

Data Engineering Table Management diagram
Figure 2: Lakehouse table management as a data engineering responsibility — compaction, Z-order, expiry.

Summary

Data engineering is the discipline that makes the data lakehouse function — building the pipelines, maintaining the tables, and ensuring the data quality that enables every downstream analytical use case. Modern lakehouse data engineers combine streaming (Flink), batch (Spark), transformation (dbt), orchestration (Airflow), and table management (Iceberg) skills into a cohesive platform engineering capability. As lakehouses mature and tooling automates more table management, data engineering increasingly shifts toward data product thinking — treating each Silver and Gold dataset as a trusted data product that business consumers can rely on.