🔧 Career Cheat Sheet

Data Engineer

SQL, Spark, Kafka, Airflow, dbt, cloud platforms — everything to crack any DE interview

$100K–$175K
US Salary Range
30+
Interview Q&As
40+
Code Snippets
6
Core Tools

📋 Table of Contents

  1. Quick Reference
  2. Core Concepts
  3. Advanced SQL
  4. Apache Spark / PySpark
  5. Apache Kafka
  6. Apache Airflow
  7. dbt
  8. Cloud Platforms
  9. Code Snippets
  10. Top 30 Interview Q&As
  11. Pipeline Design Patterns
  12. Glossary
Quick Reference Card
TopicKey Command / ConceptRemember This
Window functionROW_NUMBER() OVER (PARTITION BY x ORDER BY y)PARTITION BY = groups, ORDER BY = ordering within group
CTEWITH cte AS (SELECT ...) SELECT * FROM cteReusable, readable — runs once per query
Spark repartitiondf.repartition(200) / df.coalesce(10)repartition = full shuffle; coalesce = narrow (no shuffle)
Kafka offsetauto.offset.reset = 'earliest'/'latest'earliest = replay all; latest = new messages only
Delta MERGEMERGE INTO target USING source ON condition WHEN MATCHED THEN UPDATEUpsert pattern for incremental loads
Airflow DAG@dag(schedule='0 6 * * *', catchup=False)catchup=False prevents backfilling missed runs
dbt incremental{% if is_incremental() %} WHERE updated_at > (SELECT MAX(updated_at) FROM {{this}}) {% endif %}Only process new/changed records
SCD Type 2Add valid_from, valid_to, is_current columns; MERGE on natural keyKeep full history — each change creates new row
Snowflake warehouseALTER WAREHOUSE wh SUSPEND; AUTO_SUSPEND = 60Always suspend warehouses to save cost
BigQuery partitionPARTITION BY DATE(event_timestamp)Partition pruning = massive cost savings
Spark broadcast joinspark.sql.autoBroadcastJoinThreshold = 10MBBroadcast small table to all executors — no shuffle
CAP TheoremCan only guarantee 2 of: Consistency, Availability, Partition toleranceDistributed systems always have partitions → choose C or A
🧠
Core Concepts
Data Storage Architectures
ArchitectureStorageSchemaBest ForExamples
Data WarehouseColumnar, structuredSchema-on-write (predefined)BI, analytics, SQL queriesSnowflake, BigQuery, Redshift
Data LakeObject storage (raw)Schema-on-read (flexible)Raw storage, ML, explorationS3, GCS, ADLS
LakehouseObject storage + table formatSchema enforcement optionalUnified analytics + MLDatabricks, Delta Lake, Apache Iceberg
Data MartColumnar, structuredSchema-on-writeDepartment-specific analyticsSubset of DW for Sales/Finance
ETL vs ELT

ETL (Extract → Transform → Load)

  • Transform data BEFORE loading into destination
  • Traditional approach — used when destination can't handle raw data
  • Transformation logic lives in middle tier (Spark, custom code)
  • Use when: legacy DWs, compliance/PII masking before storage, complex transformations

ELT (Extract → Load → Transform)

  • Load raw data FIRST, transform inside the destination
  • Modern approach — cloud DWs are powerful enough to transform at scale
  • dbt runs transformations inside Snowflake/BigQuery using SQL
  • Use when: modern cloud DW, need raw data for multiple use cases, easier debugging
💡 Modern Best Practice
ELT + dbt is now the dominant pattern. Load raw data to S3/GCS (bronze), land it in your DW staging schema, then dbt transforms it into silver (cleaned) and gold (business-ready) layers.
Data Modeling Patterns

Star Schema

  • Central fact table with foreign keys to dimension tables
  • Denormalized — fast query performance
  • Best for: analytics, BI reporting
  • Example: fact_orders → dim_customer, dim_product, dim_date

Snowflake Schema

  • Normalized dimension tables (dimensions have sub-dimensions)
  • Saves storage but more complex JOINs
  • Example: dim_product → dim_category → dim_department
  • Use when: storage matters, many dimension hierarchies

Data Vault

  • Hubs (business keys), Links (relationships), Satellites (attributes)
  • Insert-only — full historical auditability
  • Highly scalable, parallel loading
  • Best for: enterprise DW with frequent schema changes
Slowly Changing Dimensions (SCD)
TypeHow it WorksKeeps History?Use Case
Type 1Overwrite old value with new value❌ NoCorrecting data errors (typos, fix address)
Type 2Insert new row; old row gets end date; add is_current flag✅ Full historyTrack changes over time (customer moves, price changes)
Type 3Add "previous value" column to existing row⚠️ One previous versionTrack one-time planned changes
Type 4Store current in main table, history in separate history table✅ In separate tableHigh-volume history with fast current lookups

SCD Type 2 Implementation Pattern:

-- SCD Type 2 MERGE in SQL
MERGE INTO dim_customer AS target
USING staging_customer AS source
ON target.customer_id = source.customer_id AND target.is_current = TRUE
WHEN MATCHED AND (
    target.email != source.email OR target.city != source.city
) THEN UPDATE SET
    target.is_current = FALSE,
    target.valid_to = CURRENT_DATE
WHEN NOT MATCHED THEN INSERT (
    customer_id, email, city, valid_from, valid_to, is_current
) VALUES (
    source.customer_id, source.email, source.city, CURRENT_DATE, NULL, TRUE
);
-- Then insert updated records as new rows
INSERT INTO dim_customer (customer_id, email, city, valid_from, valid_to, is_current)
SELECT source.customer_id, source.email, source.city, CURRENT_DATE, NULL, TRUE
FROM staging_customer source JOIN dim_customer target
  ON target.customer_id = source.customer_id AND target.is_current = FALSE
  AND target.valid_to = CURRENT_DATE;
Batch vs Streaming Processing

Batch Processing

  • What: Process accumulated data at scheduled intervals (hourly, daily)
  • Tools: Spark, dbt, Airflow, Hadoop MapReduce
  • Latency: Minutes to hours
  • Use cases: Daily reporting, data warehouse loads, model training, batch ML scoring
  • Pros: Simple, high throughput, efficient for large volumes, replay-friendly

Stream Processing

  • What: Process events as they arrive, continuously
  • Tools: Kafka Streams, Flink, Spark Structured Streaming, Kinesis
  • Latency: Milliseconds to seconds
  • Use cases: Fraud detection, real-time dashboards, recommendations, alerting
  • Challenges: Late data, exactly-once semantics, state management
Medallion Architecture (Bronze → Silver → Gold)
🥉 Bronze Layer
  • Raw data exactly as received from source
  • No transformations — preserve original
  • Schema-on-read
  • Append-only, immutable
  • All historical data retained
  • Source for reprocessing
🥈 Silver Layer
  • Cleaned, validated, standardized data
  • Deduplication, null handling
  • Data type enforcement
  • PII masked/tokenized
  • Joined with reference data
  • Conformed dimensions
🥇 Gold Layer
  • Business-ready aggregates
  • Fact & dimension tables
  • Denormalized for query speed
  • KPIs, metrics, reporting tables
  • ML feature tables
  • Served to BI tools / APIs
🗄️
Advanced SQL — Deep Dive
Window Functions — Complete Reference
-- ROW_NUMBER: unique sequential number per partition
SELECT
  order_id, customer_id, order_date, amount,
  ROW_NUMBER() OVER (PARTITION BY customer_id ORDER BY order_date DESC) AS order_rank
FROM orders;
-- Get each customer's most recent order: WHERE order_rank = 1

-- RANK vs DENSE_RANK
-- RANK: 1, 2, 2, 4  (skips 3 after tie)
-- DENSE_RANK: 1, 2, 2, 3  (no skip)
SELECT name, salary,
  RANK()       OVER (ORDER BY salary DESC) AS rank,
  DENSE_RANK() OVER (ORDER BY salary DESC) AS dense_rank
FROM employees;

-- LAG and LEAD: access previous/next row value
SELECT
  date, revenue,
  LAG(revenue, 1, 0)  OVER (ORDER BY date) AS prev_day_revenue,
  LEAD(revenue, 1, 0) OVER (ORDER BY date) AS next_day_revenue,
  revenue - LAG(revenue) OVER (ORDER BY date) AS day_over_day_change
FROM daily_metrics;

-- NTILE: divide into n buckets
SELECT customer_id, total_spend,
  NTILE(4) OVER (ORDER BY total_spend DESC) AS spend_quartile  -- 1=top 25%
FROM customer_summary;

-- SUM/AVG with frames: running total
SELECT date, revenue,
  SUM(revenue) OVER (ORDER BY date ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS running_total,
  AVG(revenue) OVER (ORDER BY date ROWS BETWEEN 6 PRECEDING AND CURRENT ROW) AS rolling_7d_avg
FROM daily_revenue;

-- FIRST_VALUE / LAST_VALUE
SELECT product_id, sale_date, price,
  FIRST_VALUE(price) OVER (PARTITION BY product_id ORDER BY sale_date) AS first_price,
  LAST_VALUE(price)  OVER (PARTITION BY product_id ORDER BY sale_date
                           ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) AS latest_price
FROM price_history;
CTEs and Recursive CTEs
-- Standard CTE: readable, can be referenced multiple times
WITH monthly_revenue AS (
  SELECT DATE_TRUNC('month', order_date) AS month,
         SUM(amount) AS revenue
  FROM orders GROUP BY 1
),
revenue_with_growth AS (
  SELECT month, revenue,
    LAG(revenue) OVER (ORDER BY month) AS prev_month_revenue,
    (revenue - LAG(revenue) OVER (ORDER BY month)) / LAG(revenue) OVER (ORDER BY month) * 100 AS growth_pct
  FROM monthly_revenue
)
SELECT * FROM revenue_with_growth WHERE month >= '2024-01-01';

-- Recursive CTE: organizational hierarchy
WITH RECURSIVE org_hierarchy AS (
  -- Anchor: start with the CEO (no manager)
  SELECT employee_id, name, manager_id, 0 AS level
  FROM employees WHERE manager_id IS NULL

  UNION ALL

  -- Recursive: join employees to their managers
  SELECT e.employee_id, e.name, e.manager_id, h.level + 1
  FROM employees e
  JOIN org_hierarchy h ON e.manager_id = h.employee_id
)
SELECT * FROM org_hierarchy ORDER BY level, name;
SQL Query Optimization Techniques

Indexing Strategy

  • B-tree index: Default. Best for equality and range queries on columns in WHERE, JOIN, ORDER BY
  • Composite index: Index on (col_a, col_b). Most selective column first. Column order matters — query must use leftmost prefix
  • Covering index: Index includes all columns needed by query — avoids table lookup entirely
  • Partial index: Index only subset of rows (WHERE is_active = TRUE) — smaller, faster
CREATE INDEX idx_orders_customer_date ON orders(customer_id, order_date DESC);

Query Plan Analysis

EXPLAIN ANALYZE
SELECT * FROM orders
WHERE customer_id = 123
  AND status = 'completed';
  • Seq Scan: Full table scan — bad for large tables
  • Index Scan: Uses index — good
  • Nested Loop: Good for small result sets
  • Hash Join: Good for large datasets without index
  • Merge Join: Both inputs sorted — efficient for ranges
  • Look for: high actual rows vs estimated rows (statistics stale → run ANALYZE)
Advanced SQL Patterns
-- PIVOT: rows to columns
SELECT *
FROM (SELECT month, category, revenue FROM sales)
PIVOT (SUM(revenue) FOR category IN ('Electronics', 'Clothing', 'Food'));

-- JSON operations (PostgreSQL)
SELECT
  user_id,
  metadata->>'email' AS email,           -- text extraction
  (metadata->'preferences'->>'theme')    -- nested access
FROM users WHERE metadata ? 'email';

-- Deduplication using ROW_NUMBER
DELETE FROM orders WHERE order_id IN (
  SELECT order_id FROM (
    SELECT order_id,
      ROW_NUMBER() OVER (PARTITION BY external_id ORDER BY created_at DESC) AS rn
    FROM orders
  ) t WHERE rn > 1
);

-- Running user cohort retention (classic DE interview question)
WITH cohorts AS (
  SELECT user_id, DATE_TRUNC('month', first_purchase) AS cohort_month
  FROM (SELECT user_id, MIN(order_date) AS first_purchase FROM orders GROUP BY 1) t
),
activity AS (
  SELECT o.user_id, c.cohort_month,
    DATEDIFF('month', c.cohort_month, o.order_date) AS period_number
  FROM orders o JOIN cohorts c USING(user_id)
)
SELECT cohort_month, period_number,
  COUNT(DISTINCT user_id) AS active_users
FROM activity
GROUP BY 1, 2 ORDER BY 1, 2;
Transaction Isolation Levels
LevelDirty ReadNon-Repeatable ReadPhantom ReadUse When
READ UNCOMMITTED✅ PossibleNever (for analytics only)
READ COMMITTED❌ PreventedDefault in most DBs — general use
REPEATABLE READ❌ PreventedFinancial calculations needing consistent view
SERIALIZABLE❌ PreventedCritical transactions (bank transfers)
Apache Spark / PySpark
Spark Architecture

Components

  • Driver: Main process — creates SparkContext, submits jobs, coordinates execution. Runs on master node.
  • Executors: Worker processes that run tasks. Each has CPU cores + memory.
  • Cluster Manager: YARN, Kubernetes, Mesos, or Standalone. Allocates resources.
  • Task: Smallest unit of work — runs on one partition of data
  • Stage: Group of tasks that can run without shuffle
  • Job: Triggered by an action (collect, write, count)

Key Concepts

  • Lazy evaluation: Transformations don't execute until an action is called — allows Catalyst to optimize the full query plan
  • DAG: Directed Acyclic Graph of stages and tasks — Spark builds this before execution
  • Shuffle: Redistribution of data across partitions — expensive (involves disk I/O and network). Triggered by: groupBy, join, repartition, distinct
  • Spill: When data doesn't fit in memory, Spark spills to disk — very slow. Fix: increase memory or reduce data per partition
PySpark — Essential Operations
from pyspark.sql import SparkSession
from pyspark.sql import functions as F
from pyspark.sql.window import Window
from delta.tables import DeltaTable

spark = SparkSession.builder.appName("ETL").config("spark.sql.extensions",
    "io.delta.sql.DeltaSparkSessionExtension").getOrCreate()

# Read data
df = spark.read.format("parquet").option("mergeSchema", "true").load("s3a://bucket/raw/orders/")

# Transformations
df_clean = (df
  .filter(F.col("amount") > 0)
  .withColumn("order_date", F.to_date("order_timestamp"))
  .withColumn("year_month", F.date_format("order_date", "yyyy-MM"))
  .withColumnRenamed("cust_id", "customer_id")
  .dropDuplicates(["order_id"])
  .na.fill({"discount": 0, "promo_code": "NONE"})
)

# Aggregation
monthly_revenue = df_clean.groupBy("year_month", "category").agg(
  F.sum("amount").alias("total_revenue"),
  F.count("order_id").alias("order_count"),
  F.countDistinct("customer_id").alias("unique_customers"),
  F.avg("amount").alias("avg_order_value"),
)

# Window functions in Spark
window_spec = Window.partitionBy("customer_id").orderBy("order_date")
df_with_rank = df_clean.withColumn("order_rank", F.row_number().over(window_spec))
df_first_order = df_with_rank.filter(F.col("order_rank") == 1)

# Delta Lake MERGE (upsert)
delta_table = DeltaTable.forPath(spark, "s3a://bucket/silver/orders/")
delta_table.alias("target").merge(
  df_clean.alias("source"),
  "target.order_id = source.order_id"
).whenMatchedUpdate(set={
  "amount": "source.amount",
  "status": "source.status",
  "updated_at": "current_timestamp()"
}).whenNotMatchedInsertAll().execute()

# Write with partitioning
df_clean.write.format("delta").mode("overwrite") \
  .partitionBy("year_month") \
  .option("overwriteSchema", "true") \
  .save("s3a://bucket/silver/orders/")
Spark Performance Optimization
ProblemSymptomSolution
Data skewOne executor takes 10x longer than others; one partition much largerSalt join key (add random suffix), AQE (spark.sql.adaptive.enabled=true), skewHint()
Too many small filesMany small partitions; metadata overhead; slow LISTINGcoalesce() before write, OPTIMIZE in Delta Lake, delta.targetFileSize
Out of memoryGC overhead or OOM errors on executorsIncrease executor memory, reduce partition size (repartition), cache only what's needed
Shuffle bottleneckShuffle Read/Write stage takes most timeBroadcast join for small tables, reduce groupBy cardinality, increase spark.sql.shuffle.partitions
Too few partitionsNot utilizing all cores; one executor saturatedRepartition: aim for 128MB–256MB per partition. Rule: cores × 2-4
# Adaptive Query Execution (Spark 3.0+) — enable always
spark.conf.set("spark.sql.adaptive.enabled", "true")
spark.conf.set("spark.sql.adaptive.coalescePartitions.enabled", "true")
spark.conf.set("spark.sql.adaptive.skewJoin.enabled", "true")

# Broadcast join for small tables
df_orders.join(F.broadcast(df_small_reference), "product_id")

# Cache frequently used DataFrames
df_customers.cache()   # or .persist(StorageLevel.MEMORY_AND_DISK)
df_customers.count()   # trigger materialization

# Delta Lake OPTIMIZE + ZORDER for query performance
spark.sql("OPTIMIZE silver.orders ZORDER BY (customer_id, order_date)")
spark.sql("VACUUM silver.orders RETAIN 168 HOURS")  # 7-day retention
📨
Apache Kafka
Kafka Architecture Deep Dive

Core Components

  • Broker: Server that stores messages. Kafka cluster = multiple brokers. Each broker handles a subset of partitions.
  • Topic: Logical category of messages (like a table). Partitioned for parallelism.
  • Partition: Ordered, immutable log. Messages appended to end. Each partition is replicated across brokers.
  • Offset: Unique sequential ID for each message within a partition. Consumer tracks its offset to know where it left off.
  • Producer: Writes messages to topics. Determines partition via key hash or round-robin.
  • Consumer: Reads messages from topics. Stores offset in __consumer_offsets topic.
  • Consumer Group: Multiple consumers sharing work. Each partition consumed by exactly ONE consumer in a group.
  • ZooKeeper / KRaft: Manages broker metadata, leader election. KRaft replaces ZooKeeper in Kafka 3.x+

Delivery Semantics

Critical Interview Topic
At-most-once: Messages may be lost. Fire-and-forget.
At-least-once: No data loss, but duplicates possible. Consumer re-reads if it crashes before committing offset.
Exactly-once (EOS): No duplicates, no loss. Requires: idempotent producer (enable.idempotence=true) + transactional API + transactional consumers.

Key Producer Settings

  • acks=all: Wait for all in-sync replicas to acknowledge — strongest durability
  • retries=Integer.MAX_VALUE: Retry on transient failures
  • enable.idempotence=true: Exactly-once producer semantics
  • linger.ms=5: Batch messages for 5ms to improve throughput
  • batch.size=65536: Max batch size in bytes
  • compression.type=snappy: Compress batches — reduces network I/O
Kafka Python Example (kafka-python)
from kafka import KafkaProducer, KafkaConsumer
import json, time

# Producer — send events
producer = KafkaProducer(
    bootstrap_servers=['kafka:9092'],
    value_serializer=lambda v: json.dumps(v).encode('utf-8'),
    key_serializer=lambda k: k.encode('utf-8') if k else None,
    acks='all',              # wait for all replicas
    enable_idempotence=True,  # exactly-once
    compression_type='snappy',
)

event = {"user_id": "u_123", "event": "purchase", "amount": 59.99, "ts": time.time()}
future = producer.send("user-events", key=event["user_id"], value=event)
record_metadata = future.get(timeout=10)  # blocks until ack received
print(f"Sent to partition {record_metadata.partition} @ offset {record_metadata.offset}")

producer.flush()
producer.close()

# Consumer — process events
consumer = KafkaConsumer(
    "user-events",
    bootstrap_servers=['kafka:9092'],
    group_id="analytics-pipeline",
    auto_offset_reset="earliest",   # replay from beginning on new consumer group
    enable_auto_commit=False,       # manual offset commit for at-least-once
    value_deserializer=lambda m: json.loads(m.decode('utf-8')),
    max_poll_records=100,
)

for message in consumer:
    event = message.value
    # Process event
    process_event(event)
    # Commit offset AFTER successful processing
    consumer.commit()
🌬️
Apache Airflow
Modern Airflow DAG (TaskFlow API)
from airflow.decorators import dag, task
from airflow.operators.python import PythonOperator
from airflow.providers.postgres.hooks.postgres import PostgresHook
from airflow.providers.amazon.aws.hooks.s3 import S3Hook
from datetime import datetime, timedelta

default_args = {
    "owner": "data-team",
    "retries": 3,
    "retry_delay": timedelta(minutes=5),
    "retry_exponential_backoff": True,
    "on_failure_callback": alert_slack,  # Slack alert on failure
}

@dag(
    dag_id="orders_etl_pipeline",
    schedule="0 6 * * *",          # 6 AM daily
    start_date=datetime(2024, 1, 1),
    catchup=False,                   # don't backfill missed runs
    default_args=default_args,
    tags=["orders", "production"],
)
def orders_etl():

    @task()
    def extract_from_postgres(**context):
        """Extract yesterday's orders from OLTP database"""
        hook = PostgresHook(postgres_conn_id="prod_postgres")
        execution_date = context["ds"]   # YYYY-MM-DD of current run
        df = hook.get_pandas_df(
            f"SELECT * FROM orders WHERE DATE(created_at) = '{execution_date}'"
        )
        # Push to XCom (auto-handled by @task)
        return df.to_dict("records")

    @task()
    def transform(records):
        """Clean and transform records"""
        import pandas as pd
        df = pd.DataFrame(records)
        df["amount"] = df["amount"].astype(float)
        df = df[df["status"] != "cancelled"]
        df["processed_at"] = pd.Timestamp.now()
        return df.to_dict("records")

    @task()
    def load_to_s3(records, **context):
        """Write Parquet to S3"""
        import pandas as pd, io
        df = pd.DataFrame(records)
        s3_hook = S3Hook(aws_conn_id="aws_default")
        buffer = io.BytesIO()
        df.to_parquet(buffer, index=False)
        s3_hook.load_bytes(buffer.getvalue(),
            key=f"orders/date={context['ds']}/orders.parquet",
            bucket_name="my-data-lake", replace=True)

    # DAG structure: extract → transform → load
    raw = extract_from_postgres()
    clean = transform(raw)
    load_to_s3(clean)

orders_etl()
🔧
dbt (data build tool)
dbt Core Concepts

Materializations

TypeBehaviorUse When
tableDROP + CREATE TABLE each runFinal gold layer models
viewCREATE OR REPLACE VIEWLightweight staging models
incrementalINSERT/MERGE only new recordsLarge tables — only process new data
ephemeralCTE inlined into downstream queryIntermediate transformations

Key Commands

dbt run                   # run all models
dbt run --select orders   # run specific model
dbt run --select +orders  # run + all ancestors
dbt test                  # run all tests
dbt test --select orders
dbt build                 # run + test together
dbt docs generate         # generate documentation
dbt docs serve            # serve docs locally
dbt compile               # compile without running
dbt debug                 # test connections
dbt snapshot              # run snapshots (SCD Type 2)
dbt Incremental Model Pattern
{{ -- models/silver/stg_orders.sql -- }}
{{
  config(
    materialized = 'incremental',
    unique_key = 'order_id',
    incremental_strategy = 'merge',    -- 'merge' for upserts, 'append' for inserts only
    on_schema_change = 'sync_all_columns'
  )
}}

WITH source AS (
  SELECT * FROM {{ source('raw', 'orders') }}
  {% if is_incremental() %}
  -- Only process records newer than our latest
  WHERE updated_at > (SELECT MAX(updated_at) FROM {{ this }})
  {% endif %}
),
cleaned AS (
  SELECT
    order_id,
    customer_id,
    CAST(amount AS NUMERIC(12,2)) AS amount,
    LOWER(status) AS status,
    DATE(created_at) AS order_date,
    created_at,
    COALESCE(updated_at, created_at) AS updated_at
  FROM source
  WHERE amount > 0
    AND order_id IS NOT NULL
)
SELECT * FROM cleaned
dbt Tests, Sources & Macros
# schema.yml — define tests, sources, documentation
version: 2

sources:
  - name: raw
    database: my_database
    schema: raw_schema
    tables:
      - name: orders
        loaded_at_field: created_at
        freshness:
          warn_after: {count: 12, period: hour}
          error_after: {count: 24, period: hour}

models:
  - name: stg_orders
    description: "Cleaned and standardized orders from OLTP"
    columns:
      - name: order_id
        description: "Unique order identifier"
        tests:
          - not_null
          - unique
      - name: amount
        tests:
          - not_null
          - accepted_range:
              min_value: 0
              max_value: 100000
      - name: status
        tests:
          - accepted_values:
              values: ['pending', 'completed', 'refunded', 'cancelled']
☁️
Cloud Data Platforms
Snowflake Key Concepts

Architecture

  • Virtual Warehouse: Compute cluster (XS to 6XL). Auto-suspend when idle. Query performance scales with size. Cost = credits/hour × warehouse size.
  • Storage: Fully managed columnar storage (micro-partitions). Separate from compute → scale independently.
  • Time Travel: Access historical data up to 90 days. SELECT * FROM orders AT(OFFSET => -3600)
  • Zero-Copy Cloning: Instant, metadata-only clone of table/schema for dev/testing without duplicating data
  • Streams: Change data capture — track inserts/updates/deletes on a table. Consumed by Tasks.

Snowflake SQL Patterns

-- Auto-suspend warehouse (save cost)
ALTER WAREHOUSE analytics_wh SET
  AUTO_SUSPEND = 60
  AUTO_RESUME = TRUE;

-- Clustering key for large tables
ALTER TABLE orders CLUSTER BY (order_date);

-- Query historical data
SELECT * FROM orders
  AT (TIMESTAMP => '2024-01-15 10:00:00'::timestamp);

-- Zero-copy clone for dev
CREATE TABLE orders_dev CLONE orders;

-- Stream for CDC
CREATE STREAM orders_stream ON TABLE orders;
SELECT * FROM orders_stream WHERE METADATA$ACTION = 'INSERT';
BigQuery Key Concepts
FeatureDetailsBest Practice
PartitioningDivide table into segments by date, integer range, or ingestion time. Queries scan only relevant partitions.Always partition large tables by date. Filter on partition column in WHERE clause.
ClusteringSort data within each partition by up to 4 columns. Auto-maintained.Cluster by columns frequently used in WHERE and JOIN. Order: highest cardinality → lowest.
SlotsUnit of compute. On-demand: auto-scaled. Reservations: dedicated slots for predictable workloads.Use on-demand for unpredictable workloads; reservations for production pipelines.
Materialized ViewsPre-computed query results, auto-refreshed. Smart rewrite — queries on base table can use MV.Use for expensive aggregation queries run frequently.
INFORMATION_SCHEMASystem views for job metadata, table sizes, slot usage, query historyMonitor costs: SELECT * FROM INFORMATION_SCHEMA.JOBS_BY_PROJECT ORDER BY total_bytes_billed DESC LIMIT 20
🎯
Top 30 Interview Questions & Answers
Q1. What is the difference between a Data Lake and a Data Warehouse?
Answer: A Data Warehouse stores structured, processed data with a predefined schema (schema-on-write). It's optimized for SQL analytics and BI reporting. Data is cleaned and modeled before loading. Examples: Snowflake, BigQuery, Redshift. A Data Lake stores raw data in any format (structured, semi-structured, unstructured) with schema applied at query time (schema-on-read). It's cheaper, more flexible, but requires more processing before it's usable. Examples: S3, ADLS, GCS. A Lakehouse (Delta Lake, Iceberg) combines the low-cost storage of data lakes with the ACID transactions and schema enforcement of data warehouses. Modern architectures often use: raw data in object storage (bronze), Delta Lake for silver/gold, Snowflake/BigQuery for serving layer.
Q2. How do you handle late-arriving data in a streaming pipeline?
Answer: Late data is a fundamental challenge in streaming. Solutions: (1) Watermarks: Tell the stream processor to wait W time before considering a window complete. In Spark Structured Streaming: df.withWatermark("event_time", "2 hours"). Events arriving more than 2 hours late are dropped. (2) Reprocessing: Keep raw events in Kafka for 7 days. Periodically reprocess to correct historical windows. (3) Lambda/Kappa architecture: Batch layer corrects streaming layer's approximate results. (4) Accumulating snapshot: Update the same aggregate row as late data arrives (requires idempotent writes). (5) Allowed lateness in Flink: Process late events up to X hours late, emit corrections. The right strategy depends on latency SLA vs. correctness requirements.
Q3. What is idempotency and why is it critical in data pipelines?
Answer: An idempotent operation produces the same result regardless of how many times it's executed. Critical because: pipelines WILL fail and be retried. If your pipeline isn't idempotent, retries cause duplicate data. Examples of idempotent patterns: (1) MERGE/UPSERT instead of INSERT: INSERT can create duplicates; MERGE on natural key is safe to retry. (2) Write to S3 with deterministic path: Overwrite same path on retry — same content, no duplicate files. (3) Use execution date as partition key: Overwrite year=2024/month=01/day=15/ on retry — no duplicates. (4) Truncate-and-insert: For small tables, delete all rows for the period, then insert fresh. (5) Exactly-once Kafka consumers: Use transactional API to prevent duplicate processing. Always design for "what happens if this runs twice?"
Q4. Explain SCD Type 2 and how you implement it.
Answer: SCD Type 2 (Slowly Changing Dimension Type 2) preserves full historical record of changes by inserting a new row for each change rather than overwriting. Implementation adds three columns: valid_from (when this version became active), valid_to (when this version was superseded — NULL for current), is_current (boolean flag). Implementation via MERGE: when source record matches target natural key and attributes changed → update old row (set valid_to = today, is_current = false) → insert new row (valid_from = today, valid_to = NULL, is_current = true). In dbt, use dbt snapshot with strategy: timestamp or check. In Delta Lake, MERGE handles this elegantly. Key uses: tracking customer address changes (for shipping), price changes over time, employee department changes.
Q5. How does Kafka ensure exactly-once delivery?
Answer: Exactly-once in Kafka requires three components working together: (1) Idempotent Producer: enable.idempotence=true — each producer gets a PID, each message gets a sequence number. Broker rejects duplicate sequence numbers, preventing duplicate writes even on retries. (2) Transactional API: Producer begins transaction, writes to multiple partitions atomically, commits or aborts. If producer fails mid-write, uncommitted messages are never consumed. (3) Transactional Consumer: isolation.level=read_committed — consumer only reads committed transactions, never sees aborted messages. Kafka Streams provides exactly-once E2E: processing.guarantee=exactly_once_v2. This is critical for financial use cases (payment events, inventory updates).
Q6. What is data skew in Spark and how do you fix it?
Answer: Data skew occurs when data is unevenly distributed across partitions — one partition gets 10x more data than others. This causes some tasks to take much longer, bottlenecking the job. Detection: check Spark UI Stages tab — if one task is much slower than median. Causes: joining on a key with highly non-uniform distribution (e.g., joining on country_code where 90% of rows are 'US'). Fixes: (1) Salting: Add random suffix (0–9) to join key on the large table, expand the small table 10x with each suffix → distribute load. (2) AQE (Spark 3+): spark.sql.adaptive.skewJoin.enabled=true — automatically splits skewed partitions. (3) Broadcast join: If one side is small (<10MB), broadcast it to avoid shuffle entirely. (4) Repartition before groupBy: Use repartition on a different column to distribute data before the skewed operation.
Q7. How do you design a partitioning strategy for a large table?
Answer: Partitioning splits data into separate files/directories to enable partition pruning — the query engine reads only the relevant partition. Design principles: (1) Cardinality: Partition key should have low-to-medium cardinality. Date (365/year) is ideal. Too high cardinality (user_id) creates millions of tiny files — "small file problem". (2) Query patterns: Partition on the column most commonly used in WHERE clause. (3) Partition size: Target 128MB–1GB per partition file. Too small → overhead; too large → parallelism issues. (4) Common partitioning: Date/time (by year/month/day depending on volume), region/country (for geo-distributed data), event_type (for log data). In BigQuery, use date partitioning + clustering. In Delta Lake, partitionBy("year_month") + ZORDER BY customer_id for secondary filtering.
Q8. Explain the Medallion Architecture and when you'd deviate from it.
Answer: Medallion (Bronze-Silver-Gold) is a layered data architecture: Bronze = raw data exactly as received, Silver = cleaned/validated/standardized data, Gold = business-aggregated, query-optimized data. Benefits: (1) Each layer has a clear contract and owner. (2) Bronze allows reprocessing from raw data at any time. (3) Separation of concerns — raw storage vs. business logic. (4) Failed pipelines can be re-run from Bronze without re-ingesting. When to deviate: if data is extremely simple and a two-layer approach is sufficient; for streaming where the boundary between layers is less clear; for very small datasets where overhead isn't justified. Some teams add a "Platinum" layer for ML-specific feature tables. The pattern works best with Delta Lake/Iceberg format for ACID support at each layer.
Q9. What is Change Data Capture (CDC) and how do you implement it?
Answer: CDC captures every INSERT, UPDATE, DELETE event on a source database and streams them downstream — enabling real-time replication without full table scans. Implementation methods: (1) Log-based CDC (best): Read database transaction logs (MySQL binlog, PostgreSQL WAL, SQL Server CDC). Tools: Debezium (open-source), AWS DMS, Fivetran. Low latency (<1s), no impact on source. (2) Trigger-based: Database triggers write changes to audit table. High overhead, not recommended for high-volume tables. (3) Polling/Timestamp-based: Query WHERE updated_at > last_run_time. Simple but misses deletes, vulnerable to missed updates if updated_at isn't maintained. CDC is critical for: building real-time data warehouse, synchronizing microservices databases, audit trails. Debezium → Kafka → Spark/Flink is the most common production pattern.
Q10. How do you ensure data quality in a pipeline?
Answer: Multi-layer approach: (1) Schema validation: Enforce schema on ingest — fail fast if source changes (dbt on_schema_change, Delta Lake schema enforcement). (2) Business rule tests: dbt tests — not_null, unique, accepted_values, relationships (referential integrity), custom SQL tests. (3) Statistical monitoring: Great Expectations, dbt Elementary — monitor column means, null rates, distributions, detect anomalies. (4) Freshness checks: Alert if Bronze data isn't arriving on schedule (dbt source freshness). (5) Row count reconciliation: Compare source row count to destination. (6) Data contracts: Formal agreements between data producers and consumers — schema, freshness, SLAs. (7) Quarantine bad records: Route records failing quality checks to error table for investigation rather than silently dropping. Quality gates at each Medallion layer — Silver models should not load if Bronze quality checks fail.
Q11–Q30 (Additional Critical Questions)
Q11. Explain ACID properties. Atomicity: all or nothing (transaction commits or fully rolls back). Consistency: data always meets constraints (FK, NOT NULL). Isolation: concurrent transactions don't see each other's intermediate state. Durability: committed transactions survive crashes (written to WAL before commit). Delta Lake brings ACID to object storage via transaction logs.

Q12. What is the difference between repartition and coalesce in Spark? repartition(n): triggers full shuffle — redistributes data evenly across n partitions. Use to increase or decrease partitions, or when you need data evenly distributed. coalesce(n): narrow transformation — combines existing partitions without full shuffle (only decreases partition count). Use when reducing partition count since it's much cheaper. Rule: increase partitions → repartition; decrease partitions → coalesce.

Q13. What is a broadcast join in Spark and when do you use it? A broadcast join copies the smaller table to all executor nodes, avoiding shuffle of the large table. Triggered automatically when small table < spark.sql.autoBroadcastJoinThreshold (default 10MB). Force it: df_large.join(F.broadcast(df_small), "key"). Use when: one table fits in memory on each executor, join on a small lookup table (country codes, product categories). Avoids expensive shuffle of the large table.

Q14. Describe the difference between Lambda and Kappa architecture. Lambda has two parallel paths: batch layer (accurate, high latency) + speed layer (approximate, low latency) → serving layer merges results. Complex to maintain — same logic in two systems. Kappa: stream processing only. Kafka retains full log; reprocess from beginning when accuracy needed. Simpler, but streaming must handle both real-time and historical workloads. Kappa is preferred in modern systems with mature streaming frameworks.

Q15. How do you handle schema evolution in a data pipeline? Options: (1) Delta Lake / Iceberg: schema evolution supported — mergeSchema automatically handles additive changes. (2) dbt on_schema_change='sync_all_columns' adds new columns automatically. (3) Backward compatibility: always add new columns, never delete/rename (breaking change). (4) Schema registry (Confluent): enforce Avro/Protobuf schemas on Kafka topics — producers must register schemas, consumers know schema evolution rules. (5) Versioned data: land new schema as new path, transform both old and new versions in Silver.

Q16. What is the difference between Star Schema and Data Vault? Star Schema: fact + dimension tables, denormalized, optimized for query performance, easier to understand. Best for stable business definitions. Data Vault: Hubs (business keys), Links (associations), Satellites (attributes + history). Fully normalized, insert-only, highly scalable, agnostic to business rules. Best for enterprise DW where source systems and business rules change frequently, need full auditability.

Q17. How do you monitor a data pipeline in production? Layers: (1) Infrastructure: Prometheus + Grafana for CPU, memory, disk, network. (2) Pipeline metadata: Airflow UI for task status, duration, SLA misses. (3) Data quality: dbt test results, Great Expectations alerts. (4) Data freshness: monitor last load time per table, alert if stale. (5) Business metrics: anomaly detection on key KPIs (sudden drops in order count = pipeline issue vs. real business drop). (6) Alerting: PagerDuty for critical failures, Slack for warnings. (7) Data lineage: dbt docs shows impact analysis — know downstream consumers before changing a model.

Q18. Explain Kafka Consumer Groups. Consumer group = set of consumers that collectively read from a topic. Each partition is assigned to exactly ONE consumer in the group — ensures no duplicate processing within the group. Add consumers to scale throughput (up to #partitions max). Multiple groups can read the same topic independently (different offsets per group). Consumer group coordinator (broker) handles rebalancing when consumers join/leave. Lag = difference between latest offset and consumer's committed offset — monitor with kafka-consumer-groups.sh or Burrow.

Q19. What is dbt and why has it become dominant in modern data stacks? dbt (data build tool) brings software engineering best practices to SQL transformations: version control (Git), testing, documentation, modular design, dependency management. Engineers write SELECT statements; dbt handles CREATE TABLE/VIEW, generates DAGs, manages dependencies. Works inside your DW — no separate compute needed. Modern stack: dbt + Snowflake/BigQuery = full ELT pipeline. Key advantages: every model is tested, documented, and version-controlled; automatic lineage graph; code reuse via macros/packages; CI/CD integration via dbt Cloud or GitHub Actions.

Q20. How do you optimize Snowflake query costs? (1) Partition pruning: always filter on cluster key column (date). (2) Auto-suspend: set AUTO_SUSPEND = 60 seconds on all warehouses. (3) Result caching: identical queries use cached results (24h TTL) — free. (4) Warehouse sizing: use smallest size that meets SLA. Scale up for complex queries, not for concurrent users (use multi-cluster). (5) COPY vs INSERT: COPY INTO is 10x faster for bulk loading. (6) Zero-copy clones for dev/test instead of copying data. (7) INFORMATION_SCHEMA.QUERY_HISTORY: identify expensive queries. (8) Materialized views for repeated expensive aggregations.

Q21-30 Summary: Other high-frequency topics: Delta Lake Z-ORDER vs traditional partitioning (Z-ORDER clusters data by multiple columns for multi-dimensional filtering); Airflow XComs (cross-task communication via key-value pairs, limit to small data — not DataFrames); Spark Catalyst Optimizer (rule-based + cost-based query planning, generates optimized physical plan); Databricks Auto Loader (incremental file ingestion from S3/ADLS with schema inference and evolution — replaces manual file listing); Data contracts (formal schema + SLA agreements between producers and consumers enforced via schema registries); Reverse ETL (warehouse data pushed to operational tools like Salesforce, Marketo via Hightouch/Census); Iceberg vs Delta Lake (both ACID on object storage — Iceberg is open standard with broad engine support, Delta Lake is Databricks-led with richer features); Great Expectations (data quality framework — define and test expectations on DataFrames and databases); dbt Exposures (document downstream consumers — BI dashboards, ML models — in lineage graph); Airflow Sensors (wait for external condition — FileSensor for S3 files, SQLSensor for query result, HttpSensor for API availability).
🏗️
Pipeline Design Patterns
Real-Time Analytics Pipeline
  • Sources: App events → Kafka (producers with schema registry)
  • Stream Processing: Flink/Spark Streaming — filter, enrich, aggregate
  • Storage: ClickHouse or Apache Druid for sub-second OLAP queries
  • Serving: REST API → real-time dashboards
  • Batch reconciliation: Daily Spark job corrects streaming results
  • Monitoring: Consumer lag, processing latency, error rates
Modern Batch Data Warehouse
  • Ingestion: Fivetran/Airbyte → raw tables in Snowflake/BigQuery
  • Transformation: dbt models — staging → intermediate → mart layers
  • Orchestration: Airflow or dbt Cloud schedules + dependencies
  • Quality: dbt tests + Great Expectations at each layer
  • Serving: BI tools connect to gold layer (Looker, Tableau, Power BI)
  • Governance: Column-level lineage, PII tagging, access controls
📖
Glossary
TermDefinition
OLTPOnline Transaction Processing — row-oriented, optimized for reads/writes of individual records (e-commerce, banking)
OLAPOnline Analytical Processing — columnar, optimized for aggregation queries across large datasets
DAGDirected Acyclic Graph — pipeline with defined dependencies, no cycles
IdempotencySame operation produces same result regardless of how many times executed
BackfillRunning a pipeline on historical data to populate past time periods
CompactionMerging many small files into fewer large files for query efficiency
Schema driftUnexpected changes to source data schema that break downstream pipelines
WatermarkStreaming concept: maximum allowed lateness before a window is finalized
Partition pruningQuery engine skips partitions irrelevant to the WHERE clause — massive scan cost reduction
Data lineageTrack where data came from, how it was transformed, and where it goes
TermDefinition
Data contractFormal agreement between data producer and consumer on schema, freshness, and SLA
Exactly-onceEach message/event processed exactly one time — no duplicates, no data loss
Consumer lagDifference between latest Kafka offset and consumer's committed offset — measures pipeline delay
Catalyst optimizerSpark's query optimizer that plans and optimizes transformations before execution
ShuffleRedistribution of data across Spark partitions — expensive due to network I/O
Feature storeCentralized repository of ML features — serves both training (batch) and inference (real-time)
Reverse ETLMoving transformed data from warehouse back to operational tools (CRM, marketing)
Z-ORDERDelta Lake multi-dimensional clustering that optimizes filtering on multiple columns
PSIPopulation Stability Index — measures distribution shift between two datasets (>0.25 = significant)
Semantic layerAbstraction layer that defines business metrics consistently across BI tools

Thejaslearning — Data Engineer Cheat Sheet  |  ← All Cheat Sheets  |  Dashboard