Data Engineer — Interview Cheat Sheet

📋 Table of Contents

Quick Reference
Core Concepts
Advanced SQL
Apache Spark / PySpark
Apache Kafka
Apache Airflow
dbt
Cloud Platforms
Code Snippets
Top 30 Interview Q&As
Pipeline Design Patterns
Glossary

⚡

Quick Reference Card

Topic	Key Command / Concept	Remember This
Window function	ROW_NUMBER() OVER (PARTITION BY x ORDER BY y)	PARTITION BY = groups, ORDER BY = ordering within group
CTE	WITH cte AS (SELECT ...) SELECT * FROM cte	Reusable, readable — runs once per query
Spark repartition	df.repartition(200) / df.coalesce(10)	repartition = full shuffle; coalesce = narrow (no shuffle)
Kafka offset	auto.offset.reset = 'earliest'/'latest'	earliest = replay all; latest = new messages only
Delta MERGE	MERGE INTO target USING source ON condition WHEN MATCHED THEN UPDATE	Upsert pattern for incremental loads
Airflow DAG	@dag(schedule='0 6 * * *', catchup=False)	catchup=False prevents backfilling missed runs
dbt incremental	{% if is_incremental() %} WHERE updated_at > (SELECT MAX(updated_at) FROM {{this}}) {% endif %}	Only process new/changed records
SCD Type 2	Add valid_from, valid_to, is_current columns; MERGE on natural key	Keep full history — each change creates new row
Snowflake warehouse	ALTER WAREHOUSE wh SUSPEND; AUTO_SUSPEND = 60	Always suspend warehouses to save cost
BigQuery partition	PARTITION BY DATE(event_timestamp)	Partition pruning = massive cost savings
Spark broadcast join	spark.sql.autoBroadcastJoinThreshold = 10MB	Broadcast small table to all executors — no shuffle
CAP Theorem	Can only guarantee 2 of: Consistency, Availability, Partition tolerance	Distributed systems always have partitions → choose C or A

🧠

Core Concepts

Data Storage Architectures

Architecture	Storage	Schema	Best For	Examples
Data Warehouse	Columnar, structured	Schema-on-write (predefined)	BI, analytics, SQL queries	Snowflake, BigQuery, Redshift
Data Lake	Object storage (raw)	Schema-on-read (flexible)	Raw storage, ML, exploration	S3, GCS, ADLS
Lakehouse	Object storage + table format	Schema enforcement optional	Unified analytics + ML	Databricks, Delta Lake, Apache Iceberg
Data Mart	Columnar, structured	Schema-on-write	Department-specific analytics	Subset of DW for Sales/Finance

ETL vs ELT

ETL (Extract → Transform → Load)

Transform data BEFORE loading into destination
Traditional approach — used when destination can't handle raw data
Transformation logic lives in middle tier (Spark, custom code)
Use when: legacy DWs, compliance/PII masking before storage, complex transformations

ELT (Extract → Load → Transform)

Load raw data FIRST, transform inside the destination
Modern approach — cloud DWs are powerful enough to transform at scale
dbt runs transformations inside Snowflake/BigQuery using SQL
Use when: modern cloud DW, need raw data for multiple use cases, easier debugging

💡 Modern Best Practice

ELT + dbt is now the dominant pattern. Load raw data to S3/GCS (bronze), land it in your DW staging schema, then dbt transforms it into silver (cleaned) and gold (business-ready) layers.

Data Modeling Patterns

Star Schema

Central fact table with foreign keys to dimension tables
Denormalized — fast query performance
Best for: analytics, BI reporting
Example: fact_orders → dim_customer, dim_product, dim_date

Snowflake Schema

Normalized dimension tables (dimensions have sub-dimensions)
Saves storage but more complex JOINs
Example: dim_product → dim_category → dim_department
Use when: storage matters, many dimension hierarchies

Data Vault

Hubs (business keys), Links (relationships), Satellites (attributes)
Insert-only — full historical auditability
Highly scalable, parallel loading
Best for: enterprise DW with frequent schema changes

Slowly Changing Dimensions (SCD)

Type	How it Works	Keeps History?	Use Case
Type 1	Overwrite old value with new value	❌ No	Correcting data errors (typos, fix address)
Type 2	Insert new row; old row gets end date; add is_current flag	✅ Full history	Track changes over time (customer moves, price changes)
Type 3	Add "previous value" column to existing row	⚠️ One previous version	Track one-time planned changes
Type 4	Store current in main table, history in separate history table	✅ In separate table	High-volume history with fast current lookups

SCD Type 2 Implementation Pattern:

-- SCD Type 2 MERGE in SQL
MERGE INTO dim_customer AS target
USING staging_customer AS source
ON target.customer_id = source.customer_id AND target.is_current = TRUE
WHEN MATCHED AND (
    target.email != source.email OR target.city != source.city
) THEN UPDATE SET
    target.is_current = FALSE,
    target.valid_to = CURRENT_DATE
WHEN NOT MATCHED THEN INSERT (
    customer_id, email, city, valid_from, valid_to, is_current
) VALUES (
    source.customer_id, source.email, source.city, CURRENT_DATE, NULL, TRUE
);
-- Then insert updated records as new rows
INSERT INTO dim_customer (customer_id, email, city, valid_from, valid_to, is_current)
SELECT source.customer_id, source.email, source.city, CURRENT_DATE, NULL, TRUE
FROM staging_customer source JOIN dim_customer target
  ON target.customer_id = source.customer_id AND target.is_current = FALSE
  AND target.valid_to = CURRENT_DATE;

Batch vs Streaming Processing

Batch Processing

What: Process accumulated data at scheduled intervals (hourly, daily)
Tools: Spark, dbt, Airflow, Hadoop MapReduce
Latency: Minutes to hours
Use cases: Daily reporting, data warehouse loads, model training, batch ML scoring
Pros: Simple, high throughput, efficient for large volumes, replay-friendly

Stream Processing

What: Process events as they arrive, continuously
Tools: Kafka Streams, Flink, Spark Structured Streaming, Kinesis
Latency: Milliseconds to seconds
Use cases: Fraud detection, real-time dashboards, recommendations, alerting
Challenges: Late data, exactly-once semantics, state management

Medallion Architecture (Bronze → Silver → Gold)

🥉 Bronze Layer

Raw data exactly as received from source
No transformations — preserve original
Schema-on-read
Append-only, immutable
All historical data retained
Source for reprocessing

🥈 Silver Layer

Cleaned, validated, standardized data
Deduplication, null handling
Data type enforcement
PII masked/tokenized
Joined with reference data
Conformed dimensions

🥇 Gold Layer

Business-ready aggregates
Fact & dimension tables
Denormalized for query speed
KPIs, metrics, reporting tables
ML feature tables
Served to BI tools / APIs

🗄️

Advanced SQL — Deep Dive

Window Functions — Complete Reference

-- ROW_NUMBER: unique sequential number per partition
SELECT
  order_id, customer_id, order_date, amount,
  ROW_NUMBER() OVER (PARTITION BY customer_id ORDER BY order_date DESC) AS order_rank
FROM orders;
-- Get each customer's most recent order: WHERE order_rank = 1

-- RANK vs DENSE_RANK
-- RANK: 1, 2, 2, 4  (skips 3 after tie)
-- DENSE_RANK: 1, 2, 2, 3  (no skip)
SELECT name, salary,
  RANK()       OVER (ORDER BY salary DESC) AS rank,
  DENSE_RANK() OVER (ORDER BY salary DESC) AS dense_rank
FROM employees;

-- LAG and LEAD: access previous/next row value
SELECT
  date, revenue,
  LAG(revenue, 1, 0)  OVER (ORDER BY date) AS prev_day_revenue,
  LEAD(revenue, 1, 0) OVER (ORDER BY date) AS next_day_revenue,
  revenue - LAG(revenue) OVER (ORDER BY date) AS day_over_day_change
FROM daily_metrics;

-- NTILE: divide into n buckets
SELECT customer_id, total_spend,
  NTILE(4) OVER (ORDER BY total_spend DESC) AS spend_quartile  -- 1=top 25%
FROM customer_summary;

-- SUM/AVG with frames: running total
SELECT date, revenue,
  SUM(revenue) OVER (ORDER BY date ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS running_total,
  AVG(revenue) OVER (ORDER BY date ROWS BETWEEN 6 PRECEDING AND CURRENT ROW) AS rolling_7d_avg
FROM daily_revenue;

-- FIRST_VALUE / LAST_VALUE
SELECT product_id, sale_date, price,
  FIRST_VALUE(price) OVER (PARTITION BY product_id ORDER BY sale_date) AS first_price,
  LAST_VALUE(price)  OVER (PARTITION BY product_id ORDER BY sale_date
                           ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) AS latest_price
FROM price_history;

CTEs and Recursive CTEs

-- Standard CTE: readable, can be referenced multiple times
WITH monthly_revenue AS (
  SELECT DATE_TRUNC('month', order_date) AS month,
         SUM(amount) AS revenue
  FROM orders GROUP BY 1
),
revenue_with_growth AS (
  SELECT month, revenue,
    LAG(revenue) OVER (ORDER BY month) AS prev_month_revenue,
    (revenue - LAG(revenue) OVER (ORDER BY month)) / LAG(revenue) OVER (ORDER BY month) * 100 AS growth_pct
  FROM monthly_revenue
)
SELECT * FROM revenue_with_growth WHERE month >= '2024-01-01';

-- Recursive CTE: organizational hierarchy
WITH RECURSIVE org_hierarchy AS (
  -- Anchor: start with the CEO (no manager)
  SELECT employee_id, name, manager_id, 0 AS level
  FROM employees WHERE manager_id IS NULL

  UNION ALL

  -- Recursive: join employees to their managers
  SELECT e.employee_id, e.name, e.manager_id, h.level + 1
  FROM employees e
  JOIN org_hierarchy h ON e.manager_id = h.employee_id
)
SELECT * FROM org_hierarchy ORDER BY level, name;

SQL Query Optimization Techniques

Indexing Strategy

B-tree index: Default. Best for equality and range queries on columns in WHERE, JOIN, ORDER BY
Composite index: Index on (col_a, col_b). Most selective column first. Column order matters — query must use leftmost prefix
Covering index: Index includes all columns needed by query — avoids table lookup entirely
Partial index: Index only subset of rows (WHERE is_active = TRUE) — smaller, faster

CREATE INDEX idx_orders_customer_date ON orders(customer_id, order_date DESC);

Query Plan Analysis

EXPLAIN ANALYZE
SELECT * FROM orders
WHERE customer_id = 123
  AND status = 'completed';

Seq Scan: Full table scan — bad for large tables
Index Scan: Uses index — good
Nested Loop: Good for small result sets
Hash Join: Good for large datasets without index
Merge Join: Both inputs sorted — efficient for ranges
Look for: high actual rows vs estimated rows (statistics stale → run ANALYZE)

Advanced SQL Patterns

-- PIVOT: rows to columns
SELECT *
FROM (SELECT month, category, revenue FROM sales)
PIVOT (SUM(revenue) FOR category IN ('Electronics', 'Clothing', 'Food'));

-- JSON operations (PostgreSQL)
SELECT
  user_id,
  metadata->>'email' AS email,           -- text extraction
  (metadata->'preferences'->>'theme')    -- nested access
FROM users WHERE metadata ? 'email';

-- Deduplication using ROW_NUMBER
DELETE FROM orders WHERE order_id IN (
  SELECT order_id FROM (
    SELECT order_id,
      ROW_NUMBER() OVER (PARTITION BY external_id ORDER BY created_at DESC) AS rn
    FROM orders
  ) t WHERE rn > 1
);

-- Running user cohort retention (classic DE interview question)
WITH cohorts AS (
  SELECT user_id, DATE_TRUNC('month', first_purchase) AS cohort_month
  FROM (SELECT user_id, MIN(order_date) AS first_purchase FROM orders GROUP BY 1) t
),
activity AS (
  SELECT o.user_id, c.cohort_month,
    DATEDIFF('month', c.cohort_month, o.order_date) AS period_number
  FROM orders o JOIN cohorts c USING(user_id)
)
SELECT cohort_month, period_number,
  COUNT(DISTINCT user_id) AS active_users
FROM activity
GROUP BY 1, 2 ORDER BY 1, 2;

Transaction Isolation Levels

Level	Dirty Read	Non-Repeatable Read	Phantom Read	Use When
READ UNCOMMITTED	✅ Possible	✅	✅	Never (for analytics only)
READ COMMITTED	❌ Prevented	✅	✅	Default in most DBs — general use
REPEATABLE READ	❌	❌ Prevented	✅	Financial calculations needing consistent view
SERIALIZABLE	❌	❌	❌ Prevented	Critical transactions (bank transfers)

⚡

Apache Spark / PySpark

Spark Architecture

Components

Driver: Main process — creates SparkContext, submits jobs, coordinates execution. Runs on master node.
Executors: Worker processes that run tasks. Each has CPU cores + memory.
Cluster Manager: YARN, Kubernetes, Mesos, or Standalone. Allocates resources.
Task: Smallest unit of work — runs on one partition of data
Stage: Group of tasks that can run without shuffle
Job: Triggered by an action (collect, write, count)

Key Concepts

Lazy evaluation: Transformations don't execute until an action is called — allows Catalyst to optimize the full query plan
DAG: Directed Acyclic Graph of stages and tasks — Spark builds this before execution
Shuffle: Redistribution of data across partitions — expensive (involves disk I/O and network). Triggered by: groupBy, join, repartition, distinct
Spill: When data doesn't fit in memory, Spark spills to disk — very slow. Fix: increase memory or reduce data per partition

PySpark — Essential Operations

from pyspark.sql import SparkSession
from pyspark.sql import functions as F
from pyspark.sql.window import Window
from delta.tables import DeltaTable

spark = SparkSession.builder.appName("ETL").config("spark.sql.extensions",
    "io.delta.sql.DeltaSparkSessionExtension").getOrCreate()

# Read data
df = spark.read.format("parquet").option("mergeSchema", "true").load("s3a://bucket/raw/orders/")

# Transformations
df_clean = (df
  .filter(F.col("amount") > 0)
  .withColumn("order_date", F.to_date("order_timestamp"))
  .withColumn("year_month", F.date_format("order_date", "yyyy-MM"))
  .withColumnRenamed("cust_id", "customer_id")
  .dropDuplicates(["order_id"])
  .na.fill({"discount": 0, "promo_code": "NONE"})
)

# Aggregation
monthly_revenue = df_clean.groupBy("year_month", "category").agg(
  F.sum("amount").alias("total_revenue"),
  F.count("order_id").alias("order_count"),
  F.countDistinct("customer_id").alias("unique_customers"),
  F.avg("amount").alias("avg_order_value"),
)

# Window functions in Spark
window_spec = Window.partitionBy("customer_id").orderBy("order_date")
df_with_rank = df_clean.withColumn("order_rank", F.row_number().over(window_spec))
df_first_order = df_with_rank.filter(F.col("order_rank") == 1)

# Delta Lake MERGE (upsert)
delta_table = DeltaTable.forPath(spark, "s3a://bucket/silver/orders/")
delta_table.alias("target").merge(
  df_clean.alias("source"),
  "target.order_id = source.order_id"
).whenMatchedUpdate(set={
  "amount": "source.amount",
  "status": "source.status",
  "updated_at": "current_timestamp()"
}).whenNotMatchedInsertAll().execute()

# Write with partitioning
df_clean.write.format("delta").mode("overwrite") \
  .partitionBy("year_month") \
  .option("overwriteSchema", "true") \
  .save("s3a://bucket/silver/orders/")

Spark Performance Optimization

Problem	Symptom	Solution
Data skew	One executor takes 10x longer than others; one partition much larger	Salt join key (add random suffix), AQE (spark.sql.adaptive.enabled=true), skewHint()
Too many small files	Many small partitions; metadata overhead; slow LISTING	coalesce() before write, OPTIMIZE in Delta Lake, delta.targetFileSize
Out of memory	GC overhead or OOM errors on executors	Increase executor memory, reduce partition size (repartition), cache only what's needed
Shuffle bottleneck	Shuffle Read/Write stage takes most time	Broadcast join for small tables, reduce groupBy cardinality, increase spark.sql.shuffle.partitions
Too few partitions	Not utilizing all cores; one executor saturated	Repartition: aim for 128MB–256MB per partition. Rule: cores × 2-4

# Adaptive Query Execution (Spark 3.0+) — enable always
spark.conf.set("spark.sql.adaptive.enabled", "true")
spark.conf.set("spark.sql.adaptive.coalescePartitions.enabled", "true")
spark.conf.set("spark.sql.adaptive.skewJoin.enabled", "true")

# Broadcast join for small tables
df_orders.join(F.broadcast(df_small_reference), "product_id")

# Cache frequently used DataFrames
df_customers.cache()   # or .persist(StorageLevel.MEMORY_AND_DISK)
df_customers.count()   # trigger materialization

# Delta Lake OPTIMIZE + ZORDER for query performance
spark.sql("OPTIMIZE silver.orders ZORDER BY (customer_id, order_date)")
spark.sql("VACUUM silver.orders RETAIN 168 HOURS")  # 7-day retention

📨

Apache Kafka

Kafka Architecture Deep Dive

Core Components

Broker: Server that stores messages. Kafka cluster = multiple brokers. Each broker handles a subset of partitions.
Topic: Logical category of messages (like a table). Partitioned for parallelism.
Partition: Ordered, immutable log. Messages appended to end. Each partition is replicated across brokers.
Offset: Unique sequential ID for each message within a partition. Consumer tracks its offset to know where it left off.
Producer: Writes messages to topics. Determines partition via key hash or round-robin.
Consumer: Reads messages from topics. Stores offset in __consumer_offsets topic.
Consumer Group: Multiple consumers sharing work. Each partition consumed by exactly ONE consumer in a group.
ZooKeeper / KRaft: Manages broker metadata, leader election. KRaft replaces ZooKeeper in Kafka 3.x+

Delivery Semantics

Critical Interview Topic

At-most-once: Messages may be lost. Fire-and-forget.
At-least-once: No data loss, but duplicates possible. Consumer re-reads if it crashes before committing offset.
Exactly-once (EOS): No duplicates, no loss. Requires: idempotent producer (enable.idempotence=true) + transactional API + transactional consumers.

Key Producer Settings

acks=all: Wait for all in-sync replicas to acknowledge — strongest durability
retries=Integer.MAX_VALUE: Retry on transient failures
enable.idempotence=true: Exactly-once producer semantics
linger.ms=5: Batch messages for 5ms to improve throughput
batch.size=65536: Max batch size in bytes
compression.type=snappy: Compress batches — reduces network I/O

Kafka Python Example (kafka-python)

from kafka import KafkaProducer, KafkaConsumer
import json, time

# Producer — send events
producer = KafkaProducer(
    bootstrap_servers=['kafka:9092'],
    value_serializer=lambda v: json.dumps(v).encode('utf-8'),
    key_serializer=lambda k: k.encode('utf-8') if k else None,
    acks='all',              # wait for all replicas
    enable_idempotence=True,  # exactly-once
    compression_type='snappy',
)

event = {"user_id": "u_123", "event": "purchase", "amount": 59.99, "ts": time.time()}
future = producer.send("user-events", key=event["user_id"], value=event)
record_metadata = future.get(timeout=10)  # blocks until ack received
print(f"Sent to partition {record_metadata.partition} @ offset {record_metadata.offset}")

producer.flush()
producer.close()

# Consumer — process events
consumer = KafkaConsumer(
    "user-events",
    bootstrap_servers=['kafka:9092'],
    group_id="analytics-pipeline",
    auto_offset_reset="earliest",   # replay from beginning on new consumer group
    enable_auto_commit=False,       # manual offset commit for at-least-once
    value_deserializer=lambda m: json.loads(m.decode('utf-8')),
    max_poll_records=100,
)

for message in consumer:
    event = message.value
    # Process event
    process_event(event)
    # Commit offset AFTER successful processing
    consumer.commit()

🌬️

Apache Airflow

Modern Airflow DAG (TaskFlow API)

from airflow.decorators import dag, task
from airflow.operators.python import PythonOperator
from airflow.providers.postgres.hooks.postgres import PostgresHook
from airflow.providers.amazon.aws.hooks.s3 import S3Hook
from datetime import datetime, timedelta

default_args = {
    "owner": "data-team",
    "retries": 3,
    "retry_delay": timedelta(minutes=5),
    "retry_exponential_backoff": True,
    "on_failure_callback": alert_slack,  # Slack alert on failure
}

@dag(
    dag_id="orders_etl_pipeline",
    schedule="0 6 * * *",          # 6 AM daily
    start_date=datetime(2024, 1, 1),
    catchup=False,                   # don't backfill missed runs
    default_args=default_args,
    tags=["orders", "production"],
)
def orders_etl():

    @task()
    def extract_from_postgres(**context):
        """Extract yesterday's orders from OLTP database"""
        hook = PostgresHook(postgres_conn_id="prod_postgres")
        execution_date = context["ds"]   # YYYY-MM-DD of current run
        df = hook.get_pandas_df(
            f"SELECT * FROM orders WHERE DATE(created_at) = '{execution_date}'"
        )
        # Push to XCom (auto-handled by @task)
        return df.to_dict("records")

    @task()
    def transform(records):
        """Clean and transform records"""
        import pandas as pd
        df = pd.DataFrame(records)
        df["amount"] = df["amount"].astype(float)
        df = df[df["status"] != "cancelled"]
        df["processed_at"] = pd.Timestamp.now()
        return df.to_dict("records")

    @task()
    def load_to_s3(records, **context):
        """Write Parquet to S3"""
        import pandas as pd, io
        df = pd.DataFrame(records)
        s3_hook = S3Hook(aws_conn_id="aws_default")
        buffer = io.BytesIO()
        df.to_parquet(buffer, index=False)
        s3_hook.load_bytes(buffer.getvalue(),
            key=f"orders/date={context['ds']}/orders.parquet",
            bucket_name="my-data-lake", replace=True)

    # DAG structure: extract → transform → load
    raw = extract_from_postgres()
    clean = transform(raw)
    load_to_s3(clean)

orders_etl()

🔧

dbt (data build tool)

dbt Core Concepts

Materializations

Type	Behavior	Use When
table	DROP + CREATE TABLE each run	Final gold layer models
view	CREATE OR REPLACE VIEW	Lightweight staging models
incremental	INSERT/MERGE only new records	Large tables — only process new data
ephemeral	CTE inlined into downstream query	Intermediate transformations

Key Commands

dbt run                   # run all models
dbt run --select orders   # run specific model
dbt run --select +orders  # run + all ancestors
dbt test                  # run all tests
dbt test --select orders
dbt build                 # run + test together
dbt docs generate         # generate documentation
dbt docs serve            # serve docs locally
dbt compile               # compile without running
dbt debug                 # test connections
dbt snapshot              # run snapshots (SCD Type 2)

dbt Incremental Model Pattern

{{ -- models/silver/stg_orders.sql -- }}
{{
  config(
    materialized = 'incremental',
    unique_key = 'order_id',
    incremental_strategy = 'merge',    -- 'merge' for upserts, 'append' for inserts only
    on_schema_change = 'sync_all_columns'
  )
}}

WITH source AS (
  SELECT * FROM {{ source('raw', 'orders') }}
  {% if is_incremental() %}
  -- Only process records newer than our latest
  WHERE updated_at > (SELECT MAX(updated_at) FROM {{ this }})
  {% endif %}
),
cleaned AS (
  SELECT
    order_id,
    customer_id,
    CAST(amount AS NUMERIC(12,2)) AS amount,
    LOWER(status) AS status,
    DATE(created_at) AS order_date,
    created_at,
    COALESCE(updated_at, created_at) AS updated_at
  FROM source
  WHERE amount > 0
    AND order_id IS NOT NULL
)
SELECT * FROM cleaned

dbt Tests, Sources & Macros

# schema.yml — define tests, sources, documentation
version: 2

sources:
  - name: raw
    database: my_database
    schema: raw_schema
    tables:
      - name: orders
        loaded_at_field: created_at
        freshness:
          warn_after: {count: 12, period: hour}
          error_after: {count: 24, period: hour}

models:
  - name: stg_orders
    description: "Cleaned and standardized orders from OLTP"
    columns:
      - name: order_id
        description: "Unique order identifier"
        tests:
          - not_null
          - unique
      - name: amount
        tests:
          - not_null
          - accepted_range:
              min_value: 0
              max_value: 100000
      - name: status
        tests:
          - accepted_values:
              values: ['pending', 'completed', 'refunded', 'cancelled']

☁️

Cloud Data Platforms

Snowflake Key Concepts

Architecture

Virtual Warehouse: Compute cluster (XS to 6XL). Auto-suspend when idle. Query performance scales with size. Cost = credits/hour × warehouse size.
Storage: Fully managed columnar storage (micro-partitions). Separate from compute → scale independently.
Time Travel: Access historical data up to 90 days. SELECT * FROM orders AT(OFFSET => -3600)
Zero-Copy Cloning: Instant, metadata-only clone of table/schema for dev/testing without duplicating data
Streams: Change data capture — track inserts/updates/deletes on a table. Consumed by Tasks.

Snowflake SQL Patterns

-- Auto-suspend warehouse (save cost)
ALTER WAREHOUSE analytics_wh SET
  AUTO_SUSPEND = 60
  AUTO_RESUME = TRUE;

-- Clustering key for large tables
ALTER TABLE orders CLUSTER BY (order_date);

-- Query historical data
SELECT * FROM orders
  AT (TIMESTAMP => '2024-01-15 10:00:00'::timestamp);

-- Zero-copy clone for dev
CREATE TABLE orders_dev CLONE orders;

-- Stream for CDC
CREATE STREAM orders_stream ON TABLE orders;
SELECT * FROM orders_stream WHERE METADATA$ACTION = 'INSERT';

BigQuery Key Concepts

Feature	Details	Best Practice
Partitioning	Divide table into segments by date, integer range, or ingestion time. Queries scan only relevant partitions.	Always partition large tables by date. Filter on partition column in WHERE clause.
Clustering	Sort data within each partition by up to 4 columns. Auto-maintained.	Cluster by columns frequently used in WHERE and JOIN. Order: highest cardinality → lowest.
Slots	Unit of compute. On-demand: auto-scaled. Reservations: dedicated slots for predictable workloads.	Use on-demand for unpredictable workloads; reservations for production pipelines.
Materialized Views	Pre-computed query results, auto-refreshed. Smart rewrite — queries on base table can use MV.	Use for expensive aggregation queries run frequently.
INFORMATION_SCHEMA	System views for job metadata, table sizes, slot usage, query history	Monitor costs: SELECT * FROM INFORMATION_SCHEMA.JOBS_BY_PROJECT ORDER BY total_bytes_billed DESC LIMIT 20

🎯

Top 30 Interview Questions & Answers

Q1. What is the difference between a Data Lake and a Data Warehouse?

Answer: A Data Warehouse stores structured, processed data with a predefined schema (schema-on-write). It's optimized for SQL analytics and BI reporting. Data is cleaned and modeled before loading. Examples: Snowflake, BigQuery, Redshift. A Data Lake stores raw data in any format (structured, semi-structured, unstructured) with schema applied at query time (schema-on-read). It's cheaper, more flexible, but requires more processing before it's usable. Examples: S3, ADLS, GCS. A Lakehouse (Delta Lake, Iceberg) combines the low-cost storage of data lakes with the ACID transactions and schema enforcement of data warehouses. Modern architectures often use: raw data in object storage (bronze), Delta Lake for silver/gold, Snowflake/BigQuery for serving layer.

Q2. How do you handle late-arriving data in a streaming pipeline?

Answer: Late data is a fundamental challenge in streaming. Solutions: (1) Watermarks: Tell the stream processor to wait W time before considering a window complete. In Spark Structured Streaming: df.withWatermark("event_time", "2 hours"). Events arriving more than 2 hours late are dropped. (2) Reprocessing: Keep raw events in Kafka for 7 days. Periodically reprocess to correct historical windows. (3) Lambda/Kappa architecture: Batch layer corrects streaming layer's approximate results. (4) Accumulating snapshot: Update the same aggregate row as late data arrives (requires idempotent writes). (5) Allowed lateness in Flink: Process late events up to X hours late, emit corrections. The right strategy depends on latency SLA vs. correctness requirements.

Q3. What is idempotency and why is it critical in data pipelines?

Answer: An idempotent operation produces the same result regardless of how many times it's executed. Critical because: pipelines WILL fail and be retried. If your pipeline isn't idempotent, retries cause duplicate data. Examples of idempotent patterns: (1) MERGE/UPSERT instead of INSERT: INSERT can create duplicates; MERGE on natural key is safe to retry. (2) Write to S3 with deterministic path: Overwrite same path on retry — same content, no duplicate files. (3) Use execution date as partition key: Overwrite year=2024/month=01/day=15/ on retry — no duplicates. (4) Truncate-and-insert: For small tables, delete all rows for the period, then insert fresh. (5) Exactly-once Kafka consumers: Use transactional API to prevent duplicate processing. Always design for "what happens if this runs twice?"

Q4. Explain SCD Type 2 and how you implement it.

Answer: SCD Type 2 (Slowly Changing Dimension Type 2) preserves full historical record of changes by inserting a new row for each change rather than overwriting. Implementation adds three columns: valid_from (when this version became active), valid_to (when this version was superseded — NULL for current), is_current (boolean flag). Implementation via MERGE: when source record matches target natural key and attributes changed → update old row (set valid_to = today, is_current = false) → insert new row (valid_from = today, valid_to = NULL, is_current = true). In dbt, use dbt snapshot with strategy: timestamp or check. In Delta Lake, MERGE handles this elegantly. Key uses: tracking customer address changes (for shipping), price changes over time, employee department changes.

Q5. How does Kafka ensure exactly-once delivery?

Answer: Exactly-once in Kafka requires three components working together: (1) Idempotent Producer: enable.idempotence=true — each producer gets a PID, each message gets a sequence number. Broker rejects duplicate sequence numbers, preventing duplicate writes even on retries. (2) Transactional API: Producer begins transaction, writes to multiple partitions atomically, commits or aborts. If producer fails mid-write, uncommitted messages are never consumed. (3) Transactional Consumer: isolation.level=read_committed — consumer only reads committed transactions, never sees aborted messages. Kafka Streams provides exactly-once E2E: processing.guarantee=exactly_once_v2. This is critical for financial use cases (payment events, inventory updates).

Q6. What is data skew in Spark and how do you fix it?

Answer: Data skew occurs when data is unevenly distributed across partitions — one partition gets 10x more data than others. This causes some tasks to take much longer, bottlenecking the job. Detection: check Spark UI Stages tab — if one task is much slower than median. Causes: joining on a key with highly non-uniform distribution (e.g., joining on country_code where 90% of rows are 'US'). Fixes: (1) Salting: Add random suffix (0–9) to join key on the large table, expand the small table 10x with each suffix → distribute load. (2) AQE (Spark 3+): spark.sql.adaptive.skewJoin.enabled=true — automatically splits skewed partitions. (3) Broadcast join: If one side is small (<10MB), broadcast it to avoid shuffle entirely. (4) Repartition before groupBy: Use repartition on a different column to distribute data before the skewed operation.

Q7. How do you design a partitioning strategy for a large table?

Answer: Partitioning splits data into separate files/directories to enable partition pruning — the query engine reads only the relevant partition. Design principles: (1) Cardinality: Partition key should have low-to-medium cardinality. Date (365/year) is ideal. Too high cardinality (user_id) creates millions of tiny files — "small file problem". (2) Query patterns: Partition on the column most commonly used in WHERE clause. (3) Partition size: Target 128MB–1GB per partition file. Too small → overhead; too large → parallelism issues. (4) Common partitioning: Date/time (by year/month/day depending on volume), region/country (for geo-distributed data), event_type (for log data). In BigQuery, use date partitioning + clustering. In Delta Lake, partitionBy("year_month") + ZORDER BY customer_id for secondary filtering.

Q8. Explain the Medallion Architecture and when you'd deviate from it.

Answer: Medallion (Bronze-Silver-Gold) is a layered data architecture: Bronze = raw data exactly as received, Silver = cleaned/validated/standardized data, Gold = business-aggregated, query-optimized data. Benefits: (1) Each layer has a clear contract and owner. (2) Bronze allows reprocessing from raw data at any time. (3) Separation of concerns — raw storage vs. business logic. (4) Failed pipelines can be re-run from Bronze without re-ingesting. When to deviate: if data is extremely simple and a two-layer approach is sufficient; for streaming where the boundary between layers is less clear; for very small datasets where overhead isn't justified. Some teams add a "Platinum" layer for ML-specific feature tables. The pattern works best with Delta Lake/Iceberg format for ACID support at each layer.

Q9. What is Change Data Capture (CDC) and how do you implement it?

Answer: CDC captures every INSERT, UPDATE, DELETE event on a source database and streams them downstream — enabling real-time replication without full table scans. Implementation methods: (1) Log-based CDC (best): Read database transaction logs (MySQL binlog, PostgreSQL WAL, SQL Server CDC). Tools: Debezium (open-source), AWS DMS, Fivetran. Low latency (<1s), no impact on source. (2) Trigger-based: Database triggers write changes to audit table. High overhead, not recommended for high-volume tables. (3) Polling/Timestamp-based: Query WHERE updated_at > last_run_time. Simple but misses deletes, vulnerable to missed updates if updated_at isn't maintained. CDC is critical for: building real-time data warehouse, synchronizing microservices databases, audit trails. Debezium → Kafka → Spark/Flink is the most common production pattern.

Q10. How do you ensure data quality in a pipeline?

Answer: Multi-layer approach: (1) Schema validation: Enforce schema on ingest — fail fast if source changes (dbt on_schema_change, Delta Lake schema enforcement). (2) Business rule tests: dbt tests — not_null, unique, accepted_values, relationships (referential integrity), custom SQL tests. (3) Statistical monitoring: Great Expectations, dbt Elementary — monitor column means, null rates, distributions, detect anomalies. (4) Freshness checks: Alert if Bronze data isn't arriving on schedule (dbt source freshness). (5) Row count reconciliation: Compare source row count to destination. (6) Data contracts: Formal agreements between data producers and consumers — schema, freshness, SLAs. (7) Quarantine bad records: Route records failing quality checks to error table for investigation rather than silently dropping. Quality gates at each Medallion layer — Silver models should not load if Bronze quality checks fail.

Q11–Q30 (Additional Critical Questions)

Q11. Explain ACID properties. Atomicity: all or nothing (transaction commits or fully rolls back). Consistency: data always meets constraints (FK, NOT NULL). Isolation: concurrent transactions don't see each other's intermediate state. Durability: committed transactions survive crashes (written to WAL before commit). Delta Lake brings ACID to object storage via transaction logs.

Q12. What is the difference between repartition and coalesce in Spark? repartition(n): triggers full shuffle — redistributes data evenly across n partitions. Use to increase or decrease partitions, or when you need data evenly distributed. coalesce(n): narrow transformation — combines existing partitions without full shuffle (only decreases partition count). Use when reducing partition count since it's much cheaper. Rule: increase partitions → repartition; decrease partitions → coalesce.

Q13. What is a broadcast join in Spark and when do you use it? A broadcast join copies the smaller table to all executor nodes, avoiding shuffle of the large table. Triggered automatically when small table < spark.sql.autoBroadcastJoinThreshold (default 10MB). Force it: df_large.join(F.broadcast(df_small), "key"). Use when: one table fits in memory on each executor, join on a small lookup table (country codes, product categories). Avoids expensive shuffle of the large table.

Q14. Describe the difference between Lambda and Kappa architecture. Lambda has two parallel paths: batch layer (accurate, high latency) + speed layer (approximate, low latency) → serving layer merges results. Complex to maintain — same logic in two systems. Kappa: stream processing only. Kafka retains full log; reprocess from beginning when accuracy needed. Simpler, but streaming must handle both real-time and historical workloads. Kappa is preferred in modern systems with mature streaming frameworks.

Q15. How do you handle schema evolution in a data pipeline? Options: (1) Delta Lake / Iceberg: schema evolution supported — mergeSchema automatically handles additive changes. (2) dbt on_schema_change='sync_all_columns' adds new columns automatically. (3) Backward compatibility: always add new columns, never delete/rename (breaking change). (4) Schema registry (Confluent): enforce Avro/Protobuf schemas on Kafka topics — producers must register schemas, consumers know schema evolution rules. (5) Versioned data: land new schema as new path, transform both old and new versions in Silver.

Q16. What is the difference between Star Schema and Data Vault? Star Schema: fact + dimension tables, denormalized, optimized for query performance, easier to understand. Best for stable business definitions. Data Vault: Hubs (business keys), Links (associations), Satellites (attributes + history). Fully normalized, insert-only, highly scalable, agnostic to business rules. Best for enterprise DW where source systems and business rules change frequently, need full auditability.

Q17. How do you monitor a data pipeline in production? Layers: (1) Infrastructure: Prometheus + Grafana for CPU, memory, disk, network. (2) Pipeline metadata: Airflow UI for task status, duration, SLA misses. (3) Data quality: dbt test results, Great Expectations alerts. (4) Data freshness: monitor last load time per table, alert if stale. (5) Business metrics: anomaly detection on key KPIs (sudden drops in order count = pipeline issue vs. real business drop). (6) Alerting: PagerDuty for critical failures, Slack for warnings. (7) Data lineage: dbt docs shows impact analysis — know downstream consumers before changing a model.

Q18. Explain Kafka Consumer Groups. Consumer group = set of consumers that collectively read from a topic. Each partition is assigned to exactly ONE consumer in the group — ensures no duplicate processing within the group. Add consumers to scale throughput (up to #partitions max). Multiple groups can read the same topic independently (different offsets per group). Consumer group coordinator (broker) handles rebalancing when consumers join/leave. Lag = difference between latest offset and consumer's committed offset — monitor with kafka-consumer-groups.sh or Burrow.

Q19. What is dbt and why has it become dominant in modern data stacks? dbt (data build tool) brings software engineering best practices to SQL transformations: version control (Git), testing, documentation, modular design, dependency management. Engineers write SELECT statements; dbt handles CREATE TABLE/VIEW, generates DAGs, manages dependencies. Works inside your DW — no separate compute needed. Modern stack: dbt + Snowflake/BigQuery = full ELT pipeline. Key advantages: every model is tested, documented, and version-controlled; automatic lineage graph; code reuse via macros/packages; CI/CD integration via dbt Cloud or GitHub Actions.

Q20. How do you optimize Snowflake query costs? (1) Partition pruning: always filter on cluster key column (date). (2) Auto-suspend: set AUTO_SUSPEND = 60 seconds on all warehouses. (3) Result caching: identical queries use cached results (24h TTL) — free. (4) Warehouse sizing: use smallest size that meets SLA. Scale up for complex queries, not for concurrent users (use multi-cluster). (5) COPY vs INSERT: COPY INTO is 10x faster for bulk loading. (6) Zero-copy clones for dev/test instead of copying data. (7) INFORMATION_SCHEMA.QUERY_HISTORY: identify expensive queries. (8) Materialized views for repeated expensive aggregations.

Q21-30 Summary: Other high-frequency topics: Delta Lake Z-ORDER vs traditional partitioning (Z-ORDER clusters data by multiple columns for multi-dimensional filtering); Airflow XComs (cross-task communication via key-value pairs, limit to small data — not DataFrames); Spark Catalyst Optimizer (rule-based + cost-based query planning, generates optimized physical plan); Databricks Auto Loader (incremental file ingestion from S3/ADLS with schema inference and evolution — replaces manual file listing); Data contracts (formal schema + SLA agreements between producers and consumers enforced via schema registries); Reverse ETL (warehouse data pushed to operational tools like Salesforce, Marketo via Hightouch/Census); Iceberg vs Delta Lake (both ACID on object storage — Iceberg is open standard with broad engine support, Delta Lake is Databricks-led with richer features); Great Expectations (data quality framework — define and test expectations on DataFrames and databases); dbt Exposures (document downstream consumers — BI dashboards, ML models — in lineage graph); Airflow Sensors (wait for external condition — FileSensor for S3 files, SQLSensor for query result, HttpSensor for API availability).

🏗️

Pipeline Design Patterns

Real-Time Analytics Pipeline

Sources: App events → Kafka (producers with schema registry)
Stream Processing: Flink/Spark Streaming — filter, enrich, aggregate
Storage: ClickHouse or Apache Druid for sub-second OLAP queries
Serving: REST API → real-time dashboards
Batch reconciliation: Daily Spark job corrects streaming results
Monitoring: Consumer lag, processing latency, error rates

Modern Batch Data Warehouse

Ingestion: Fivetran/Airbyte → raw tables in Snowflake/BigQuery
Transformation: dbt models — staging → intermediate → mart layers
Orchestration: Airflow or dbt Cloud schedules + dependencies
Quality: dbt tests + Great Expectations at each layer
Serving: BI tools connect to gold layer (Looker, Tableau, Power BI)
Governance: Column-level lineage, PII tagging, access controls

📖

Glossary

Term	Definition
OLTP	Online Transaction Processing — row-oriented, optimized for reads/writes of individual records (e-commerce, banking)
OLAP	Online Analytical Processing — columnar, optimized for aggregation queries across large datasets
DAG	Directed Acyclic Graph — pipeline with defined dependencies, no cycles
Idempotency	Same operation produces same result regardless of how many times executed
Backfill	Running a pipeline on historical data to populate past time periods
Compaction	Merging many small files into fewer large files for query efficiency
Schema drift	Unexpected changes to source data schema that break downstream pipelines
Watermark	Streaming concept: maximum allowed lateness before a window is finalized
Partition pruning	Query engine skips partitions irrelevant to the WHERE clause — massive scan cost reduction
Data lineage	Track where data came from, how it was transformed, and where it goes

Term	Definition
Data contract	Formal agreement between data producer and consumer on schema, freshness, and SLA
Exactly-once	Each message/event processed exactly one time — no duplicates, no data loss
Consumer lag	Difference between latest Kafka offset and consumer's committed offset — measures pipeline delay
Catalyst optimizer	Spark's query optimizer that plans and optimizes transformations before execution
Shuffle	Redistribution of data across Spark partitions — expensive due to network I/O
Feature store	Centralized repository of ML features — serves both training (batch) and inference (real-time)
Reverse ETL	Moving transformed data from warehouse back to operational tools (CRM, marketing)
Z-ORDER	Delta Lake multi-dimensional clustering that optimizes filtering on multiple columns
PSI	Population Stability Index — measures distribution shift between two datasets (>0.25 = significant)
Semantic layer	Abstraction layer that defines business metrics consistently across BI tools