Data Storage Architectures
| Architecture | Storage | Schema | Best For | Examples |
| Data Warehouse | Columnar, structured | Schema-on-write (predefined) | BI, analytics, SQL queries | Snowflake, BigQuery, Redshift |
| Data Lake | Object storage (raw) | Schema-on-read (flexible) | Raw storage, ML, exploration | S3, GCS, ADLS |
| Lakehouse | Object storage + table format | Schema enforcement optional | Unified analytics + ML | Databricks, Delta Lake, Apache Iceberg |
| Data Mart | Columnar, structured | Schema-on-write | Department-specific analytics | Subset of DW for Sales/Finance |
ETL vs ELT
ETL (Extract → Transform → Load)
- Transform data BEFORE loading into destination
- Traditional approach — used when destination can't handle raw data
- Transformation logic lives in middle tier (Spark, custom code)
- Use when: legacy DWs, compliance/PII masking before storage, complex transformations
ELT (Extract → Load → Transform)
- Load raw data FIRST, transform inside the destination
- Modern approach — cloud DWs are powerful enough to transform at scale
- dbt runs transformations inside Snowflake/BigQuery using SQL
- Use when: modern cloud DW, need raw data for multiple use cases, easier debugging
💡 Modern Best Practice
ELT + dbt is now the dominant pattern. Load raw data to S3/GCS (bronze), land it in your DW staging schema, then dbt transforms it into silver (cleaned) and gold (business-ready) layers.
Data Modeling Patterns
Star Schema
- Central fact table with foreign keys to dimension tables
- Denormalized — fast query performance
- Best for: analytics, BI reporting
- Example: fact_orders → dim_customer, dim_product, dim_date
Snowflake Schema
- Normalized dimension tables (dimensions have sub-dimensions)
- Saves storage but more complex JOINs
- Example: dim_product → dim_category → dim_department
- Use when: storage matters, many dimension hierarchies
Data Vault
- Hubs (business keys), Links (relationships), Satellites (attributes)
- Insert-only — full historical auditability
- Highly scalable, parallel loading
- Best for: enterprise DW with frequent schema changes
Slowly Changing Dimensions (SCD)
| Type | How it Works | Keeps History? | Use Case |
| Type 1 | Overwrite old value with new value | ❌ No | Correcting data errors (typos, fix address) |
| Type 2 | Insert new row; old row gets end date; add is_current flag | ✅ Full history | Track changes over time (customer moves, price changes) |
| Type 3 | Add "previous value" column to existing row | ⚠️ One previous version | Track one-time planned changes |
| Type 4 | Store current in main table, history in separate history table | ✅ In separate table | High-volume history with fast current lookups |
SCD Type 2 Implementation Pattern:
MERGE INTO dim_customer AS target
USING staging_customer AS source
ON target.customer_id = source.customer_id AND target.is_current = TRUE
WHEN MATCHED AND (
target.email != source.email OR target.city != source.city
) THEN UPDATE SET
target.is_current = FALSE,
target.valid_to = CURRENT_DATE
WHEN NOT MATCHED THEN INSERT (
customer_id, email, city, valid_from, valid_to, is_current
) VALUES (
source.customer_id, source.email, source.city, CURRENT_DATE, NULL, TRUE
);
INSERT INTO dim_customer (customer_id, email, city, valid_from, valid_to, is_current)
SELECT source.customer_id, source.email, source.city, CURRENT_DATE, NULL, TRUE
FROM staging_customer source JOIN dim_customer target
ON target.customer_id = source.customer_id AND target.is_current = FALSE
AND target.valid_to = CURRENT_DATE;
Batch vs Streaming Processing
Batch Processing
- What: Process accumulated data at scheduled intervals (hourly, daily)
- Tools: Spark, dbt, Airflow, Hadoop MapReduce
- Latency: Minutes to hours
- Use cases: Daily reporting, data warehouse loads, model training, batch ML scoring
- Pros: Simple, high throughput, efficient for large volumes, replay-friendly
Stream Processing
- What: Process events as they arrive, continuously
- Tools: Kafka Streams, Flink, Spark Structured Streaming, Kinesis
- Latency: Milliseconds to seconds
- Use cases: Fraud detection, real-time dashboards, recommendations, alerting
- Challenges: Late data, exactly-once semantics, state management
Medallion Architecture (Bronze → Silver → Gold)
🥉 Bronze Layer
- Raw data exactly as received from source
- No transformations — preserve original
- Schema-on-read
- Append-only, immutable
- All historical data retained
- Source for reprocessing
🥈 Silver Layer
- Cleaned, validated, standardized data
- Deduplication, null handling
- Data type enforcement
- PII masked/tokenized
- Joined with reference data
- Conformed dimensions
🥇 Gold Layer
- Business-ready aggregates
- Fact & dimension tables
- Denormalized for query speed
- KPIs, metrics, reporting tables
- ML feature tables
- Served to BI tools / APIs
Window Functions — Complete Reference
SELECT
order_id, customer_id, order_date, amount,
ROW_NUMBER() OVER (PARTITION BY customer_id ORDER BY order_date DESC) AS order_rank
FROM orders;
SELECT name, salary,
RANK() OVER (ORDER BY salary DESC) AS rank,
DENSE_RANK() OVER (ORDER BY salary DESC) AS dense_rank
FROM employees;
SELECT
date, revenue,
LAG(revenue, 1, 0) OVER (ORDER BY date) AS prev_day_revenue,
LEAD(revenue, 1, 0) OVER (ORDER BY date) AS next_day_revenue,
revenue - LAG(revenue) OVER (ORDER BY date) AS day_over_day_change
FROM daily_metrics;
SELECT customer_id, total_spend,
NTILE(4) OVER (ORDER BY total_spend DESC) AS spend_quartile
FROM customer_summary;
SELECT date, revenue,
SUM(revenue) OVER (ORDER BY date ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS running_total,
AVG(revenue) OVER (ORDER BY date ROWS BETWEEN 6 PRECEDING AND CURRENT ROW) AS rolling_7d_avg
FROM daily_revenue;
SELECT product_id, sale_date, price,
FIRST_VALUE(price) OVER (PARTITION BY product_id ORDER BY sale_date) AS first_price,
LAST_VALUE(price) OVER (PARTITION BY product_id ORDER BY sale_date
ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) AS latest_price
FROM price_history;
CTEs and Recursive CTEs
WITH monthly_revenue AS (
SELECT DATE_TRUNC('month', order_date) AS month,
SUM(amount) AS revenue
FROM orders GROUP BY 1
),
revenue_with_growth AS (
SELECT month, revenue,
LAG(revenue) OVER (ORDER BY month) AS prev_month_revenue,
(revenue - LAG(revenue) OVER (ORDER BY month)) / LAG(revenue) OVER (ORDER BY month) * 100 AS growth_pct
FROM monthly_revenue
)
SELECT * FROM revenue_with_growth WHERE month >= '2024-01-01';
WITH RECURSIVE org_hierarchy AS (
SELECT employee_id, name, manager_id, 0 AS level
FROM employees WHERE manager_id IS NULL
UNION ALL
SELECT e.employee_id, e.name, e.manager_id, h.level + 1
FROM employees e
JOIN org_hierarchy h ON e.manager_id = h.employee_id
)
SELECT * FROM org_hierarchy ORDER BY level, name;
SQL Query Optimization Techniques
Indexing Strategy
- B-tree index: Default. Best for equality and range queries on columns in WHERE, JOIN, ORDER BY
- Composite index: Index on (col_a, col_b). Most selective column first. Column order matters — query must use leftmost prefix
- Covering index: Index includes all columns needed by query — avoids table lookup entirely
- Partial index: Index only subset of rows (WHERE is_active = TRUE) — smaller, faster
CREATE INDEX idx_orders_customer_date ON orders(customer_id, order_date DESC);
Query Plan Analysis
EXPLAIN ANALYZE
SELECT * FROM orders
WHERE customer_id = 123
AND status = 'completed';
- Seq Scan: Full table scan — bad for large tables
- Index Scan: Uses index — good
- Nested Loop: Good for small result sets
- Hash Join: Good for large datasets without index
- Merge Join: Both inputs sorted — efficient for ranges
- Look for: high actual rows vs estimated rows (statistics stale → run ANALYZE)
Advanced SQL Patterns
SELECT *
FROM (SELECT month, category, revenue FROM sales)
PIVOT (SUM(revenue) FOR category IN ('Electronics', 'Clothing', 'Food'));
SELECT
user_id,
metadata->>'email' AS email,
(metadata->'preferences'->>'theme')
FROM users WHERE metadata ? 'email';
DELETE FROM orders WHERE order_id IN (
SELECT order_id FROM (
SELECT order_id,
ROW_NUMBER() OVER (PARTITION BY external_id ORDER BY created_at DESC) AS rn
FROM orders
) t WHERE rn > 1
);
WITH cohorts AS (
SELECT user_id, DATE_TRUNC('month', first_purchase) AS cohort_month
FROM (SELECT user_id, MIN(order_date) AS first_purchase FROM orders GROUP BY 1) t
),
activity AS (
SELECT o.user_id, c.cohort_month,
DATEDIFF('month', c.cohort_month, o.order_date) AS period_number
FROM orders o JOIN cohorts c USING(user_id)
)
SELECT cohort_month, period_number,
COUNT(DISTINCT user_id) AS active_users
FROM activity
GROUP BY 1, 2 ORDER BY 1, 2;
Transaction Isolation Levels
| Level | Dirty Read | Non-Repeatable Read | Phantom Read | Use When |
| READ UNCOMMITTED | ✅ Possible | ✅ | ✅ | Never (for analytics only) |
| READ COMMITTED | ❌ Prevented | ✅ | ✅ | Default in most DBs — general use |
| REPEATABLE READ | ❌ | ❌ Prevented | ✅ | Financial calculations needing consistent view |
| SERIALIZABLE | ❌ | ❌ | ❌ Prevented | Critical transactions (bank transfers) |
Spark Architecture
Components
- Driver: Main process — creates SparkContext, submits jobs, coordinates execution. Runs on master node.
- Executors: Worker processes that run tasks. Each has CPU cores + memory.
- Cluster Manager: YARN, Kubernetes, Mesos, or Standalone. Allocates resources.
- Task: Smallest unit of work — runs on one partition of data
- Stage: Group of tasks that can run without shuffle
- Job: Triggered by an action (collect, write, count)
Key Concepts
- Lazy evaluation: Transformations don't execute until an action is called — allows Catalyst to optimize the full query plan
- DAG: Directed Acyclic Graph of stages and tasks — Spark builds this before execution
- Shuffle: Redistribution of data across partitions — expensive (involves disk I/O and network). Triggered by: groupBy, join, repartition, distinct
- Spill: When data doesn't fit in memory, Spark spills to disk — very slow. Fix: increase memory or reduce data per partition
PySpark — Essential Operations
from pyspark.sql import SparkSession
from pyspark.sql import functions as F
from pyspark.sql.window import Window
from delta.tables import DeltaTable
spark = SparkSession.builder.appName("ETL").config("spark.sql.extensions",
"io.delta.sql.DeltaSparkSessionExtension").getOrCreate()
df = spark.read.format("parquet").option("mergeSchema", "true").load("s3a://bucket/raw/orders/")
df_clean = (df
.filter(F.col("amount") > 0)
.withColumn("order_date", F.to_date("order_timestamp"))
.withColumn("year_month", F.date_format("order_date", "yyyy-MM"))
.withColumnRenamed("cust_id", "customer_id")
.dropDuplicates(["order_id"])
.na.fill({"discount": 0, "promo_code": "NONE"})
)
monthly_revenue = df_clean.groupBy("year_month", "category").agg(
F.sum("amount").alias("total_revenue"),
F.count("order_id").alias("order_count"),
F.countDistinct("customer_id").alias("unique_customers"),
F.avg("amount").alias("avg_order_value"),
)
window_spec = Window.partitionBy("customer_id").orderBy("order_date")
df_with_rank = df_clean.withColumn("order_rank", F.row_number().over(window_spec))
df_first_order = df_with_rank.filter(F.col("order_rank") == 1)
delta_table = DeltaTable.forPath(spark, "s3a://bucket/silver/orders/")
delta_table.alias("target").merge(
df_clean.alias("source"),
"target.order_id = source.order_id"
).whenMatchedUpdate(set={
"amount": "source.amount",
"status": "source.status",
"updated_at": "current_timestamp()"
}).whenNotMatchedInsertAll().execute()
df_clean.write.format("delta").mode("overwrite") \
.partitionBy("year_month") \
.option("overwriteSchema", "true") \
.save("s3a://bucket/silver/orders/")
Spark Performance Optimization
| Problem | Symptom | Solution |
| Data skew | One executor takes 10x longer than others; one partition much larger | Salt join key (add random suffix), AQE (spark.sql.adaptive.enabled=true), skewHint() |
| Too many small files | Many small partitions; metadata overhead; slow LISTING | coalesce() before write, OPTIMIZE in Delta Lake, delta.targetFileSize |
| Out of memory | GC overhead or OOM errors on executors | Increase executor memory, reduce partition size (repartition), cache only what's needed |
| Shuffle bottleneck | Shuffle Read/Write stage takes most time | Broadcast join for small tables, reduce groupBy cardinality, increase spark.sql.shuffle.partitions |
| Too few partitions | Not utilizing all cores; one executor saturated | Repartition: aim for 128MB–256MB per partition. Rule: cores × 2-4 |
spark.conf.set("spark.sql.adaptive.enabled", "true")
spark.conf.set("spark.sql.adaptive.coalescePartitions.enabled", "true")
spark.conf.set("spark.sql.adaptive.skewJoin.enabled", "true")
df_orders.join(F.broadcast(df_small_reference), "product_id")
df_customers.cache()
df_customers.count()
spark.sql("OPTIMIZE silver.orders ZORDER BY (customer_id, order_date)")
spark.sql("VACUUM silver.orders RETAIN 168 HOURS")
dbt Core Concepts
Materializations
| Type | Behavior | Use When |
| table | DROP + CREATE TABLE each run | Final gold layer models |
| view | CREATE OR REPLACE VIEW | Lightweight staging models |
| incremental | INSERT/MERGE only new records | Large tables — only process new data |
| ephemeral | CTE inlined into downstream query | Intermediate transformations |
Key Commands
dbt run
dbt run --select orders
dbt run --select +orders
dbt test
dbt test --select orders
dbt build
dbt docs generate
dbt docs serve
dbt compile
dbt debug
dbt snapshot
dbt Incremental Model Pattern
{{ }}
{{
config(
materialized = 'incremental',
unique_key = 'order_id',
incremental_strategy = 'merge',
on_schema_change = 'sync_all_columns'
)
}}
WITH source AS (
SELECT * FROM {{ source('raw', 'orders') }}
{% if is_incremental() %}
WHERE updated_at > (SELECT MAX(updated_at) FROM {{ this }})
{% endif %}
),
cleaned AS (
SELECT
order_id,
customer_id,
CAST(amount AS NUMERIC(12,2)) AS amount,
LOWER(status) AS status,
DATE(created_at) AS order_date,
created_at,
COALESCE(updated_at, created_at) AS updated_at
FROM source
WHERE amount > 0
AND order_id IS NOT NULL
)
SELECT * FROM cleaned
dbt Tests, Sources & Macros
version: 2
sources:
- name: raw
database: my_database
schema: raw_schema
tables:
- name: orders
loaded_at_field: created_at
freshness:
warn_after: {count: 12, period: hour}
error_after: {count: 24, period: hour}
models:
- name: stg_orders
description: "Cleaned and standardized orders from OLTP"
columns:
- name: order_id
description: "Unique order identifier"
tests:
- not_null
- unique
- name: amount
tests:
- not_null
- accepted_range:
min_value: 0
max_value: 100000
- name: status
tests:
- accepted_values:
values: ['pending', 'completed', 'refunded', 'cancelled']
Q1. What is the difference between a Data Lake and a Data Warehouse?
Answer: A Data Warehouse stores structured, processed data with a predefined schema (schema-on-write). It's optimized for SQL analytics and BI reporting. Data is cleaned and modeled before loading. Examples: Snowflake, BigQuery, Redshift. A Data Lake stores raw data in any format (structured, semi-structured, unstructured) with schema applied at query time (schema-on-read). It's cheaper, more flexible, but requires more processing before it's usable. Examples: S3, ADLS, GCS. A Lakehouse (Delta Lake, Iceberg) combines the low-cost storage of data lakes with the ACID transactions and schema enforcement of data warehouses. Modern architectures often use: raw data in object storage (bronze), Delta Lake for silver/gold, Snowflake/BigQuery for serving layer.
Q2. How do you handle late-arriving data in a streaming pipeline?
Answer: Late data is a fundamental challenge in streaming. Solutions: (1) Watermarks: Tell the stream processor to wait W time before considering a window complete. In Spark Structured Streaming: df.withWatermark("event_time", "2 hours"). Events arriving more than 2 hours late are dropped. (2) Reprocessing: Keep raw events in Kafka for 7 days. Periodically reprocess to correct historical windows. (3) Lambda/Kappa architecture: Batch layer corrects streaming layer's approximate results. (4) Accumulating snapshot: Update the same aggregate row as late data arrives (requires idempotent writes). (5) Allowed lateness in Flink: Process late events up to X hours late, emit corrections. The right strategy depends on latency SLA vs. correctness requirements.
Q3. What is idempotency and why is it critical in data pipelines?
Answer: An idempotent operation produces the same result regardless of how many times it's executed. Critical because: pipelines WILL fail and be retried. If your pipeline isn't idempotent, retries cause duplicate data. Examples of idempotent patterns: (1) MERGE/UPSERT instead of INSERT: INSERT can create duplicates; MERGE on natural key is safe to retry. (2) Write to S3 with deterministic path: Overwrite same path on retry — same content, no duplicate files. (3) Use execution date as partition key: Overwrite year=2024/month=01/day=15/ on retry — no duplicates. (4) Truncate-and-insert: For small tables, delete all rows for the period, then insert fresh. (5) Exactly-once Kafka consumers: Use transactional API to prevent duplicate processing. Always design for "what happens if this runs twice?"
Q4. Explain SCD Type 2 and how you implement it.
Answer: SCD Type 2 (Slowly Changing Dimension Type 2) preserves full historical record of changes by inserting a new row for each change rather than overwriting. Implementation adds three columns: valid_from (when this version became active), valid_to (when this version was superseded — NULL for current), is_current (boolean flag). Implementation via MERGE: when source record matches target natural key and attributes changed → update old row (set valid_to = today, is_current = false) → insert new row (valid_from = today, valid_to = NULL, is_current = true). In dbt, use dbt snapshot with strategy: timestamp or check. In Delta Lake, MERGE handles this elegantly. Key uses: tracking customer address changes (for shipping), price changes over time, employee department changes.
Q5. How does Kafka ensure exactly-once delivery?
Answer: Exactly-once in Kafka requires three components working together: (1) Idempotent Producer: enable.idempotence=true — each producer gets a PID, each message gets a sequence number. Broker rejects duplicate sequence numbers, preventing duplicate writes even on retries. (2) Transactional API: Producer begins transaction, writes to multiple partitions atomically, commits or aborts. If producer fails mid-write, uncommitted messages are never consumed. (3) Transactional Consumer: isolation.level=read_committed — consumer only reads committed transactions, never sees aborted messages. Kafka Streams provides exactly-once E2E: processing.guarantee=exactly_once_v2. This is critical for financial use cases (payment events, inventory updates).
Q6. What is data skew in Spark and how do you fix it?
Answer: Data skew occurs when data is unevenly distributed across partitions — one partition gets 10x more data than others. This causes some tasks to take much longer, bottlenecking the job. Detection: check Spark UI Stages tab — if one task is much slower than median. Causes: joining on a key with highly non-uniform distribution (e.g., joining on country_code where 90% of rows are 'US'). Fixes: (1) Salting: Add random suffix (0–9) to join key on the large table, expand the small table 10x with each suffix → distribute load. (2) AQE (Spark 3+): spark.sql.adaptive.skewJoin.enabled=true — automatically splits skewed partitions. (3) Broadcast join: If one side is small (<10MB), broadcast it to avoid shuffle entirely. (4) Repartition before groupBy: Use repartition on a different column to distribute data before the skewed operation.
Q7. How do you design a partitioning strategy for a large table?
Answer: Partitioning splits data into separate files/directories to enable partition pruning — the query engine reads only the relevant partition. Design principles: (1) Cardinality: Partition key should have low-to-medium cardinality. Date (365/year) is ideal. Too high cardinality (user_id) creates millions of tiny files — "small file problem". (2) Query patterns: Partition on the column most commonly used in WHERE clause. (3) Partition size: Target 128MB–1GB per partition file. Too small → overhead; too large → parallelism issues. (4) Common partitioning: Date/time (by year/month/day depending on volume), region/country (for geo-distributed data), event_type (for log data). In BigQuery, use date partitioning + clustering. In Delta Lake, partitionBy("year_month") + ZORDER BY customer_id for secondary filtering.
Q8. Explain the Medallion Architecture and when you'd deviate from it.
Answer: Medallion (Bronze-Silver-Gold) is a layered data architecture: Bronze = raw data exactly as received, Silver = cleaned/validated/standardized data, Gold = business-aggregated, query-optimized data. Benefits: (1) Each layer has a clear contract and owner. (2) Bronze allows reprocessing from raw data at any time. (3) Separation of concerns — raw storage vs. business logic. (4) Failed pipelines can be re-run from Bronze without re-ingesting. When to deviate: if data is extremely simple and a two-layer approach is sufficient; for streaming where the boundary between layers is less clear; for very small datasets where overhead isn't justified. Some teams add a "Platinum" layer for ML-specific feature tables. The pattern works best with Delta Lake/Iceberg format for ACID support at each layer.
Q9. What is Change Data Capture (CDC) and how do you implement it?
Answer: CDC captures every INSERT, UPDATE, DELETE event on a source database and streams them downstream — enabling real-time replication without full table scans. Implementation methods: (1) Log-based CDC (best): Read database transaction logs (MySQL binlog, PostgreSQL WAL, SQL Server CDC). Tools: Debezium (open-source), AWS DMS, Fivetran. Low latency (<1s), no impact on source. (2) Trigger-based: Database triggers write changes to audit table. High overhead, not recommended for high-volume tables. (3) Polling/Timestamp-based: Query WHERE updated_at > last_run_time. Simple but misses deletes, vulnerable to missed updates if updated_at isn't maintained. CDC is critical for: building real-time data warehouse, synchronizing microservices databases, audit trails. Debezium → Kafka → Spark/Flink is the most common production pattern.
Q10. How do you ensure data quality in a pipeline?
Answer: Multi-layer approach: (1) Schema validation: Enforce schema on ingest — fail fast if source changes (dbt on_schema_change, Delta Lake schema enforcement). (2) Business rule tests: dbt tests — not_null, unique, accepted_values, relationships (referential integrity), custom SQL tests. (3) Statistical monitoring: Great Expectations, dbt Elementary — monitor column means, null rates, distributions, detect anomalies. (4) Freshness checks: Alert if Bronze data isn't arriving on schedule (dbt source freshness). (5) Row count reconciliation: Compare source row count to destination. (6) Data contracts: Formal agreements between data producers and consumers — schema, freshness, SLAs. (7) Quarantine bad records: Route records failing quality checks to error table for investigation rather than silently dropping. Quality gates at each Medallion layer — Silver models should not load if Bronze quality checks fail.
Q11–Q30 (Additional Critical Questions)
Q11. Explain ACID properties. Atomicity: all or nothing (transaction commits or fully rolls back). Consistency: data always meets constraints (FK, NOT NULL). Isolation: concurrent transactions don't see each other's intermediate state. Durability: committed transactions survive crashes (written to WAL before commit). Delta Lake brings ACID to object storage via transaction logs.
Q12. What is the difference between repartition and coalesce in Spark? repartition(n): triggers full shuffle — redistributes data evenly across n partitions. Use to increase or decrease partitions, or when you need data evenly distributed. coalesce(n): narrow transformation — combines existing partitions without full shuffle (only decreases partition count). Use when reducing partition count since it's much cheaper. Rule: increase partitions → repartition; decrease partitions → coalesce.
Q13. What is a broadcast join in Spark and when do you use it? A broadcast join copies the smaller table to all executor nodes, avoiding shuffle of the large table. Triggered automatically when small table < spark.sql.autoBroadcastJoinThreshold (default 10MB). Force it: df_large.join(F.broadcast(df_small), "key"). Use when: one table fits in memory on each executor, join on a small lookup table (country codes, product categories). Avoids expensive shuffle of the large table.
Q14. Describe the difference between Lambda and Kappa architecture. Lambda has two parallel paths: batch layer (accurate, high latency) + speed layer (approximate, low latency) → serving layer merges results. Complex to maintain — same logic in two systems. Kappa: stream processing only. Kafka retains full log; reprocess from beginning when accuracy needed. Simpler, but streaming must handle both real-time and historical workloads. Kappa is preferred in modern systems with mature streaming frameworks.
Q15. How do you handle schema evolution in a data pipeline? Options: (1) Delta Lake / Iceberg: schema evolution supported — mergeSchema automatically handles additive changes. (2) dbt on_schema_change='sync_all_columns' adds new columns automatically. (3) Backward compatibility: always add new columns, never delete/rename (breaking change). (4) Schema registry (Confluent): enforce Avro/Protobuf schemas on Kafka topics — producers must register schemas, consumers know schema evolution rules. (5) Versioned data: land new schema as new path, transform both old and new versions in Silver.
Q16. What is the difference between Star Schema and Data Vault? Star Schema: fact + dimension tables, denormalized, optimized for query performance, easier to understand. Best for stable business definitions. Data Vault: Hubs (business keys), Links (associations), Satellites (attributes + history). Fully normalized, insert-only, highly scalable, agnostic to business rules. Best for enterprise DW where source systems and business rules change frequently, need full auditability.
Q17. How do you monitor a data pipeline in production? Layers: (1) Infrastructure: Prometheus + Grafana for CPU, memory, disk, network. (2) Pipeline metadata: Airflow UI for task status, duration, SLA misses. (3) Data quality: dbt test results, Great Expectations alerts. (4) Data freshness: monitor last load time per table, alert if stale. (5) Business metrics: anomaly detection on key KPIs (sudden drops in order count = pipeline issue vs. real business drop). (6) Alerting: PagerDuty for critical failures, Slack for warnings. (7) Data lineage: dbt docs shows impact analysis — know downstream consumers before changing a model.
Q18. Explain Kafka Consumer Groups. Consumer group = set of consumers that collectively read from a topic. Each partition is assigned to exactly ONE consumer in the group — ensures no duplicate processing within the group. Add consumers to scale throughput (up to #partitions max). Multiple groups can read the same topic independently (different offsets per group). Consumer group coordinator (broker) handles rebalancing when consumers join/leave. Lag = difference between latest offset and consumer's committed offset — monitor with kafka-consumer-groups.sh or Burrow.
Q19. What is dbt and why has it become dominant in modern data stacks? dbt (data build tool) brings software engineering best practices to SQL transformations: version control (Git), testing, documentation, modular design, dependency management. Engineers write SELECT statements; dbt handles CREATE TABLE/VIEW, generates DAGs, manages dependencies. Works inside your DW — no separate compute needed. Modern stack: dbt + Snowflake/BigQuery = full ELT pipeline. Key advantages: every model is tested, documented, and version-controlled; automatic lineage graph; code reuse via macros/packages; CI/CD integration via dbt Cloud or GitHub Actions.
Q20. How do you optimize Snowflake query costs? (1) Partition pruning: always filter on cluster key column (date). (2) Auto-suspend: set AUTO_SUSPEND = 60 seconds on all warehouses. (3) Result caching: identical queries use cached results (24h TTL) — free. (4) Warehouse sizing: use smallest size that meets SLA. Scale up for complex queries, not for concurrent users (use multi-cluster). (5) COPY vs INSERT: COPY INTO is 10x faster for bulk loading. (6) Zero-copy clones for dev/test instead of copying data. (7) INFORMATION_SCHEMA.QUERY_HISTORY: identify expensive queries. (8) Materialized views for repeated expensive aggregations.
Q21-30 Summary: Other high-frequency topics: Delta Lake Z-ORDER vs traditional partitioning (Z-ORDER clusters data by multiple columns for multi-dimensional filtering); Airflow XComs (cross-task communication via key-value pairs, limit to small data — not DataFrames); Spark Catalyst Optimizer (rule-based + cost-based query planning, generates optimized physical plan); Databricks Auto Loader (incremental file ingestion from S3/ADLS with schema inference and evolution — replaces manual file listing); Data contracts (formal schema + SLA agreements between producers and consumers enforced via schema registries); Reverse ETL (warehouse data pushed to operational tools like Salesforce, Marketo via Hightouch/Census); Iceberg vs Delta Lake (both ACID on object storage — Iceberg is open standard with broad engine support, Delta Lake is Databricks-led with richer features); Great Expectations (data quality framework — define and test expectations on DataFrames and databases); dbt Exposures (document downstream consumers — BI dashboards, ML models — in lineage graph); Airflow Sensors (wait for external condition — FileSensor for S3 files, SQLSensor for query result, HttpSensor for API availability).