Data Analyst Cheat Sheet

SQL, Statistics, Python, and Beyond

$70K–$130K US 30+ Q&As Complete SQL Reference Interview Ready

Table of Contents

SQL for Analysts — Complete Reference

The 4 Join Types

Joins combine data from multiple tables. The 4 main types handle different use cases.

INNER JOIN

Returns only rows where key exists in BOTH tables.

SELECT u.user_id, u.name, o.order_id FROM users u INNER JOIN orders o ON u.user_id = o.user_id; -- Result: Only users who have orders

Use case: "Show me only users who made purchases"

LEFT JOIN

Returns ALL rows from left table + matches from right table.

SELECT u.user_id, u.name, o.order_id FROM users u LEFT JOIN orders o ON u.user_id = o.user_id; -- Result: All users, with NULL for those without orders

Use case: "Show me all users, flag who hasn't ordered yet"

RIGHT JOIN

Returns ALL rows from right table + matches from left table.

SELECT u.user_id, u.name, o.order_id FROM users u RIGHT JOIN orders o ON u.user_id = o.user_id; -- Result: All orders, with NULL for orders without user (data quality issue!)

Use case: Rare. Usually "find orphaned records"

FULL OUTER JOIN

Returns all rows from BOTH tables.

SELECT u.user_id, u.name, o.order_id FROM users u FULL OUTER JOIN orders o ON u.user_id = o.user_id; -- Result: All users + all orders (NULLs on both sides if no match)

Use case: "Find data quality issues: orphaned users or orders"

Aggregation & Grouping

GROUP BY aggregates rows into groups. Use aggregate functions like COUNT, SUM, AVG.

GROUP BY Syntax

SELECT DATE(order_date) as order_day, COUNT(*) as num_orders, SUM(amount) as total_revenue, AVG(amount) as avg_order_value FROM orders WHERE amount > 10 -- WHERE filters BEFORE grouping GROUP BY DATE(order_date) HAVING COUNT(*) > 5 -- HAVING filters AFTER grouping ORDER BY order_day DESC;

Key concept: WHERE runs before GROUP BY, HAVING runs after. Use HAVING to filter on aggregate results.

Window Functions

Window functions operate on a "window" of rows while keeping individual rows intact.

ROW_NUMBER, RANK, DENSE_RANK

SELECT user_id, amount, ROW_NUMBER() OVER (PARTITION BY user_id ORDER BY amount DESC) as rn, RANK() OVER (PARTITION BY user_id ORDER BY amount DESC) as rnk FROM orders;

ROW_NUMBER: Always unique (1, 2, 3, 4, 5)

RANK: Ties get same rank, skips next (1, 2, 2, 4, 5)

DENSE_RANK: Ties get same rank, no skip (1, 2, 2, 3, 4)

LAG & LEAD — Access Previous/Next Row

SELECT DATE(order_date) as order_date, revenue, LAG(revenue) OVER (ORDER BY DATE(order_date)) as prev_day_revenue, LEAD(revenue) OVER (ORDER BY DATE(order_date)) as next_day_revenue, revenue - LAG(revenue) OVER (ORDER BY DATE(order_date)) as daily_change FROM daily_revenue ORDER BY order_date;

Use case: Calculate day-over-day changes, churn detection (user was active yesterday but not today)

SUM/AVG OVER — Running Totals

SELECT DATE(order_date) as order_date, revenue, SUM(revenue) OVER (ORDER BY DATE(order_date) ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) as cumulative_revenue, AVG(revenue) OVER (ORDER BY DATE(order_date) ROWS BETWEEN 6 PRECEDING AND CURRENT ROW) as moving_7day_avg FROM daily_revenue ORDER BY order_date;

ROWS BETWEEN: Controls window size. UNBOUNDED PRECEDING = from start. 6 PRECEDING = last 7 rows

Subqueries vs CTEs

Subquery (Nested SELECT)

SELECT user_id, revenue FROM ( SELECT user_id, SUM(amount) as revenue FROM orders GROUP BY user_id ) as user_revenue WHERE revenue > 100;

Pro: Inline, obvious what data is used

Con: Messy with multiple levels

CTE (Common Table Expression)

WITH user_revenue AS ( SELECT user_id, SUM(amount) as revenue FROM orders GROUP BY user_id ) SELECT user_id, revenue FROM user_revenue WHERE revenue > 100;

Pro: Readable, reusable, easy to test

Con: Slight performance overhead

Date Functions

Common Date Operations

-- Truncate to period (PostgreSQL) DATE_TRUNC('month', order_date) -- 2024-01-01 -- Calculate difference (days between dates) SELECT DATEDIFF(day, start_date, end_date) -- SQL Server SELECT DATE(end_date) - DATE(start_date) -- PostgreSQL -- Add days to date DATE_ADD(order_date, INTERVAL 7 DAY) -- MySQL order_date + INTERVAL '7 day' -- PostgreSQL -- Extract year/month/day EXTRACT(YEAR FROM order_date) -- Convert to date TO_DATE(order_date_str, 'YYYY-MM-DD')

String Functions

String Manipulation

SELECT UPPER(name) -- 'john' → 'JOHN', LOWER(name) -- 'JOHN' → 'john', TRIM(email) -- Remove leading/trailing spaces, CONCAT(first_name, ' ', last_name) -- Combine strings, SUBSTR(email, 1, 5) -- First 5 characters, SPLIT(tags, ',') -- Convert 'a,b,c' to array, REPLACE(email, '@old.com', '@new.com') FROM users;

NULL Handling

NULL Operations

-- IS NULL checks WHERE phone_number IS NULL -- Correct WHERE phone_number = NULL -- WRONG! Always false -- COALESCE: return first non-null value COALESCE(phone_number, email, 'no contact') -- NULLIF: return null if two values equal, else return first NULLIF(actual_revenue, expected_revenue) -- NULL if equal -- Filter out nulls in aggregation COUNT(DISTINCT customer_id) -- Automatically excludes NULLs

Interview-Level SQL Problems

Problem 1: Find Users with Consecutive Month Purchases

WITH monthly_purchases AS ( SELECT user_id, DATE_TRUNC('month', order_date) as month, ROW_NUMBER() OVER (PARTITION BY user_id ORDER BY DATE_TRUNC('month', order_date)) as purchase_seq FROM orders GROUP BY user_id, DATE_TRUNC('month', order_date) ), consecutive AS ( SELECT user_id, month, month - (purchase_seq || ' months')::interval as cohort FROM monthly_purchases ) SELECT DISTINCT user_id FROM consecutive GROUP BY user_id, cohort HAVING COUNT(*) >= 2;

Problem 2: Calculate 7-Day Rolling Average Revenue

SELECT order_date, revenue, AVG(revenue) OVER ( ORDER BY order_date ROWS BETWEEN 6 PRECEDING AND CURRENT ROW ) as rolling_7day_avg FROM daily_revenue ORDER BY order_date;

Problem 3: Find Second Highest Salary per Department

WITH ranked_salaries AS ( SELECT department, employee_name, salary, RANK() OVER (PARTITION BY department ORDER BY salary DESC) as salary_rank FROM employees ) SELECT department, employee_name, salary FROM ranked_salaries WHERE salary_rank = 2;

Problem 4: Calculate Month-over-Month Growth

WITH monthly_revenue AS ( SELECT DATE_TRUNC('month', order_date) as month, SUM(amount) as revenue FROM orders GROUP BY DATE_TRUNC('month', order_date) ) SELECT month, revenue, LAG(revenue) OVER (ORDER BY month) as prev_month_revenue, ROUND(100.0 * (revenue - LAG(revenue) OVER (ORDER BY month)) / LAG(revenue) OVER (ORDER BY month), 2) as mom_growth_pct FROM monthly_revenue ORDER BY month;

Problem 5: Find Churned Customers (Active Last Month, Inactive This Month)

WITH last_month AS ( SELECT DISTINCT user_id FROM events WHERE DATE_TRUNC('month', event_date) = DATE_TRUNC('month', CURRENT_DATE - INTERVAL '1 month') ), this_month AS ( SELECT DISTINCT user_id FROM events WHERE DATE_TRUNC('month', event_date) = DATE_TRUNC('month', CURRENT_DATE) ) SELECT user_id FROM last_month WHERE user_id NOT IN (SELECT user_id FROM this_month);

SQL Performance Tips

Python for Data Analysis

Pandas Basics

Pandas is the workhorse library for data manipulation in Python.

Loading and Exploring Data

import pandas as pd -- Load data df = pd.read_csv('data.csv') -- Explore df.head() -- First 5 rows df.tail() -- Last 5 rows df.info() -- Data types, null counts df.describe() -- Summary statistics df.shape -- (rows, columns) df.columns -- Column names df.dtypes -- Data type of each column

Data Cleaning Pipeline

-- Check for nulls df.isnull().sum() -- Count nulls per column df.isnull().any() -- Any nulls in each column? -- Handle nulls df.fillna(0) -- Replace with value df.fillna(df.mean()) -- Replace with mean df.dropna() -- Remove rows with any null -- Remove duplicates df.drop_duplicates() df.drop_duplicates(subset=['user_id']) -- Fix data types df['order_date'] = pd.to_datetime(df['order_date']) df['amount'] = pd.to_numeric(df['amount']) -- Fix inconsistent values df['category'] = df['category'].str.lower().str.strip() df.replace({'M': 1, 'F': 0})

Filtering & Selection

-- Boolean filtering df[df['amount'] > 100] df[(df['amount'] > 100) & (df['category'] == 'A')] -- Select columns df[['user_id', 'amount']] df.loc[:, ['user_id', 'amount']] -- Select by position df.iloc[0:5, 0:3] -- First 5 rows, first 3 columns -- String matching df[df['email'].str.contains('@gmail.com')] df[df['name'].str.startswith('J')]

GroupBy & Aggregation

-- Simple groupby df.groupby('category')['amount'].sum() -- Multiple aggregations df.groupby('category').agg({ 'amount': ['sum', 'mean', 'count'], 'user_id': 'nunique' }) -- Pivot table df.pivot_table( values='amount', index='user_id', columns='category', aggfunc='sum' ) -- Melt (wide to long) pd.melt(df, id_vars=['user_id'], value_vars=['jan', 'feb'])

NumPy for Numerical Computing

Core NumPy Operations

import numpy as np -- Array creation arr = np.array([1, 2, 3, 4, 5]) np.zeros(5) -- Array of zeros np.ones((2, 3)) -- 2x3 array of ones np.arange(0, 10, 2) -- 0, 2, 4, 6, 8 np.linspace(0, 1, 10) -- 10 evenly spaced values 0-1 -- Vectorized operations arr * 2 -- Multiply each element arr + arr -- Element-wise addition np.sqrt(arr) -- Square root of each -- Statistics np.mean(arr) np.median(arr) np.std(arr) -- Standard deviation np.percentile(arr, 75) -- 75th percentile

Matplotlib for Plotting

Basic Plotting

import matplotlib.pyplot as plt -- Line plot plt.plot(x, y, label='Sales') plt.xlabel('Date') plt.ylabel('Revenue') plt.title('Daily Revenue') plt.legend() plt.show() -- Bar chart plt.bar(categories, values) -- Histogram plt.hist(data, bins=30, edgecolor='black') -- Scatter plot plt.scatter(x, y, alpha=0.5) -- Multiple subplots fig, axes = plt.subplots(2, 2, figsize=(12, 8)) axes[0, 0].plot(x1, y1) axes[0, 1].bar(x2, y2)

Seaborn for Statistical Plots

Seaborn Examples

import seaborn as sns -- Correlation heatmap sns.heatmap(df.corr(), annot=True, cmap='coolwarm') -- Box plot (distribution by category) sns.boxplot(x='category', y='amount', data=df) -- Violin plot sns.violinplot(x='category', y='amount', data=df) -- Scatter with regression line sns.regplot(x='x', y='y', data=df) -- Distribution plot sns.histplot(data=df, x='amount', kde=True)

Statistics for Data Analysts

Descriptive Statistics

Common Measures

  • Mean: Average. Sensitive to outliers. Use for normally distributed data.
  • Median: Middle value. Robust to outliers. Use for skewed data.
  • Mode: Most frequent value. Use for categorical data.
  • Variance: Average squared deviation from mean. Units are squared.
  • Standard Deviation (SD): Square root of variance. Same units as data. 68% of data within 1 SD of mean (normal distribution).
  • IQR (Interquartile Range): Q3 - Q1. Middle 50% of data. Good for detecting outliers (outliers are >1.5×IQR from Q1 or Q3).
  • Percentile: Value below which X% of data falls. 50th percentile = median.

Distributions

Normal Distribution

Bell curve. Symmetric. Mean = Median = Mode.

68-95-99.7 Rule:

  • 68% of data within 1 SD
  • 95% within 2 SD
  • 99.7% within 3 SD

Example: Height, test scores, measurement errors

Skewed Distributions

Right-skewed (positive): Long tail on right. Mean > Median.

Example: Income (few high earners pull mean up)

Left-skewed (negative): Long tail on left. Mean < Median.

Example: Test scores where most score well (ceiling effect)

Hypothesis Testing

The Hypothesis Testing Workflow

  1. State Hypothesis:
    • H0 (Null): No difference/effect. Status quo. "The control and treatment are the same"
    • H1 (Alternative): There is a difference. "The treatment is different"
  2. Choose test: t-test (continuous), chi-squared (categorical), ANOVA (3+ groups)
  3. Calculate test statistic: How far is our data from H0?
  4. Calculate p-value: Probability of observing this result if H0 is true
  5. Interpret: If p < 0.05, reject H0 (result is statistically significant)
  6. Conclusion: State in business terms, not stat terms
Critical Misconception: p-value is NOT the probability that H0 is true. p-value = P(data | H0 is true). It's the probability of seeing our result if H0 were true.

Common Statistical Tests

T-Test (Comparing Means)

When: Comparing means of two groups, continuous data

Assumptions: Normal distribution, equal variance (unless Welch's t-test)

from scipy.stats import ttest_ind -- Compare two groups control_group = [10, 12, 15, 18, 20] treatment_group = [14, 16, 19, 22, 25] t_stat, p_value = ttest_ind(control_group, treatment_group) -- If p_value < 0.05: groups are significantly different

Chi-Squared Test (Comparing Proportions)

When: Comparing proportions/frequencies, categorical data

Example: Does email subject line A or B get more clicks?

from scipy.stats import chi2_contingency -- Contingency table contingency_table = [ [100, 50], -- Subject A: 100 clicks, 50 no-clicks [95, 55] -- Subject B: 95 clicks, 55 no-clicks ] chi2, p_value, dof, expected = chi2_contingency(contingency_table) -- If p_value < 0.05: subjects perform differently

A/B Testing: Statistical Significance

Setup: Control (A) vs Variant (B). Sample size determined upfront.

Key concept: Statistical significance ≠ practical significance

  • Large sample size → small differences become "significant"
  • Effect size matters: What's the magnitude of the difference?
  • Always report confidence interval, not just p-value
-- Calculate confidence interval for A/B test import numpy as np from scipy import stats control_conversion = 0.05 -- 5% variant_conversion = 0.055 -- 5.5% n = 10000 -- per group -- Standard error of difference se_control = np.sqrt(control_conversion * (1 - control_conversion) / n) se_variant = np.sqrt(variant_conversion * (1 - variant_conversion) / n) se_diff = np.sqrt(se_control**2 + se_variant**2) -- 95% Confidence Interval diff = variant_conversion - control_conversion ci = (diff - 1.96 * se_diff, diff + 1.96 * se_diff)

Type I and Type II Errors

Type I Error (False Positive)

We reject H0 when H0 is actually true.

"We think variant is better, but actually no difference"

Probability: α (alpha) = 0.05

Risk: Launch variant that doesn't actually help

Type II Error (False Negative)

We fail to reject H0 when H0 is actually false.

"We think variants are same, but variant actually is better"

Probability: β (beta) = 0.20

Power: 1 - β = 0.80 (ability to detect true effect)

Risk: Miss a real improvement

Correlation vs Causation

Correlation does NOT imply causation. Examples:
  • Ice cream sales ↑ → Shark attacks ↑ (both caused by summer)
  • More firefighters → More fire damage (not causation; more firefighters sent to bigger fires)

To prove causation: Need randomized experiment, control for confounders, or time-ordered data with clear mechanism

Correlation Metrics

Pearson Correlation: Linear relationship. Range: -1 (perfect negative) to +1 (perfect positive).

Spearman Correlation: Rank-based. Better for non-linear relationships and outliers.

Caution: Correlation = 0.8 looks strong, but R² = 0.64 (only 64% of variance explained)

Regression

Linear Regression Interpretation

from sklearn.linear_model import LinearRegression -- Fit model: Revenue = intercept + coefficient×Marketing_Spend model = LinearRegression() model.fit(X, y) print(f"Intercept: {model.intercept_}") -- Base revenue print(f"Coefficient: {model.coef_[0]}") -- ROI: revenue per $1 spent print(f"R²: {model.score(X, y)}") -- % of variance explained

Coefficient = 2.5: For every $1 increase in marketing spend, revenue increases by $2.50 (holding other factors constant)

R² = 0.75: 75% of revenue variance is explained by marketing spend. Other factors account for 25%.

Data Visualization Best Practices

Chart Selection

Bar Chart

Use for: Comparing values across categories

Example: Revenue by product category

Pro tip: Sort bars by value (descending)

Line Chart

Use for: Trends over time

Example: Daily active users over 30 days

Pro tip: Don't truncate y-axis unless context requires it

Scatter Plot

Use for: Relationship between two variables

Example: Marketing spend vs Revenue

Pro tip: Add regression line if showing correlation

Histogram

Use for: Distribution of single variable

Example: Distribution of customer ages

Pro tip: Choose bin size carefully (too many/few distorts view)

Heatmap

Use for: Correlation matrix, or two categorical variables

Example: Correlation of metrics, user activity by hour/day

Pro tip: Color scale should be intuitive (red=bad, green=good)

Box Plot

Use for: Distribution + outliers across groups

Example: Order value by customer segment

Pro tip: Show all data points if N is small

Pie Chart

Use for: Proportions only (max 5 slices)

Example: Market share by competitor

Pro tip: Bar chart is usually better. Don't use pie with >5 segments.

Bubble Chart

Use for: Three variables (x, y, bubble size)

Example: Revenue, profit, company size

Pro tip: Bubble size should be area, not radius (square root scale)

Tableau/Power BI Skills

Key Tableau Concepts

  • Calculated Fields: Create new dimensions/measures on the fly. Example: [Sales] / [Units] = price per unit
  • Filters: Context filters (run first), Dimension filters, Measure filters
  • Parameters: Interactive dropdowns. Users select value, dashboard updates
  • LOD Expressions:
    • FIXED: Compute across all rows (e.g., total sales regardless of what's filtered)
    • INCLUDE: Add dimensions to aggregation
    • EXCLUDE: Remove dimensions from aggregation
  • Drill-down: Click on bar → see detail rows

Key Power BI Concepts

  • DAX Measures: Power BI's calculation language. Similar to calculated fields
  • Relationships: Connect tables (similar to SQL joins)
  • Row-Level Security (RLS): Different users see different data based on roles
  • Power Query: ETL tool. Transform data before loading to model

Storytelling with Data

Golden Rule: Don't make the audience read. Tell them what to see.
  • Lead with the insight, not the chart
  • One key message per slide
  • Annotate the chart: highlight key points, add arrows
  • Use color to emphasize (only highlight what matters)
  • Include context: is this good or bad? How does it compare to last month?

Dashboard Design

Best Practices

  • Executive Summary at top: KPI cards, headline metrics, key insights
  • Trend charts: How are we doing over time?
  • Drill-down capability: Click to see detail without cluttering main view
  • Appropriate detail level: C-suite sees high-level; analysts see detail
  • Self-explanatory: Legend shouldn't be needed. Hover-over tooltips explain what each metric means
  • Update frequency: Daily for operational dashboards, weekly/monthly for business reviews
  • White space: Don't cram everything. Breathing room improves readability

Common Visualization Mistakes

  • Truncating Y-axis: Makes small differences look huge (misleading)
  • Dual Y-axis: Confusing. Choose one scale or separate charts
  • Rainbow colors: Non-colorblind friendly. Use perceptually uniform palettes
  • 3D charts: Hard to read. Stick to 2D
  • Pie charts with 7+ slices: Can't compare slices. Use bar chart instead
  • No context: "Revenue is $10M" means nothing. $10M vs last year? vs budget?

Business Metrics & KPIs

E-Commerce Metrics

Core E-Commerce KPIs

  • GMV (Gross Merchandise Value): Total transaction value. Includes returns.
  • Conversion Rate: % of visitors who purchase. Formula: Orders / Sessions × 100
  • Cart Abandonment Rate: % of users who add to cart but don't buy. High = friction in checkout
  • AOV (Average Order Value): Total revenue / number of orders. Target: increase through upsell/cross-sell
  • Repeat Purchase Rate: % of customers who buy again. Measures loyalty and product quality
  • Return Rate: % of items returned. High return = quality issues or mismatched expectations
  • Customer Lifetime Value (LTV): Total revenue from a customer. Formula: AOV × Purchase Frequency × Customer Lifespan

SaaS Metrics

Core SaaS KPIs

  • MRR (Monthly Recurring Revenue): Total subscription revenue per month
  • ARR (Annual Recurring Revenue): MRR × 12. Used for valuation
  • Churn Rate: % of customers who cancel each month. Formula: (Customers Lost / Start of Month Customers) × 100. Target: <5%
  • Net Revenue Retention (NRR): (ARR End of Period - Churn + Expansion) / ARR Start of Period. > 100% = growing from existing customers
  • Expansion MRR: Revenue increase from existing customers (upsell, add-ons)
  • CAC (Customer Acquisition Cost): Total sales + marketing cost / new customers. Should be paid back in 12 months
  • LTV:CAC Ratio: Should be 3:1 minimum, 5:1+ is strong
  • Activation Rate: % of new users who reach aha moment. Critical for retention
  • Time-to-Value: How long until user experiences value? Should be <7 days

Marketing Metrics

Marketing KPIs

  • CTR (Click-Through Rate): Clicks / impressions. Measures ad appeal and relevance
  • CPC (Cost Per Click): Ad spend / clicks. Related to CTR and bid amount
  • CPM (Cost Per Mille): Ad spend / impressions × 1000. Standard metric for display ads
  • ROAS (Return on Ad Spend): Revenue from ad / ad spend. Target: >3:1
  • Attribution Models:
    • First-touch: 100% credit to first channel user touched
    • Last-touch: 100% credit to last channel before conversion
    • Linear: Equal credit to all channels
    • Time-decay: More credit to channels closer to conversion
    • Data-driven: ML assigns credit based on actual conversion patterns

Engagement Metrics

How to Measure Product Engagement

  • DAU/MAU: Daily / Monthly Active Users. DAU/MAU = stickiness. >20% is good, >50% is excellent
  • Session Length: Average time spent per visit. Longer = more engaged (or frustrated?)
  • Sessions per User: How often users return. Frequency of use
  • Feature Adoption Rate: % of users using feature X. Launch feature, measure after 4 weeks
  • Bounce Rate: % of sessions where user leaves without action. High bounce = poor landing page or mismatch with expectations
  • Pages per Session: How deep do users go? More = more engaged

Funnel Metrics by Stage

Customer Journey Stages & Key Metrics

Awareness: Impressions, reach, brand searches

Interest: Click-through rate, time on site, pages per session

Consideration: Add to cart, wishlist adds, content downloads

Intent: Free trial signups, demo requests, pricing page views

Purchase: Conversion rate, AOV, transaction completion rate

Loyalty: Repeat purchase rate, NPS, customer lifetime value

Advocacy: Referral rate, social shares, reviews posted

Top 30 Interview Questions & Answers

1. How do you approach a new dataset? +

Structured approach:

  1. Understand the context: What question are we answering? What's the business impact?
  2. Check data quality: Row count, column names, data types, nulls, duplicates
  3. Explore distributions: df.describe(), df.info(), check for outliers
  4. Verify data integrity: Do values make sense? Any obvious data entry errors?
  5. Identify relationships: Correlations, patterns, anomalies
  6. Form hypothesis: What might explain the patterns I see?
  7. Deep dive: Answer the business question with targeted analysis
2. What is the difference between GROUP BY and WINDOW functions? +

GROUP BY: Collapses rows. One row per group. Can't access individual row values.

WINDOW functions: Don't collapse rows. Keep individual rows while computing aggregate. Can mix individual and aggregate values.

Example:

  • GROUP BY: "Show me total revenue per category" → 3 rows (one per category)
  • WINDOW: "Show me each order with running total revenue per category" → All original rows, plus running total column
3. Write SQL to find the second highest salary per department +
WITH ranked_salaries AS ( SELECT department, employee_name, salary, RANK() OVER (PARTITION BY department ORDER BY salary DESC) as salary_rank FROM employees ) SELECT department, employee_name, salary FROM ranked_salaries WHERE salary_rank = 2;

Key points: Use RANK (handles ties), PARTITION BY department, ORDER BY salary DESC

4. How do you handle missing data? +

Approach: Understand why it's missing, then decide

  1. Identify missing data type:
    • MCAR (Missing Completely At Random): Safe to ignore or impute
    • MAR (Missing At Random): Depends on other variables. May introduce bias
    • MNAR (Missing Not At Random): Systematic. Likely biased if ignored
  2. Options:
    • Remove rows: Simple, but loses data
    • Remove column: If >50% missing, likely not useful
    • Impute with mean/median: Assumes data is MCAR
    • Forward fill: For time series (carry last value forward)
    • Predictive imputation: Use other variables to predict missing value
    • Segment analysis: Analyze complete cases separately from incomplete
  3. Document your choice: Always note what you did and why
5. What is p-value and how do you interpret it? +

Definition: Probability of observing our data (or more extreme) IF the null hypothesis is true.

Interpretation: p < 0.05 = reject null hypothesis = result is statistically significant

What p-value is NOT:

  • NOT the probability that H0 is true
  • NOT the probability that we made a mistake
  • NOT a measure of effect size (small p doesn't mean big effect)

Example: "p = 0.03 means there's a 3% chance of seeing this data if the two groups truly have identical conversion rates."

Always report: p-value AND confidence interval AND effect size

6. How do you detect and handle outliers? +

Detection methods:

  • IQR method: Outliers are < Q1 - 1.5×IQR or > Q3 + 1.5×IQR. Most common.
  • Z-score: Values with |z| > 3 are outliers (3 SD from mean)
  • Visual inspection: Box plots, scatter plots, histograms
  • Domain knowledge: "A $1 million order for a $50 product is clearly an error"

Handling options:

  • Remove: If data entry error
  • Cap: Set to max/min reasonable value
  • Transform: Log transform to reduce impact
  • Segment: Analyze separately (high-value customers vs typical)
  • Investigate: Why is this an outlier? Are there multiple reasons?

Critical: Never silently remove outliers. Always document.

7. Explain the difference between correlation and causation +

Correlation: Two variables move together. -1 to +1.

Causation: Change in X directly causes change in Y.

They're NOT the same! Examples of correlation without causation:

  • Ice cream sales ↑ → Shark attacks ↑ (both caused by warm weather)
  • More firefighters → More fire damage (not causation; firefighters go to bigger fires)
  • Stock price ↑ → Company performance ↑ (causation runs both ways)

To prove causation:

  • Randomized controlled experiment (gold standard)
  • Control for confounders (use regression, propensity score matching)
  • Establish time order (X happened before Y)
  • Understand mechanism (why would X cause Y?)
8. How do you design an A/B test? +

Step-by-step:

  1. Define hypothesis: "Changing button color from blue to red will increase click rate by 5%"
  2. Choose metric: What will you measure? (CTR, conversion, revenue)
  3. Calculate sample size: Use power calculator (typically 80% power, 95% confidence). Don't just guess.
  4. Run duration: 1-2 weeks minimum (account for day-of-week effects). Never stop early "because we see winner"—you'll be wrong
  5. Randomization: 50/50 split, but MUST be random and persistent (same user always sees same variant)
  6. Statistical testing: At end, run appropriate test (t-test for continuous, chi-squared for proportions)
  7. Report results: p-value, confidence interval, effect size. Not just "winner/loser"

Common mistakes: Under-powering, stopping early, peeking during test, running too many tests (multiple testing problem)

9. What is Type I vs Type II error? +

Type I (False Positive): Reject H0 when H0 is true. "We think variant is better but it's actually same."

  • Probability: α = 0.05
  • Risk: Launch change that doesn't help (wasted effort)

Type II (False Negative): Fail to reject H0 when H0 is false. "We think variants are same but variant is actually better."

  • Probability: β = 0.20
  • Power: 1 - β = 0.80 (ability to detect true effect)
  • Risk: Miss real improvement (opportunity cost)

Trade-off: Can't minimize both simultaneously. Usually prefer Type I control (α = 0.05), allow higher Type II (β = 0.20)

10. How would you analyze a 20% drop in conversion rate? +

Systematic investigation:

  1. Verify it's real: Check for data pipeline errors, DB issues, data quality problems. Did something break?
  2. Segment the drop: Which user segments? Which geographies? Which device types? Which page?
  3. Identify what changed: Code deploy? Traffic source change? Site outage? Competitor launch?
  4. Look for correlation: Did anything else change around same time?
  5. Check external factors: Market change? Seasonality? Competitor action?
  6. Mitigation: If it's a bug, revert. If it's competitive, differentiate. If it's traffic source, redirect to secondary.
  7. Communication: Update stakeholders daily with findings and ETA to fix

Key: Don't assume root cause. Investigate systematically.

11-30. Additional Interview Questions +

11. What is the difference between HAVING and WHERE? WHERE filters rows BEFORE grouping. HAVING filters AFTER grouping. Use HAVING when filtering on aggregate results.

12. How do you communicate findings to non-technical stakeholders? Use business language, not statistical jargon. Lead with insight. Show impact in dollars, not percentages. Use visuals. Anticipate questions.

13. What is cohort analysis and when do you use it? Group users by signup date, track behavior over time. Detects if product is improving. Reveals retention curves.

14. How do you validate your analysis? Cross-check with other data sources. Spot-check outliers. Test conclusions with different segments. Ask domain experts to validate.

15. Tell me about a time you found an insight that drove business impact. Pick real example. Show your process (what question, how you analyzed, what you found, what action taken, what was impact).

16. What is funnel analysis? Track users through stages (awareness → interest → purchase). Identify biggest drop-offs. Prioritize optimization on biggest leaks.

17. How do you calculate LTV? LTV = Average Revenue Per User / Churn Rate. Or: ARPU × (1 / Monthly Churn Rate). Critical for SaaS profitability.

18. What are attribution models? Ways to credit channels for conversions. First-touch, last-touch, linear, time-decay, data-driven. Each has trade-offs.

19. How do you detect data quality issues? Check for: duplicates, nulls, outliers, inconsistent formatting, logical errors (e.g., end_date before start_date), negative values where impossible

20. SQL optimization: how do you speed up slow queries? Add indexes on filter columns. Partition by date. Aggregate before joining. Avoid functions in WHERE. Use EXISTS instead of IN for large subqueries. Ask DBA for help.

21. How would you segment customers? RFM (Recency, Frequency, Monetary). Behavioral (usage patterns). Demographic. Geographic. Psychographic. Choose based on business need.

22. What is time series analysis? Analyzing data over time. Trend (direction), seasonality (patterns), outliers (anomalies). Forecasting uses historical patterns to predict future.

23. Python vs SQL for analysis? SQL: data extraction, large datasets, repeatable queries. Python: complex transforms, visualization, statistical tests, machine learning.

24. How do you approach a vague problem statement? Ask clarifying questions. What decision needs to be made? What data is available? What's the timeline? Who's the audience?

25. How do you handle imbalanced data? If 95% positive class: oversampling minority, undersampling majority, synthetic data (SMOTE), class weights in models, change evaluation metric (use precision/recall, not accuracy)

26. What is Simpson's Paradox? Trend reverses when data is aggregated. Caused by unequal subgroup sizes. Always segment before concluding.

27. How do you measure customer satisfaction? NPS (Net Promoter Score), CSAT (satisfaction rating), CES (Customer Effort Score), qualitative feedback, retention (voting with feet)

28. How would you A/B test something that takes months to measure? Use leading indicator (early proxy). Example: Daily engagement predicts eventual retention. Test early indicator instead of final outcome.

29. What is survivor bias? Ignoring failures (only looking at survivors). Example: successful companies cut cost, but failed companies also cut cost (you only see winners). Use control group or look at full population.

30. How do you stay current with analytics best practices? Read blogs (Reforge, Towards Data Science), take courses, practice with real data, collaborate with peers, experiment with new tools

Common Analyst Mistakes

1. Correlation ≠ Causation

Finding correlation without establishing causal mechanism or randomized experiment

2. Ignoring Sample Bias

Analyzing a subset of data that doesn't represent population (e.g., only analyzing users who stayed)

3. Multiple Testing Without Correction

Run 20 A/B tests, one will be "significant" by chance. Use Bonferroni correction or multiple testing adjustment

4. Stopping Test Early

"Our result looks significant" after 2 weeks instead of waiting 2 weeks. Will lead to wrong conclusions

5. P-hacking

Testing many hypotheses, thresholds, or subgroups until you find "significant" result. Then claiming victory. Use pre-registered hypothesis.

6. No Context

"Revenue is up 10%"—up vs what? Last month? Last year? Budget? Competitor?

7. Outliers Removed Without Justification

Silently removing data to make conclusion prettier. Always document and justify.