Data Analyst Cheat Sheet

SQL for Analysts
Python for Data Analysis
Statistics Fundamentals
Data Visualization
Business Metrics & KPIs
Interview Q&As

SQL for Analysts — Complete Reference

The 4 Join Types

Joins combine data from multiple tables. The 4 main types handle different use cases.

INNER JOIN

Returns only rows where key exists in BOTH tables.

SELECT u.user_id, u.name, o.order_id
FROM users u
INNER JOIN orders o
    ON u.user_id = o.user_id;

-- Result: Only users who have orders
                    

Use case: "Show me only users who made purchases"

LEFT JOIN

Returns ALL rows from left table + matches from right table.

SELECT u.user_id, u.name, o.order_id
FROM users u
LEFT JOIN orders o
    ON u.user_id = o.user_id;

-- Result: All users, with NULL for those without orders
                    

Use case: "Show me all users, flag who hasn't ordered yet"

RIGHT JOIN

Returns ALL rows from right table + matches from left table.

SELECT u.user_id, u.name, o.order_id
FROM users u
RIGHT JOIN orders o
    ON u.user_id = o.user_id;

-- Result: All orders, with NULL for orders without user (data quality issue!)
                    

Use case: Rare. Usually "find orphaned records"

FULL OUTER JOIN

Returns all rows from BOTH tables.

SELECT u.user_id, u.name, o.order_id
FROM users u
FULL OUTER JOIN orders o
    ON u.user_id = o.user_id;

-- Result: All users + all orders (NULLs on both sides if no match)
                    

Use case: "Find data quality issues: orphaned users or orders"

Aggregation & Grouping

GROUP BY aggregates rows into groups. Use aggregate functions like COUNT, SUM, AVG.

GROUP BY Syntax

SELECT
    DATE(order_date) as order_day,
    COUNT(*) as num_orders,
    SUM(amount) as total_revenue,
    AVG(amount) as avg_order_value
FROM orders
WHERE amount > 10 -- WHERE filters BEFORE grouping
GROUP BY DATE(order_date)
HAVING COUNT(*) > 5 -- HAVING filters AFTER grouping
ORDER BY order_day DESC;
                

Key concept: WHERE runs before GROUP BY, HAVING runs after. Use HAVING to filter on aggregate results.

Window Functions

Window functions operate on a "window" of rows while keeping individual rows intact.

ROW_NUMBER, RANK, DENSE_RANK

SELECT
    user_id,
    amount,
    ROW_NUMBER() OVER (PARTITION BY user_id ORDER BY amount DESC) as rn,
    RANK() OVER (PARTITION BY user_id ORDER BY amount DESC) as rnk
FROM orders;
                

ROW_NUMBER: Always unique (1, 2, 3, 4, 5)

RANK: Ties get same rank, skips next (1, 2, 2, 4, 5)

DENSE_RANK: Ties get same rank, no skip (1, 2, 2, 3, 4)

LAG & LEAD — Access Previous/Next Row

SELECT
    DATE(order_date) as order_date,
    revenue,
    LAG(revenue) OVER (ORDER BY DATE(order_date)) as prev_day_revenue,
    LEAD(revenue) OVER (ORDER BY DATE(order_date)) as next_day_revenue,
    revenue - LAG(revenue) OVER (ORDER BY DATE(order_date)) as daily_change
FROM daily_revenue
ORDER BY order_date;
                

Use case: Calculate day-over-day changes, churn detection (user was active yesterday but not today)

SUM/AVG OVER — Running Totals

SELECT
    DATE(order_date) as order_date,
    revenue,
    SUM(revenue) OVER (ORDER BY DATE(order_date) ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) as cumulative_revenue,
    AVG(revenue) OVER (ORDER BY DATE(order_date) ROWS BETWEEN 6 PRECEDING AND CURRENT ROW) as moving_7day_avg
FROM daily_revenue
ORDER BY order_date;
                

ROWS BETWEEN: Controls window size. UNBOUNDED PRECEDING = from start. 6 PRECEDING = last 7 rows

Subqueries vs CTEs

Subquery (Nested SELECT)

SELECT user_id, revenue
FROM (
    SELECT
        user_id,
        SUM(amount) as revenue
    FROM orders
    GROUP BY user_id
) as user_revenue
WHERE revenue > 100;
                    

Pro: Inline, obvious what data is used

Con: Messy with multiple levels

CTE (Common Table Expression)

WITH user_revenue AS (
    SELECT
        user_id,
        SUM(amount) as revenue
    FROM orders
    GROUP BY user_id
)
SELECT user_id, revenue
FROM user_revenue
WHERE revenue > 100;
                    

Pro: Readable, reusable, easy to test

Con: Slight performance overhead

Date Functions

Common Date Operations

-- Truncate to period (PostgreSQL)
DATE_TRUNC('month', order_date)  -- 2024-01-01

-- Calculate difference (days between dates)
SELECT DATEDIFF(day, start_date, end_date)  -- SQL Server
SELECT DATE(end_date) - DATE(start_date)  -- PostgreSQL

-- Add days to date
DATE_ADD(order_date, INTERVAL 7 DAY)  -- MySQL
order_date + INTERVAL '7 day'  -- PostgreSQL

-- Extract year/month/day
EXTRACT(YEAR FROM order_date)

-- Convert to date
TO_DATE(order_date_str, 'YYYY-MM-DD')
                

String Functions

String Manipulation

SELECT
    UPPER(name) -- 'john' → 'JOHN',
    LOWER(name) -- 'JOHN' → 'john',
    TRIM(email) -- Remove leading/trailing spaces,
    CONCAT(first_name, ' ', last_name) -- Combine strings,
    SUBSTR(email, 1, 5) -- First 5 characters,
    SPLIT(tags, ',') -- Convert 'a,b,c' to array,
    REPLACE(email, '@old.com', '@new.com')
FROM users;
                

NULL Handling

NULL Operations

-- IS NULL checks
WHERE phone_number IS NULL  -- Correct
WHERE phone_number = NULL  -- WRONG! Always false

-- COALESCE: return first non-null value
COALESCE(phone_number, email, 'no contact')

-- NULLIF: return null if two values equal, else return first
NULLIF(actual_revenue, expected_revenue)  -- NULL if equal

-- Filter out nulls in aggregation
COUNT(DISTINCT customer_id)  -- Automatically excludes NULLs
                

Interview-Level SQL Problems

Problem 1: Find Users with Consecutive Month Purchases

WITH monthly_purchases AS (
    SELECT
        user_id,
        DATE_TRUNC('month', order_date) as month,
        ROW_NUMBER() OVER (PARTITION BY user_id ORDER BY DATE_TRUNC('month', order_date)) as purchase_seq
    FROM orders
    GROUP BY user_id, DATE_TRUNC('month', order_date)
),
consecutive AS (
    SELECT
        user_id,
        month,
        month - (purchase_seq || ' months')::interval as cohort
    FROM monthly_purchases
)
SELECT DISTINCT user_id
FROM consecutive
GROUP BY user_id, cohort
HAVING COUNT(*) >= 2;
                

Problem 2: Calculate 7-Day Rolling Average Revenue

SELECT
    order_date,
    revenue,
    AVG(revenue) OVER (
        ORDER BY order_date
        ROWS BETWEEN 6 PRECEDING AND CURRENT ROW
    ) as rolling_7day_avg
FROM daily_revenue
ORDER BY order_date;
                

Problem 3: Find Second Highest Salary per Department

WITH ranked_salaries AS (
    SELECT
        department,
        employee_name,
        salary,
        RANK() OVER (PARTITION BY department ORDER BY salary DESC) as salary_rank
    FROM employees
)
SELECT department, employee_name, salary
FROM ranked_salaries
WHERE salary_rank = 2;
                

Problem 4: Calculate Month-over-Month Growth

WITH monthly_revenue AS (
    SELECT
        DATE_TRUNC('month', order_date) as month,
        SUM(amount) as revenue
    FROM orders
    GROUP BY DATE_TRUNC('month', order_date)
)
SELECT
    month,
    revenue,
    LAG(revenue) OVER (ORDER BY month) as prev_month_revenue,
    ROUND(100.0 * (revenue - LAG(revenue) OVER (ORDER BY month)) / LAG(revenue) OVER (ORDER BY month), 2) as mom_growth_pct
FROM monthly_revenue
ORDER BY month;
                

Problem 5: Find Churned Customers (Active Last Month, Inactive This Month)

WITH last_month AS (
    SELECT DISTINCT user_id
    FROM events
    WHERE DATE_TRUNC('month', event_date) = DATE_TRUNC('month', CURRENT_DATE - INTERVAL '1 month')
),
this_month AS (
    SELECT DISTINCT user_id
    FROM events
    WHERE DATE_TRUNC('month', event_date) = DATE_TRUNC('month', CURRENT_DATE)
)
SELECT user_id
FROM last_month
WHERE user_id NOT IN (SELECT user_id FROM this_month);
                

SQL Performance Tips

Avoid SELECT *: Query only columns you need. Better for performance and clarity.
Partition filters: Filter on partition key early (DATE, region). Huge speed improvement.
Avoid functions in WHERE: WHERE YEAR(order_date) = 2024 won't use index. Use WHERE order_date >= '2024-01-01' AND order_date < '2025-01-01'
INNER JOIN vs LEFT JOIN: INNER JOIN is faster (no NULL handling). Use LEFT only if you need unmatched rows.
EXISTS vs IN: EXISTS stops at first match—faster. Use EXISTS for large right-side subqueries.
Pre-aggregate: GROUP BY before joining. Better than joining then filtering.
Indexes: Ask DBA which columns are indexed. Query on indexed columns first.

Python for Data Analysis

Pandas Basics

Pandas is the workhorse library for data manipulation in Python.

Loading and Exploring Data

import pandas as pd

-- Load data
df = pd.read_csv('data.csv')

-- Explore
df.head()  -- First 5 rows
df.tail()  -- Last 5 rows
df.info()  -- Data types, null counts
df.describe()  -- Summary statistics
df.shape  -- (rows, columns)
df.columns  -- Column names
df.dtypes  -- Data type of each column
                

Data Cleaning Pipeline

-- Check for nulls
df.isnull().sum()  -- Count nulls per column
df.isnull().any()  -- Any nulls in each column?

-- Handle nulls
df.fillna(0)  -- Replace with value
df.fillna(df.mean())  -- Replace with mean
df.dropna()  -- Remove rows with any null

-- Remove duplicates
df.drop_duplicates()
df.drop_duplicates(subset=['user_id'])

-- Fix data types
df['order_date'] = pd.to_datetime(df['order_date'])
df['amount'] = pd.to_numeric(df['amount'])

-- Fix inconsistent values
df['category'] = df['category'].str.lower().str.strip()
df.replace({'M': 1, 'F': 0})
                

Filtering & Selection

-- Boolean filtering
df[df['amount'] > 100]
df[(df['amount'] > 100) & (df['category'] == 'A')]

-- Select columns
df[['user_id', 'amount']]
df.loc[:, ['user_id', 'amount']]

-- Select by position
df.iloc[0:5, 0:3]  -- First 5 rows, first 3 columns

-- String matching
df[df['email'].str.contains('@gmail.com')]
df[df['name'].str.startswith('J')]
                

GroupBy & Aggregation

-- Simple groupby
df.groupby('category')['amount'].sum()

-- Multiple aggregations
df.groupby('category').agg({
    'amount': ['sum', 'mean', 'count'],
    'user_id': 'nunique'
})

-- Pivot table
df.pivot_table(
    values='amount',
    index='user_id',
    columns='category',
    aggfunc='sum'
)

-- Melt (wide to long)
pd.melt(df, id_vars=['user_id'], value_vars=['jan', 'feb'])
                

NumPy for Numerical Computing

Core NumPy Operations

import numpy as np

-- Array creation
arr = np.array([1, 2, 3, 4, 5])
np.zeros(5)  -- Array of zeros
np.ones((2, 3))  -- 2x3 array of ones
np.arange(0, 10, 2)  -- 0, 2, 4, 6, 8
np.linspace(0, 1, 10)  -- 10 evenly spaced values 0-1

-- Vectorized operations
arr * 2  -- Multiply each element
arr + arr  -- Element-wise addition
np.sqrt(arr)  -- Square root of each

-- Statistics
np.mean(arr)
np.median(arr)
np.std(arr)  -- Standard deviation
np.percentile(arr, 75)  -- 75th percentile
                

Matplotlib for Plotting

Basic Plotting

import matplotlib.pyplot as plt

-- Line plot
plt.plot(x, y, label='Sales')
plt.xlabel('Date')
plt.ylabel('Revenue')
plt.title('Daily Revenue')
plt.legend()
plt.show()

-- Bar chart
plt.bar(categories, values)

-- Histogram
plt.hist(data, bins=30, edgecolor='black')

-- Scatter plot
plt.scatter(x, y, alpha=0.5)

-- Multiple subplots
fig, axes = plt.subplots(2, 2, figsize=(12, 8))
axes[0, 0].plot(x1, y1)
axes[0, 1].bar(x2, y2)
                

Seaborn for Statistical Plots

Seaborn Examples

import seaborn as sns

-- Correlation heatmap
sns.heatmap(df.corr(), annot=True, cmap='coolwarm')

-- Box plot (distribution by category)
sns.boxplot(x='category', y='amount', data=df)

-- Violin plot
sns.violinplot(x='category', y='amount', data=df)

-- Scatter with regression line
sns.regplot(x='x', y='y', data=df)

-- Distribution plot
sns.histplot(data=df, x='amount', kde=True)
                

Statistics for Data Analysts

Descriptive Statistics

Common Measures

Mean: Average. Sensitive to outliers. Use for normally distributed data.
Median: Middle value. Robust to outliers. Use for skewed data.
Mode: Most frequent value. Use for categorical data.
Variance: Average squared deviation from mean. Units are squared.
Standard Deviation (SD): Square root of variance. Same units as data. 68% of data within 1 SD of mean (normal distribution).
IQR (Interquartile Range): Q3 - Q1. Middle 50% of data. Good for detecting outliers (outliers are >1.5×IQR from Q1 or Q3).
Percentile: Value below which X% of data falls. 50th percentile = median.

Distributions

Normal Distribution

Bell curve. Symmetric. Mean = Median = Mode.

68-95-99.7 Rule:

68% of data within 1 SD
95% within 2 SD
99.7% within 3 SD

Example: Height, test scores, measurement errors

Skewed Distributions

Right-skewed (positive): Long tail on right. Mean > Median.

Example: Income (few high earners pull mean up)

Left-skewed (negative): Long tail on left. Mean < Median.

Example: Test scores where most score well (ceiling effect)

Hypothesis Testing

The Hypothesis Testing Workflow

State Hypothesis:
- H0 (Null): No difference/effect. Status quo. "The control and treatment are the same"
- H1 (Alternative): There is a difference. "The treatment is different"
Choose test: t-test (continuous), chi-squared (categorical), ANOVA (3+ groups)
Calculate test statistic: How far is our data from H0?
Calculate p-value: Probability of observing this result if H0 is true
Interpret: If p < 0.05, reject H0 (result is statistically significant)
Conclusion: State in business terms, not stat terms

Critical Misconception: p-value is NOT the probability that H0 is true. p-value = P(data | H0 is true). It's the probability of seeing our result if H0 were true.

Common Statistical Tests

T-Test (Comparing Means)

When: Comparing means of two groups, continuous data

Assumptions: Normal distribution, equal variance (unless Welch's t-test)

from scipy.stats import ttest_ind

-- Compare two groups
control_group = [10, 12, 15, 18, 20]
treatment_group = [14, 16, 19, 22, 25]

t_stat, p_value = ttest_ind(control_group, treatment_group)
-- If p_value < 0.05: groups are significantly different
                

Chi-Squared Test (Comparing Proportions)

When: Comparing proportions/frequencies, categorical data

Example: Does email subject line A or B get more clicks?

from scipy.stats import chi2_contingency

-- Contingency table
contingency_table = [
    [100, 50],  -- Subject A: 100 clicks, 50 no-clicks
    [95, 55]    -- Subject B: 95 clicks, 55 no-clicks
]

chi2, p_value, dof, expected = chi2_contingency(contingency_table)
-- If p_value < 0.05: subjects perform differently
                

A/B Testing: Statistical Significance

Setup: Control (A) vs Variant (B). Sample size determined upfront.

Key concept: Statistical significance ≠ practical significance

Large sample size → small differences become "significant"
Effect size matters: What's the magnitude of the difference?
Always report confidence interval, not just p-value

-- Calculate confidence interval for A/B test
import numpy as np
from scipy import stats

control_conversion = 0.05  -- 5%
variant_conversion = 0.055  -- 5.5%
n = 10000  -- per group

-- Standard error of difference
se_control = np.sqrt(control_conversion * (1 - control_conversion) / n)
se_variant = np.sqrt(variant_conversion * (1 - variant_conversion) / n)
se_diff = np.sqrt(se_control**2 + se_variant**2)

-- 95% Confidence Interval
diff = variant_conversion - control_conversion
ci = (diff - 1.96 * se_diff, diff + 1.96 * se_diff)
                

Type I and Type II Errors

Type I Error (False Positive)

We reject H0 when H0 is actually true.

"We think variant is better, but actually no difference"

Probability: α (alpha) = 0.05

Risk: Launch variant that doesn't actually help

Type II Error (False Negative)

We fail to reject H0 when H0 is actually false.

"We think variants are same, but variant actually is better"

Probability: β (beta) = 0.20

Power: 1 - β = 0.80 (ability to detect true effect)

Risk: Miss a real improvement

Correlation vs Causation

Correlation does NOT imply causation. Examples:

Ice cream sales ↑ → Shark attacks ↑ (both caused by summer)
More firefighters → More fire damage (not causation; more firefighters sent to bigger fires)

To prove causation: Need randomized experiment, control for confounders, or time-ordered data with clear mechanism

Correlation Metrics

Pearson Correlation: Linear relationship. Range: -1 (perfect negative) to +1 (perfect positive).

Spearman Correlation: Rank-based. Better for non-linear relationships and outliers.

Caution: Correlation = 0.8 looks strong, but R² = 0.64 (only 64% of variance explained)

Regression

Linear Regression Interpretation

from sklearn.linear_model import LinearRegression

-- Fit model: Revenue = intercept + coefficient×Marketing_Spend
model = LinearRegression()
model.fit(X, y)

print(f"Intercept: {model.intercept_}")  -- Base revenue
print(f"Coefficient: {model.coef_[0]}")  -- ROI: revenue per $1 spent
print(f"R²: {model.score(X, y)}")  -- % of variance explained
                

Coefficient = 2.5: For every $1 increase in marketing spend, revenue increases by $2.50 (holding other factors constant)

R² = 0.75: 75% of revenue variance is explained by marketing spend. Other factors account for 25%.

Data Visualization Best Practices

Chart Selection

Bar Chart

Use for: Comparing values across categories

Example: Revenue by product category

Pro tip: Sort bars by value (descending)

Line Chart

Use for: Trends over time

Example: Daily active users over 30 days

Pro tip: Don't truncate y-axis unless context requires it

Scatter Plot

Use for: Relationship between two variables

Example: Marketing spend vs Revenue

Pro tip: Add regression line if showing correlation

Histogram

Use for: Distribution of single variable

Example: Distribution of customer ages

Pro tip: Choose bin size carefully (too many/few distorts view)

Heatmap

Use for: Correlation matrix, or two categorical variables

Example: Correlation of metrics, user activity by hour/day

Pro tip: Color scale should be intuitive (red=bad, green=good)

Box Plot

Use for: Distribution + outliers across groups

Example: Order value by customer segment

Pro tip: Show all data points if N is small

Pie Chart

Use for: Proportions only (max 5 slices)

Example: Market share by competitor

Pro tip: Bar chart is usually better. Don't use pie with >5 segments.

Bubble Chart

Use for: Three variables (x, y, bubble size)

Example: Revenue, profit, company size

Pro tip: Bubble size should be area, not radius (square root scale)

Tableau/Power BI Skills

Key Tableau Concepts

Calculated Fields: Create new dimensions/measures on the fly. Example: [Sales] / [Units] = price per unit
Filters: Context filters (run first), Dimension filters, Measure filters
Parameters: Interactive dropdowns. Users select value, dashboard updates
LOD Expressions:
- FIXED: Compute across all rows (e.g., total sales regardless of what's filtered)
- INCLUDE: Add dimensions to aggregation
- EXCLUDE: Remove dimensions from aggregation
Drill-down: Click on bar → see detail rows

Key Power BI Concepts

DAX Measures: Power BI's calculation language. Similar to calculated fields
Relationships: Connect tables (similar to SQL joins)
Row-Level Security (RLS): Different users see different data based on roles
Power Query: ETL tool. Transform data before loading to model

Storytelling with Data

                Golden Rule: Don't make the audience read. Tell them what to see.
                Lead with the insight, not the chart
One key message per slide
Annotate the chart: highlight key points, add arrows
Use color to emphasize (only highlight what matters)
Include context: is this good or bad? How does it compare to last month?

            

Dashboard Design

Best Practices

Executive Summary at top: KPI cards, headline metrics, key insights
Trend charts: How are we doing over time?
Drill-down capability: Click to see detail without cluttering main view
Appropriate detail level: C-suite sees high-level; analysts see detail
Self-explanatory: Legend shouldn't be needed. Hover-over tooltips explain what each metric means
Update frequency: Daily for operational dashboards, weekly/monthly for business reviews
White space: Don't cram everything. Breathing room improves readability

Common Visualization Mistakes

Truncating Y-axis: Makes small differences look huge (misleading)
Dual Y-axis: Confusing. Choose one scale or separate charts
Rainbow colors: Non-colorblind friendly. Use perceptually uniform palettes
3D charts: Hard to read. Stick to 2D
Pie charts with 7+ slices: Can't compare slices. Use bar chart instead
No context: "Revenue is $10M" means nothing. $10M vs last year? vs budget?

Business Metrics & KPIs

E-Commerce Metrics

Core E-Commerce KPIs

GMV (Gross Merchandise Value): Total transaction value. Includes returns.
Conversion Rate: % of visitors who purchase. Formula: Orders / Sessions × 100
Cart Abandonment Rate: % of users who add to cart but don't buy. High = friction in checkout
AOV (Average Order Value): Total revenue / number of orders. Target: increase through upsell/cross-sell
Repeat Purchase Rate: % of customers who buy again. Measures loyalty and product quality
Return Rate: % of items returned. High return = quality issues or mismatched expectations
Customer Lifetime Value (LTV): Total revenue from a customer. Formula: AOV × Purchase Frequency × Customer Lifespan

SaaS Metrics

Core SaaS KPIs

MRR (Monthly Recurring Revenue): Total subscription revenue per month
ARR (Annual Recurring Revenue): MRR × 12. Used for valuation
Churn Rate: % of customers who cancel each month. Formula: (Customers Lost / Start of Month Customers) × 100. Target: <5%
Net Revenue Retention (NRR): (ARR End of Period - Churn + Expansion) / ARR Start of Period. > 100% = growing from existing customers
Expansion MRR: Revenue increase from existing customers (upsell, add-ons)
CAC (Customer Acquisition Cost): Total sales + marketing cost / new customers. Should be paid back in 12 months
LTV:CAC Ratio: Should be 3:1 minimum, 5:1+ is strong
Activation Rate: % of new users who reach aha moment. Critical for retention
Time-to-Value: How long until user experiences value? Should be <7 days

Marketing Metrics

Marketing KPIs

CTR (Click-Through Rate): Clicks / impressions. Measures ad appeal and relevance
CPC (Cost Per Click): Ad spend / clicks. Related to CTR and bid amount
CPM (Cost Per Mille): Ad spend / impressions × 1000. Standard metric for display ads
ROAS (Return on Ad Spend): Revenue from ad / ad spend. Target: >3:1
Attribution Models:
- First-touch: 100% credit to first channel user touched
- Last-touch: 100% credit to last channel before conversion
- Linear: Equal credit to all channels
- Time-decay: More credit to channels closer to conversion
- Data-driven: ML assigns credit based on actual conversion patterns

Engagement Metrics

How to Measure Product Engagement

DAU/MAU: Daily / Monthly Active Users. DAU/MAU = stickiness. >20% is good, >50% is excellent
Session Length: Average time spent per visit. Longer = more engaged (or frustrated?)
Sessions per User: How often users return. Frequency of use
Feature Adoption Rate: % of users using feature X. Launch feature, measure after 4 weeks
Bounce Rate: % of sessions where user leaves without action. High bounce = poor landing page or mismatch with expectations
Pages per Session: How deep do users go? More = more engaged

Funnel Metrics by Stage

Customer Journey Stages & Key Metrics

Awareness: Impressions, reach, brand searches

Interest: Click-through rate, time on site, pages per session

Consideration: Add to cart, wishlist adds, content downloads

Intent: Free trial signups, demo requests, pricing page views

Purchase: Conversion rate, AOV, transaction completion rate

Loyalty: Repeat purchase rate, NPS, customer lifetime value

Advocacy: Referral rate, social shares, reviews posted

Top 30 Interview Questions & Answers

1. How do you approach a new dataset? +

Structured approach:

Understand the context: What question are we answering? What's the business impact?
Check data quality: Row count, column names, data types, nulls, duplicates
Explore distributions: df.describe(), df.info(), check for outliers
Verify data integrity: Do values make sense? Any obvious data entry errors?
Identify relationships: Correlations, patterns, anomalies
Form hypothesis: What might explain the patterns I see?
Deep dive: Answer the business question with targeted analysis

2. What is the difference between GROUP BY and WINDOW functions? +

GROUP BY: Collapses rows. One row per group. Can't access individual row values.

WINDOW functions: Don't collapse rows. Keep individual rows while computing aggregate. Can mix individual and aggregate values.

Example:

GROUP BY: "Show me total revenue per category" → 3 rows (one per category)
WINDOW: "Show me each order with running total revenue per category" → All original rows, plus running total column

3. Write SQL to find the second highest salary per department +

WITH ranked_salaries AS (
    SELECT
        department,
        employee_name,
        salary,
        RANK() OVER (PARTITION BY department ORDER BY salary DESC) as salary_rank
    FROM employees
)
SELECT department, employee_name, salary
FROM ranked_salaries
WHERE salary_rank = 2;
                        

Key points: Use RANK (handles ties), PARTITION BY department, ORDER BY salary DESC

4. How do you handle missing data? +

Approach: Understand why it's missing, then decide

Identify missing data type:
- MCAR (Missing Completely At Random): Safe to ignore or impute
- MAR (Missing At Random): Depends on other variables. May introduce bias
- MNAR (Missing Not At Random): Systematic. Likely biased if ignored
Options:
- Remove rows: Simple, but loses data
- Remove column: If >50% missing, likely not useful
- Impute with mean/median: Assumes data is MCAR
- Forward fill: For time series (carry last value forward)
- Predictive imputation: Use other variables to predict missing value
- Segment analysis: Analyze complete cases separately from incomplete
Document your choice: Always note what you did and why

5. What is p-value and how do you interpret it? +

Definition: Probability of observing our data (or more extreme) IF the null hypothesis is true.

Interpretation: p < 0.05 = reject null hypothesis = result is statistically significant

What p-value is NOT:

NOT the probability that H0 is true
NOT the probability that we made a mistake
NOT a measure of effect size (small p doesn't mean big effect)

Example: "p = 0.03 means there's a 3% chance of seeing this data if the two groups truly have identical conversion rates."

Always report: p-value AND confidence interval AND effect size

6. How do you detect and handle outliers? +

Detection methods:

IQR method: Outliers are < Q1 - 1.5×IQR or > Q3 + 1.5×IQR. Most common.
Z-score: Values with |z| > 3 are outliers (3 SD from mean)
Visual inspection: Box plots, scatter plots, histograms
Domain knowledge: "A $1 million order for a $50 product is clearly an error"

Handling options:

Remove: If data entry error
Cap: Set to max/min reasonable value
Transform: Log transform to reduce impact
Segment: Analyze separately (high-value customers vs typical)
Investigate: Why is this an outlier? Are there multiple reasons?

Critical: Never silently remove outliers. Always document.

7. Explain the difference between correlation and causation +

Correlation: Two variables move together. -1 to +1.

Causation: Change in X directly causes change in Y.

They're NOT the same! Examples of correlation without causation:

Ice cream sales ↑ → Shark attacks ↑ (both caused by warm weather)
More firefighters → More fire damage (not causation; firefighters go to bigger fires)
Stock price ↑ → Company performance ↑ (causation runs both ways)

To prove causation:

Randomized controlled experiment (gold standard)
Control for confounders (use regression, propensity score matching)
Establish time order (X happened before Y)
Understand mechanism (why would X cause Y?)

8. How do you design an A/B test? +

Step-by-step:

Define hypothesis: "Changing button color from blue to red will increase click rate by 5%"
Choose metric: What will you measure? (CTR, conversion, revenue)
Calculate sample size: Use power calculator (typically 80% power, 95% confidence). Don't just guess.
Run duration: 1-2 weeks minimum (account for day-of-week effects). Never stop early "because we see winner"—you'll be wrong
Randomization: 50/50 split, but MUST be random and persistent (same user always sees same variant)
Statistical testing: At end, run appropriate test (t-test for continuous, chi-squared for proportions)
Report results: p-value, confidence interval, effect size. Not just "winner/loser"

Common mistakes: Under-powering, stopping early, peeking during test, running too many tests (multiple testing problem)

9. What is Type I vs Type II error? +

Type I (False Positive): Reject H0 when H0 is true. "We think variant is better but it's actually same."

Probability: α = 0.05
Risk: Launch change that doesn't help (wasted effort)

Type II (False Negative): Fail to reject H0 when H0 is false. "We think variants are same but variant is actually better."

Probability: β = 0.20
Power: 1 - β = 0.80 (ability to detect true effect)
Risk: Miss real improvement (opportunity cost)

Trade-off: Can't minimize both simultaneously. Usually prefer Type I control (α = 0.05), allow higher Type II (β = 0.20)

10. How would you analyze a 20% drop in conversion rate? +

Systematic investigation:

Verify it's real: Check for data pipeline errors, DB issues, data quality problems. Did something break?
Segment the drop: Which user segments? Which geographies? Which device types? Which page?
Identify what changed: Code deploy? Traffic source change? Site outage? Competitor launch?
Look for correlation: Did anything else change around same time?
Check external factors: Market change? Seasonality? Competitor action?
Mitigation: If it's a bug, revert. If it's competitive, differentiate. If it's traffic source, redirect to secondary.
Communication: Update stakeholders daily with findings and ETA to fix

Key: Don't assume root cause. Investigate systematically.

11-30. Additional Interview Questions +

11. What is the difference between HAVING and WHERE? WHERE filters rows BEFORE grouping. HAVING filters AFTER grouping. Use HAVING when filtering on aggregate results.

12. How do you communicate findings to non-technical stakeholders? Use business language, not statistical jargon. Lead with insight. Show impact in dollars, not percentages. Use visuals. Anticipate questions.

13. What is cohort analysis and when do you use it? Group users by signup date, track behavior over time. Detects if product is improving. Reveals retention curves.

14. How do you validate your analysis? Cross-check with other data sources. Spot-check outliers. Test conclusions with different segments. Ask domain experts to validate.

15. Tell me about a time you found an insight that drove business impact. Pick real example. Show your process (what question, how you analyzed, what you found, what action taken, what was impact).

16. What is funnel analysis? Track users through stages (awareness → interest → purchase). Identify biggest drop-offs. Prioritize optimization on biggest leaks.

17. How do you calculate LTV? LTV = Average Revenue Per User / Churn Rate. Or: ARPU × (1 / Monthly Churn Rate). Critical for SaaS profitability.

18. What are attribution models? Ways to credit channels for conversions. First-touch, last-touch, linear, time-decay, data-driven. Each has trade-offs.

19. How do you detect data quality issues? Check for: duplicates, nulls, outliers, inconsistent formatting, logical errors (e.g., end_date before start_date), negative values where impossible

20. SQL optimization: how do you speed up slow queries? Add indexes on filter columns. Partition by date. Aggregate before joining. Avoid functions in WHERE. Use EXISTS instead of IN for large subqueries. Ask DBA for help.

21. How would you segment customers? RFM (Recency, Frequency, Monetary). Behavioral (usage patterns). Demographic. Geographic. Psychographic. Choose based on business need.

22. What is time series analysis? Analyzing data over time. Trend (direction), seasonality (patterns), outliers (anomalies). Forecasting uses historical patterns to predict future.

23. Python vs SQL for analysis? SQL: data extraction, large datasets, repeatable queries. Python: complex transforms, visualization, statistical tests, machine learning.

24. How do you approach a vague problem statement? Ask clarifying questions. What decision needs to be made? What data is available? What's the timeline? Who's the audience?

25. How do you handle imbalanced data? If 95% positive class: oversampling minority, undersampling majority, synthetic data (SMOTE), class weights in models, change evaluation metric (use precision/recall, not accuracy)

26. What is Simpson's Paradox? Trend reverses when data is aggregated. Caused by unequal subgroup sizes. Always segment before concluding.

27. How do you measure customer satisfaction? NPS (Net Promoter Score), CSAT (satisfaction rating), CES (Customer Effort Score), qualitative feedback, retention (voting with feet)

28. How would you A/B test something that takes months to measure? Use leading indicator (early proxy). Example: Daily engagement predicts eventual retention. Test early indicator instead of final outcome.

29. What is survivor bias? Ignoring failures (only looking at survivors). Example: successful companies cut cost, but failed companies also cut cost (you only see winners). Use control group or look at full population.

30. How do you stay current with analytics best practices? Read blogs (Reforge, Towards Data Science), take courses, practice with real data, collaborate with peers, experiment with new tools

Common Analyst Mistakes

1. Correlation ≠ Causation

Finding correlation without establishing causal mechanism or randomized experiment

2. Ignoring Sample Bias

Analyzing a subset of data that doesn't represent population (e.g., only analyzing users who stayed)

3. Multiple Testing Without Correction

Run 20 A/B tests, one will be "significant" by chance. Use Bonferroni correction or multiple testing adjustment

4. Stopping Test Early

"Our result looks significant" after 2 weeks instead of waiting 2 weeks. Will lead to wrong conclusions

5. P-hacking

Testing many hypotheses, thresholds, or subgroups until you find "significant" result. Then claiming victory. Use pre-registered hypothesis.

6. No Context

"Revenue is up 10%"—up vs what? Last month? Last year? Budget? Competitor?

7. Outliers Removed Without Justification

Silently removing data to make conclusion prettier. Always document and justify.

Table of Contents