SQL, Statistics, Python, and Beyond
Joins combine data from multiple tables. The 4 main types handle different use cases.
Returns only rows where key exists in BOTH tables.
SELECT u.user_id, u.name, o.order_id
FROM users u
INNER JOIN orders o
ON u.user_id = o.user_id;
-- Result: Only users who have orders
Use case: "Show me only users who made purchases"
Returns ALL rows from left table + matches from right table.
SELECT u.user_id, u.name, o.order_id
FROM users u
LEFT JOIN orders o
ON u.user_id = o.user_id;
-- Result: All users, with NULL for those without orders
Use case: "Show me all users, flag who hasn't ordered yet"
Returns ALL rows from right table + matches from left table.
SELECT u.user_id, u.name, o.order_id
FROM users u
RIGHT JOIN orders o
ON u.user_id = o.user_id;
-- Result: All orders, with NULL for orders without user (data quality issue!)
Use case: Rare. Usually "find orphaned records"
Returns all rows from BOTH tables.
SELECT u.user_id, u.name, o.order_id
FROM users u
FULL OUTER JOIN orders o
ON u.user_id = o.user_id;
-- Result: All users + all orders (NULLs on both sides if no match)
Use case: "Find data quality issues: orphaned users or orders"
GROUP BY aggregates rows into groups. Use aggregate functions like COUNT, SUM, AVG.
SELECT
DATE(order_date) as order_day,
COUNT(*) as num_orders,
SUM(amount) as total_revenue,
AVG(amount) as avg_order_value
FROM orders
WHERE amount > 10 -- WHERE filters BEFORE grouping
GROUP BY DATE(order_date)
HAVING COUNT(*) > 5 -- HAVING filters AFTER grouping
ORDER BY order_day DESC;
Key concept: WHERE runs before GROUP BY, HAVING runs after. Use HAVING to filter on aggregate results.
Window functions operate on a "window" of rows while keeping individual rows intact.
SELECT
user_id,
amount,
ROW_NUMBER() OVER (PARTITION BY user_id ORDER BY amount DESC) as rn,
RANK() OVER (PARTITION BY user_id ORDER BY amount DESC) as rnk
FROM orders;
ROW_NUMBER: Always unique (1, 2, 3, 4, 5)
RANK: Ties get same rank, skips next (1, 2, 2, 4, 5)
DENSE_RANK: Ties get same rank, no skip (1, 2, 2, 3, 4)
SELECT
DATE(order_date) as order_date,
revenue,
LAG(revenue) OVER (ORDER BY DATE(order_date)) as prev_day_revenue,
LEAD(revenue) OVER (ORDER BY DATE(order_date)) as next_day_revenue,
revenue - LAG(revenue) OVER (ORDER BY DATE(order_date)) as daily_change
FROM daily_revenue
ORDER BY order_date;
Use case: Calculate day-over-day changes, churn detection (user was active yesterday but not today)
SELECT
DATE(order_date) as order_date,
revenue,
SUM(revenue) OVER (ORDER BY DATE(order_date) ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) as cumulative_revenue,
AVG(revenue) OVER (ORDER BY DATE(order_date) ROWS BETWEEN 6 PRECEDING AND CURRENT ROW) as moving_7day_avg
FROM daily_revenue
ORDER BY order_date;
ROWS BETWEEN: Controls window size. UNBOUNDED PRECEDING = from start. 6 PRECEDING = last 7 rows
SELECT user_id, revenue
FROM (
SELECT
user_id,
SUM(amount) as revenue
FROM orders
GROUP BY user_id
) as user_revenue
WHERE revenue > 100;
Pro: Inline, obvious what data is used
Con: Messy with multiple levels
WITH user_revenue AS (
SELECT
user_id,
SUM(amount) as revenue
FROM orders
GROUP BY user_id
)
SELECT user_id, revenue
FROM user_revenue
WHERE revenue > 100;
Pro: Readable, reusable, easy to test
Con: Slight performance overhead
-- Truncate to period (PostgreSQL)
DATE_TRUNC('month', order_date) -- 2024-01-01
-- Calculate difference (days between dates)
SELECT DATEDIFF(day, start_date, end_date) -- SQL Server
SELECT DATE(end_date) - DATE(start_date) -- PostgreSQL
-- Add days to date
DATE_ADD(order_date, INTERVAL 7 DAY) -- MySQL
order_date + INTERVAL '7 day' -- PostgreSQL
-- Extract year/month/day
EXTRACT(YEAR FROM order_date)
-- Convert to date
TO_DATE(order_date_str, 'YYYY-MM-DD')
SELECT
UPPER(name) -- 'john' → 'JOHN',
LOWER(name) -- 'JOHN' → 'john',
TRIM(email) -- Remove leading/trailing spaces,
CONCAT(first_name, ' ', last_name) -- Combine strings,
SUBSTR(email, 1, 5) -- First 5 characters,
SPLIT(tags, ',') -- Convert 'a,b,c' to array,
REPLACE(email, '@old.com', '@new.com')
FROM users;
-- IS NULL checks
WHERE phone_number IS NULL -- Correct
WHERE phone_number = NULL -- WRONG! Always false
-- COALESCE: return first non-null value
COALESCE(phone_number, email, 'no contact')
-- NULLIF: return null if two values equal, else return first
NULLIF(actual_revenue, expected_revenue) -- NULL if equal
-- Filter out nulls in aggregation
COUNT(DISTINCT customer_id) -- Automatically excludes NULLs
WITH monthly_purchases AS (
SELECT
user_id,
DATE_TRUNC('month', order_date) as month,
ROW_NUMBER() OVER (PARTITION BY user_id ORDER BY DATE_TRUNC('month', order_date)) as purchase_seq
FROM orders
GROUP BY user_id, DATE_TRUNC('month', order_date)
),
consecutive AS (
SELECT
user_id,
month,
month - (purchase_seq || ' months')::interval as cohort
FROM monthly_purchases
)
SELECT DISTINCT user_id
FROM consecutive
GROUP BY user_id, cohort
HAVING COUNT(*) >= 2;
SELECT
order_date,
revenue,
AVG(revenue) OVER (
ORDER BY order_date
ROWS BETWEEN 6 PRECEDING AND CURRENT ROW
) as rolling_7day_avg
FROM daily_revenue
ORDER BY order_date;
WITH ranked_salaries AS (
SELECT
department,
employee_name,
salary,
RANK() OVER (PARTITION BY department ORDER BY salary DESC) as salary_rank
FROM employees
)
SELECT department, employee_name, salary
FROM ranked_salaries
WHERE salary_rank = 2;
WITH monthly_revenue AS (
SELECT
DATE_TRUNC('month', order_date) as month,
SUM(amount) as revenue
FROM orders
GROUP BY DATE_TRUNC('month', order_date)
)
SELECT
month,
revenue,
LAG(revenue) OVER (ORDER BY month) as prev_month_revenue,
ROUND(100.0 * (revenue - LAG(revenue) OVER (ORDER BY month)) / LAG(revenue) OVER (ORDER BY month), 2) as mom_growth_pct
FROM monthly_revenue
ORDER BY month;
WITH last_month AS (
SELECT DISTINCT user_id
FROM events
WHERE DATE_TRUNC('month', event_date) = DATE_TRUNC('month', CURRENT_DATE - INTERVAL '1 month')
),
this_month AS (
SELECT DISTINCT user_id
FROM events
WHERE DATE_TRUNC('month', event_date) = DATE_TRUNC('month', CURRENT_DATE)
)
SELECT user_id
FROM last_month
WHERE user_id NOT IN (SELECT user_id FROM this_month);
WHERE YEAR(order_date) = 2024 won't use index. Use WHERE order_date >= '2024-01-01' AND order_date < '2025-01-01'Pandas is the workhorse library for data manipulation in Python.
import pandas as pd
-- Load data
df = pd.read_csv('data.csv')
-- Explore
df.head() -- First 5 rows
df.tail() -- Last 5 rows
df.info() -- Data types, null counts
df.describe() -- Summary statistics
df.shape -- (rows, columns)
df.columns -- Column names
df.dtypes -- Data type of each column
-- Check for nulls
df.isnull().sum() -- Count nulls per column
df.isnull().any() -- Any nulls in each column?
-- Handle nulls
df.fillna(0) -- Replace with value
df.fillna(df.mean()) -- Replace with mean
df.dropna() -- Remove rows with any null
-- Remove duplicates
df.drop_duplicates()
df.drop_duplicates(subset=['user_id'])
-- Fix data types
df['order_date'] = pd.to_datetime(df['order_date'])
df['amount'] = pd.to_numeric(df['amount'])
-- Fix inconsistent values
df['category'] = df['category'].str.lower().str.strip()
df.replace({'M': 1, 'F': 0})
-- Boolean filtering
df[df['amount'] > 100]
df[(df['amount'] > 100) & (df['category'] == 'A')]
-- Select columns
df[['user_id', 'amount']]
df.loc[:, ['user_id', 'amount']]
-- Select by position
df.iloc[0:5, 0:3] -- First 5 rows, first 3 columns
-- String matching
df[df['email'].str.contains('@gmail.com')]
df[df['name'].str.startswith('J')]
-- Simple groupby
df.groupby('category')['amount'].sum()
-- Multiple aggregations
df.groupby('category').agg({
'amount': ['sum', 'mean', 'count'],
'user_id': 'nunique'
})
-- Pivot table
df.pivot_table(
values='amount',
index='user_id',
columns='category',
aggfunc='sum'
)
-- Melt (wide to long)
pd.melt(df, id_vars=['user_id'], value_vars=['jan', 'feb'])
import numpy as np
-- Array creation
arr = np.array([1, 2, 3, 4, 5])
np.zeros(5) -- Array of zeros
np.ones((2, 3)) -- 2x3 array of ones
np.arange(0, 10, 2) -- 0, 2, 4, 6, 8
np.linspace(0, 1, 10) -- 10 evenly spaced values 0-1
-- Vectorized operations
arr * 2 -- Multiply each element
arr + arr -- Element-wise addition
np.sqrt(arr) -- Square root of each
-- Statistics
np.mean(arr)
np.median(arr)
np.std(arr) -- Standard deviation
np.percentile(arr, 75) -- 75th percentile
import matplotlib.pyplot as plt
-- Line plot
plt.plot(x, y, label='Sales')
plt.xlabel('Date')
plt.ylabel('Revenue')
plt.title('Daily Revenue')
plt.legend()
plt.show()
-- Bar chart
plt.bar(categories, values)
-- Histogram
plt.hist(data, bins=30, edgecolor='black')
-- Scatter plot
plt.scatter(x, y, alpha=0.5)
-- Multiple subplots
fig, axes = plt.subplots(2, 2, figsize=(12, 8))
axes[0, 0].plot(x1, y1)
axes[0, 1].bar(x2, y2)
import seaborn as sns
-- Correlation heatmap
sns.heatmap(df.corr(), annot=True, cmap='coolwarm')
-- Box plot (distribution by category)
sns.boxplot(x='category', y='amount', data=df)
-- Violin plot
sns.violinplot(x='category', y='amount', data=df)
-- Scatter with regression line
sns.regplot(x='x', y='y', data=df)
-- Distribution plot
sns.histplot(data=df, x='amount', kde=True)
Bell curve. Symmetric. Mean = Median = Mode.
68-95-99.7 Rule:
Example: Height, test scores, measurement errors
Right-skewed (positive): Long tail on right. Mean > Median.
Example: Income (few high earners pull mean up)
Left-skewed (negative): Long tail on left. Mean < Median.
Example: Test scores where most score well (ceiling effect)
When: Comparing means of two groups, continuous data
Assumptions: Normal distribution, equal variance (unless Welch's t-test)
from scipy.stats import ttest_ind
-- Compare two groups
control_group = [10, 12, 15, 18, 20]
treatment_group = [14, 16, 19, 22, 25]
t_stat, p_value = ttest_ind(control_group, treatment_group)
-- If p_value < 0.05: groups are significantly different
When: Comparing proportions/frequencies, categorical data
Example: Does email subject line A or B get more clicks?
from scipy.stats import chi2_contingency
-- Contingency table
contingency_table = [
[100, 50], -- Subject A: 100 clicks, 50 no-clicks
[95, 55] -- Subject B: 95 clicks, 55 no-clicks
]
chi2, p_value, dof, expected = chi2_contingency(contingency_table)
-- If p_value < 0.05: subjects perform differently
Setup: Control (A) vs Variant (B). Sample size determined upfront.
Key concept: Statistical significance ≠ practical significance
-- Calculate confidence interval for A/B test
import numpy as np
from scipy import stats
control_conversion = 0.05 -- 5%
variant_conversion = 0.055 -- 5.5%
n = 10000 -- per group
-- Standard error of difference
se_control = np.sqrt(control_conversion * (1 - control_conversion) / n)
se_variant = np.sqrt(variant_conversion * (1 - variant_conversion) / n)
se_diff = np.sqrt(se_control**2 + se_variant**2)
-- 95% Confidence Interval
diff = variant_conversion - control_conversion
ci = (diff - 1.96 * se_diff, diff + 1.96 * se_diff)
We reject H0 when H0 is actually true.
"We think variant is better, but actually no difference"
Probability: α (alpha) = 0.05
Risk: Launch variant that doesn't actually help
We fail to reject H0 when H0 is actually false.
"We think variants are same, but variant actually is better"
Probability: β (beta) = 0.20
Power: 1 - β = 0.80 (ability to detect true effect)
Risk: Miss a real improvement
To prove causation: Need randomized experiment, control for confounders, or time-ordered data with clear mechanism
Pearson Correlation: Linear relationship. Range: -1 (perfect negative) to +1 (perfect positive).
Spearman Correlation: Rank-based. Better for non-linear relationships and outliers.
Caution: Correlation = 0.8 looks strong, but R² = 0.64 (only 64% of variance explained)
from sklearn.linear_model import LinearRegression
-- Fit model: Revenue = intercept + coefficient×Marketing_Spend
model = LinearRegression()
model.fit(X, y)
print(f"Intercept: {model.intercept_}") -- Base revenue
print(f"Coefficient: {model.coef_[0]}") -- ROI: revenue per $1 spent
print(f"R²: {model.score(X, y)}") -- % of variance explained
Coefficient = 2.5: For every $1 increase in marketing spend, revenue increases by $2.50 (holding other factors constant)
R² = 0.75: 75% of revenue variance is explained by marketing spend. Other factors account for 25%.
Use for: Comparing values across categories
Example: Revenue by product category
Pro tip: Sort bars by value (descending)
Use for: Trends over time
Example: Daily active users over 30 days
Pro tip: Don't truncate y-axis unless context requires it
Use for: Relationship between two variables
Example: Marketing spend vs Revenue
Pro tip: Add regression line if showing correlation
Use for: Distribution of single variable
Example: Distribution of customer ages
Pro tip: Choose bin size carefully (too many/few distorts view)
Use for: Correlation matrix, or two categorical variables
Example: Correlation of metrics, user activity by hour/day
Pro tip: Color scale should be intuitive (red=bad, green=good)
Use for: Distribution + outliers across groups
Example: Order value by customer segment
Pro tip: Show all data points if N is small
Use for: Proportions only (max 5 slices)
Example: Market share by competitor
Pro tip: Bar chart is usually better. Don't use pie with >5 segments.
Use for: Three variables (x, y, bubble size)
Example: Revenue, profit, company size
Pro tip: Bubble size should be area, not radius (square root scale)
Awareness: Impressions, reach, brand searches
Interest: Click-through rate, time on site, pages per session
Consideration: Add to cart, wishlist adds, content downloads
Intent: Free trial signups, demo requests, pricing page views
Purchase: Conversion rate, AOV, transaction completion rate
Loyalty: Repeat purchase rate, NPS, customer lifetime value
Advocacy: Referral rate, social shares, reviews posted
Structured approach:
GROUP BY: Collapses rows. One row per group. Can't access individual row values.
WINDOW functions: Don't collapse rows. Keep individual rows while computing aggregate. Can mix individual and aggregate values.
Example:
WITH ranked_salaries AS (
SELECT
department,
employee_name,
salary,
RANK() OVER (PARTITION BY department ORDER BY salary DESC) as salary_rank
FROM employees
)
SELECT department, employee_name, salary
FROM ranked_salaries
WHERE salary_rank = 2;
Key points: Use RANK (handles ties), PARTITION BY department, ORDER BY salary DESC
Approach: Understand why it's missing, then decide
Definition: Probability of observing our data (or more extreme) IF the null hypothesis is true.
Interpretation: p < 0.05 = reject null hypothesis = result is statistically significant
What p-value is NOT:
Example: "p = 0.03 means there's a 3% chance of seeing this data if the two groups truly have identical conversion rates."
Always report: p-value AND confidence interval AND effect size
Detection methods:
Handling options:
Critical: Never silently remove outliers. Always document.
Correlation: Two variables move together. -1 to +1.
Causation: Change in X directly causes change in Y.
They're NOT the same! Examples of correlation without causation:
To prove causation:
Step-by-step:
Common mistakes: Under-powering, stopping early, peeking during test, running too many tests (multiple testing problem)
Type I (False Positive): Reject H0 when H0 is true. "We think variant is better but it's actually same."
Type II (False Negative): Fail to reject H0 when H0 is false. "We think variants are same but variant is actually better."
Trade-off: Can't minimize both simultaneously. Usually prefer Type I control (α = 0.05), allow higher Type II (β = 0.20)
Systematic investigation:
Key: Don't assume root cause. Investigate systematically.
11. What is the difference between HAVING and WHERE? WHERE filters rows BEFORE grouping. HAVING filters AFTER grouping. Use HAVING when filtering on aggregate results.
12. How do you communicate findings to non-technical stakeholders? Use business language, not statistical jargon. Lead with insight. Show impact in dollars, not percentages. Use visuals. Anticipate questions.
13. What is cohort analysis and when do you use it? Group users by signup date, track behavior over time. Detects if product is improving. Reveals retention curves.
14. How do you validate your analysis? Cross-check with other data sources. Spot-check outliers. Test conclusions with different segments. Ask domain experts to validate.
15. Tell me about a time you found an insight that drove business impact. Pick real example. Show your process (what question, how you analyzed, what you found, what action taken, what was impact).
16. What is funnel analysis? Track users through stages (awareness → interest → purchase). Identify biggest drop-offs. Prioritize optimization on biggest leaks.
17. How do you calculate LTV? LTV = Average Revenue Per User / Churn Rate. Or: ARPU × (1 / Monthly Churn Rate). Critical for SaaS profitability.
18. What are attribution models? Ways to credit channels for conversions. First-touch, last-touch, linear, time-decay, data-driven. Each has trade-offs.
19. How do you detect data quality issues? Check for: duplicates, nulls, outliers, inconsistent formatting, logical errors (e.g., end_date before start_date), negative values where impossible
20. SQL optimization: how do you speed up slow queries? Add indexes on filter columns. Partition by date. Aggregate before joining. Avoid functions in WHERE. Use EXISTS instead of IN for large subqueries. Ask DBA for help.
21. How would you segment customers? RFM (Recency, Frequency, Monetary). Behavioral (usage patterns). Demographic. Geographic. Psychographic. Choose based on business need.
22. What is time series analysis? Analyzing data over time. Trend (direction), seasonality (patterns), outliers (anomalies). Forecasting uses historical patterns to predict future.
23. Python vs SQL for analysis? SQL: data extraction, large datasets, repeatable queries. Python: complex transforms, visualization, statistical tests, machine learning.
24. How do you approach a vague problem statement? Ask clarifying questions. What decision needs to be made? What data is available? What's the timeline? Who's the audience?
25. How do you handle imbalanced data? If 95% positive class: oversampling minority, undersampling majority, synthetic data (SMOTE), class weights in models, change evaluation metric (use precision/recall, not accuracy)
26. What is Simpson's Paradox? Trend reverses when data is aggregated. Caused by unequal subgroup sizes. Always segment before concluding.
27. How do you measure customer satisfaction? NPS (Net Promoter Score), CSAT (satisfaction rating), CES (Customer Effort Score), qualitative feedback, retention (voting with feet)
28. How would you A/B test something that takes months to measure? Use leading indicator (early proxy). Example: Daily engagement predicts eventual retention. Test early indicator instead of final outcome.
29. What is survivor bias? Ignoring failures (only looking at survivors). Example: successful companies cut cost, but failed companies also cut cost (you only see winners). Use control group or look at full population.
30. How do you stay current with analytics best practices? Read blogs (Reforge, Towards Data Science), take courses, practice with real data, collaborate with peers, experiment with new tools
Finding correlation without establishing causal mechanism or randomized experiment
Analyzing a subset of data that doesn't represent population (e.g., only analyzing users who stayed)
Run 20 A/B tests, one will be "significant" by chance. Use Bonferroni correction or multiple testing adjustment
"Our result looks significant" after 2 weeks instead of waiting 2 weeks. Will lead to wrong conclusions
Testing many hypotheses, thresholds, or subgroups until you find "significant" result. Then claiming victory. Use pre-registered hypothesis.
"Revenue is up 10%"—up vs what? Last month? Last year? Budget? Competitor?
Silently removing data to make conclusion prettier. Always document and justify.