πŸ€– Career Cheat Sheet

AI / ML Engineer

Everything you need to crack any ML interview β€” concepts, code, Q&A, system design

$110K–$210K
US Salary Range
30+
Interview Q&As
50+
Code Snippets
11
ML Algorithms

πŸ“‹ Table of Contents

  1. Quick Reference Card
  2. ML Fundamentals
  3. All ML Algorithms
  4. Deep Learning
  5. NLP & LLMs
  6. MLOps
  7. Code Snippets
  8. Tools & Stack
  9. Top 30 Interview Q&As
  10. System Design Patterns
  11. Mistakes to Avoid
  12. Glossary
⚑
Quick Reference Card
ConceptFormula / Key FactWhen to Use
Bias-Variance TradeoffError = BiasΒ² + Variance + IrreducibleModel selection, regularization decisions
Learning RateTypical: 0.001–0.01. Too high β†’ diverge, too low β†’ slowAll gradient-based training
PrecisionTP / (TP + FP)Spam detection, fraud β€” minimize false positives
RecallTP / (TP + FN)Cancer detection β€” minimize false negatives
F1 Score2 Γ— (P Γ— R) / (P + R)Imbalanced datasets
ROC-AUCArea under ROC curve. 0.5 = random, 1.0 = perfectBinary classification evaluation
RMSE√(Ξ£(yα΅’ - Ε·α΅’)Β² / n)Regression β€” sensitive to outliers
L1 (Lasso)Loss + Ξ»|w|Feature selection β€” drives weights to 0
L2 (Ridge)Loss + Ξ»wΒ²Prevent overfitting, keeps all features
DropoutRandomly zero out p% of neurons during trainingNeural network regularization
Batch NormNormalize layer inputs: ΞΌ=0, Οƒ=1 per mini-batchDeep networks β€” stabilizes training
Adam OptimizerCombines momentum + RMSprop. β₁=0.9, Ξ²β‚‚=0.999Default optimizer β€” works well most cases
Cross-Entropy Loss-Ξ£ yα΅’ log(Ε·α΅’)Classification tasks
Softmaxeˣⁱ / Ξ£eΛ£Κ²Multi-class output β€” probabilities sum to 1
k-Fold CVSplit data into k parts; train on k-1, test on 1; repeatModel evaluation, hyperparameter tuning
πŸ’‘ Pro Tip
When in doubt about evaluation metric: imbalanced classes β†’ F1/AUC; cost-sensitive β†’ use custom loss; ranking β†’ NDCG; regression β†’ RMSE/MAE depending on outlier tolerance.
🧠
ML Fundamentals
Types of Machine Learning
Supervised
  • Labeled training data
  • Classification: predict category
  • Regression: predict number
  • Examples: spam detection, house prices
Unsupervised
  • No labels β€” find hidden patterns
  • Clustering (K-Means, DBSCAN)
  • Dimensionality reduction (PCA)
  • Examples: customer segmentation
Reinforcement
  • Agent learns via reward/penalty
  • Policy optimization (PPO, DQN)
  • Examples: game playing, robotics
  • Key: exploration vs exploitation
Bias-Variance Tradeoff
Total Error = BiasΒ² + Variance + Irreducible Noise

High Bias (Underfitting)

  • Model too simple β€” can't capture patterns
  • High training error AND high test error
  • Fix: More complex model, add features, reduce regularization
  • Example: Linear model on non-linear data

High Variance (Overfitting)

  • Model memorizes training data
  • Low training error, HIGH test error
  • Fix: More data, regularization, simpler model, dropout
  • Example: Deep tree with no max_depth
πŸ”‘ Key Insight
The sweet spot: model complex enough to learn patterns, but not so complex it memorizes noise. Cross-validation helps find this point.
Gradient Descent Variants
VariantUpdate RuleProsCons
Batch GDUse all data per stepStable, smooth convergenceSlow for large datasets
Stochastic GD (SGD)Use 1 sample per stepFast, can escape local minimaNoisy updates, may not converge
Mini-Batch GDUse batch of 32–256 per stepBalance of speed + stabilityRequires tuning batch size
w = w - Ξ± Γ— βˆ‡L(w) where Ξ± = learning rate, βˆ‡L = gradient of loss
Regularization Techniques

L1 Regularization (Lasso)

Loss_total = Loss + Ξ» Γ— Ξ£|wα΅’|
  • Drives some weights to exactly 0 β†’ feature selection
  • Produces sparse models
  • Use when: you suspect many features are irrelevant

L2 Regularization (Ridge)

Loss_total = Loss + Ξ» Γ— Ξ£wα΅’Β²
  • Shrinks weights toward 0 but not exactly 0
  • Keeps all features, just smaller weights
  • Use when: all features likely relevant

ElasticNet

Loss + Ξ±(λ₁|w| + Ξ»β‚‚wΒ²)
  • Combines L1 + L2. Best of both worlds.

Dropout

  • During training: randomly set neurons to 0 with probability p
  • Forces network to learn redundant representations
  • Typical p: 0.1–0.5 depending on layer

Early Stopping

  • Monitor validation loss; stop when it starts increasing
  • Simple and very effective
Feature Engineering

Encoding Categorical Variables

  • Label Encoding: Maps categories to integers. Use for ordinal data (low/med/high)
  • One-Hot Encoding: Creates binary columns per category. Use for nominal data
  • Target Encoding: Replace category with mean of target. Watch for leakage!
  • Frequency Encoding: Replace with count/frequency
  • Embedding: Learn dense representations (high cardinality)

Scaling Numerical Features

  • StandardScaler: (x - ΞΌ) / Οƒ β†’ mean=0, std=1. Use for linear models, SVMs
  • MinMaxScaler: (x - min) / (max - min) β†’ [0,1]. Use for NNs, distance-based
  • RobustScaler: Uses median & IQR. Robust to outliers
  • Log transform: For right-skewed distributions (e.g., income)

Handling Missing Values

StrategyWhen to Use
Mean/Median/Mode imputationMCAR data, small proportion missing
KNN ImputationMAR data, when neighbors share values
Model-based imputationComplex missingness patterns
Drop rows/columnsIf > 70% missing or data is MNAR
Indicator columnMissingness itself is informative
Handling Imbalanced Datasets

Data-Level Methods

  • Oversampling minority: SMOTE (Synthetic Minority Oversampling) β€” creates synthetic samples in feature space
  • Undersampling majority: Random or informed (Tomek links, ENN)
  • Class weights: class_weight='balanced' in sklearn β€” adjusts loss contribution per class

Algorithm-Level Methods

  • Choose appropriate metric: F1, PR-AUC, not accuracy
  • Adjust decision threshold (default 0.5 may not be optimal)
  • Use cost-sensitive learning
  • Ensemble methods (balanced random forest)
πŸ“Š
All ML Algorithms β€” Deep Dive
πŸ“ˆ Linear Regression
Regression Supervised

How it works: Fits a line Ε· = Ξ²β‚€ + β₁x₁ + ... + Ξ²β‚™xβ‚™ that minimizes MSE (Ordinary Least Squares)

  • Assumptions: Linearity, independence, homoscedasticity, normality of residuals
  • Key hyperparams: regularization strength (Ξ»), fit_intercept
  • When to use: Continuous target, linear relationships, interpretability needed
  • Pros: Fast, interpretable, no hyperparameter tuning needed
  • Cons: Can't model non-linear patterns, sensitive to outliers
πŸ”΅ Logistic Regression
Classification Supervised

How it works: Applies sigmoid to linear output β†’ probability. Decision boundary where P=0.5

Οƒ(z) = 1 / (1 + e⁻ᢻ)
  • Loss: Binary cross-entropy
  • Key hyperparams: C (inverse regularization), penalty (L1/L2), solver
  • When to use: Binary classification, need probability outputs, baseline model
  • Pros: Probabilistic output, fast, interpretable coefficients
  • Cons: Assumes linear decision boundary, requires feature scaling
🌲 Decision Trees
Both tasks Supervised

How it works: Recursively splits data on features that maximize information gain (entropy) or minimize Gini impurity

Gini = 1 - Ξ£ pα΅’Β² Entropy = -Ξ£ pα΅’ log(pα΅’)
  • Key hyperparams: max_depth, min_samples_split, min_samples_leaf, max_features
  • Pros: Interpretable, no scaling needed, handles mixed types
  • Cons: Prone to overfitting, unstable (small data changes β†’ different tree)
🌳 Random Forest
Ensemble Bagging

How it works: Trains N trees on bootstrap samples, random feature subsets. Final prediction = majority vote (classification) or mean (regression)

  • Key hyperparams: n_estimators, max_depth, max_features, min_samples_leaf
  • Feature importance: Mean decrease in impurity across all trees
  • Pros: Robust to outliers, no scaling, handles high dimensions, gives feature importance
  • Cons: Slow prediction, not interpretable, memory intensive
  • When to use: Tabular data, feature selection, robust baseline
⚑ XGBoost / LightGBM
Ensemble Boosting

How it works: Trains trees sequentially, each correcting previous errors. Uses gradient of loss to determine next tree direction.

  • XGBoost: Level-wise tree growth, built-in regularization (L1, L2)
  • LightGBM: Leaf-wise growth (faster), histogram-based, better for large data
  • Key hyperparams: n_estimators, learning_rate, max_depth, subsample, colsample_bytree, reg_alpha, reg_lambda
  • Pros: State-of-art on tabular data, handles missing values, built-in CV
  • Cons: Many hyperparameters, risk of overfitting
πŸ”΄ Support Vector Machine
Classification Supervised

How it works: Finds hyperplane that maximizes margin between classes. Kernel trick maps to higher dimensions.

  • Kernels: Linear, RBF (Gaussian), Polynomial, Sigmoid
  • C parameter: Low C = wide margin (more misclassifications allowed), High C = narrow margin (less misclassifications)
  • Gamma (RBF): Low = smooth boundary, High = complex boundary
  • Pros: Works well in high dimensions, effective when n_features > n_samples
  • Cons: Slow on large datasets, requires feature scaling, kernel choice
πŸ”΅ K-Nearest Neighbors
Both tasks Instance-based

How it works: At prediction time, finds k closest training points (by distance), returns majority vote or mean

  • Distance metrics: Euclidean (default), Manhattan, Minkowski, Cosine
  • Key hyperparams: k (n_neighbors), distance metric, weights (uniform/distance)
  • Choosing k: Low k = overfits, high k = underfits. Use √n as starting point
  • Pros: Simple, no training, naturally multi-class
  • Cons: Slow prediction O(n), curse of dimensionality, requires scaling
β­• K-Means Clustering
Unsupervised Clustering

How it works: Initialize k centroids β†’ assign each point to nearest centroid β†’ recompute centroids β†’ repeat until convergence

  • Choosing k: Elbow method (inertia), Silhouette score
  • K-Means++: Smart initialization to avoid bad local minima
  • Pros: Fast, scales to large datasets, simple
  • Cons: Assumes spherical clusters, must specify k, sensitive to outliers/scaling
πŸŒ€ DBSCAN
Unsupervised Density-based

How it works: Groups points that are closely packed together; marks outliers as noise. Expands clusters from core points.

  • Core point: Has β‰₯ min_samples neighbors within Ξ΅ radius
  • Key hyperparams: eps (neighborhood radius), min_samples
  • Pros: No need to specify k, finds arbitrary shapes, identifies outliers
  • Cons: Struggles with varying densities, sensitive to eps/min_samples
πŸ“‰ PCA (Dimensionality Reduction)
Unsupervised Dimensionality Reduction

How it works: Finds orthogonal axes (principal components) that capture maximum variance. Projects data onto top k components.

  • Steps: Standardize β†’ covariance matrix β†’ eigendecomposition β†’ select top k eigenvectors
  • Explained variance ratio: How much variance each component captures. Pick k where cumulative β‰₯ 95%
  • Pros: Reduces noise, speeds up training, enables 2D/3D visualization
  • Cons: Loses interpretability, linear only (use UMAP/t-SNE for non-linear)
πŸ“§ Naive Bayes
Classification Probabilistic
P(y|X) ∝ P(y) Γ— Ξ  P(xα΅’|y)
  • Assumes: Features are conditionally independent given class (often violated, but works well)
  • Variants: GaussianNB (continuous), MultinomialNB (word counts), BernoulliNB (binary)
  • Pros: Very fast, works well for text classification, small data
  • Cons: Independence assumption, poor probability calibration
🧬
Deep Learning
Neural Network Architecture

Layer Types

  • Dense (Fully Connected): Every neuron connected to every neuron in next layer. y = WΒ·x + b
  • Convolutional (Conv2D): Applies learnable filters to detect local patterns (edges, shapes)
  • Recurrent (LSTM/GRU): Maintains hidden state across sequence steps
  • Attention / Transformer: Computes pairwise relationships between all positions
  • Embedding: Maps discrete tokens to dense vectors
  • BatchNorm: Normalizes activations per mini-batch β†’ stable training
  • Dropout: Random neuron zeroing β†’ regularization

Activation Functions

FunctionFormulaUse When
ReLUmax(0, x)Hidden layers (default)
Leaky ReLUmax(0.01x, x)Avoid dying ReLU problem
GELUxΒ·Ξ¦(x)Transformers (BERT, GPT)
Sigmoid1/(1+e⁻ˣ)Binary output layer
Softmaxeˣⁱ/ΣeˣʲMulti-class output
Tanh(eˣ-e⁻ˣ)/(eˣ+e⁻ˣ)RNNs, [-1,1] output
CNNs β€” Convolutional Neural Networks

How convolutions work: A filter (kernel) slides over input, computing dot products at each position β†’ creates feature map showing where patterns appear. Multiple filters detect multiple features.

  • Padding: 'same' keeps spatial dimensions, 'valid' reduces them
  • Stride: How many pixels the filter moves. Stride 2 halves spatial dimensions
  • Max Pooling: Downsamples by taking max in each window β†’ translation invariance
  • Global Average Pooling: Collapses spatial dims to 1Γ—1 β†’ used before final dense layer
ArchitectureKey InnovationUsed For
VGG-16/19Deep with small 3Γ—3 filtersClassification baseline
ResNetSkip connections (residual blocks) β€” solve vanishing gradientDeep classification, transfer learning
EfficientNetCompound scaling (depth+width+resolution)Efficient high-accuracy classification
YOLOSingle-pass real-time object detectionObject detection
U-NetEncoder-decoder with skip connectionsImage segmentation
Transformers & Attention Mechanism

Self-Attention: Each token attends to every other token. Computes how much each word should "focus on" other words in context.

Attention(Q, K, V) = softmax(QKα΅€ / √dβ‚–) Γ— V
  • Q, K, V: Query, Key, Value β€” learned linear projections of input
  • Multi-Head Attention: Run attention h times in parallel with different projections β†’ capture different types of relationships
  • Positional Encoding: Since attention is order-agnostic, add position information via sinusoidal encoding
  • BERT: Bidirectional encoder. Pre-trained with Masked Language Model (MLM) + Next Sentence Prediction (NSP). Used for understanding tasks (classification, NER, QA)
  • GPT: Decoder-only, autoregressive. Pre-trained to predict next token. Used for generation tasks.
  • T5: Encoder-Decoder. Frames all NLP tasks as text-to-text.
Interview Gold
"Why does attention scale by √dβ‚–?" β†’ Without scaling, large dβ‚– makes dot products large β†’ softmax becomes very peaked β†’ gradients vanish. Dividing by √dβ‚– keeps variance stable.
Backpropagation β€” Step by Step
  1. Forward pass: Compute activations layer by layer, store intermediate values (cache)
  2. Compute loss: Compare prediction Ε· with true label y using loss function
  3. Backward pass: Use chain rule to compute gradient of loss w.r.t. each parameter: βˆ‚L/βˆ‚w = (βˆ‚L/βˆ‚z)(βˆ‚z/βˆ‚w)
  4. Update weights: w = w - Ξ± Γ— βˆ‚L/βˆ‚w
Vanishing Gradient Problem
In deep networks, gradients multiplied through many sigmoid/tanh layers shrink to near 0 β†’ early layers learn very slowly. Solutions: ReLU activation, batch normalization, skip connections (ResNets), gradient clipping.
πŸ“
NLP & Large Language Models
Text Preprocessing Pipeline
# Standard NLP preprocessing pipeline
import re, nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer

def preprocess(text):
    text = text.lower()                              # lowercase
    text = re.sub(r'[^a-z\s]', '', text)            # remove punctuation
    tokens = word_tokenize(text)                     # tokenize
    tokens = [t for t in tokens if t not in stopwords.words('english')]
    lemmatizer = WordNetLemmatizer()
    tokens = [lemmatizer.lemmatize(t) for t in tokens]
    return ' '.join(tokens)
Text Representations β€” Evolution
MethodHow It WorksCaptures Semantics?Use Case
Bag of WordsWord count vector (no order)NoSimple classification
TF-IDFTF Γ— log(N/df) β€” penalizes common wordsNoInformation retrieval, search
Word2VecCBOW or Skip-gram neural network. king-man+womanβ‰ˆqueenYes (local)Word similarity, analogies
GloVeMatrix factorization on co-occurrence statisticsYes (global)Word similarity at scale
BERT EmbeddingsContextual β€” same word has different embedding in different sentencesYes (deep)All modern NLP tasks
Fine-tuning LLMs with Hugging Face
from transformers import AutoTokenizer, AutoModelForSequenceClassification, Trainer, TrainingArguments
import torch

# Load pretrained model
model_name = "bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=2)

# Tokenize dataset
def tokenize(examples):
    return tokenizer(examples["text"], padding="max_length", truncation=True, max_length=128)

tokenized = dataset.map(tokenize, batched=True)

# Training configuration
training_args = TrainingArguments(
    output_dir="./results",
    num_train_epochs=3,
    per_device_train_batch_size=16,
    learning_rate=2e-5,         # Low LR for fine-tuning!
    warmup_steps=500,
    weight_decay=0.01,
    evaluation_strategy="epoch",
    save_strategy="epoch",
    load_best_model_at_end=True,
)

trainer = Trainer(model=model, args=training_args,
                  train_dataset=tokenized["train"], eval_dataset=tokenized["test"])
trainer.train()
πŸš€
MLOps
MLOps Lifecycle
StageToolsKey Tasks
Data ManagementDVC, Delta Lake, FeastVersioning data, feature store, data lineage
Experiment TrackingMLflow, W&B, NeptuneLog params, metrics, artifacts, compare runs
Model Trainingsklearn, PyTorch, TF, RayDistributed training, hyperparameter tuning
Model RegistryMLflow Registry, Vertex AIVersion models, stage (staging/prod), lineage
Serving/DeploymentFastAPI, TorchServe, KServe, SageMakerREST/gRPC endpoints, batch inference, A/B
MonitoringEvidently AI, Seldon, PrometheusData drift, model drift, concept drift alerts
MLflow Experiment Tracking
import mlflow
import mlflow.sklearn
from sklearn.ensemble import RandomForestClassifier

mlflow.set_experiment("fraud_detection_v2")

with mlflow.start_run(run_name="rf_baseline"):
    # Log hyperparameters
    params = {"n_estimators": 200, "max_depth": 8, "class_weight": "balanced"}
    mlflow.log_params(params)

    model = RandomForestClassifier(**params)
    model.fit(X_train, y_train)

    # Log metrics
    mlflow.log_metric("train_f1", f1_score(y_train, model.predict(X_train)))
    mlflow.log_metric("val_f1",   f1_score(y_val,   model.predict(X_val)))
    mlflow.log_metric("val_auc",  roc_auc_score(y_val, model.predict_proba(X_val)[:,1]))

    # Save model to registry
    mlflow.sklearn.log_model(model, "model", registered_model_name="FraudDetector")
FastAPI Model Serving
from fastapi import FastAPI
from pydantic import BaseModel
import mlflow.pyfunc
import numpy as np

app = FastAPI()
model = mlflow.pyfunc.load_model("models:/FraudDetector/Production")

class PredictionRequest(BaseModel):
    amount: float
    merchant_category: str
    hour_of_day: int
    user_avg_spend: float

class PredictionResponse(BaseModel):
    fraud_probability: float
    is_fraud: bool

@app.post("/predict", response_model=PredictionResponse)
async def predict(request: PredictionRequest):
    features = np.array([[request.amount, request.hour_of_day, request.user_avg_spend]])
    proba = model.predict(features)[0]
    return PredictionResponse(fraud_probability=proba, is_fraud=proba > 0.5)

# Run: uvicorn app:app --host 0.0.0.0 --port 8000
πŸ’»
Essential Code Snippets
Complete sklearn Pipeline
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.model_selection import GridSearchCV, StratifiedKFold

numeric_features = ['age', 'income', 'credit_score']
categorical_features = ['job_type', 'education']

numeric_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler',  StandardScaler()),
])
categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='constant', fill_value='missing')),
    ('onehot',  OneHotEncoder(handle_unknown='ignore', sparse=False)),
])

preprocessor = ColumnTransformer(transformers=[
    ('num', numeric_transformer,   numeric_features),
    ('cat', categorical_transformer, categorical_features),
])

pipeline = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('classifier',   GradientBoostingClassifier()),
])

# Grid search with cross-validation
param_grid = {
    'classifier__n_estimators': [100, 200],
    'classifier__max_depth':    [3, 5],
    'classifier__learning_rate': [0.05, 0.1],
}
cv = GridSearchCV(pipeline, param_grid, cv=StratifiedKFold(5), scoring='roc_auc', n_jobs=-1)
cv.fit(X_train, y_train)
print(f"Best AUC: {cv.best_score_:.4f}")
print(f"Best params: {cv.best_params_}")
PyTorch Training Loop
import torch
import torch.nn as nn
from torch.optim import AdamW
from torch.optim.lr_scheduler import CosineAnnealingLR

class MLPClassifier(nn.Module):
    def __init__(self, input_dim, hidden_dims, num_classes, dropout=0.3):
        super().__init__()
        layers, in_dim = [], input_dim
        for h in hidden_dims:
            layers += [nn.Linear(in_dim, h), nn.BatchNorm1d(h), nn.GELU(), nn.Dropout(dropout)]
            in_dim = h
        layers.append(nn.Linear(in_dim, num_classes))
        self.net = nn.Sequential(*layers)

    def forward(self, x): return self.net(x)

device = 'cuda' if torch.cuda.is_available() else 'cpu'
model = MLPClassifier(50, [256, 128, 64], 2).to(device)
optimizer = AdamW(model.parameters(), lr=1e-3, weight_decay=0.01)
scheduler = CosineAnnealingLR(optimizer, T_max=50)
criterion = nn.CrossEntropyLoss()

for epoch in range(50):
    model.train()
    train_loss = 0
    for X_batch, y_batch in train_loader:
        X_batch, y_batch = X_batch.to(device), y_batch.to(device)
        optimizer.zero_grad()
        loss = criterion(model(X_batch), y_batch)
        loss.backward()
        torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)  # gradient clipping
        optimizer.step()
        train_loss += loss.item()
    scheduler.step()

    model.eval()
    with torch.no_grad():
        val_preds = model(X_val.to(device)).argmax(dim=1).cpu()
    if epoch % 10 == 0:
        print(f"Epoch {epoch}: loss={train_loss/len(train_loader):.4f}")
πŸ› οΈ
Tools & Stack
ToolCategoryKey UseMust-Know Commands/API
scikit-learnML FrameworkClassical ML, preprocessing, evaluationPipeline, GridSearchCV, cross_val_score, train_test_split
PyTorchDeep LearningResearch, custom models, NLPnn.Module, DataLoader, optimizer.zero_grad(), loss.backward()
TensorFlow/KerasDeep LearningProduction deployment, mobilemodel.compile(), model.fit(), model.predict(), tf.data
Hugging FaceNLP/LLMsPretrained models, fine-tuning, inferenceAutoTokenizer, pipeline(), Trainer, from_pretrained()
XGBoost/LightGBMBoostingTabular data, competitions, productionxgb.train(), early_stopping_rounds, feature_importance
MLflowMLOpsExperiment tracking, model registrymlflow.log_params(), log_metric(), log_model(), load_model()
W&B (Weights & Biases)MLOpsRich experiment dashboard, sweepswandb.init(), wandb.log(), wandb.Sweep
pandasDataData manipulation, EDAgroupby, merge, apply, pivot_table, read_csv, to_sql
NumPyNumericalArray operations, linear algebranp.dot, np.stack, np.where, np.argmax, broadcasting
ONNXDeploymentFramework-agnostic model formattorch.onnx.export(), onnxruntime.InferenceSession
🎯
Top 30 Interview Questions & Answers
Q1. Explain the bias-variance tradeoff and how you manage it in practice.
Answer: The bias-variance tradeoff captures two competing sources of error in ML models. Bias is error from incorrect assumptions β€” a highly biased model (like linear regression on non-linear data) fails to capture the true pattern (underfitting). Variance is error from sensitivity to small fluctuations in training data β€” a high-variance model (like a deep decision tree) memorizes noise and fails to generalize (overfitting). Total error = BiasΒ² + Variance + Irreducible noise. In practice: if train error is high β†’ high bias β†’ use a more complex model or add features. If train error is low but val error is high β†’ high variance β†’ add regularization, get more data, use dropout, reduce model complexity. Techniques like k-fold cross-validation help detect the issue, and learning curves (train vs. validation error vs. training set size) help diagnose it visually.
Q2. How does backpropagation work? Walk me through the algorithm.
Answer: Backpropagation is an algorithm to efficiently compute gradients of the loss w.r.t. all parameters using the chain rule. Steps: (1) Forward pass: compute output of each layer and store activations/cached values. (2) Compute loss: compare output Ε· to label y. (3) Backward pass: starting from the loss, compute βˆ‚L/βˆ‚aα΄Έ (gradient of loss w.r.t. final activation), then propagate backwards: βˆ‚L/βˆ‚Wα΅’ = βˆ‚L/βˆ‚aα΅’ Γ— βˆ‚aα΅’/βˆ‚zα΅’ Γ— βˆ‚zα΅’/βˆ‚Wα΅’. The key insight is the chain rule: the gradient of the loss w.r.t. a weight equals the gradient of the loss w.r.t. that layer's output, times the local gradient. (4) Update: w = w - Ξ± Γ— βˆ‚L/βˆ‚w. The vanishing gradient problem occurs when gradients shrink exponentially in deep networks β€” solved by ReLU activations, batch normalization, and residual connections.
Q3. What is the attention mechanism in Transformers? Why is it better than RNNs?
Answer: Attention computes pairwise relationships between all tokens simultaneously: Attention(Q,K,V) = softmax(QKα΅€/√dβ‚–)V. Each query looks at all keys to compute relevance scores, then uses those to weight-sum the values. Benefits over RNNs: (1) Parallelism: RNNs process tokens sequentially β€” can't parallelize. Transformers compute all positions simultaneously. (2) Long-range dependencies: RNNs struggle with very long sequences (vanishing gradient). Attention directly connects any two positions with O(1) operations. (3) Multi-head attention allows the model to jointly attend to different representation subspaces β€” one head may capture syntax, another semantics. The scaling by √dβ‚– prevents dot products from growing too large in high dimensions, which would make softmax gradients vanish.
Q4. When would you use XGBoost vs. a Neural Network?
Answer: XGBoost when: tabular/structured data with mixed feature types, interpretability matters, limited training data (<100K rows), fast training/iteration cycle needed, features require less engineering. Neural Networks when: unstructured data (images, text, audio, video), very large datasets (millions of rows), automatically learning hierarchical features is valuable, state-of-art performance is needed for perception tasks. Rule of thumb: for tabular data, try XGBoost first β€” it wins Kaggle competitions on tabular data. For anything involving sequences, images, or text, go neural. Also consider ensemble approaches (stacking XGBoost with NN outputs as features).
Q5. How do you handle imbalanced datasets?
Answer: Multiple strategies: (1) Resampling: SMOTE oversamples minority class by interpolating between existing minority samples; random undersampling removes majority samples. (2) Class weights: class_weight='balanced' in sklearn β€” increases the penalty for misclassifying minority class. (3) Threshold tuning: default 0.5 threshold is often wrong β€” optimize threshold using F1 or precision-recall curve on validation set. (4) Choose right metrics: Accuracy is misleading (a model predicting all majority is 99% accurate on 1% minority data). Use F1, precision-recall AUC, or Cohen's kappa. (5) Algorithm choice: tree ensembles (XGBoost's scale_pos_weight), anomaly detection (Isolation Forest) for extreme imbalance. (6) In production: monitor per-class performance separately.
Q6. What is the difference between bagging and boosting?
Answer: Bagging (Bootstrap Aggregating) trains N models in parallel on bootstrap samples (random subsets with replacement), then averages predictions. Reduces variance. Example: Random Forest. Each tree is independent and sees different data. Boosting trains models sequentially, each one focusing on samples the previous model got wrong. The final prediction is a weighted sum. Reduces bias. Examples: AdaBoost, XGBoost, LightGBM. Key differences: Bagging is parallelizable and more robust to overfitting; boosting is sequential, can overfit with too many estimators, but typically achieves better accuracy. Boosting tends to outperform bagging on tabular data when tuned properly.
Q7. How do you prevent overfitting in neural networks?
Answer: Multiple techniques: (1) Dropout: randomly zero out neurons during training (p=0.1–0.5 depending on layer size) β€” forces learning of robust features. (2) L1/L2 regularization (weight decay): adds penalty to large weights. (3) Early stopping: monitor val loss, stop when it starts increasing β€” simplest and most effective. (4) Data augmentation: for images β€” flips, crops, color jitter; for text β€” back-translation, synonym substitution. (5) Batch normalization: normalizes activations, acts as mild regularizer. (6) More data: most reliable fix. (7) Simpler architecture: reduce layers, hidden units. (8) Learning rate scheduling: cosine annealing helps escape sharp minima.
Q8. Explain the curse of dimensionality.
Answer: As the number of dimensions (features) increases, the volume of the space grows exponentially, so the available data becomes sparse. This causes several problems: (1) Distance-based algorithms (KNN, K-Means) fail because all points become approximately equidistant in high dimensions β€” "distance" loses meaning. (2) More dimensions require exponentially more training data to maintain the same density. (3) Models with many features are harder to train and prone to overfitting. Solutions: dimensionality reduction (PCA, UMAP, autoencoders), feature selection (L1 regularization, mutual information), domain knowledge to select relevant features, tree-based models which are more resistant (they select features at each split).
Q9. What is data leakage and how do you prevent it?
Answer: Data leakage occurs when information from outside the training distribution leaks into the model training, causing artificially inflated validation metrics that don't reflect real-world performance. Types: (1) Train-test contamination: normalizing using the entire dataset before splitting β€” use fit() on train only, then transform() on both. (2) Target leakage: using features that are causally downstream of the target (e.g., using "claim amount" to predict "insurance fraud"). (3) Temporal leakage: using future data to predict past events β€” always split time-series data chronologically. Prevention: always fit preprocessing inside cross-validation folds; use sklearn Pipelines; be skeptical of unrealistically good results; audit features carefully for temporal ordering.
Q10. How do you evaluate and compare machine learning models?
Answer: Start with the right metric for the task: F1/AUC for imbalanced classification, RMSE/MAE for regression, NDCG for ranking. Use stratified k-fold cross-validation (k=5 or 10) β€” never evaluate on a single train/test split. Report confidence intervals across folds. Check for statistical significance (McNemar's test for classification). Beyond accuracy: calibration (are probabilities accurate β€” use Platt scaling or isotonic regression if not), inference time (latency matters in production), memory footprint, interpretability (SHAP values), and fairness metrics (demographic parity, equalized odds across subgroups). In production: shadow mode (run new model alongside old, compare on real traffic), A/B testing, monitoring for data drift.
Q11. What is transfer learning and when do you use it?
Answer: Transfer learning reuses a model pre-trained on a large dataset (ImageNet, Wikipedia+BooksCorpus for BERT) for a different but related task. When to use: limited labeled data for your specific task, your task is in the same domain as the pre-training data, you need fast iteration. How to fine-tune: (1) Feature extraction: freeze all pretrained weights, add new classification head, train only the head. (2) Full fine-tuning: unfreeze all layers, train end-to-end with very small learning rate (2e-5 for BERT). (3) Gradual unfreezing: start with head, then progressively unfreeze layers (ULMFiT approach). Key: pretrained weights are a warm start β€” they've already learned low-level features (edges for vision, syntax for NLP), so your model needs less data to learn high-level task-specific features.
Q12. How does BERT work? What makes it different from GPT?
Answer: BERT (Bidirectional Encoder Representations from Transformers) is a transformer encoder pre-trained with two objectives: (1) Masked Language Model (MLM): randomly mask 15% of input tokens, predict the masked tokens β€” forces bidirectional context understanding. (2) Next Sentence Prediction (NSP): given two sentences, predict if the second follows the first. BERT is bidirectional β€” each token attends to all other tokens in both directions. GPT is a decoder-only autoregressive model β€” each token attends only to previous tokens (causal/left-to-right masking). Predicts the next token. BERT excels at: understanding tasks (classification, NER, Q&A extractive). GPT excels at: generation tasks (text completion, dialogue, summarization). Modern LLMs like GPT-4 use only the decoder (generative pre-training), while models like T5 use encoder-decoder for sequence-to-sequence tasks.
Q13. How would you deploy a machine learning model to production?
Answer: Full deployment pipeline: (1) Serialization: save model (joblib for sklearn, torch.save, ONNX for cross-framework compatibility). (2) API: wrap in FastAPI/Flask REST endpoint with input validation (Pydantic), error handling, and request logging. (3) Containerization: Docker container with all dependencies β€” ensures reproducibility. (4) Orchestration: Kubernetes for auto-scaling, rolling updates, health checks. (5) CI/CD: automated testing (unit + integration), model validation checks (performance must exceed threshold), gradual rollout (canary). (6) Monitoring: log predictions and inputs, detect data drift (distribution shift in features), model drift (performance degradation), set up alerts. (7) Retraining trigger: schedule or drift-based. Always keep the previous model version as fallback.
Q14. What is a confusion matrix and how do you interpret it?
Answer: A confusion matrix shows the breakdown of actual vs. predicted classes. For binary classification: TP (predicted positive, actually positive), FP (predicted positive, actually negative β€” Type I error), FN (predicted negative, actually positive β€” Type II error), TN (predicted negative, actually negative). Key metrics derived: Precision = TP/(TP+FP) β€” "of all predicted positives, how many are correct". Recall/Sensitivity = TP/(TP+FN) β€” "of all actual positives, how many did we catch". Specificity = TN/(TN+FP). F1 = harmonic mean of P and R. In practice: high precision, low recall β†’ model is conservative (misses positives). High recall, low precision β†’ model is liberal (many false alarms). Choose based on cost: medical diagnosis β†’ maximize recall (missing cancer is worse than false alarm); spam detection β†’ maximize precision (false positives annoy users).
Q15. What is regularization and why is it necessary?
Answer: Regularization adds a penalty term to the loss function to discourage overly complex models that memorize training data. Without regularization, a model can fit training data perfectly (zero training loss) by learning arbitrary decision boundaries that don't generalize. L1 (Lasso) adds Ξ»|w| β€” penalty proportional to absolute value of weights. This drives some weights exactly to zero, performing feature selection. L2 (Ridge) adds Ξ»wΒ² β€” penalty proportional to squared weights. This shrinks all weights toward zero but rarely to exactly zero. ElasticNet combines both. The hyperparameter Ξ» controls the regularization strength β€” too high causes underfitting, too low has no effect. Tune via cross-validation. In neural networks, L2 is called weight decay, and dropout acts as a stochastic regularizer by preventing co-adaptation of neurons.
Q16. Explain gradient descent and its variants (SGD, Adam, RMSprop).
Answer: Gradient descent minimizes loss by iteratively moving parameters in the direction of steepest descent: w = w - Ξ±βˆ‡L. Variants: SGD: w = w - Ξ±βˆ‡L(single sample) β€” noisy but fast, can escape local minima, good for large datasets. SGD + Momentum: maintains velocity v = Ξ²v + Ξ±βˆ‡L, w = w - v β€” smooth oscillations, faster convergence. RMSprop: adapts learning rate per parameter using running mean of squared gradients β€” good for non-stationary objectives. Adam: combines momentum (first moment) + RMSprop (second moment), with bias correction. Default go-to optimizer. Hyperparams: β₁=0.9, Ξ²β‚‚=0.999, Ξ΅=1e-8. AdamW: Adam with decoupled weight decay β€” often better for transformers. In practice: start with Adam; if overfitting, try SGD + momentum (often better generalization); for transformers, use AdamW with linear warmup + cosine decay.
Q17. How do you choose hyperparameters for a model?
Answer: Methods in increasing sophistication: (1) Manual search: understand which hyperparameters matter most (learning rate is #1 for NNs, max_depth for trees). (2) Grid Search: exhaustive search over specified grid. Works for small grids (<100 combos). (3) Random Search: sample random combos β€” often 3x more efficient than grid for same compute budget (Bergstra et al.). (4) Bayesian Optimization: builds a surrogate model of the objective function, samples promising regions (Optuna, Hyperopt). 10-50x more efficient than random. (5) Population-based training: parallelize and adaptively allocate compute (PBT). Key insights: learning rate is most important β€” always search log-uniformly (0.0001 to 0.1). Regularization strength: log-uniform. Number of layers/units: linear. Use early stopping inside hyperparameter search to save time.
Q18. What is a ROC curve and AUC? When would you use PR curve instead?
Answer: ROC curve plots True Positive Rate (Recall) vs. False Positive Rate (1 - Specificity) at every classification threshold. AUC (Area Under Curve) = probability that the model ranks a random positive higher than a random negative. AUC=0.5 β†’ random, AUC=1.0 β†’ perfect. ROC is threshold-independent and good when classes are balanced. Precision-Recall curve is better for imbalanced datasets. With 1% positive class, a model predicting all negative has FPR=0, which makes ROC look artificially good. PR curve shows the tradeoff between precision and recall directly β€” more informative when positives are rare (fraud detection, disease screening). Rule of thumb: both imbalanced? Use PR-AUC. Standard classification with balanced classes? ROC-AUC.
Q19. What are embeddings and why are they important?
Answer: Embeddings are dense, low-dimensional vector representations of high-cardinality discrete inputs (words, users, products, categories). Instead of one-hot encoding (sparse, no semantic meaning), embeddings place similar items close in vector space β€” learned from co-occurrence patterns. Why important: (1) Dimensionality reduction β€” 50,000 words β†’ 300-dim vectors. (2) Encode semantic similarity β€” "king" - "man" + "woman" β‰ˆ "queen". (3) Enable generalization β€” model can handle unseen words similar to trained words. Used in: NLP (word2vec, GloVe, BERT), recommender systems (user/item embeddings for collaborative filtering), e-commerce (product embeddings for similarity search), knowledge graphs. In practice: use pre-trained embeddings for NLP, learn task-specific embeddings for entities, store in FAISS/Pinecone for ANN search.
Q20. How do you detect and handle data drift in production?
Answer: Data drift occurs when the statistical properties of production data differ from training data. Types: Covariate shift: input distribution P(X) changes. Label shift: output distribution P(Y) changes. Concept drift: the relationship P(Y|X) changes (e.g., user behavior changes seasonally). Detection methods: (1) Statistical tests β€” KS test (continuous), chi-squared (categorical), Population Stability Index (PSI). (2) Monitor feature means, standard deviations, null rates over time. (3) Monitor model output distribution (prediction drift). (4) Monitor actual performance if labels available (performance drift). Tools: Evidently AI, WhyLogs, Great Expectations. Response: retrain on recent data; use domain adaptation techniques; alert and human review above thresholds. PSI > 0.25 = significant drift requiring action.
Q21–Q30 (Additional Critical Questions)
Q21. What is SHAP and how do you use it for model interpretability? SHAP (SHapley Additive exPlanations) assigns each feature a value representing its contribution to a prediction, consistent with game theory Shapley values. SHAP values sum to model output - expected output. Use shap.TreeExplainer (fast for tree models), shap.DeepExplainer (NNs), shap.KernelExplainer (any model). Waterfall plots for individual predictions; summary plots for global feature importance.

Q22. Explain L1 vs L2 loss functions. L1 loss (MAE) = |y - Ε·|. Robust to outliers (gradient = constant). L2 loss (MSE) = (y - Ε·)Β². Penalizes large errors more heavily, gradient = 2(y - Ε·). Huber loss combines both β€” MAE beyond threshold Ξ΄, MSE within Ξ΄. Use MAE when outliers are real signal; MSE when outliers are noise.

Q23. What is batch normalization and why does it help? After each layer, normalize activations: xΜ‚ = (x - ΞΌ) / Οƒ, then apply learnable scale Ξ³ and shift Ξ². Benefits: (1) Reduces internal covariate shift β€” stabilizes training. (2) Allows higher learning rates. (3) Acts as mild regularizer (adds noise from mini-batch statistics). (4) Reduces sensitivity to weight initialization.

Q24. How do CNNs handle spatial invariance? Weight sharing β€” same filter applied at all spatial positions. Max pooling β€” takes maximum activation in each window, discarding exact position. This means a cat in the top-left activates similar to a cat in the bottom-right. Data augmentation (random crops, flips) further improves invariance.

Q25. What is the difference between generative and discriminative models? Discriminative models (logistic regression, SVM, random forest) learn P(Y|X) β€” directly model the decision boundary. Generative models (Naive Bayes, GMM, VAE, GANs) learn P(X,Y) β€” model how data is generated, can generate new samples. Generative models are more powerful but harder to train and require more data.

Q26. How does a GAN work? Generator G learns to produce fake data G(z) from noise z. Discriminator D learns to distinguish real from fake. They play a minimax game: min_G max_D [logD(x) + log(1-D(G(z)))]. Training instability is the main challenge β€” use techniques like WGAN (Wasserstein distance), spectral normalization, progressive growing.

Q27. What is few-shot learning? Learning new concepts from very few examples (1-shot, 5-shot). Approaches: (1) Meta-learning (MAML) β€” learn to learn, model initializes to quickly adapt. (2) Prototypical Networks β€” compute prototype embedding per class, classify by nearest prototype. (3) Fine-tuning pretrained LLMs β€” in-context learning without updating weights.

Q28. Explain LoRA (Low-Rank Adaptation) for fine-tuning LLMs. LoRA freezes pretrained weights W and adds trainable low-rank matrices: W' = W + BA where B∈Rᡈˣʳ, A∈Rʳˣᡏ, r β‰ͺ min(d,k). Only A and B are trained β€” reduces trainable parameters by 10,000x while maintaining performance. r=4-16 for most tasks. Used in Alpaca, LLaMA fine-tuning.

Q29. What is RAG (Retrieval Augmented Generation)? Combines retrieval with generation. Query β†’ retrieve relevant documents from vector DB (FAISS, Pinecone) β†’ feed documents + query to LLM β†’ generate grounded answer. Solves: LLM hallucination, stale knowledge cutoff, private knowledge injection. Key components: embedding model (for retrieval), vector DB, chunking strategy, LLM for generation.

Q30. How do you design an ML system for real-time fraud detection? Requirements: sub-100ms latency, high recall, online learning. Architecture: features served from feature store (Redis for real-time, S3 for batch); rule-based pre-filter for obvious fraud; ML model (LightGBM or neural network) for scoring; ensemble/cascade for cost efficiency; post-processing (threshold, velocity rules). Key features: amount deviation from user baseline, merchant category, geolocation velocity (impossible travel), time of day, network features (shared device/email). Feedback loop: labeled fraud cases retrain model weekly. A/B test new models in shadow mode.
πŸ—οΈ
System Design Patterns
Real-Time Recommendation System
  • Candidate Generation: Matrix factorization or two-tower model β†’ retrieve top-1000 candidates from vector DB
  • Ranking: DNN ranker with user/item/context features β†’ score candidates
  • Feature Store: Redis for real-time features (clicks last 1h), Kafka for streaming updates
  • A/B testing: Multi-armed bandit for exploration vs. exploitation
  • Cold start: Content-based fallback for new users/items
  • Latency target: <50ms using pre-computed user embeddings
ML Training Pipeline at Scale
  • Data: Raw data in S3 β†’ Spark for feature engineering β†’ Feature store
  • Training: Distributed training (PyTorch DDP or Ray Train) on GPU cluster
  • Experiment tracking: MLflow β€” log all runs, compare metrics
  • CI/CD: Automated model quality gates (AUC must be > baseline before promoting)
  • Serving: Model registry β†’ Docker image β†’ Kubernetes deployment
  • Monitoring: Evidently AI dashboards for drift; PagerDuty alerts
⚠️
Common Mistakes to Avoid
MistakeWhy It's WrongCorrect Approach
Fitting StandardScaler on entire datasetLeaks test distribution into trainingFit only on train, transform both train and test
Using accuracy for imbalanced dataMisleading β€” 99% accuracy on 1% positive dataset by predicting all negativeUse F1, AUC, or PR-AUC
Not shuffling before train/test splitTemporal or class patterns in data orderUse shuffle=True or stratified split
Tuning hyperparameters on test setOverfits to test set β€” optimistic estimateUse validation set or nested CV for tuning
Ignoring class imbalanceModel biased toward majority classSMOTE, class_weight, threshold tuning
Not checking for data leakageUnrealistic validation metricsAudit all features for temporal leakage
Too large learning rateLoss diverges or oscillatesStart small, use LR finder (fast.ai method)
Not setting random seedsIrreproducible experimentsSet np.random.seed, torch.manual_seed, random.seed
πŸ“–
Terminology Glossary
TermDefinition
EpochOne full pass through the entire training dataset
BatchSubset of training data used in one gradient update
IterationOne weight update (one batch processed)
HyperparameterConfiguration set before training (learning rate, depth) vs learned parameters (weights)
Inductive BiasAssumptions a model makes about the problem (CNNs assume spatial locality)
GeneralizationModel's ability to perform well on unseen data
CalibrationAlignment between predicted probabilities and actual frequencies
Latent SpaceCompressed representation learned by model (e.g., autoencoder bottleneck)
TokenizationSplitting text into tokens (words, subwords, characters) for NLP models
Fine-tuningFurther training a pre-trained model on a specific task with a small learning rate
TermDefinition
PerplexityMeasure of how well an LLM predicts a sequence. Lower = better. 2^(cross-entropy)
TemperatureControls LLM output randomness. High temp β†’ diverse, creative. Low β†’ deterministic
HallucinationLLM generates plausible-sounding but factually incorrect information
RAGRetrieval-Augmented Generation β€” ground LLM with retrieved documents
QuantizationReduce model precision (FP32 β†’ INT8) to shrink size and speed up inference
PruningRemove low-importance weights/neurons to compress model
DistillationTrain small student model to mimic large teacher model outputs
EnsembleCombine predictions of multiple models for better performance
StackingUse another model's predictions as input features for a meta-model
RLHFReinforcement Learning from Human Feedback β€” how GPT is aligned with human preferences

Thejaslearning β€” AI/ML Engineer Cheat Sheet  |  ← All Cheat Sheets  |  Dashboard