AI/ML Engineer — Interview Cheat Sheet

📋 Table of Contents

Quick Reference Card
ML Fundamentals
All ML Algorithms
Deep Learning
NLP & LLMs
MLOps
Code Snippets
Tools & Stack
Top 30 Interview Q&As
System Design Patterns
Mistakes to Avoid
Glossary

⚡

Quick Reference Card

Concept	Formula / Key Fact	When to Use
Bias-Variance Tradeoff	Error = Bias² + Variance + Irreducible	Model selection, regularization decisions
Learning Rate	Typical: 0.001–0.01. Too high → diverge, too low → slow	All gradient-based training
Precision	TP / (TP + FP)	Spam detection, fraud — minimize false positives
Recall	TP / (TP + FN)	Cancer detection — minimize false negatives
F1 Score	2 × (P × R) / (P + R)	Imbalanced datasets
ROC-AUC	Area under ROC curve. 0.5 = random, 1.0 = perfect	Binary classification evaluation
RMSE	√(Σ(yᵢ - ŷᵢ)² / n)	Regression — sensitive to outliers
L1 (Lasso)	Loss + λ\|w\|	Feature selection — drives weights to 0
L2 (Ridge)	Loss + λw²	Prevent overfitting, keeps all features
Dropout	Randomly zero out p% of neurons during training	Neural network regularization
Batch Norm	Normalize layer inputs: μ=0, σ=1 per mini-batch	Deep networks — stabilizes training
Adam Optimizer	Combines momentum + RMSprop. β₁=0.9, β₂=0.999	Default optimizer — works well most cases
Cross-Entropy Loss	-Σ yᵢ log(ŷᵢ)	Classification tasks
Softmax	eˣⁱ / Σeˣʲ	Multi-class output — probabilities sum to 1
k-Fold CV	Split data into k parts; train on k-1, test on 1; repeat	Model evaluation, hyperparameter tuning

💡 Pro Tip

When in doubt about evaluation metric: imbalanced classes → F1/AUC; cost-sensitive → use custom loss; ranking → NDCG; regression → RMSE/MAE depending on outlier tolerance.

🧠

ML Fundamentals

Types of Machine Learning

Supervised

Labeled training data
Classification: predict category
Regression: predict number
Examples: spam detection, house prices

Unsupervised

No labels — find hidden patterns
Clustering (K-Means, DBSCAN)
Dimensionality reduction (PCA)
Examples: customer segmentation

Reinforcement

Agent learns via reward/penalty
Policy optimization (PPO, DQN)
Examples: game playing, robotics
Key: exploration vs exploitation

Bias-Variance Tradeoff

Total Error = Bias² + Variance + Irreducible Noise

High Bias (Underfitting)

Model too simple — can't capture patterns
High training error AND high test error
Fix: More complex model, add features, reduce regularization
Example: Linear model on non-linear data

High Variance (Overfitting)

Model memorizes training data
Low training error, HIGH test error
Fix: More data, regularization, simpler model, dropout
Example: Deep tree with no max_depth

🔑 Key Insight

The sweet spot: model complex enough to learn patterns, but not so complex it memorizes noise. Cross-validation helps find this point.

Gradient Descent Variants

Variant	Update Rule	Pros	Cons
Batch GD	Use all data per step	Stable, smooth convergence	Slow for large datasets
Stochastic GD (SGD)	Use 1 sample per step	Fast, can escape local minima	Noisy updates, may not converge
Mini-Batch GD	Use batch of 32–256 per step	Balance of speed + stability	Requires tuning batch size

w = w - α × ∇L(w) where α = learning rate, ∇L = gradient of loss

Regularization Techniques

L1 Regularization (Lasso)

Loss_total = Loss + λ × Σ|wᵢ|

Drives some weights to exactly 0 → feature selection
Produces sparse models
Use when: you suspect many features are irrelevant

L2 Regularization (Ridge)

Loss_total = Loss + λ × Σwᵢ²

Shrinks weights toward 0 but not exactly 0
Keeps all features, just smaller weights
Use when: all features likely relevant

ElasticNet

Loss + α(λ₁|w| + λ₂w²)

Combines L1 + L2. Best of both worlds.

Dropout

During training: randomly set neurons to 0 with probability p
Forces network to learn redundant representations
Typical p: 0.1–0.5 depending on layer

Early Stopping

Monitor validation loss; stop when it starts increasing
Simple and very effective

Feature Engineering

Encoding Categorical Variables

Label Encoding: Maps categories to integers. Use for ordinal data (low/med/high)
One-Hot Encoding: Creates binary columns per category. Use for nominal data
Target Encoding: Replace category with mean of target. Watch for leakage!
Frequency Encoding: Replace with count/frequency
Embedding: Learn dense representations (high cardinality)

Scaling Numerical Features

StandardScaler: (x - μ) / σ → mean=0, std=1. Use for linear models, SVMs
MinMaxScaler: (x - min) / (max - min) → [0,1]. Use for NNs, distance-based
RobustScaler: Uses median & IQR. Robust to outliers
Log transform: For right-skewed distributions (e.g., income)

Handling Missing Values

Strategy	When to Use
Mean/Median/Mode imputation	MCAR data, small proportion missing
KNN Imputation	MAR data, when neighbors share values
Model-based imputation	Complex missingness patterns
Drop rows/columns	If > 70% missing or data is MNAR
Indicator column	Missingness itself is informative

Handling Imbalanced Datasets

Data-Level Methods

Oversampling minority: SMOTE (Synthetic Minority Oversampling) — creates synthetic samples in feature space
Undersampling majority: Random or informed (Tomek links, ENN)
Class weights: class_weight='balanced' in sklearn — adjusts loss contribution per class

Algorithm-Level Methods

Choose appropriate metric: F1, PR-AUC, not accuracy
Adjust decision threshold (default 0.5 may not be optimal)
Use cost-sensitive learning
Ensemble methods (balanced random forest)

📊

All ML Algorithms — Deep Dive

📈 Linear Regression

Regression Supervised

How it works: Fits a line ŷ = β₀ + β₁x₁ + ... + βₙxₙ that minimizes MSE (Ordinary Least Squares)

Assumptions: Linearity, independence, homoscedasticity, normality of residuals
Key hyperparams: regularization strength (λ), fit_intercept
When to use: Continuous target, linear relationships, interpretability needed
Pros: Fast, interpretable, no hyperparameter tuning needed
Cons: Can't model non-linear patterns, sensitive to outliers

🔵 Logistic Regression

Classification Supervised

How it works: Applies sigmoid to linear output → probability. Decision boundary where P=0.5

σ(z) = 1 / (1 + e⁻ᶻ)

Loss: Binary cross-entropy
Key hyperparams: C (inverse regularization), penalty (L1/L2), solver
When to use: Binary classification, need probability outputs, baseline model
Pros: Probabilistic output, fast, interpretable coefficients
Cons: Assumes linear decision boundary, requires feature scaling

🌲 Decision Trees

Both tasks Supervised

How it works: Recursively splits data on features that maximize information gain (entropy) or minimize Gini impurity

Gini = 1 - Σ pᵢ² Entropy = -Σ pᵢ log(pᵢ)

Key hyperparams: max_depth, min_samples_split, min_samples_leaf, max_features
Pros: Interpretable, no scaling needed, handles mixed types
Cons: Prone to overfitting, unstable (small data changes → different tree)

🌳 Random Forest

Ensemble Bagging

How it works: Trains N trees on bootstrap samples, random feature subsets. Final prediction = majority vote (classification) or mean (regression)

Key hyperparams: n_estimators, max_depth, max_features, min_samples_leaf
Feature importance: Mean decrease in impurity across all trees
Pros: Robust to outliers, no scaling, handles high dimensions, gives feature importance
Cons: Slow prediction, not interpretable, memory intensive
When to use: Tabular data, feature selection, robust baseline

⚡ XGBoost / LightGBM

Ensemble Boosting

How it works: Trains trees sequentially, each correcting previous errors. Uses gradient of loss to determine next tree direction.

XGBoost: Level-wise tree growth, built-in regularization (L1, L2)
LightGBM: Leaf-wise growth (faster), histogram-based, better for large data
Key hyperparams: n_estimators, learning_rate, max_depth, subsample, colsample_bytree, reg_alpha, reg_lambda
Pros: State-of-art on tabular data, handles missing values, built-in CV
Cons: Many hyperparameters, risk of overfitting

🔴 Support Vector Machine

Classification Supervised

How it works: Finds hyperplane that maximizes margin between classes. Kernel trick maps to higher dimensions.

Kernels: Linear, RBF (Gaussian), Polynomial, Sigmoid
C parameter: Low C = wide margin (more misclassifications allowed), High C = narrow margin (less misclassifications)
Gamma (RBF): Low = smooth boundary, High = complex boundary
Pros: Works well in high dimensions, effective when n_features > n_samples
Cons: Slow on large datasets, requires feature scaling, kernel choice

🔵 K-Nearest Neighbors

Both tasks Instance-based

How it works: At prediction time, finds k closest training points (by distance), returns majority vote or mean

Distance metrics: Euclidean (default), Manhattan, Minkowski, Cosine
Key hyperparams: k (n_neighbors), distance metric, weights (uniform/distance)
Choosing k: Low k = overfits, high k = underfits. Use √n as starting point
Pros: Simple, no training, naturally multi-class
Cons: Slow prediction O(n), curse of dimensionality, requires scaling

⭕ K-Means Clustering

Unsupervised Clustering

How it works: Initialize k centroids → assign each point to nearest centroid → recompute centroids → repeat until convergence

Choosing k: Elbow method (inertia), Silhouette score
K-Means++: Smart initialization to avoid bad local minima
Pros: Fast, scales to large datasets, simple
Cons: Assumes spherical clusters, must specify k, sensitive to outliers/scaling

🌀 DBSCAN

Unsupervised Density-based

How it works: Groups points that are closely packed together; marks outliers as noise. Expands clusters from core points.

Core point: Has ≥ min_samples neighbors within ε radius
Key hyperparams: eps (neighborhood radius), min_samples
Pros: No need to specify k, finds arbitrary shapes, identifies outliers
Cons: Struggles with varying densities, sensitive to eps/min_samples

📉 PCA (Dimensionality Reduction)

Unsupervised Dimensionality Reduction

How it works: Finds orthogonal axes (principal components) that capture maximum variance. Projects data onto top k components.

Steps: Standardize → covariance matrix → eigendecomposition → select top k eigenvectors
Explained variance ratio: How much variance each component captures. Pick k where cumulative ≥ 95%
Pros: Reduces noise, speeds up training, enables 2D/3D visualization
Cons: Loses interpretability, linear only (use UMAP/t-SNE for non-linear)

📧 Naive Bayes

Classification Probabilistic

P(y|X) ∝ P(y) × Π P(xᵢ|y)

Assumes: Features are conditionally independent given class (often violated, but works well)
Variants: GaussianNB (continuous), MultinomialNB (word counts), BernoulliNB (binary)
Pros: Very fast, works well for text classification, small data
Cons: Independence assumption, poor probability calibration

🧬

Deep Learning

Neural Network Architecture

Layer Types

Dense (Fully Connected): Every neuron connected to every neuron in next layer. y = W·x + b
Convolutional (Conv2D): Applies learnable filters to detect local patterns (edges, shapes)
Recurrent (LSTM/GRU): Maintains hidden state across sequence steps
Attention / Transformer: Computes pairwise relationships between all positions
Embedding: Maps discrete tokens to dense vectors
BatchNorm: Normalizes activations per mini-batch → stable training
Dropout: Random neuron zeroing → regularization

Activation Functions

Function	Formula	Use When
ReLU	max(0, x)	Hidden layers (default)
Leaky ReLU	max(0.01x, x)	Avoid dying ReLU problem
GELU	x·Φ(x)	Transformers (BERT, GPT)
Sigmoid	1/(1+e⁻ˣ)	Binary output layer
Softmax	eˣⁱ/Σeˣʲ	Multi-class output
Tanh	(eˣ-e⁻ˣ)/(eˣ+e⁻ˣ)	RNNs, [-1,1] output

CNNs — Convolutional Neural Networks

How convolutions work: A filter (kernel) slides over input, computing dot products at each position → creates feature map showing where patterns appear. Multiple filters detect multiple features.

Padding: 'same' keeps spatial dimensions, 'valid' reduces them
Stride: How many pixels the filter moves. Stride 2 halves spatial dimensions
Max Pooling: Downsamples by taking max in each window → translation invariance
Global Average Pooling: Collapses spatial dims to 1×1 → used before final dense layer

Architecture	Key Innovation	Used For
VGG-16/19	Deep with small 3×3 filters	Classification baseline
ResNet	Skip connections (residual blocks) — solve vanishing gradient	Deep classification, transfer learning
EfficientNet	Compound scaling (depth+width+resolution)	Efficient high-accuracy classification
YOLO	Single-pass real-time object detection	Object detection
U-Net	Encoder-decoder with skip connections	Image segmentation

Transformers & Attention Mechanism

Self-Attention: Each token attends to every other token. Computes how much each word should "focus on" other words in context.

Attention(Q, K, V) = softmax(QKᵀ / √dₖ) × V

Q, K, V: Query, Key, Value — learned linear projections of input
Multi-Head Attention: Run attention h times in parallel with different projections → capture different types of relationships
Positional Encoding: Since attention is order-agnostic, add position information via sinusoidal encoding
BERT: Bidirectional encoder. Pre-trained with Masked Language Model (MLM) + Next Sentence Prediction (NSP). Used for understanding tasks (classification, NER, QA)
GPT: Decoder-only, autoregressive. Pre-trained to predict next token. Used for generation tasks.
T5: Encoder-Decoder. Frames all NLP tasks as text-to-text.

Interview Gold

"Why does attention scale by √dₖ?" → Without scaling, large dₖ makes dot products large → softmax becomes very peaked → gradients vanish. Dividing by √dₖ keeps variance stable.

Backpropagation — Step by Step

Forward pass: Compute activations layer by layer, store intermediate values (cache)
Compute loss: Compare prediction ŷ with true label y using loss function
Backward pass: Use chain rule to compute gradient of loss w.r.t. each parameter: ∂L/∂w = (∂L/∂z)(∂z/∂w)
Update weights: w = w - α × ∂L/∂w

Vanishing Gradient Problem

In deep networks, gradients multiplied through many sigmoid/tanh layers shrink to near 0 → early layers learn very slowly. Solutions: ReLU activation, batch normalization, skip connections (ResNets), gradient clipping.

📝

NLP & Large Language Models

Text Preprocessing Pipeline

# Standard NLP preprocessing pipeline
import re, nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer

def preprocess(text):
    text = text.lower()                              # lowercase
    text = re.sub(r'[^a-z\s]', '', text)            # remove punctuation
    tokens = word_tokenize(text)                     # tokenize
    tokens = [t for t in tokens if t not in stopwords.words('english')]
    lemmatizer = WordNetLemmatizer()
    tokens = [lemmatizer.lemmatize(t) for t in tokens]
    return ' '.join(tokens)

Text Representations — Evolution

Method	How It Works	Captures Semantics?	Use Case
Bag of Words	Word count vector (no order)	No	Simple classification
TF-IDF	TF × log(N/df) — penalizes common words	No	Information retrieval, search
Word2Vec	CBOW or Skip-gram neural network. king-man+woman≈queen	Yes (local)	Word similarity, analogies
GloVe	Matrix factorization on co-occurrence statistics	Yes (global)	Word similarity at scale
BERT Embeddings	Contextual — same word has different embedding in different sentences	Yes (deep)	All modern NLP tasks

Fine-tuning LLMs with Hugging Face

from transformers import AutoTokenizer, AutoModelForSequenceClassification, Trainer, TrainingArguments
import torch

# Load pretrained model
model_name = "bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=2)

# Tokenize dataset
def tokenize(examples):
    return tokenizer(examples["text"], padding="max_length", truncation=True, max_length=128)

tokenized = dataset.map(tokenize, batched=True)

# Training configuration
training_args = TrainingArguments(
    output_dir="./results",
    num_train_epochs=3,
    per_device_train_batch_size=16,
    learning_rate=2e-5,         # Low LR for fine-tuning!
    warmup_steps=500,
    weight_decay=0.01,
    evaluation_strategy="epoch",
    save_strategy="epoch",
    load_best_model_at_end=True,
)

trainer = Trainer(model=model, args=training_args,
                  train_dataset=tokenized["train"], eval_dataset=tokenized["test"])
trainer.train()

🚀

MLOps

MLOps Lifecycle

Stage	Tools	Key Tasks
Data Management	DVC, Delta Lake, Feast	Versioning data, feature store, data lineage
Experiment Tracking	MLflow, W&B, Neptune	Log params, metrics, artifacts, compare runs
Model Training	sklearn, PyTorch, TF, Ray	Distributed training, hyperparameter tuning
Model Registry	MLflow Registry, Vertex AI	Version models, stage (staging/prod), lineage
Serving/Deployment	FastAPI, TorchServe, KServe, SageMaker	REST/gRPC endpoints, batch inference, A/B
Monitoring	Evidently AI, Seldon, Prometheus	Data drift, model drift, concept drift alerts

MLflow Experiment Tracking

import mlflow
import mlflow.sklearn
from sklearn.ensemble import RandomForestClassifier

mlflow.set_experiment("fraud_detection_v2")

with mlflow.start_run(run_name="rf_baseline"):
    # Log hyperparameters
    params = {"n_estimators": 200, "max_depth": 8, "class_weight": "balanced"}
    mlflow.log_params(params)

    model = RandomForestClassifier(**params)
    model.fit(X_train, y_train)

    # Log metrics
    mlflow.log_metric("train_f1", f1_score(y_train, model.predict(X_train)))
    mlflow.log_metric("val_f1",   f1_score(y_val,   model.predict(X_val)))
    mlflow.log_metric("val_auc",  roc_auc_score(y_val, model.predict_proba(X_val)[:,1]))

    # Save model to registry
    mlflow.sklearn.log_model(model, "model", registered_model_name="FraudDetector")

FastAPI Model Serving

from fastapi import FastAPI
from pydantic import BaseModel
import mlflow.pyfunc
import numpy as np

app = FastAPI()
model = mlflow.pyfunc.load_model("models:/FraudDetector/Production")

class PredictionRequest(BaseModel):
    amount: float
    merchant_category: str
    hour_of_day: int
    user_avg_spend: float

class PredictionResponse(BaseModel):
    fraud_probability: float
    is_fraud: bool

@app.post("/predict", response_model=PredictionResponse)
async def predict(request: PredictionRequest):
    features = np.array([[request.amount, request.hour_of_day, request.user_avg_spend]])
    proba = model.predict(features)[0]
    return PredictionResponse(fraud_probability=proba, is_fraud=proba > 0.5)

# Run: uvicorn app:app --host 0.0.0.0 --port 8000

💻

Essential Code Snippets

Complete sklearn Pipeline

from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.model_selection import GridSearchCV, StratifiedKFold

numeric_features = ['age', 'income', 'credit_score']
categorical_features = ['job_type', 'education']

numeric_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler',  StandardScaler()),
])
categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='constant', fill_value='missing')),
    ('onehot',  OneHotEncoder(handle_unknown='ignore', sparse=False)),
])

preprocessor = ColumnTransformer(transformers=[
    ('num', numeric_transformer,   numeric_features),
    ('cat', categorical_transformer, categorical_features),
])

pipeline = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('classifier',   GradientBoostingClassifier()),
])

# Grid search with cross-validation
param_grid = {
    'classifier__n_estimators': [100, 200],
    'classifier__max_depth':    [3, 5],
    'classifier__learning_rate': [0.05, 0.1],
}
cv = GridSearchCV(pipeline, param_grid, cv=StratifiedKFold(5), scoring='roc_auc', n_jobs=-1)
cv.fit(X_train, y_train)
print(f"Best AUC: {cv.best_score_:.4f}")
print(f"Best params: {cv.best_params_}")

PyTorch Training Loop

import torch
import torch.nn as nn
from torch.optim import AdamW
from torch.optim.lr_scheduler import CosineAnnealingLR

class MLPClassifier(nn.Module):
    def __init__(self, input_dim, hidden_dims, num_classes, dropout=0.3):
        super().__init__()
        layers, in_dim = [], input_dim
        for h in hidden_dims:
            layers += [nn.Linear(in_dim, h), nn.BatchNorm1d(h), nn.GELU(), nn.Dropout(dropout)]
            in_dim = h
        layers.append(nn.Linear(in_dim, num_classes))
        self.net = nn.Sequential(*layers)

    def forward(self, x): return self.net(x)

device = 'cuda' if torch.cuda.is_available() else 'cpu'
model = MLPClassifier(50, [256, 128, 64], 2).to(device)
optimizer = AdamW(model.parameters(), lr=1e-3, weight_decay=0.01)
scheduler = CosineAnnealingLR(optimizer, T_max=50)
criterion = nn.CrossEntropyLoss()

for epoch in range(50):
    model.train()
    train_loss = 0
    for X_batch, y_batch in train_loader:
        X_batch, y_batch = X_batch.to(device), y_batch.to(device)
        optimizer.zero_grad()
        loss = criterion(model(X_batch), y_batch)
        loss.backward()
        torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)  # gradient clipping
        optimizer.step()
        train_loss += loss.item()
    scheduler.step()

    model.eval()
    with torch.no_grad():
        val_preds = model(X_val.to(device)).argmax(dim=1).cpu()
    if epoch % 10 == 0:
        print(f"Epoch {epoch}: loss={train_loss/len(train_loader):.4f}")

🛠️

Tools & Stack

Tool	Category	Key Use	Must-Know Commands/API
scikit-learn	ML Framework	Classical ML, preprocessing, evaluation	Pipeline, GridSearchCV, cross_val_score, train_test_split
PyTorch	Deep Learning	Research, custom models, NLP	nn.Module, DataLoader, optimizer.zero_grad(), loss.backward()
TensorFlow/Keras	Deep Learning	Production deployment, mobile	model.compile(), model.fit(), model.predict(), tf.data
Hugging Face	NLP/LLMs	Pretrained models, fine-tuning, inference	AutoTokenizer, pipeline(), Trainer, from_pretrained()
XGBoost/LightGBM	Boosting	Tabular data, competitions, production	xgb.train(), early_stopping_rounds, feature_importance
MLflow	MLOps	Experiment tracking, model registry	mlflow.log_params(), log_metric(), log_model(), load_model()
W&B (Weights & Biases)	MLOps	Rich experiment dashboard, sweeps	wandb.init(), wandb.log(), wandb.Sweep
pandas	Data	Data manipulation, EDA	groupby, merge, apply, pivot_table, read_csv, to_sql
NumPy	Numerical	Array operations, linear algebra	np.dot, np.stack, np.where, np.argmax, broadcasting
ONNX	Deployment	Framework-agnostic model format	torch.onnx.export(), onnxruntime.InferenceSession

🎯

Top 30 Interview Questions & Answers

Q1. Explain the bias-variance tradeoff and how you manage it in practice.

Answer: The bias-variance tradeoff captures two competing sources of error in ML models. Bias is error from incorrect assumptions — a highly biased model (like linear regression on non-linear data) fails to capture the true pattern (underfitting). Variance is error from sensitivity to small fluctuations in training data — a high-variance model (like a deep decision tree) memorizes noise and fails to generalize (overfitting). Total error = Bias² + Variance + Irreducible noise. In practice: if train error is high → high bias → use a more complex model or add features. If train error is low but val error is high → high variance → add regularization, get more data, use dropout, reduce model complexity. Techniques like k-fold cross-validation help detect the issue, and learning curves (train vs. validation error vs. training set size) help diagnose it visually.

Q2. How does backpropagation work? Walk me through the algorithm.

Answer: Backpropagation is an algorithm to efficiently compute gradients of the loss w.r.t. all parameters using the chain rule. Steps: (1) Forward pass: compute output of each layer and store activations/cached values. (2) Compute loss: compare output ŷ to label y. (3) Backward pass: starting from the loss, compute ∂L/∂aᴸ (gradient of loss w.r.t. final activation), then propagate backwards: ∂L/∂Wᵢ = ∂L/∂aᵢ × ∂aᵢ/∂zᵢ × ∂zᵢ/∂Wᵢ. The key insight is the chain rule: the gradient of the loss w.r.t. a weight equals the gradient of the loss w.r.t. that layer's output, times the local gradient. (4) Update: w = w - α × ∂L/∂w. The vanishing gradient problem occurs when gradients shrink exponentially in deep networks — solved by ReLU activations, batch normalization, and residual connections.

Q3. What is the attention mechanism in Transformers? Why is it better than RNNs?

Answer: Attention computes pairwise relationships between all tokens simultaneously: Attention(Q,K,V) = softmax(QKᵀ/√dₖ)V. Each query looks at all keys to compute relevance scores, then uses those to weight-sum the values. Benefits over RNNs: (1) Parallelism: RNNs process tokens sequentially — can't parallelize. Transformers compute all positions simultaneously. (2) Long-range dependencies: RNNs struggle with very long sequences (vanishing gradient). Attention directly connects any two positions with O(1) operations. (3) Multi-head attention allows the model to jointly attend to different representation subspaces — one head may capture syntax, another semantics. The scaling by √dₖ prevents dot products from growing too large in high dimensions, which would make softmax gradients vanish.

Q4. When would you use XGBoost vs. a Neural Network?

Answer: XGBoost when: tabular/structured data with mixed feature types, interpretability matters, limited training data (<100K rows), fast training/iteration cycle needed, features require less engineering. Neural Networks when: unstructured data (images, text, audio, video), very large datasets (millions of rows), automatically learning hierarchical features is valuable, state-of-art performance is needed for perception tasks. Rule of thumb: for tabular data, try XGBoost first — it wins Kaggle competitions on tabular data. For anything involving sequences, images, or text, go neural. Also consider ensemble approaches (stacking XGBoost with NN outputs as features).

Q5. How do you handle imbalanced datasets?

Answer: Multiple strategies: (1) Resampling: SMOTE oversamples minority class by interpolating between existing minority samples; random undersampling removes majority samples. (2) Class weights: class_weight='balanced' in sklearn — increases the penalty for misclassifying minority class. (3) Threshold tuning: default 0.5 threshold is often wrong — optimize threshold using F1 or precision-recall curve on validation set. (4) Choose right metrics: Accuracy is misleading (a model predicting all majority is 99% accurate on 1% minority data). Use F1, precision-recall AUC, or Cohen's kappa. (5) Algorithm choice: tree ensembles (XGBoost's scale_pos_weight), anomaly detection (Isolation Forest) for extreme imbalance. (6) In production: monitor per-class performance separately.

Q6. What is the difference between bagging and boosting?

Answer: Bagging (Bootstrap Aggregating) trains N models in parallel on bootstrap samples (random subsets with replacement), then averages predictions. Reduces variance. Example: Random Forest. Each tree is independent and sees different data. Boosting trains models sequentially, each one focusing on samples the previous model got wrong. The final prediction is a weighted sum. Reduces bias. Examples: AdaBoost, XGBoost, LightGBM. Key differences: Bagging is parallelizable and more robust to overfitting; boosting is sequential, can overfit with too many estimators, but typically achieves better accuracy. Boosting tends to outperform bagging on tabular data when tuned properly.

Q7. How do you prevent overfitting in neural networks?

Answer: Multiple techniques: (1) Dropout: randomly zero out neurons during training (p=0.1–0.5 depending on layer size) — forces learning of robust features. (2) L1/L2 regularization (weight decay): adds penalty to large weights. (3) Early stopping: monitor val loss, stop when it starts increasing — simplest and most effective. (4) Data augmentation: for images — flips, crops, color jitter; for text — back-translation, synonym substitution. (5) Batch normalization: normalizes activations, acts as mild regularizer. (6) More data: most reliable fix. (7) Simpler architecture: reduce layers, hidden units. (8) Learning rate scheduling: cosine annealing helps escape sharp minima.

Q8. Explain the curse of dimensionality.

Answer: As the number of dimensions (features) increases, the volume of the space grows exponentially, so the available data becomes sparse. This causes several problems: (1) Distance-based algorithms (KNN, K-Means) fail because all points become approximately equidistant in high dimensions — "distance" loses meaning. (2) More dimensions require exponentially more training data to maintain the same density. (3) Models with many features are harder to train and prone to overfitting. Solutions: dimensionality reduction (PCA, UMAP, autoencoders), feature selection (L1 regularization, mutual information), domain knowledge to select relevant features, tree-based models which are more resistant (they select features at each split).

Q9. What is data leakage and how do you prevent it?

Answer: Data leakage occurs when information from outside the training distribution leaks into the model training, causing artificially inflated validation metrics that don't reflect real-world performance. Types: (1) Train-test contamination: normalizing using the entire dataset before splitting — use fit() on train only, then transform() on both. (2) Target leakage: using features that are causally downstream of the target (e.g., using "claim amount" to predict "insurance fraud"). (3) Temporal leakage: using future data to predict past events — always split time-series data chronologically. Prevention: always fit preprocessing inside cross-validation folds; use sklearn Pipelines; be skeptical of unrealistically good results; audit features carefully for temporal ordering.

Q10. How do you evaluate and compare machine learning models?

Answer: Start with the right metric for the task: F1/AUC for imbalanced classification, RMSE/MAE for regression, NDCG for ranking. Use stratified k-fold cross-validation (k=5 or 10) — never evaluate on a single train/test split. Report confidence intervals across folds. Check for statistical significance (McNemar's test for classification). Beyond accuracy: calibration (are probabilities accurate — use Platt scaling or isotonic regression if not), inference time (latency matters in production), memory footprint, interpretability (SHAP values), and fairness metrics (demographic parity, equalized odds across subgroups). In production: shadow mode (run new model alongside old, compare on real traffic), A/B testing, monitoring for data drift.

Q11. What is transfer learning and when do you use it?

Answer: Transfer learning reuses a model pre-trained on a large dataset (ImageNet, Wikipedia+BooksCorpus for BERT) for a different but related task. When to use: limited labeled data for your specific task, your task is in the same domain as the pre-training data, you need fast iteration. How to fine-tune: (1) Feature extraction: freeze all pretrained weights, add new classification head, train only the head. (2) Full fine-tuning: unfreeze all layers, train end-to-end with very small learning rate (2e-5 for BERT). (3) Gradual unfreezing: start with head, then progressively unfreeze layers (ULMFiT approach). Key: pretrained weights are a warm start — they've already learned low-level features (edges for vision, syntax for NLP), so your model needs less data to learn high-level task-specific features.

Q12. How does BERT work? What makes it different from GPT?

Answer: BERT (Bidirectional Encoder Representations from Transformers) is a transformer encoder pre-trained with two objectives: (1) Masked Language Model (MLM): randomly mask 15% of input tokens, predict the masked tokens — forces bidirectional context understanding. (2) Next Sentence Prediction (NSP): given two sentences, predict if the second follows the first. BERT is bidirectional — each token attends to all other tokens in both directions. GPT is a decoder-only autoregressive model — each token attends only to previous tokens (causal/left-to-right masking). Predicts the next token. BERT excels at: understanding tasks (classification, NER, Q&A extractive). GPT excels at: generation tasks (text completion, dialogue, summarization). Modern LLMs like GPT-4 use only the decoder (generative pre-training), while models like T5 use encoder-decoder for sequence-to-sequence tasks.

Q13. How would you deploy a machine learning model to production?

Answer: Full deployment pipeline: (1) Serialization: save model (joblib for sklearn, torch.save, ONNX for cross-framework compatibility). (2) API: wrap in FastAPI/Flask REST endpoint with input validation (Pydantic), error handling, and request logging. (3) Containerization: Docker container with all dependencies — ensures reproducibility. (4) Orchestration: Kubernetes for auto-scaling, rolling updates, health checks. (5) CI/CD: automated testing (unit + integration), model validation checks (performance must exceed threshold), gradual rollout (canary). (6) Monitoring: log predictions and inputs, detect data drift (distribution shift in features), model drift (performance degradation), set up alerts. (7) Retraining trigger: schedule or drift-based. Always keep the previous model version as fallback.

Q14. What is a confusion matrix and how do you interpret it?

Answer: A confusion matrix shows the breakdown of actual vs. predicted classes. For binary classification: TP (predicted positive, actually positive), FP (predicted positive, actually negative — Type I error), FN (predicted negative, actually positive — Type II error), TN (predicted negative, actually negative). Key metrics derived: Precision = TP/(TP+FP) — "of all predicted positives, how many are correct". Recall/Sensitivity = TP/(TP+FN) — "of all actual positives, how many did we catch". Specificity = TN/(TN+FP). F1 = harmonic mean of P and R. In practice: high precision, low recall → model is conservative (misses positives). High recall, low precision → model is liberal (many false alarms). Choose based on cost: medical diagnosis → maximize recall (missing cancer is worse than false alarm); spam detection → maximize precision (false positives annoy users).

Q15. What is regularization and why is it necessary?

Answer: Regularization adds a penalty term to the loss function to discourage overly complex models that memorize training data. Without regularization, a model can fit training data perfectly (zero training loss) by learning arbitrary decision boundaries that don't generalize. L1 (Lasso) adds λ|w| — penalty proportional to absolute value of weights. This drives some weights exactly to zero, performing feature selection. L2 (Ridge) adds λw² — penalty proportional to squared weights. This shrinks all weights toward zero but rarely to exactly zero. ElasticNet combines both. The hyperparameter λ controls the regularization strength — too high causes underfitting, too low has no effect. Tune via cross-validation. In neural networks, L2 is called weight decay, and dropout acts as a stochastic regularizer by preventing co-adaptation of neurons.

Q16. Explain gradient descent and its variants (SGD, Adam, RMSprop).

Answer: Gradient descent minimizes loss by iteratively moving parameters in the direction of steepest descent: w = w - α∇L. Variants: SGD: w = w - α∇L(single sample) — noisy but fast, can escape local minima, good for large datasets. SGD + Momentum: maintains velocity v = βv + α∇L, w = w - v — smooth oscillations, faster convergence. RMSprop: adapts learning rate per parameter using running mean of squared gradients — good for non-stationary objectives. Adam: combines momentum (first moment) + RMSprop (second moment), with bias correction. Default go-to optimizer. Hyperparams: β₁=0.9, β₂=0.999, ε=1e-8. AdamW: Adam with decoupled weight decay — often better for transformers. In practice: start with Adam; if overfitting, try SGD + momentum (often better generalization); for transformers, use AdamW with linear warmup + cosine decay.

Q17. How do you choose hyperparameters for a model?

Answer: Methods in increasing sophistication: (1) Manual search: understand which hyperparameters matter most (learning rate is #1 for NNs, max_depth for trees). (2) Grid Search: exhaustive search over specified grid. Works for small grids (<100 combos). (3) Random Search: sample random combos — often 3x more efficient than grid for same compute budget (Bergstra et al.). (4) Bayesian Optimization: builds a surrogate model of the objective function, samples promising regions (Optuna, Hyperopt). 10-50x more efficient than random. (5) Population-based training: parallelize and adaptively allocate compute (PBT). Key insights: learning rate is most important — always search log-uniformly (0.0001 to 0.1). Regularization strength: log-uniform. Number of layers/units: linear. Use early stopping inside hyperparameter search to save time.

Q18. What is a ROC curve and AUC? When would you use PR curve instead?

Answer: ROC curve plots True Positive Rate (Recall) vs. False Positive Rate (1 - Specificity) at every classification threshold. AUC (Area Under Curve) = probability that the model ranks a random positive higher than a random negative. AUC=0.5 → random, AUC=1.0 → perfect. ROC is threshold-independent and good when classes are balanced. Precision-Recall curve is better for imbalanced datasets. With 1% positive class, a model predicting all negative has FPR=0, which makes ROC look artificially good. PR curve shows the tradeoff between precision and recall directly — more informative when positives are rare (fraud detection, disease screening). Rule of thumb: both imbalanced? Use PR-AUC. Standard classification with balanced classes? ROC-AUC.

Q19. What are embeddings and why are they important?

Answer: Embeddings are dense, low-dimensional vector representations of high-cardinality discrete inputs (words, users, products, categories). Instead of one-hot encoding (sparse, no semantic meaning), embeddings place similar items close in vector space — learned from co-occurrence patterns. Why important: (1) Dimensionality reduction — 50,000 words → 300-dim vectors. (2) Encode semantic similarity — "king" - "man" + "woman" ≈ "queen". (3) Enable generalization — model can handle unseen words similar to trained words. Used in: NLP (word2vec, GloVe, BERT), recommender systems (user/item embeddings for collaborative filtering), e-commerce (product embeddings for similarity search), knowledge graphs. In practice: use pre-trained embeddings for NLP, learn task-specific embeddings for entities, store in FAISS/Pinecone for ANN search.

Q20. How do you detect and handle data drift in production?

Answer: Data drift occurs when the statistical properties of production data differ from training data. Types: Covariate shift: input distribution P(X) changes. Label shift: output distribution P(Y) changes. Concept drift: the relationship P(Y|X) changes (e.g., user behavior changes seasonally). Detection methods: (1) Statistical tests — KS test (continuous), chi-squared (categorical), Population Stability Index (PSI). (2) Monitor feature means, standard deviations, null rates over time. (3) Monitor model output distribution (prediction drift). (4) Monitor actual performance if labels available (performance drift). Tools: Evidently AI, WhyLogs, Great Expectations. Response: retrain on recent data; use domain adaptation techniques; alert and human review above thresholds. PSI > 0.25 = significant drift requiring action.

Q21–Q30 (Additional Critical Questions)

Q21. What is SHAP and how do you use it for model interpretability? SHAP (SHapley Additive exPlanations) assigns each feature a value representing its contribution to a prediction, consistent with game theory Shapley values. SHAP values sum to model output - expected output. Use shap.TreeExplainer (fast for tree models), shap.DeepExplainer (NNs), shap.KernelExplainer (any model). Waterfall plots for individual predictions; summary plots for global feature importance.

Q22. Explain L1 vs L2 loss functions. L1 loss (MAE) = |y - ŷ|. Robust to outliers (gradient = constant). L2 loss (MSE) = (y - ŷ)². Penalizes large errors more heavily, gradient = 2(y - ŷ). Huber loss combines both — MAE beyond threshold δ, MSE within δ. Use MAE when outliers are real signal; MSE when outliers are noise.

Q23. What is batch normalization and why does it help? After each layer, normalize activations: x̂ = (x - μ) / σ, then apply learnable scale γ and shift β. Benefits: (1) Reduces internal covariate shift — stabilizes training. (2) Allows higher learning rates. (3) Acts as mild regularizer (adds noise from mini-batch statistics). (4) Reduces sensitivity to weight initialization.

Q24. How do CNNs handle spatial invariance? Weight sharing — same filter applied at all spatial positions. Max pooling — takes maximum activation in each window, discarding exact position. This means a cat in the top-left activates similar to a cat in the bottom-right. Data augmentation (random crops, flips) further improves invariance.

Q25. What is the difference between generative and discriminative models? Discriminative models (logistic regression, SVM, random forest) learn P(Y|X) — directly model the decision boundary. Generative models (Naive Bayes, GMM, VAE, GANs) learn P(X,Y) — model how data is generated, can generate new samples. Generative models are more powerful but harder to train and require more data.

Q26. How does a GAN work? Generator G learns to produce fake data G(z) from noise z. Discriminator D learns to distinguish real from fake. They play a minimax game: min_G max_D [logD(x) + log(1-D(G(z)))]. Training instability is the main challenge — use techniques like WGAN (Wasserstein distance), spectral normalization, progressive growing.

Q27. What is few-shot learning? Learning new concepts from very few examples (1-shot, 5-shot). Approaches: (1) Meta-learning (MAML) — learn to learn, model initializes to quickly adapt. (2) Prototypical Networks — compute prototype embedding per class, classify by nearest prototype. (3) Fine-tuning pretrained LLMs — in-context learning without updating weights.

Q28. Explain LoRA (Low-Rank Adaptation) for fine-tuning LLMs. LoRA freezes pretrained weights W and adds trainable low-rank matrices: W' = W + BA where B∈Rᵈˣʳ, A∈Rʳˣᵏ, r ≪ min(d,k). Only A and B are trained — reduces trainable parameters by 10,000x while maintaining performance. r=4-16 for most tasks. Used in Alpaca, LLaMA fine-tuning.

Q29. What is RAG (Retrieval Augmented Generation)? Combines retrieval with generation. Query → retrieve relevant documents from vector DB (FAISS, Pinecone) → feed documents + query to LLM → generate grounded answer. Solves: LLM hallucination, stale knowledge cutoff, private knowledge injection. Key components: embedding model (for retrieval), vector DB, chunking strategy, LLM for generation.

Q30. How do you design an ML system for real-time fraud detection? Requirements: sub-100ms latency, high recall, online learning. Architecture: features served from feature store (Redis for real-time, S3 for batch); rule-based pre-filter for obvious fraud; ML model (LightGBM or neural network) for scoring; ensemble/cascade for cost efficiency; post-processing (threshold, velocity rules). Key features: amount deviation from user baseline, merchant category, geolocation velocity (impossible travel), time of day, network features (shared device/email). Feedback loop: labeled fraud cases retrain model weekly. A/B test new models in shadow mode.

🏗️

System Design Patterns

Real-Time Recommendation System

Candidate Generation: Matrix factorization or two-tower model → retrieve top-1000 candidates from vector DB
Ranking: DNN ranker with user/item/context features → score candidates
Feature Store: Redis for real-time features (clicks last 1h), Kafka for streaming updates
A/B testing: Multi-armed bandit for exploration vs. exploitation
Cold start: Content-based fallback for new users/items
Latency target: <50ms using pre-computed user embeddings

ML Training Pipeline at Scale

Data: Raw data in S3 → Spark for feature engineering → Feature store
Training: Distributed training (PyTorch DDP or Ray Train) on GPU cluster
Experiment tracking: MLflow — log all runs, compare metrics
CI/CD: Automated model quality gates (AUC must be > baseline before promoting)
Serving: Model registry → Docker image → Kubernetes deployment
Monitoring: Evidently AI dashboards for drift; PagerDuty alerts

⚠️

Common Mistakes to Avoid

Mistake	Why It's Wrong	Correct Approach
Fitting StandardScaler on entire dataset	Leaks test distribution into training	Fit only on train, transform both train and test
Using accuracy for imbalanced data	Misleading — 99% accuracy on 1% positive dataset by predicting all negative	Use F1, AUC, or PR-AUC
Not shuffling before train/test split	Temporal or class patterns in data order	Use shuffle=True or stratified split
Tuning hyperparameters on test set	Overfits to test set — optimistic estimate	Use validation set or nested CV for tuning
Ignoring class imbalance	Model biased toward majority class	SMOTE, class_weight, threshold tuning
Not checking for data leakage	Unrealistic validation metrics	Audit all features for temporal leakage
Too large learning rate	Loss diverges or oscillates	Start small, use LR finder (fast.ai method)
Not setting random seeds	Irreproducible experiments	Set np.random.seed, torch.manual_seed, random.seed

📖

Terminology Glossary

Term	Definition
Epoch	One full pass through the entire training dataset
Batch	Subset of training data used in one gradient update
Iteration	One weight update (one batch processed)
Hyperparameter	Configuration set before training (learning rate, depth) vs learned parameters (weights)
Inductive Bias	Assumptions a model makes about the problem (CNNs assume spatial locality)
Generalization	Model's ability to perform well on unseen data
Calibration	Alignment between predicted probabilities and actual frequencies
Latent Space	Compressed representation learned by model (e.g., autoencoder bottleneck)
Tokenization	Splitting text into tokens (words, subwords, characters) for NLP models
Fine-tuning	Further training a pre-trained model on a specific task with a small learning rate

Term	Definition
Perplexity	Measure of how well an LLM predicts a sequence. Lower = better. 2^(cross-entropy)
Temperature	Controls LLM output randomness. High temp → diverse, creative. Low → deterministic
Hallucination	LLM generates plausible-sounding but factually incorrect information
RAG	Retrieval-Augmented Generation — ground LLM with retrieved documents
Quantization	Reduce model precision (FP32 → INT8) to shrink size and speed up inference
Pruning	Remove low-importance weights/neurons to compress model
Distillation	Train small student model to mimic large teacher model outputs
Ensemble	Combine predictions of multiple models for better performance
Stacking	Use another model's predictions as input features for a meta-model
RLHF	Reinforcement Learning from Human Feedback — how GPT is aligned with human preferences

AI / ML Engineer

📋 Table of Contents

High Bias (Underfitting)

High Variance (Overfitting)

L1 Regularization (Lasso)

L2 Regularization (Ridge)

ElasticNet

Dropout

Early Stopping

Encoding Categorical Variables

Scaling Numerical Features

Handling Missing Values

Data-Level Methods

Algorithm-Level Methods

Layer Types

Activation Functions