Everything you need to crack any ML interview β concepts, code, Q&A, system design
| Concept | Formula / Key Fact | When to Use |
|---|---|---|
| Bias-Variance Tradeoff | Error = BiasΒ² + Variance + Irreducible | Model selection, regularization decisions |
| Learning Rate | Typical: 0.001β0.01. Too high β diverge, too low β slow | All gradient-based training |
| Precision | TP / (TP + FP) | Spam detection, fraud β minimize false positives |
| Recall | TP / (TP + FN) | Cancer detection β minimize false negatives |
| F1 Score | 2 Γ (P Γ R) / (P + R) | Imbalanced datasets |
| ROC-AUC | Area under ROC curve. 0.5 = random, 1.0 = perfect | Binary classification evaluation |
| RMSE | β(Ξ£(yα΅’ - Ε·α΅’)Β² / n) | Regression β sensitive to outliers |
| L1 (Lasso) | Loss + Ξ»|w| | Feature selection β drives weights to 0 |
| L2 (Ridge) | Loss + Ξ»wΒ² | Prevent overfitting, keeps all features |
| Dropout | Randomly zero out p% of neurons during training | Neural network regularization |
| Batch Norm | Normalize layer inputs: ΞΌ=0, Ο=1 per mini-batch | Deep networks β stabilizes training |
| Adam Optimizer | Combines momentum + RMSprop. Ξ²β=0.9, Ξ²β=0.999 | Default optimizer β works well most cases |
| Cross-Entropy Loss | -Ξ£ yα΅’ log(Ε·α΅’) | Classification tasks |
| Softmax | eΛ£β± / Ξ£eΛ£Κ² | Multi-class output β probabilities sum to 1 |
| k-Fold CV | Split data into k parts; train on k-1, test on 1; repeat | Model evaluation, hyperparameter tuning |
| Variant | Update Rule | Pros | Cons |
|---|---|---|---|
| Batch GD | Use all data per step | Stable, smooth convergence | Slow for large datasets |
| Stochastic GD (SGD) | Use 1 sample per step | Fast, can escape local minima | Noisy updates, may not converge |
| Mini-Batch GD | Use batch of 32β256 per step | Balance of speed + stability | Requires tuning batch size |
| Strategy | When to Use |
|---|---|
| Mean/Median/Mode imputation | MCAR data, small proportion missing |
| KNN Imputation | MAR data, when neighbors share values |
| Model-based imputation | Complex missingness patterns |
| Drop rows/columns | If > 70% missing or data is MNAR |
| Indicator column | Missingness itself is informative |
How it works: Fits a line Ε· = Ξ²β + Ξ²βxβ + ... + Ξ²βxβ that minimizes MSE (Ordinary Least Squares)
How it works: Applies sigmoid to linear output β probability. Decision boundary where P=0.5
How it works: Recursively splits data on features that maximize information gain (entropy) or minimize Gini impurity
How it works: Trains N trees on bootstrap samples, random feature subsets. Final prediction = majority vote (classification) or mean (regression)
How it works: Trains trees sequentially, each correcting previous errors. Uses gradient of loss to determine next tree direction.
How it works: Finds hyperplane that maximizes margin between classes. Kernel trick maps to higher dimensions.
How it works: At prediction time, finds k closest training points (by distance), returns majority vote or mean
How it works: Initialize k centroids β assign each point to nearest centroid β recompute centroids β repeat until convergence
How it works: Groups points that are closely packed together; marks outliers as noise. Expands clusters from core points.
How it works: Finds orthogonal axes (principal components) that capture maximum variance. Projects data onto top k components.
| Function | Formula | Use When |
|---|---|---|
| ReLU | max(0, x) | Hidden layers (default) |
| Leaky ReLU | max(0.01x, x) | Avoid dying ReLU problem |
| GELU | xΒ·Ξ¦(x) | Transformers (BERT, GPT) |
| Sigmoid | 1/(1+eβ»Λ£) | Binary output layer |
| Softmax | eΛ£β±/Ξ£eΛ£Κ² | Multi-class output |
| Tanh | (eΛ£-eβ»Λ£)/(eΛ£+eβ»Λ£) | RNNs, [-1,1] output |
How convolutions work: A filter (kernel) slides over input, computing dot products at each position β creates feature map showing where patterns appear. Multiple filters detect multiple features.
| Architecture | Key Innovation | Used For |
|---|---|---|
| VGG-16/19 | Deep with small 3Γ3 filters | Classification baseline |
| ResNet | Skip connections (residual blocks) β solve vanishing gradient | Deep classification, transfer learning |
| EfficientNet | Compound scaling (depth+width+resolution) | Efficient high-accuracy classification |
| YOLO | Single-pass real-time object detection | Object detection |
| U-Net | Encoder-decoder with skip connections | Image segmentation |
Self-Attention: Each token attends to every other token. Computes how much each word should "focus on" other words in context.
# Standard NLP preprocessing pipeline
import re, nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
def preprocess(text):
text = text.lower() # lowercase
text = re.sub(r'[^a-z\s]', '', text) # remove punctuation
tokens = word_tokenize(text) # tokenize
tokens = [t for t in tokens if t not in stopwords.words('english')]
lemmatizer = WordNetLemmatizer()
tokens = [lemmatizer.lemmatize(t) for t in tokens]
return ' '.join(tokens)
| Method | How It Works | Captures Semantics? | Use Case |
|---|---|---|---|
| Bag of Words | Word count vector (no order) | No | Simple classification |
| TF-IDF | TF Γ log(N/df) β penalizes common words | No | Information retrieval, search |
| Word2Vec | CBOW or Skip-gram neural network. king-man+womanβqueen | Yes (local) | Word similarity, analogies |
| GloVe | Matrix factorization on co-occurrence statistics | Yes (global) | Word similarity at scale |
| BERT Embeddings | Contextual β same word has different embedding in different sentences | Yes (deep) | All modern NLP tasks |
from transformers import AutoTokenizer, AutoModelForSequenceClassification, Trainer, TrainingArguments
import torch
# Load pretrained model
model_name = "bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=2)
# Tokenize dataset
def tokenize(examples):
return tokenizer(examples["text"], padding="max_length", truncation=True, max_length=128)
tokenized = dataset.map(tokenize, batched=True)
# Training configuration
training_args = TrainingArguments(
output_dir="./results",
num_train_epochs=3,
per_device_train_batch_size=16,
learning_rate=2e-5, # Low LR for fine-tuning!
warmup_steps=500,
weight_decay=0.01,
evaluation_strategy="epoch",
save_strategy="epoch",
load_best_model_at_end=True,
)
trainer = Trainer(model=model, args=training_args,
train_dataset=tokenized["train"], eval_dataset=tokenized["test"])
trainer.train()
| Stage | Tools | Key Tasks |
|---|---|---|
| Data Management | DVC, Delta Lake, Feast | Versioning data, feature store, data lineage |
| Experiment Tracking | MLflow, W&B, Neptune | Log params, metrics, artifacts, compare runs |
| Model Training | sklearn, PyTorch, TF, Ray | Distributed training, hyperparameter tuning |
| Model Registry | MLflow Registry, Vertex AI | Version models, stage (staging/prod), lineage |
| Serving/Deployment | FastAPI, TorchServe, KServe, SageMaker | REST/gRPC endpoints, batch inference, A/B |
| Monitoring | Evidently AI, Seldon, Prometheus | Data drift, model drift, concept drift alerts |
import mlflow
import mlflow.sklearn
from sklearn.ensemble import RandomForestClassifier
mlflow.set_experiment("fraud_detection_v2")
with mlflow.start_run(run_name="rf_baseline"):
# Log hyperparameters
params = {"n_estimators": 200, "max_depth": 8, "class_weight": "balanced"}
mlflow.log_params(params)
model = RandomForestClassifier(**params)
model.fit(X_train, y_train)
# Log metrics
mlflow.log_metric("train_f1", f1_score(y_train, model.predict(X_train)))
mlflow.log_metric("val_f1", f1_score(y_val, model.predict(X_val)))
mlflow.log_metric("val_auc", roc_auc_score(y_val, model.predict_proba(X_val)[:,1]))
# Save model to registry
mlflow.sklearn.log_model(model, "model", registered_model_name="FraudDetector")
from fastapi import FastAPI
from pydantic import BaseModel
import mlflow.pyfunc
import numpy as np
app = FastAPI()
model = mlflow.pyfunc.load_model("models:/FraudDetector/Production")
class PredictionRequest(BaseModel):
amount: float
merchant_category: str
hour_of_day: int
user_avg_spend: float
class PredictionResponse(BaseModel):
fraud_probability: float
is_fraud: bool
@app.post("/predict", response_model=PredictionResponse)
async def predict(request: PredictionRequest):
features = np.array([[request.amount, request.hour_of_day, request.user_avg_spend]])
proba = model.predict(features)[0]
return PredictionResponse(fraud_probability=proba, is_fraud=proba > 0.5)
# Run: uvicorn app:app --host 0.0.0.0 --port 8000
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.model_selection import GridSearchCV, StratifiedKFold
numeric_features = ['age', 'income', 'credit_score']
categorical_features = ['job_type', 'education']
numeric_transformer = Pipeline(steps=[
('imputer', SimpleImputer(strategy='median')),
('scaler', StandardScaler()),
])
categorical_transformer = Pipeline(steps=[
('imputer', SimpleImputer(strategy='constant', fill_value='missing')),
('onehot', OneHotEncoder(handle_unknown='ignore', sparse=False)),
])
preprocessor = ColumnTransformer(transformers=[
('num', numeric_transformer, numeric_features),
('cat', categorical_transformer, categorical_features),
])
pipeline = Pipeline(steps=[
('preprocessor', preprocessor),
('classifier', GradientBoostingClassifier()),
])
# Grid search with cross-validation
param_grid = {
'classifier__n_estimators': [100, 200],
'classifier__max_depth': [3, 5],
'classifier__learning_rate': [0.05, 0.1],
}
cv = GridSearchCV(pipeline, param_grid, cv=StratifiedKFold(5), scoring='roc_auc', n_jobs=-1)
cv.fit(X_train, y_train)
print(f"Best AUC: {cv.best_score_:.4f}")
print(f"Best params: {cv.best_params_}")
import torch
import torch.nn as nn
from torch.optim import AdamW
from torch.optim.lr_scheduler import CosineAnnealingLR
class MLPClassifier(nn.Module):
def __init__(self, input_dim, hidden_dims, num_classes, dropout=0.3):
super().__init__()
layers, in_dim = [], input_dim
for h in hidden_dims:
layers += [nn.Linear(in_dim, h), nn.BatchNorm1d(h), nn.GELU(), nn.Dropout(dropout)]
in_dim = h
layers.append(nn.Linear(in_dim, num_classes))
self.net = nn.Sequential(*layers)
def forward(self, x): return self.net(x)
device = 'cuda' if torch.cuda.is_available() else 'cpu'
model = MLPClassifier(50, [256, 128, 64], 2).to(device)
optimizer = AdamW(model.parameters(), lr=1e-3, weight_decay=0.01)
scheduler = CosineAnnealingLR(optimizer, T_max=50)
criterion = nn.CrossEntropyLoss()
for epoch in range(50):
model.train()
train_loss = 0
for X_batch, y_batch in train_loader:
X_batch, y_batch = X_batch.to(device), y_batch.to(device)
optimizer.zero_grad()
loss = criterion(model(X_batch), y_batch)
loss.backward()
torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0) # gradient clipping
optimizer.step()
train_loss += loss.item()
scheduler.step()
model.eval()
with torch.no_grad():
val_preds = model(X_val.to(device)).argmax(dim=1).cpu()
if epoch % 10 == 0:
print(f"Epoch {epoch}: loss={train_loss/len(train_loader):.4f}")
| Tool | Category | Key Use | Must-Know Commands/API |
|---|---|---|---|
| scikit-learn | ML Framework | Classical ML, preprocessing, evaluation | Pipeline, GridSearchCV, cross_val_score, train_test_split |
| PyTorch | Deep Learning | Research, custom models, NLP | nn.Module, DataLoader, optimizer.zero_grad(), loss.backward() |
| TensorFlow/Keras | Deep Learning | Production deployment, mobile | model.compile(), model.fit(), model.predict(), tf.data |
| Hugging Face | NLP/LLMs | Pretrained models, fine-tuning, inference | AutoTokenizer, pipeline(), Trainer, from_pretrained() |
| XGBoost/LightGBM | Boosting | Tabular data, competitions, production | xgb.train(), early_stopping_rounds, feature_importance |
| MLflow | MLOps | Experiment tracking, model registry | mlflow.log_params(), log_metric(), log_model(), load_model() |
| W&B (Weights & Biases) | MLOps | Rich experiment dashboard, sweeps | wandb.init(), wandb.log(), wandb.Sweep |
| pandas | Data | Data manipulation, EDA | groupby, merge, apply, pivot_table, read_csv, to_sql |
| NumPy | Numerical | Array operations, linear algebra | np.dot, np.stack, np.where, np.argmax, broadcasting |
| ONNX | Deployment | Framework-agnostic model format | torch.onnx.export(), onnxruntime.InferenceSession |
| Mistake | Why It's Wrong | Correct Approach |
|---|---|---|
| Fitting StandardScaler on entire dataset | Leaks test distribution into training | Fit only on train, transform both train and test |
| Using accuracy for imbalanced data | Misleading β 99% accuracy on 1% positive dataset by predicting all negative | Use F1, AUC, or PR-AUC |
| Not shuffling before train/test split | Temporal or class patterns in data order | Use shuffle=True or stratified split |
| Tuning hyperparameters on test set | Overfits to test set β optimistic estimate | Use validation set or nested CV for tuning |
| Ignoring class imbalance | Model biased toward majority class | SMOTE, class_weight, threshold tuning |
| Not checking for data leakage | Unrealistic validation metrics | Audit all features for temporal leakage |
| Too large learning rate | Loss diverges or oscillates | Start small, use LR finder (fast.ai method) |
| Not setting random seeds | Irreproducible experiments | Set np.random.seed, torch.manual_seed, random.seed |
| Term | Definition |
|---|---|
| Epoch | One full pass through the entire training dataset |
| Batch | Subset of training data used in one gradient update |
| Iteration | One weight update (one batch processed) |
| Hyperparameter | Configuration set before training (learning rate, depth) vs learned parameters (weights) |
| Inductive Bias | Assumptions a model makes about the problem (CNNs assume spatial locality) |
| Generalization | Model's ability to perform well on unseen data |
| Calibration | Alignment between predicted probabilities and actual frequencies |
| Latent Space | Compressed representation learned by model (e.g., autoencoder bottleneck) |
| Tokenization | Splitting text into tokens (words, subwords, characters) for NLP models |
| Fine-tuning | Further training a pre-trained model on a specific task with a small learning rate |
| Term | Definition |
|---|---|
| Perplexity | Measure of how well an LLM predicts a sequence. Lower = better. 2^(cross-entropy) |
| Temperature | Controls LLM output randomness. High temp β diverse, creative. Low β deterministic |
| Hallucination | LLM generates plausible-sounding but factually incorrect information |
| RAG | Retrieval-Augmented Generation β ground LLM with retrieved documents |
| Quantization | Reduce model precision (FP32 β INT8) to shrink size and speed up inference |
| Pruning | Remove low-importance weights/neurons to compress model |
| Distillation | Train small student model to mimic large teacher model outputs |
| Ensemble | Combine predictions of multiple models for better performance |
| Stacking | Use another model's predictions as input features for a meta-model |
| RLHF | Reinforcement Learning from Human Feedback β how GPT is aligned with human preferences |
Thejaslearning β AI/ML Engineer Cheat Sheet | β All Cheat Sheets | Dashboard