Master your interview with real-world questions and detailed answers
Large Language Models (LLMs): Transformer-based models trained on massive text corpora with billions of parameters. They can understand and generate human-like text, code, and structured outputs.
| Model | Params | Context | Company |
|---|---|---|---|
| GPT-4 | ~1.8T | 128K tokens | OpenAI |
| Claude 3.5 Sonnet | ~200B | 200K tokens | Anthropic |
| Gemini Ultra | Unknown | 1M tokens | |
| LLaMA 3.1 405B | 405B | 128K tokens | Meta (open source) |
| Mistral Large | ~123B | 128K tokens | Mistral AI |
Problem 1: Knowledge Cutoff → GPT-4 trained on data until April 2023 → Asks about 2024 events → Hallucination! → Doesn`t know your company`s internal documents! Problem 2: Hallucination → LLMs confidently generate FALSE information → "What is the CEO of XYZ Corp?" → Makes up a name! → Cannot cite sources without retrieval Problem 3: Context Length Limits → Can`t fit 10,000-page legal document in context → Even 1M token context = $$$$ expensive Problem 4: Stale Knowledge → Your company`s latest product specs change weekly → LLM doesn`t know current stock prices, live data SOLUTION: RAG — Retrieve relevant documents first, THEN generate!
Phase 1: INDEXING (one-time setup)
Documents (PDFs, docs, websites)
→ Text Chunking (split into ~500 token chunks)
→ Embedding Model (convert each chunk to vector)
→ Vector Database (store all embeddings)
Phase 2: RETRIEVAL (at query time)
User Question → Embed question → Search vector DB
→ Find top-k most similar chunks (cosine similarity)
→ Retrieve actual text of those chunks
Phase 3: GENERATION
User Question + Retrieved Context → LLM Prompt
→ LLM generates answer grounded in retrieved context
→ LLM can cite which document it used!
Setup:
- Company has 100,000 internal documents in SharePoint
- Employees want to ask questions like:
"What is our refund policy for enterprise customers?"
"What were the Q3 2024 quarterly targets for APAC?"
RAG Pipeline:
1. INDEXING:
100K SharePoint docs → Azure AI Document Intelligence (parse PDFs)
→ Split into 500-token chunks with 50-token overlap
→ Azure OpenAI text-embedding-ada-002 → 1536-dim vectors
→ Store in Azure AI Search (vector store)
2. RETRIEVAL:
Query: "refund policy enterprise" → embed query
→ Cosine similarity search → Top 5 most similar chunks found:
[Policy_Handbook_v3.pdf, p.12], [Enterprise_Agreement_2024.docx, p.4], ...
3. GENERATION:
Prompt = "Using ONLY the following context, answer the question.
"
+ retrieved_chunks + "
Question: " + user_query
→ GPT-4 generates answer with citations!
→ "According to Enterprise_Agreement_2024.docx, the refund window is 90 days..."
Benefits:
✅ Accurate — grounded in actual documents
✅ Citable — shows which document answered the question
✅ Always up-to-date — just re-index when docs change
✅ Secure — documents stay in your Azure environment
from langchain.document_loaders import PyPDFLoader, WebBaseLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.embeddings import OpenAIEmbeddings
from langchain.vectorstores import Chroma
from langchain.chains import RetrievalQA
from langchain.llms import OpenAI
# PHASE 1: INDEX DOCUMENTS
# Load documents
loader = PyPDFLoader("company_policy.pdf")
documents = loader.load()
# Split into chunks
splitter = RecursiveCharacterTextSplitter(
chunk_size=500, # Max 500 tokens per chunk
chunk_overlap=50, # 50-token overlap to preserve context
length_function=len
)
chunks = splitter.split_documents(documents)
print(f"Split into {len(chunks)} chunks")
# Create embeddings + vector store
embeddings = OpenAIEmbeddings(model="text-embedding-ada-002")
vectorstore = Chroma.from_documents(
documents=chunks,
embedding=embeddings,
persist_directory="./company_vectorstore" # Save to disk
)
# PHASE 2 + 3: RAG CHAIN (Retrieval + Generation together)
llm = OpenAI(model_name="gpt-4-turbo", temperature=0) # Low temp = more factual
qa_chain = RetrievalQA.from_chain_type(
llm=llm,
chain_type="stuff", # "stuff" = put all retrieved docs in prompt
retriever=vectorstore.as_retriever(
search_type="similarity",
search_kwargs={"k": 5} # Retrieve top 5 most similar chunks
),
return_source_documents=True # Show which docs were used!
)
# Ask a question!
query = "What is the maximum refund amount for enterprise customers?"
result = qa_chain({"query": query})
print("Answer:", result["result"])
print("
Sources used:")
for doc in result["source_documents"]:
print(f" - {doc.metadata.get(`source`, `Unknown`)} (page {doc.metadata.get(`page`, `N/A`)})")
| Technique | What It Solves | How |
|---|---|---|
| HyDE (Hypothetical Doc Embeddings) | Query-document mismatch | Generate hypothetical answer → embed that for retrieval |
| Multi-Query Retrieval | Single query misses relevant docs | Generate 3 different query phrasings, retrieve for all |
| Re-ranking (Cross-Encoder) | Bi-encoder misses relevance nuances | Use BERT cross-encoder to re-rank top-k results |
| Contextual Compression | Retrieved chunks have noise | Compress each chunk to only relevant sentences |
| GraphRAG (Microsoft) | Multi-hop reasoning | Build knowledge graph of document relationships |
Two architectures: CBOW: Predict center word from context words → ["The", "_", "sat", "on"] → predict "cat" Skip-gram: Predict context from center word → "cat" → predict ["The", "sat", "on", "mat"] Training: Sliding window over corpus text Window size = 5: For each word, use 5 words on each side as context Key Properties: → Similar words cluster together in vector space → Semantic relationships preserved (analogies work!) → King - Man + Woman ≈ Queen → Paris - France + Germany ≈ Berlin (capital relationship!) → Doctor - Man + Woman ≈ Nurse (problematic gender bias!)
from gensim.models import Word2Vec
import numpy as np
# Train Word2Vec on LinkedIn job descriptions
sentences = [desc.split() for desc in job_descriptions]
w2v = Word2Vec(
sentences,
vector_size=300, # 300-dim embeddings
window=10, # Context window
min_count=5, # Ignore rare words
workers=4, # Parallel training
sg=1, # Skip-gram (better for small corpus)
epochs=10
)
# Explore relationships
print(w2v.wv.most_similar("python", topn=5))
# [("pandas", 0.91), ("numpy", 0.89), ("scikit-learn", 0.87)]
# Skill arithmetic
result = w2v.wv.most_similar(
positive=["machine_learning", "python"],
negative=["statistics"], topn=3
)
# → "deep_learning" (ML + Python - Statistics)
Key Innovation: Uses GLOBAL co-occurrence statistics Word2Vec: Local context windows GloVe: Co-occurrence matrix over entire corpus J(θ) = Σ f(X_ij) (w_i^T w_j + b_i + b_j - log X_ij)² X_ij = How often word i and j appear together globally f(X_ij) = Weighting function (downweights very frequent pairs) GloVe vs Word2Vec: → GloVe: Better on analogy tasks (syntactic) → Word2Vec: Better on semantic similarity → GloVe: Faster to train (one pass over co-occurrence matrix) → Word2Vec: Better for large, dynamic corpora LinkedIn uses pre-trained GloVe for job title understanding "Software Engineer" ≈ "Software Developer" ≈ "SWE" (near synonyms)
Key Innovation: Subword (character n-gram) embeddings!
Word2Vec/GloVe: word "python" → single vector
Unknown word "pythonic" → OOV error!
FastText: "python" → character n-grams
[py, pyt, yth, tho, hon, ]
word vector = average of n-gram vectors
Benefits:
✅ Handles unknown/rare words (morphological variants)
✅ Works for new technical terms: "FastAPI", "LangChain"
✅ Better for morphologically rich languages (Tamil, Arabic)
✅ Handles typos: "pythn" still close to "python"
from gensim.models import FastText
ft_model = FastText(
sentences,
vector_size=300,
window=10,
min_count=1, # Can even use count=1 since handles OOV!
sg=1,
epochs=10
)
# FastText handles new/unseen words!
vector = ft_model.wv["ReactJS"] # Even if not in training data!
500M+ members with self-reported skills 20M+ job postings with required skills Problem: "Machine Learning" = "ML" = "Statistical Learning" = "ML/AI" → How to match when terms don`t exactly match?
Step 1: Train Word2Vec on LinkedIn corpus → 400M+ professional profiles → 20M+ job postings (updated monthly) → All text treated as sentences Step 2: Create Skill Vectors "machine_learning" → [0.72, -0.14, 0.55, ...] "ml" → [0.71, -0.13, 0.53, ...] (similar!) "deep_learning" → [0.68, -0.12, 0.51, ...] (related) "sql" → [0.20, 0.45, -0.30, ...] (different domain) Step 3: Job Matching Job requires: ["Python", "Machine Learning", "AWS"] User has: ["Python", "ML", "Amazon Web Services", "TensorFlow"] Cosine similarity: "Machine Learning" ↔ "ML" = 0.97 ✅ (MATCH!) "AWS" ↔ "Amazon Web Services" = 0.95 ✅ (MATCH!) "TensorFlow" ↔ "Machine Learning" = 0.78 (RELATED — show as bonus skill) Step 4: Present match score "You have 94% of required skills. Missing: Spark (but you have Pandas → related!)" LinkedIn`s Results: → Job match quality improved 35% → Application-to-hire rate increased 22% → Works across 26 languages simultaneously
| Aspect | Word2Vec | GloVe | FastText | BERT |
|---|---|---|---|---|
| OOV Words | ❌ No | ❌ No | ✅ Yes | ✅ Subwords |
| Context-Aware | ❌ No | ❌ No | ❌ No | ✅ Yes |
| Training Speed | Fast | Fast | Moderate | Slow |
| Inference Speed | Very Fast | Very Fast | Very Fast | Slow (GPU) |
| Best Use | General NLP | Analogy tasks | Morphology, OOV | State-of-the-art |
| Used by | Many companies | Meta, multilingual | Google Search |
Raw text is messy — preprocessing transforms it into a format models can learn from. Quality preprocessing often matters MORE than model choice!
import re
def clean_text(text):
text = text.lower() # Lowercase
text = re.sub(r"httpS+|wwwS+", "", text) # Remove URLs
text = re.sub(r"]+>", "", text) # Remove HTML tags
text = re.sub(r"@w+", "", text) # Remove @mentions
text = re.sub(r"#(w+)", r"1", text) # Remove # (keep word)
text = re.sub(r"[^ws]", "", text) # Remove punctuation
text = re.sub(r"s+", " ", text).strip() # Fix whitespace
return text
# Google: Clean search queries
raw_query = "Best Python Tutorial!! @codewithme http://xyz.com #python"
clean_query = clean_text(raw_query)
print(clean_query)
# Output: "best python tutorial codewithme python"
from nltk.tokenize import word_tokenize, sent_tokenize
from transformers import BertTokenizer
# Word Tokenization
text = "Google processes 8.5 billion searches daily!"
tokens = word_tokenize(text)
# ["Google", "processes", "8.5", "billion", "searches", "daily", "!"]
# Sentence Tokenization
document = "Google was founded in 1998. It has many products."
sentences = sent_tokenize(document)
# ["Google was founded in 1998.", "It has many products."]
# Subword Tokenization (BERT/GPT use this!)
tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")
bert_tokens = tokenizer.tokenize("unforgettable experience")
# ["un", "##forget", "##table", "experience"]
# → Handles unknown words by breaking into subwords!
# → "transforming" → ["transform", "##ing"]
from nltk.stem import PorterStemmer, WordNetLemmatizer
stemmer = PorterStemmer()
lemmatizer = WordNetLemmatizer()
words = ["running", "flies", "studies", "better", "dogs"]
for word in words:
stem = stemmer.stem(word)
lemma = lemmatizer.lemmatize(word, pos="v")
print(f"{word:12} | Stem: {stem:10} | Lemma: {lemma}")
# running | Stem: run | Lemma: run
# flies | Stem: fli | Lemma: fly
# studies | Stem: studi | Lemma: study
# better | Stem: better | Lemma: better
# dogs | Stem: dog | Lemma: dog
# Key difference:
# Stemming: Cuts suffix brutally (fli ❌)
# Lemmatization: Converts to dictionary base form (fly ✅)
# Google uses Lemmatization (WordNet-based) for search understanding
from nltk.corpus import stopwords
import nltk
# Stop word removal
stop_words = set(stopwords.words("english"))
tokens = ["best", "python", "tutorial", "for", "beginners", "in", "2024"]
filtered = [w for w in tokens if w not in stop_words]
print(filtered) # ["best", "python", "tutorial", "beginners", "2024"]
# POS Tagging (Part of Speech)
text = "Google searches billions of queries every second"
tagged = nltk.pos_tag(nltk.word_tokenize(text))
# [("Google", "NNP"), ("searches", "VBZ"), ("billions", "NNS"),
# ("of", "IN"), ("queries", "NNS"), ("every", "DT"), ("second", "NN")]
# POS Tags: NNP=Proper Noun, VBZ=Verb, NNS=Noun Plural
# Google uses POS tagging to:
# → Identify that "Python" in "python snake" vs "python programming" are different
# → Adjective before noun = product descriptor
from sklearn.feature_extraction.text import TfidfVectorizer
from gensim.models import Word2Vec
from transformers import BertModel, BertTokenizer
import torch
# Method 1: TF-IDF (Traditional)
corpus = ["Google is a search engine", "Python is for data science",
"Machine learning uses Python"]
vectorizer = TfidfVectorizer(max_features=5000, ngram_range=(1,2))
X_tfidf = vectorizer.fit_transform(corpus)
# Sparse matrix: 3 documents × 5000 features
# Good for search — fast, interpretable
# Method 2: Word2Vec Embeddings (Semantic)
sentences = [doc.split() for doc in corpus]
w2v_model = Word2Vec(sentences, vector_size=100, window=5, min_count=1, epochs=10)
python_vector = w2v_model.wv["python"] # 100-dimensional vector
similar_words = w2v_model.wv.most_similar("google")
print(similar_words) # ["search", "engine", "bing", ...]
# Method 3: BERT Embeddings (Context-aware — Google uses this!)
tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")
bert = BertModel.from_pretrained("bert-base-uncased")
text = "Apple is a great company"
inputs = tokenizer(text, return_tensors="pt", padding=True, truncation=True)
with torch.no_grad():
outputs = bert(**inputs)
embeddings = outputs.last_hidden_state[:, 0, :] # [CLS] token = sentence embedding
# Shape: (1, 768) — 768-dim context-aware embedding
# "Apple" in this context = tech company (not fruit!)
Google processes a query like: "best sushi restaurant near me open now"
Step 1: Normalization
→ Lowercase, expand contractions
→ Spell check: "resturant" → "restaurant" (using huge n-gram model)
Step 2: Query Understanding (BERT)
→ Named Entity Recognition: "sushi" = food type, "near me" = location intent
→ Intent Classification: Local search + temporal constraint
→ Semantic parsing: {food: sushi, type: restaurant, modifier: best, location: user_location, time: open_now}
Step 3: Query Expansion
→ Synonyms: "sushi restaurant" → also match "Japanese restaurant"
→ Related: "best" → consider "highly rated", "top-rated"
→ Personalization: YOUR location from GPS
Step 4: Document Retrieval + Ranking
→ BM25 initial retrieval (keyword matching)
→ BERT re-ranking (semantic relevance scoring)
→ PageRank + 200+ other signals
→ Return top 10 results in <100ms
Google processes each query through this pipeline in ~100ms!
→ Recommend 10-20 shows/movies per user on homepage → Update recommendations when user finishes watching something → Support different surfaces: Homepage, "Because you watched X", Search → Work for 250M+ users, 15,000+ titles → Handle new users (cold start) and new content
→ Latency: <100ms to serve recommendations (user is already on site!) → Throughput: 100K+ recommendation requests/second (peak evening hours) → Freshness: Incorporate viewing event within 5-10 minutes → Availability: 99.99% uptime (downtime = lost revenue!) → Scalability: Handle 10x growth without architecture changes
OFFLINE PIPELINE (runs every few hours): User behavior data → Feature Engineering → Model Training → Model Registry NEARLINE PIPELINE (runs every minutes): Recent events → Pre-compute user embeddings → Cache (Redis) ONLINE SERVING (real-time, <100ms): User request → Feature lookup → Candidate generation → Ranking → Response
STAGE 1: CANDIDATE GENERATION (~1000 candidates) Goal: Fast retrieval of potentially relevant items from 15,000+ Methods (run in parallel): → Collaborative Filtering: "Users like you watched..." (ALS/Neural CF) → Content-Based: "Similar to what you watched" (CNN embeddings) → Trending: "Popular in your country this week" → Continuations: "Continue watching..." → Newly released in your genres STAGE 2: RANKING (~10-20 final results) Goal: Precisely rank 1000 candidates using rich features Model: Deep Learning Ranker (Wide & Deep / DNN) Features: → Match score from Stage 1 → User-item interaction features (watched 80%+ of director`s other films) → Context: Time of day, device type, country → Item features: Genre, language, release year, average rating → User features: Watch history (last 5 shows), subscription length, age Output: Ranked list of 20 shows with predicted engagement score
Online Store (Redis — <5ms lookup): → user_recent_genres: last 30 days genre history → user_avg_session_length: rolling 7-day average → user_last_watched: last 3 show IDs Offline Store (S3/HDFS): → Full viewing history (3+ years) → Content metadata features → Training data for models
User embedding: [0.72, -0.14, 0.55, ...] (128 dim) Item embeddings: Matrix of 15,000 × 128 Naive approach: Compute similarity to all 15,000 → too slow! Netflix uses: FAISS (Facebook AI Similarity Search) → Hierarchical Navigable Small World (HNSW) index → Retrieves top-1000 from 15,000 in <5ms! → Approximate (99% of exact result) — good enough! import faiss import numpy as np # Build FAISS index (offline, done once) d = 128 # dimension index = faiss.IndexHNSWFlat(d, 32) # HNSW with M=32 connections index.add(item_embeddings) # Add all 15,000 item embeddings # Online serving (fast!) user_vector = get_user_embedding(user_id) # From Redis cache D, I = index.search(user_vector.reshape(1, -1), k=1000) # Top 1000 candidates candidate_item_ids = I[0] # Indices of top 1000 candidates
Memorization (Wide part): Linear model on crossed features
→ "User watched crime show + evening → recommend crime shows"
→ Fast, handles sparse explicit patterns
Generalization (Deep part): DNN for unseen feature combinations
→ Learns complex user-content interactions
→ Handles new combinations not seen in training
Code:
import tensorflow as tf
from tensorflow.keras import layers
# Wide part (memorization)
wide_features = tf.feature_column.crossed_column(
["genre", "time_of_day"], hash_bucket_size=1000
)
wide_input = tf.feature_column.input_layer([wide_features])
# Deep part (generalization)
deep_inputs = [
tf.feature_column.embedding_column(genre_vocab, dimension=8),
tf.feature_column.numeric_column("avg_rating"),
tf.feature_column.numeric_column("days_since_release"),
]
deep_input = tf.feature_column.input_layer(deep_inputs)
deep_layer1 = layers.Dense(256, activation="relu")(deep_input)
deep_layer2 = layers.Dense(128, activation="relu")(deep_layer1)
# Combine
combined = layers.Concatenate()([wide_input, deep_layer2])
output = layers.Dense(1, activation="sigmoid")(combined) # Engagement probability
New User: → Onboarding quiz: "What genres do you like?" (3 choices) → Show top trending content in their country → After 3 watches: Enough signal for CF! New Content: → Use content metadata (genre, director, cast) for Content-Based → Show to users who watch similar content (small controlled group) → After 100 views: Enough for CF → Netflix has editorial team manually curate launch recommendations
Traffic: → 250M users, 30% daily active = 75M DAU → Each opens app 1-2x daily → 100M recommendation requests/day → Peak: 20% of traffic in 2-hour window = 100M × 0.2 / 7200 = 2,780 RPS → With 10x safety: Need to handle 27,800 RPS Storage: → User embeddings: 250M × 128 × 4 bytes = 128GB (fits in Redis cluster!) → Item embeddings: 15K × 128 × 4 bytes = 7.7MB (trivially small) → FAISS index: ~50GB for HNSW with 15K items Compute: → Stage 1 (ANN): <5ms per request → Stage 2 (DNN ranking): <20ms per 1000 candidates → Feature lookup: <5ms (Redis) → Total: ~30ms → Within 100ms SLA! ✅
MLOps (Machine Learning Operations): The practice of applying DevOps principles to machine learning systems — automating, monitoring, and maintaining ML models in production.
Traditional Software: Code → Test → Deploy → Monitor
ML Challenge: Code + Data + Model → Test → Deploy → Monitor → Retrain
(Data changes! Model degrades! Distribution shifts!)
MLOps solves:
❌ "Works on my machine but not in production"
❌ "Model was 95% accurate in testing but 70% in production"
❌ "Data scientists train models, engineers deploy — no communication"
❌ "Can`t reproduce experiment — which data/hyperparams did we use?"
❌ "Don`t know when model performance degrades until users complain"
1. PROBLEM DEFINITION → Business KPI → ML metric mapping → Data availability assessment → Success criteria definition 2. DATA COLLECTION & VALIDATION → Data Pipeline: Kafka → Data Lake → Feature Store → Data validation: Great Expectations / TFX Data Validation → Check: Schema, distribution, missing values, biases 3. FEATURE ENGINEERING → Feature Store (Feast, Tecton): Share features across teams → Online features (Redis — <5ms): Real-time data → Offline features (BigQuery): Historical data for training 4. MODEL TRAINING → Experiment tracking: MLflow / W&B → Hyperparameter tuning: Optuna / Ray Tune / Vizier → Distributed training: Horovod / PyTorch DDP / TPUs 5. MODEL EVALUATION → Offline metrics: Accuracy, F1, AUC on holdout test set → Business metrics: Revenue impact, conversion rate → Bias & Fairness: Disaggregated metrics by demographic 6. MODEL DEPLOYMENT → Serving: TF Serving, Triton, BentoML → Deployment strategies: Blue-Green, Canary, Shadow → A/B testing integration 7. MONITORING & OBSERVABILITY → Data drift: KL divergence, PSI on input features → Model drift: Accuracy degradation over time → System health: Latency, throughput, errors 8. RETRAINING TRIGGER → Scheduled (weekly/monthly) → Performance threshold crossed → Data drift detected → New training data available
TFX Components (each is a separate step in pipeline): 1. ExampleGen: Reads data from BigQuery/GCS 2. StatisticsGen: Compute data statistics 3. SchemaGen: Infer schema from statistics 4. ExampleValidator: Detect anomalies vs schema 5. Transform: Feature engineering (saved for serving!) 6. Trainer: Train model (TF/Keras) 7. Tuner: Hyperparameter optimization (Vizier) 8. Evaluator: Compare new model vs baseline 9. InfraValidator: Test model in replica production env 10. Pusher: Deploy to TF Serving IF model passes all checks All steps are: → Logged in ML Metadata (MLMD) → Reproducible (same inputs → same outputs) → Cacheable (skip steps if inputs unchanged) → Automated via Kubeflow Pipelines or Vertex AI Pipelines
Google Search Ranking model: → 200+ features (query, page, user signals) → Retrained DAILY with new click data → Each version evaluated vs current production model → Auto-deployed if improvement > 0.1% on offline metrics → Gradual rollout: 1% → 5% → 25% → 100% traffic Google Ads CTR model: → Retrained HOURLY (advertising is very time-sensitive!) → Model must serve predictions in <10ms → Versioning: Keep last 5 models in case rollback needed
Types of Drift to Monitor:
1. Data Drift (Covariate Shift)
→ Input feature distribution changes
→ Example: COVID lockdowns → all user location features at "home"
→ Detection: PSI (Population Stability Index), KL Divergence
2. Label Drift (Prior Probability Shift)
→ Target variable distribution changes
→ Example: Fraud pattern changes with new fraud techniques
3. Model Drift (Concept Drift)
→ Relationship between inputs and target changes
→ Example: Economy changes → what predicts credit default changes
4. Prediction Drift
→ Distribution of model outputs changes
→ Even without ground truth labels — detectable immediately!
Monitoring Tools:
→ Evidently AI (open source)
→ WhyLogs
→ AWS SageMaker Model Monitor
→ Google Cloud Model Monitoring
→ Arize AI
from evidently.test_suite import TestSuite
from evidently.tests import TestNumberOfMissingValues, TestFeatureDrift
# Compare current week vs last week data
data_drift_tests = TestSuite(tests=[
TestNumberOfMissingValues(lt=0.05), # Less than 5% missing
TestFeatureDrift(features=["age", "income", "location"])
])
data_drift_tests.run(reference_data=last_week_data, current_data=this_week_data)
report = data_drift_tests.as_dict()
if report["summary"]["all_passed"] == False:
alert_slack_channel("Data drift detected! Investigate before retraining.")
| Strategy | How It Works | Risk | Use Case |
|---|---|---|---|
| Blue-Green | Two identical envs; switch traffic instantly | Low | Zero-downtime deploys |
| Canary | Send 5% traffic to new model, then ramp up | Very Low | ML model updates |
| Shadow | Run new model in parallel, compare but don`t serve | None | Model evaluation |
| A/B Test | Split traffic, measure metric impact | Low | When business impact uncertain |
Definition: A sequence of data processing steps that moves data from source(s) to destination(s), transforming it along the way.
| Concept | ETL (Traditional) | ELT (Modern / Cloud) |
|---|---|---|
| Process Order | Extract → Transform → Load | Extract → Load → Transform |
| Where Transform? | Separate ETL server (Spark, Talend) | Inside the data warehouse (BigQuery) |
| Best For | Small-medium data, strict schema | Petabyte scale, flexible schema |
| Tools | Informatica, SSIS, Apache Spark | BigQuery, Snowflake, Redshift |
| Latency | Batch (hourly/daily) | Near real-time or streaming |
Architecture: Dremel + Capacitor + Borg (Google`s internal systems) Storage: Columnar (stores each column separately) Compute: Serverless — auto-scales to thousands of machines Network: Jupiter (Google`s internal 1 petabit/second network) Why Columnar Storage? Row storage: [user_id, name, age, country, spending] — reads all columns Column storage: [spending] column only — 90% less I/O for analytics! BigQuery processes: → 20+ petabytes per day (Google scale) → Query 1TB in seconds for $5 (pay per query) → No indexes needed — full scan in parallel
Raw Sources:
→ Google Search logs (8.5B searches/day)
→ YouTube watch events (1B hours/day)
→ Google Ads clicks (90M/day)
→ Maps location data (1B+ users)
Pipeline:
Pub/Sub (real-time messaging)
→ Dataflow (Apache Beam — streaming ETL)
→ BigQuery (storage + analytics)
→ Looker Studio (dashboards)
→ BigQuery ML (in-SQL machine learning!)
-- Train a Logistic Regression MODEL entirely in SQL!
CREATE OR REPLACE MODEL `project.dataset.churn_model`
OPTIONS(
model_type = "LOGISTIC_REG",
input_label_cols = ["churned"],
max_iterations = 50,
learn_rate = 0.001
) AS
SELECT
days_since_last_active,
avg_session_duration_minutes,
features_used_count,
support_tickets_opened,
billing_plan,
churned -- 1 = churned, 0 = retained
FROM `project.dataset.user_features`
WHERE training_date BETWEEN "2024-01-01" AND "2024-06-30";
-- Evaluate the model
SELECT * FROM ML.EVALUATE(MODEL `project.dataset.churn_model`);
-- Predict churn for active users
SELECT
user_id,
predicted_churned,
predicted_churned_probs[OFFSET(1)].prob AS churn_probability
FROM ML.PREDICT(
MODEL `project.dataset.churn_model`,
(SELECT * FROM `project.dataset.user_features` WHERE is_active = 1)
)
ORDER BY churn_probability DESC
LIMIT 1000; -- Top 1000 highest churn risk users
Google Ads Real-Time Pipeline:
1. Ad click event occurs → Pub/Sub publishes event (milliseconds)
2. Dataflow job reads from Pub/Sub → processes stream
3. Validates click (bot detection, duplicate removal)
4. Joins with advertiser budget (Bigtable lookup — low latency)
5. Writes to BigQuery streaming buffer
6. Advertiser dashboard updates within 10 seconds!
Python Dataflow (Apache Beam) Example:
import apache_beam as beam
def validate_click(click_event):
"""Remove bot clicks and duplicates"""
if click_event["is_bot"] == False and click_event["is_duplicate"] == False:
return click_event
pipeline = beam.Pipeline()
(
pipeline
| "Read from Pub/Sub" >> beam.io.ReadFromPubSub(topic="ad_clicks_topic")
| "Parse JSON" >> beam.Map(json.loads)
| "Validate Clicks" >> beam.Filter(validate_click)
| "Enrich with Budget" >> beam.ParDo(LookupAdvertiserBudget())
| "Write to BigQuery" >> beam.io.WriteToBigQuery(
table="project:ads.processed_clicks",
write_disposition=beam.io.BigQueryDisposition.WRITE_APPEND,
create_disposition=beam.io.BigQueryDisposition.CREATE_IF_NEEDED
)
)
pipeline.run()
Common Data Quality Issues:
1. Missing values (NULLs)
2. Duplicates (same event recorded twice)
3. Schema drift (column types change)
4. Late-arriving data (streaming: event time vs processing time)
5. Data drift (distribution shifts over time)
BigQuery Data Quality Checks:
-- Daily automated quality check
SELECT
date,
COUNT(*) AS total_records,
COUNT(DISTINCT user_id) AS unique_users,
SUM(CASE WHEN amount IS NULL THEN 1 ELSE 0 END) AS null_amounts,
SUM(CASE WHEN amount 1.0 -- Alert if >1% nulls
OR total_records < 1000000; -- Alert if fewer records than expected
Netflix uses SQL daily for content performance tracking, user behavior analysis, churn prediction, and A/B test analysis. Every recommendation, pricing decision, and content investment is backed by SQL queries.
-- Find shows gaining momentum (viewership growing fastest)
WITH daily_views AS (
SELECT
show_id, show_name,
view_date,
COUNT(DISTINCT user_id) AS daily_viewers,
LAG(COUNT(DISTINCT user_id), 7) OVER (
PARTITION BY show_id
ORDER BY view_date
) AS viewers_last_week
FROM viewing_events
WHERE view_date >= CURRENT_DATE - 30
GROUP BY show_id, show_name, view_date
),
growth_calc AS (
SELECT *,
ROUND((daily_viewers - viewers_last_week) * 100.0 /
NULLIF(viewers_last_week, 0), 2) AS week_over_week_growth,
RANK() OVER (ORDER BY daily_viewers DESC) AS viewership_rank,
ROW_NUMBER() OVER (PARTITION BY show_id ORDER BY view_date DESC) AS rn
FROM daily_views
WHERE viewers_last_week IS NOT NULL
)
SELECT show_name, daily_viewers, viewers_last_week,
week_over_week_growth, viewership_rank
FROM growth_calc
WHERE rn = 1
ORDER BY week_over_week_growth DESC
LIMIT 10;
-- Shows with fastest growing viewership week-over-week
-- Netflix Monthly Cohort Retention
-- "Of users who subscribed in Jan 2024, how many are still watching in month 3?"
WITH cohorts AS (
SELECT
user_id,
DATE_TRUNC(`month`, first_watch_date) AS cohort_month
FROM (
SELECT user_id, MIN(view_date) AS first_watch_date
FROM viewing_events
GROUP BY user_id
) first_views
),
cohort_activity AS (
SELECT
c.user_id,
c.cohort_month,
DATE_TRUNC(`month`, v.view_date) AS activity_month,
DATEDIFF(`month`, c.cohort_month, DATE_TRUNC(`month`, v.view_date)) AS months_since_start
FROM cohorts c
JOIN viewing_events v ON c.user_id = v.user_id
GROUP BY c.user_id, c.cohort_month, DATE_TRUNC(`month`, v.view_date)
),
cohort_sizes AS (
SELECT cohort_month, COUNT(DISTINCT user_id) AS cohort_size
FROM cohorts
GROUP BY cohort_month
)
SELECT
ca.cohort_month,
ca.months_since_start,
COUNT(DISTINCT ca.user_id) AS retained_users,
cs.cohort_size AS original_size,
ROUND(100.0 * COUNT(DISTINCT ca.user_id) / cs.cohort_size, 1) AS retention_rate
FROM cohort_activity ca
JOIN cohort_sizes cs ON ca.cohort_month = cs.cohort_month
WHERE ca.cohort_month >= `2024-01-01`
GROUP BY ca.cohort_month, ca.months_since_start, cs.cohort_size
ORDER BY ca.cohort_month, ca.months_since_start;
-- Flag users at risk of cancellation (churn prediction features)
WITH user_engagement AS (
SELECT
user_id,
COUNT(DISTINCT view_date) AS active_days_30,
COUNT(DISTINCT show_id) AS unique_shows_watched,
SUM(watch_duration_minutes) AS total_minutes,
AVG(watch_duration_minutes) AS avg_session_length,
MAX(view_date) AS last_watch_date,
DATEDIFF(`day`, MAX(view_date), CURRENT_DATE) AS days_since_last_watch,
COUNT(DISTINCT DATE_TRUNC(`week`, view_date)) AS active_weeks
FROM viewing_events
WHERE view_date >= CURRENT_DATE - 30
GROUP BY user_id
),
user_profile AS (
SELECT
u.user_id, u.subscription_plan, u.monthly_fee,
u.subscription_start_date,
DATEDIFF(`month`, u.subscription_start_date, CURRENT_DATE) AS tenure_months
FROM users u
)
SELECT
e.user_id,
p.subscription_plan,
p.tenure_months,
e.active_days_30,
e.total_minutes,
e.days_since_last_watch,
e.unique_shows_watched,
CASE
WHEN e.days_since_last_watch > 14 AND e.active_days_30 7 AND e.active_days_30 < 7 THEN `MEDIUM RISK`
WHEN e.active_days_30 14 THEN 0.85
WHEN e.days_since_last_watch > 7 THEN 0.50
WHEN e.active_days_30 < 7 THEN 0.30
ELSE 0.05
END, 2
) AS churn_probability_score
FROM user_engagement e
JOIN user_profile p ON e.user_id = p.user_id
ORDER BY churn_probability_score DESC;
-- Analyze Netflix "New UI" A/B Test Results
WITH experiment_data AS (
SELECT
e.user_id,
e.variant, -- `control` or `treatment`
e.assignment_date,
COUNT(v.view_date) AS sessions_in_period,
SUM(v.watch_duration_minutes) AS total_watch_minutes,
COUNT(DISTINCT v.show_id) AS unique_shows,
CASE WHEN s.user_id IS NOT NULL THEN 1 ELSE 0 END AS is_still_subscribed
FROM experiment_assignments e
LEFT JOIN viewing_events v
ON e.user_id = v.user_id
AND v.view_date BETWEEN e.assignment_date AND e.assignment_date + 14
LEFT JOIN active_subscribers s ON e.user_id = s.user_id
GROUP BY e.user_id, e.variant, e.assignment_date, s.user_id
)
SELECT
variant,
COUNT(*) AS users,
ROUND(AVG(total_watch_minutes), 1) AS avg_watch_minutes,
ROUND(AVG(unique_shows), 2) AS avg_shows_watched,
ROUND(AVG(sessions_in_period), 2) AS avg_sessions,
ROUND(100.0 * SUM(is_still_subscribed) / COUNT(*), 2) AS retention_rate,
ROUND(STDDEV(total_watch_minutes), 1) AS std_watch_minutes
FROM experiment_data
GROUP BY variant;
-- Result interpretation:
-- control: avg_watch=87min, retention=92%
-- treatment: avg_watch=93min, retention=94%
-- Run statistical test → +6min watch (+6.9%), +2% retention → LAUNCH!
Netflix processes petabytes daily. SQL optimization is critical: 1. Use PARTITION BY in window functions (not OVER() alone) → Prevents full table scan across billions of rows 2. Filter early with CTEs → WHERE before JOIN reduces data size 3. Avoid SELECT * in production → Only select needed columns 4. Use APPROXIMATE COUNT (HyperLogLog) for huge cardinality → SELECT APPROX_COUNT_DISTINCT(user_id) is 100x faster! 5. Partition tables by date → WHERE view_date >= `2024-01-01` uses partition pruning 6. EXPLAIN ANALYZE before running heavy queries → Shows execution plan, identifies slow operations Netflix uses Apache Spark SQL for queries involving terabytes of data — standard SQL concepts apply, but distributed across 1000s of machines!
Definition: A sequence of data processing steps that moves data from source(s) to destination(s), transforming it along the way.
| Concept | ETL (Traditional) | ELT (Modern / Cloud) |
|---|---|---|
| Process Order | Extract → Transform → Load | Extract → Load → Transform |
| Where Transform? | Separate ETL server (Spark, Talend) | Inside the data warehouse (BigQuery) |
| Best For | Small-medium data, strict schema | Petabyte scale, flexible schema |
| Tools | Informatica, SSIS, Apache Spark | BigQuery, Snowflake, Redshift |
| Latency | Batch (hourly/daily) | Near real-time or streaming |
Architecture: Dremel + Capacitor + Borg (Google`s internal systems) Storage: Columnar (stores each column separately) Compute: Serverless — auto-scales to thousands of machines Network: Jupiter (Google`s internal 1 petabit/second network) Why Columnar Storage? Row storage: [user_id, name, age, country, spending] — reads all columns Column storage: [spending] column only — 90% less I/O for analytics! BigQuery processes: → 20+ petabytes per day (Google scale) → Query 1TB in seconds for $5 (pay per query) → No indexes needed — full scan in parallel
Raw Sources:
→ Google Search logs (8.5B searches/day)
→ YouTube watch events (1B hours/day)
→ Google Ads clicks (90M/day)
→ Maps location data (1B+ users)
Pipeline:
Pub/Sub (real-time messaging)
→ Dataflow (Apache Beam — streaming ETL)
→ BigQuery (storage + analytics)
→ Looker Studio (dashboards)
→ BigQuery ML (in-SQL machine learning!)
-- Train a Logistic Regression MODEL entirely in SQL!
CREATE OR REPLACE MODEL `project.dataset.churn_model`
OPTIONS(
model_type = "LOGISTIC_REG",
input_label_cols = ["churned"],
max_iterations = 50,
learn_rate = 0.001
) AS
SELECT
days_since_last_active,
avg_session_duration_minutes,
features_used_count,
support_tickets_opened,
billing_plan,
churned -- 1 = churned, 0 = retained
FROM `project.dataset.user_features`
WHERE training_date BETWEEN "2024-01-01" AND "2024-06-30";
-- Evaluate the model
SELECT * FROM ML.EVALUATE(MODEL `project.dataset.churn_model`);
-- Predict churn for active users
SELECT
user_id,
predicted_churned,
predicted_churned_probs[OFFSET(1)].prob AS churn_probability
FROM ML.PREDICT(
MODEL `project.dataset.churn_model`,
(SELECT * FROM `project.dataset.user_features` WHERE is_active = 1)
)
ORDER BY churn_probability DESC
LIMIT 1000; -- Top 1000 highest churn risk users
Google Ads Real-Time Pipeline:
1. Ad click event occurs → Pub/Sub publishes event (milliseconds)
2. Dataflow job reads from Pub/Sub → processes stream
3. Validates click (bot detection, duplicate removal)
4. Joins with advertiser budget (Bigtable lookup — low latency)
5. Writes to BigQuery streaming buffer
6. Advertiser dashboard updates within 10 seconds!
Python Dataflow (Apache Beam) Example:
import apache_beam as beam
def validate_click(click_event):
"""Remove bot clicks and duplicates"""
if click_event["is_bot"] == False and click_event["is_duplicate"] == False:
return click_event
pipeline = beam.Pipeline()
(
pipeline
| "Read from Pub/Sub" >> beam.io.ReadFromPubSub(topic="ad_clicks_topic")
| "Parse JSON" >> beam.Map(json.loads)
| "Validate Clicks" >> beam.Filter(validate_click)
| "Enrich with Budget" >> beam.ParDo(LookupAdvertiserBudget())
| "Write to BigQuery" >> beam.io.WriteToBigQuery(
table="project:ads.processed_clicks",
write_disposition=beam.io.BigQueryDisposition.WRITE_APPEND,
create_disposition=beam.io.BigQueryDisposition.CREATE_IF_NEEDED
)
)
pipeline.run()
Common Data Quality Issues:
1. Missing values (NULLs)
2. Duplicates (same event recorded twice)
3. Schema drift (column types change)
4. Late-arriving data (streaming: event time vs processing time)
5. Data drift (distribution shifts over time)
BigQuery Data Quality Checks:
-- Daily automated quality check
SELECT
date,
COUNT(*) AS total_records,
COUNT(DISTINCT user_id) AS unique_users,
SUM(CASE WHEN amount IS NULL THEN 1 ELSE 0 END) AS null_amounts,
SUM(CASE WHEN amount 1.0 -- Alert if >1% nulls
OR total_records < 1000000; -- Alert if fewer records than expected
Netflix uses SQL daily for content performance tracking, user behavior analysis, churn prediction, and A/B test analysis. Every recommendation, pricing decision, and content investment is backed by SQL queries.
-- Find shows gaining momentum (viewership growing fastest)
WITH daily_views AS (
SELECT
show_id, show_name,
view_date,
COUNT(DISTINCT user_id) AS daily_viewers,
LAG(COUNT(DISTINCT user_id), 7) OVER (
PARTITION BY show_id
ORDER BY view_date
) AS viewers_last_week
FROM viewing_events
WHERE view_date >= CURRENT_DATE - 30
GROUP BY show_id, show_name, view_date
),
growth_calc AS (
SELECT *,
ROUND((daily_viewers - viewers_last_week) * 100.0 /
NULLIF(viewers_last_week, 0), 2) AS week_over_week_growth,
RANK() OVER (ORDER BY daily_viewers DESC) AS viewership_rank,
ROW_NUMBER() OVER (PARTITION BY show_id ORDER BY view_date DESC) AS rn
FROM daily_views
WHERE viewers_last_week IS NOT NULL
)
SELECT show_name, daily_viewers, viewers_last_week,
week_over_week_growth, viewership_rank
FROM growth_calc
WHERE rn = 1
ORDER BY week_over_week_growth DESC
LIMIT 10;
-- Shows with fastest growing viewership week-over-week
-- Netflix Monthly Cohort Retention
-- "Of users who subscribed in Jan 2024, how many are still watching in month 3?"
WITH cohorts AS (
SELECT
user_id,
DATE_TRUNC(`month`, first_watch_date) AS cohort_month
FROM (
SELECT user_id, MIN(view_date) AS first_watch_date
FROM viewing_events
GROUP BY user_id
) first_views
),
cohort_activity AS (
SELECT
c.user_id,
c.cohort_month,
DATE_TRUNC(`month`, v.view_date) AS activity_month,
DATEDIFF(`month`, c.cohort_month, DATE_TRUNC(`month`, v.view_date)) AS months_since_start
FROM cohorts c
JOIN viewing_events v ON c.user_id = v.user_id
GROUP BY c.user_id, c.cohort_month, DATE_TRUNC(`month`, v.view_date)
),
cohort_sizes AS (
SELECT cohort_month, COUNT(DISTINCT user_id) AS cohort_size
FROM cohorts
GROUP BY cohort_month
)
SELECT
ca.cohort_month,
ca.months_since_start,
COUNT(DISTINCT ca.user_id) AS retained_users,
cs.cohort_size AS original_size,
ROUND(100.0 * COUNT(DISTINCT ca.user_id) / cs.cohort_size, 1) AS retention_rate
FROM cohort_activity ca
JOIN cohort_sizes cs ON ca.cohort_month = cs.cohort_month
WHERE ca.cohort_month >= `2024-01-01`
GROUP BY ca.cohort_month, ca.months_since_start, cs.cohort_size
ORDER BY ca.cohort_month, ca.months_since_start;
-- Flag users at risk of cancellation (churn prediction features)
WITH user_engagement AS (
SELECT
user_id,
COUNT(DISTINCT view_date) AS active_days_30,
COUNT(DISTINCT show_id) AS unique_shows_watched,
SUM(watch_duration_minutes) AS total_minutes,
AVG(watch_duration_minutes) AS avg_session_length,
MAX(view_date) AS last_watch_date,
DATEDIFF(`day`, MAX(view_date), CURRENT_DATE) AS days_since_last_watch,
COUNT(DISTINCT DATE_TRUNC(`week`, view_date)) AS active_weeks
FROM viewing_events
WHERE view_date >= CURRENT_DATE - 30
GROUP BY user_id
),
user_profile AS (
SELECT
u.user_id, u.subscription_plan, u.monthly_fee,
u.subscription_start_date,
DATEDIFF(`month`, u.subscription_start_date, CURRENT_DATE) AS tenure_months
FROM users u
)
SELECT
e.user_id,
p.subscription_plan,
p.tenure_months,
e.active_days_30,
e.total_minutes,
e.days_since_last_watch,
e.unique_shows_watched,
CASE
WHEN e.days_since_last_watch > 14 AND e.active_days_30 7 AND e.active_days_30 < 7 THEN `MEDIUM RISK`
WHEN e.active_days_30 14 THEN 0.85
WHEN e.days_since_last_watch > 7 THEN 0.50
WHEN e.active_days_30 < 7 THEN 0.30
ELSE 0.05
END, 2
) AS churn_probability_score
FROM user_engagement e
JOIN user_profile p ON e.user_id = p.user_id
ORDER BY churn_probability_score DESC;
-- Analyze Netflix "New UI" A/B Test Results
WITH experiment_data AS (
SELECT
e.user_id,
e.variant, -- `control` or `treatment`
e.assignment_date,
COUNT(v.view_date) AS sessions_in_period,
SUM(v.watch_duration_minutes) AS total_watch_minutes,
COUNT(DISTINCT v.show_id) AS unique_shows,
CASE WHEN s.user_id IS NOT NULL THEN 1 ELSE 0 END AS is_still_subscribed
FROM experiment_assignments e
LEFT JOIN viewing_events v
ON e.user_id = v.user_id
AND v.view_date BETWEEN e.assignment_date AND e.assignment_date + 14
LEFT JOIN active_subscribers s ON e.user_id = s.user_id
GROUP BY e.user_id, e.variant, e.assignment_date, s.user_id
)
SELECT
variant,
COUNT(*) AS users,
ROUND(AVG(total_watch_minutes), 1) AS avg_watch_minutes,
ROUND(AVG(unique_shows), 2) AS avg_shows_watched,
ROUND(AVG(sessions_in_period), 2) AS avg_sessions,
ROUND(100.0 * SUM(is_still_subscribed) / COUNT(*), 2) AS retention_rate,
ROUND(STDDEV(total_watch_minutes), 1) AS std_watch_minutes
FROM experiment_data
GROUP BY variant;
-- Result interpretation:
-- control: avg_watch=87min, retention=92%
-- treatment: avg_watch=93min, retention=94%
-- Run statistical test → +6min watch (+6.9%), +2% retention → LAUNCH!
Netflix processes petabytes daily. SQL optimization is critical: 1. Use PARTITION BY in window functions (not OVER() alone) → Prevents full table scan across billions of rows 2. Filter early with CTEs → WHERE before JOIN reduces data size 3. Avoid SELECT * in production → Only select needed columns 4. Use APPROXIMATE COUNT (HyperLogLog) for huge cardinality → SELECT APPROX_COUNT_DISTINCT(user_id) is 100x faster! 5. Partition tables by date → WHERE view_date >= `2024-01-01` uses partition pruning 6. EXPLAIN ANALYZE before running heavy queries → Shows execution plan, identifies slow operations Netflix uses Apache Spark SQL for queries involving terabytes of data — standard SQL concepts apply, but distributed across 1000s of machines!
Definition: A sequence of data processing steps that moves data from source(s) to destination(s), transforming it along the way.
| Concept | ETL (Traditional) | ELT (Modern / Cloud) |
|---|---|---|
| Process Order | Extract → Transform → Load | Extract → Load → Transform |
| Where Transform? | Separate ETL server (Spark, Talend) | Inside the data warehouse (BigQuery) |
| Best For | Small-medium data, strict schema | Petabyte scale, flexible schema |
| Tools | Informatica, SSIS, Apache Spark | BigQuery, Snowflake, Redshift |
| Latency | Batch (hourly/daily) | Near real-time or streaming |
Architecture: Dremel + Capacitor + Borg (Google`s internal systems) Storage: Columnar (stores each column separately) Compute: Serverless — auto-scales to thousands of machines Network: Jupiter (Google`s internal 1 petabit/second network) Why Columnar Storage? Row storage: [user_id, name, age, country, spending] — reads all columns Column storage: [spending] column only — 90% less I/O for analytics! BigQuery processes: → 20+ petabytes per day (Google scale) → Query 1TB in seconds for $5 (pay per query) → No indexes needed — full scan in parallel
Raw Sources:
→ Google Search logs (8.5B searches/day)
→ YouTube watch events (1B hours/day)
→ Google Ads clicks (90M/day)
→ Maps location data (1B+ users)
Pipeline:
Pub/Sub (real-time messaging)
→ Dataflow (Apache Beam — streaming ETL)
→ BigQuery (storage + analytics)
→ Looker Studio (dashboards)
→ BigQuery ML (in-SQL machine learning!)
-- Train a Logistic Regression MODEL entirely in SQL!
CREATE OR REPLACE MODEL `project.dataset.churn_model`
OPTIONS(
model_type = "LOGISTIC_REG",
input_label_cols = ["churned"],
max_iterations = 50,
learn_rate = 0.001
) AS
SELECT
days_since_last_active,
avg_session_duration_minutes,
features_used_count,
support_tickets_opened,
billing_plan,
churned -- 1 = churned, 0 = retained
FROM `project.dataset.user_features`
WHERE training_date BETWEEN "2024-01-01" AND "2024-06-30";
-- Evaluate the model
SELECT * FROM ML.EVALUATE(MODEL `project.dataset.churn_model`);
-- Predict churn for active users
SELECT
user_id,
predicted_churned,
predicted_churned_probs[OFFSET(1)].prob AS churn_probability
FROM ML.PREDICT(
MODEL `project.dataset.churn_model`,
(SELECT * FROM `project.dataset.user_features` WHERE is_active = 1)
)
ORDER BY churn_probability DESC
LIMIT 1000; -- Top 1000 highest churn risk users
Google Ads Real-Time Pipeline:
1. Ad click event occurs → Pub/Sub publishes event (milliseconds)
2. Dataflow job reads from Pub/Sub → processes stream
3. Validates click (bot detection, duplicate removal)
4. Joins with advertiser budget (Bigtable lookup — low latency)
5. Writes to BigQuery streaming buffer
6. Advertiser dashboard updates within 10 seconds!
Python Dataflow (Apache Beam) Example:
import apache_beam as beam
def validate_click(click_event):
"""Remove bot clicks and duplicates"""
if click_event["is_bot"] == False and click_event["is_duplicate"] == False:
return click_event
pipeline = beam.Pipeline()
(
pipeline
| "Read from Pub/Sub" >> beam.io.ReadFromPubSub(topic="ad_clicks_topic")
| "Parse JSON" >> beam.Map(json.loads)
| "Validate Clicks" >> beam.Filter(validate_click)
| "Enrich with Budget" >> beam.ParDo(LookupAdvertiserBudget())
| "Write to BigQuery" >> beam.io.WriteToBigQuery(
table="project:ads.processed_clicks",
write_disposition=beam.io.BigQueryDisposition.WRITE_APPEND,
create_disposition=beam.io.BigQueryDisposition.CREATE_IF_NEEDED
)
)
pipeline.run()
Common Data Quality Issues:
1. Missing values (NULLs)
2. Duplicates (same event recorded twice)
3. Schema drift (column types change)
4. Late-arriving data (streaming: event time vs processing time)
5. Data drift (distribution shifts over time)
BigQuery Data Quality Checks:
-- Daily automated quality check
SELECT
date,
COUNT(*) AS total_records,
COUNT(DISTINCT user_id) AS unique_users,
SUM(CASE WHEN amount IS NULL THEN 1 ELSE 0 END) AS null_amounts,
SUM(CASE WHEN amount 1.0 -- Alert if >1% nulls
OR total_records < 1000000; -- Alert if fewer records than expected
Netflix uses SQL daily for content performance tracking, user behavior analysis, churn prediction, and A/B test analysis. Every recommendation, pricing decision, and content investment is backed by SQL queries.
-- Find shows gaining momentum (viewership growing fastest)
WITH daily_views AS (
SELECT
show_id, show_name,
view_date,
COUNT(DISTINCT user_id) AS daily_viewers,
LAG(COUNT(DISTINCT user_id), 7) OVER (
PARTITION BY show_id
ORDER BY view_date
) AS viewers_last_week
FROM viewing_events
WHERE view_date >= CURRENT_DATE - 30
GROUP BY show_id, show_name, view_date
),
growth_calc AS (
SELECT *,
ROUND((daily_viewers - viewers_last_week) * 100.0 /
NULLIF(viewers_last_week, 0), 2) AS week_over_week_growth,
RANK() OVER (ORDER BY daily_viewers DESC) AS viewership_rank,
ROW_NUMBER() OVER (PARTITION BY show_id ORDER BY view_date DESC) AS rn
FROM daily_views
WHERE viewers_last_week IS NOT NULL
)
SELECT show_name, daily_viewers, viewers_last_week,
week_over_week_growth, viewership_rank
FROM growth_calc
WHERE rn = 1
ORDER BY week_over_week_growth DESC
LIMIT 10;
-- Shows with fastest growing viewership week-over-week
-- Netflix Monthly Cohort Retention
-- "Of users who subscribed in Jan 2024, how many are still watching in month 3?"
WITH cohorts AS (
SELECT
user_id,
DATE_TRUNC(`month`, first_watch_date) AS cohort_month
FROM (
SELECT user_id, MIN(view_date) AS first_watch_date
FROM viewing_events
GROUP BY user_id
) first_views
),
cohort_activity AS (
SELECT
c.user_id,
c.cohort_month,
DATE_TRUNC(`month`, v.view_date) AS activity_month,
DATEDIFF(`month`, c.cohort_month, DATE_TRUNC(`month`, v.view_date)) AS months_since_start
FROM cohorts c
JOIN viewing_events v ON c.user_id = v.user_id
GROUP BY c.user_id, c.cohort_month, DATE_TRUNC(`month`, v.view_date)
),
cohort_sizes AS (
SELECT cohort_month, COUNT(DISTINCT user_id) AS cohort_size
FROM cohorts
GROUP BY cohort_month
)
SELECT
ca.cohort_month,
ca.months_since_start,
COUNT(DISTINCT ca.user_id) AS retained_users,
cs.cohort_size AS original_size,
ROUND(100.0 * COUNT(DISTINCT ca.user_id) / cs.cohort_size, 1) AS retention_rate
FROM cohort_activity ca
JOIN cohort_sizes cs ON ca.cohort_month = cs.cohort_month
WHERE ca.cohort_month >= `2024-01-01`
GROUP BY ca.cohort_month, ca.months_since_start, cs.cohort_size
ORDER BY ca.cohort_month, ca.months_since_start;
-- Flag users at risk of cancellation (churn prediction features)
WITH user_engagement AS (
SELECT
user_id,
COUNT(DISTINCT view_date) AS active_days_30,
COUNT(DISTINCT show_id) AS unique_shows_watched,
SUM(watch_duration_minutes) AS total_minutes,
AVG(watch_duration_minutes) AS avg_session_length,
MAX(view_date) AS last_watch_date,
DATEDIFF(`day`, MAX(view_date), CURRENT_DATE) AS days_since_last_watch,
COUNT(DISTINCT DATE_TRUNC(`week`, view_date)) AS active_weeks
FROM viewing_events
WHERE view_date >= CURRENT_DATE - 30
GROUP BY user_id
),
user_profile AS (
SELECT
u.user_id, u.subscription_plan, u.monthly_fee,
u.subscription_start_date,
DATEDIFF(`month`, u.subscription_start_date, CURRENT_DATE) AS tenure_months
FROM users u
)
SELECT
e.user_id,
p.subscription_plan,
p.tenure_months,
e.active_days_30,
e.total_minutes,
e.days_since_last_watch,
e.unique_shows_watched,
CASE
WHEN e.days_since_last_watch > 14 AND e.active_days_30 7 AND e.active_days_30 < 7 THEN `MEDIUM RISK`
WHEN e.active_days_30 14 THEN 0.85
WHEN e.days_since_last_watch > 7 THEN 0.50
WHEN e.active_days_30 < 7 THEN 0.30
ELSE 0.05
END, 2
) AS churn_probability_score
FROM user_engagement e
JOIN user_profile p ON e.user_id = p.user_id
ORDER BY churn_probability_score DESC;
-- Analyze Netflix "New UI" A/B Test Results
WITH experiment_data AS (
SELECT
e.user_id,
e.variant, -- `control` or `treatment`
e.assignment_date,
COUNT(v.view_date) AS sessions_in_period,
SUM(v.watch_duration_minutes) AS total_watch_minutes,
COUNT(DISTINCT v.show_id) AS unique_shows,
CASE WHEN s.user_id IS NOT NULL THEN 1 ELSE 0 END AS is_still_subscribed
FROM experiment_assignments e
LEFT JOIN viewing_events v
ON e.user_id = v.user_id
AND v.view_date BETWEEN e.assignment_date AND e.assignment_date + 14
LEFT JOIN active_subscribers s ON e.user_id = s.user_id
GROUP BY e.user_id, e.variant, e.assignment_date, s.user_id
)
SELECT
variant,
COUNT(*) AS users,
ROUND(AVG(total_watch_minutes), 1) AS avg_watch_minutes,
ROUND(AVG(unique_shows), 2) AS avg_shows_watched,
ROUND(AVG(sessions_in_period), 2) AS avg_sessions,
ROUND(100.0 * SUM(is_still_subscribed) / COUNT(*), 2) AS retention_rate,
ROUND(STDDEV(total_watch_minutes), 1) AS std_watch_minutes
FROM experiment_data
GROUP BY variant;
-- Result interpretation:
-- control: avg_watch=87min, retention=92%
-- treatment: avg_watch=93min, retention=94%
-- Run statistical test → +6min watch (+6.9%), +2% retention → LAUNCH!
Netflix processes petabytes daily. SQL optimization is critical: 1. Use PARTITION BY in window functions (not OVER() alone) → Prevents full table scan across billions of rows 2. Filter early with CTEs → WHERE before JOIN reduces data size 3. Avoid SELECT * in production → Only select needed columns 4. Use APPROXIMATE COUNT (HyperLogLog) for huge cardinality → SELECT APPROX_COUNT_DISTINCT(user_id) is 100x faster! 5. Partition tables by date → WHERE view_date >= `2024-01-01` uses partition pruning 6. EXPLAIN ANALYZE before running heavy queries → Shows execution plan, identifies slow operations Netflix uses Apache Spark SQL for queries involving terabytes of data — standard SQL concepts apply, but distributed across 1000s of machines!
Assumption: Relationship between X and y is LINEAR y = β₀ + β₁X₁ + β₂X₂ + ε Check: Scatter plot of residuals vs fitted values → should be random (no pattern) Fix if violated: Add polynomial terms (X²), log transform, use tree models Uber example: Surge multiplier vs ride demand MAY be non-linear at extremes
Assumption: Observations are independent (no autocorrelation) Check: Durbin-Watson test (value ~2 = no autocorrelation) Fix: Add time lag features, use time series models (ARIMA, LSTM) Uber example: Surge at 5pm affects 5:01pm → VIOLATES independence! Solution: Use LSTM instead of pure regression
Assumption: Residual variance is CONSTANT across all values of X
Violation (Heteroscedasticity): Variance changes — like a fan shape
Check: Breusch-Pagan test, Residuals vs Fitted plot
Fix: Log transform y, Weighted Least Squares, Robust Standard Errors
Uber example: Variance of surge multiplier is higher during peak hours
Log-transform surge multiplier before regression!
Assumption: Residuals are normally distributed Check: Q-Q plot (points follow diagonal), Shapiro-Wilk test Fix: More data (CLT kicks in), remove outliers, transform target Note: Less critical with large samples (CLT makes tests valid)
Assumption: Predictor variables are NOT highly correlated with each other Problem: If X1 and X2 are correlated, cant determine individual effects Check: Variance Inflation Factor (VIF) VIF = 1 / (1 - R²_j) where R²_j = R² of regressing X_j on all other X`s VIF < 5: Acceptable VIF 5-10: Concerning VIF > 10: Serious multicollinearity! Fix: Remove one correlated variable, use PCA, Ridge Regression (L2) Uber example: "Temperature" and "Rain" might be correlated → check VIF
Target: Surge Multiplier (1.0x to 5.0x) Features (X variables): Time Features: - Hour of day (0-23) - Day of week (1-7) - Is weekend? (0/1) - Days until holiday Demand Features: - Active ride requests in zone (last 5 min) - Requests per driver in zone - Historical demand at same time/location last week Supply Features: - Available drivers in zone - Driver acceptance rate (last 10 min) - Estimated driver arrival time Context Features: - Weather (rain, temperature, wind) - Major event nearby (stadium, concert) - Special occasion (New Year, festival) Location: - Zone ID (encoded) - Distance from city center
Simple Linear: surge = β₀ + β₁(demand) + β₂(supply) + β₃(weather) + ε → Fails: non-linear relationship, doesn`t capture interactions Uber`s Actual Approach (Gradient Boosted Trees + Regression): 1. Feature engineering: demand/supply ratio, rolling averages 2. XGBoost regression for surge prediction 3. Linear regression for interpretability / A/B testing analysis 4. Combine: XGBoost prediction + linear correction for edge cases
Key Plots: 1. Residuals vs Fitted: Check linearity & homoscedasticity → Random scatter = good ✅ → Funnel shape = heteroscedasticity ❌ → Curved pattern = non-linearity ❌ 2. Q-Q Plot: Check normality of residuals → Points on diagonal line = normal ✅ → Heavy tails = non-normal ❌ 3. Scale-Location Plot: Check homoscedasticity → Flat, horizontal red line = good ✅ 4. Cook`s Distance: Detect influential outliers → Points > 4/n = investigate these observations!
import pandas as pd
import numpy as np
from sklearn.linear_model import LinearRegression, Ridge
from sklearn.preprocessing import StandardScaler
import statsmodels.api as sm
from statsmodels.stats.outliers_influence import variance_inflation_factor
# Uber-style surge prediction features
# X = [demand_supply_ratio, hour, is_raining, event_nearby, temperature]
X_train = uber_features
y_train = surge_multiplier # Target
# Train with statsmodels (for diagnostics)
X_with_const = sm.add_constant(X_train)
model = sm.OLS(y_train, X_with_const).fit()
print(model.summary()) # Full statistics
# Check all 5 assumptions:
residuals = model.resid
fitted = model.fittedvalues
# 1. Linearity: Residuals vs Fitted
import matplotlib.pyplot as plt
plt.scatter(fitted, residuals)
plt.axhline(y=0, color="red")
plt.xlabel("Fitted Values"); plt.ylabel("Residuals")
plt.title("Residuals vs Fitted — Check Linearity")
# 2. Normality: Q-Q Plot
sm.qqplot(residuals, line="45")
plt.title("Q-Q Plot — Check Normality")
# 3. Homoscedasticity: Breusch-Pagan test
from statsmodels.stats.diagnostic import het_breuschpagan
bp_stat, bp_pvalue, _, _ = het_breuschpagan(residuals, X_with_const)
print(f"Breusch-Pagan p-value: {bp_pvalue:.4f}")
print("Homoscedastic" if bp_pvalue > 0.05 else "Heteroscedastic — fix needed!")
# 4. Independence: Durbin-Watson
from statsmodels.stats.stattools import durbin_watson
dw = durbin_watson(residuals)
print(f"Durbin-Watson: {dw:.2f} (2=independent, 10]) # Flag high VIF features
# Fix heteroscedasticity: Log transform target
log_surge = np.log(surge_multiplier + 0.01)
model_log = sm.OLS(log_surge, X_with_const).fit()
print(f"
Original R²: {model.rsquared:.4f}")
print(f"Log-transformed R²: {model_log.rsquared:.4f}")
# Ridge regression to handle multicollinearity
ridge = Ridge(alpha=1.0)
ridge.fit(X_train, y_train)
print(f"Ridge coefficients: {dict(zip(feature_names, ridge.coef_))}")
P(H|E) = P(E|H) × P(H) / P(E) Where: P(H|E) = Posterior: Updated belief AFTER seeing evidence P(E|H) = Likelihood: Probability of evidence given hypothesis P(H) = Prior: Initial belief BEFORE evidence P(E) = Marginal Likelihood: Normalizing constant Key Insight: "Update your prior belief with new evidence to get posterior belief" Analogy: Prior: "Its probably not raining" (30% chance — general belief) Evidence: You see people with umbrellas Likelihood: Very likely if raining, unlikely if not Posterior: "Now 90% chance its raining!"
| Aspect | Frequentist | Bayesian |
|---|---|---|
| Probability | Long-run frequency of events | Degree of belief |
| Output | p-value, point estimate | Full posterior distribution |
| Prior Knowledge | Not used | Explicitly incorporated |
| Small Data | Unreliable | Works well with priors |
| Uncertainty | Fixed parameter, random data | Random parameter, fixed data |
| Industry Use | Clinical trials, standard tests | Real-time, adaptive systems |
Azure monitors millions of metrics per second:
- CPU usage, memory, network traffic
- API error rates, response times
- Revenue per minute, user sign-ins
Goal: Detect anomalies in real-time
Alert engineers BEFORE users are impacted
Minimize false alarms (alert fatigue!)
Frequentist approach:
"If CPU > 90% for 5 minutes → alert"
Problems:
→ Doesnt account for time of day (90% normal at 9am, anomaly at 3am)
→ No memory of recent patterns
→ Cant distinguish temporary spike vs sustained anomaly
Bayesian approach:
Prior: Whats the typical CPU distribution at THIS time of day?
(learned from past 30 days of data)
Likelihood: How likely is TODAYs observation given that prior?
Posterior: Updated distribution → anomaly score
→ Automatically adapts to patterns!
Model: Bayesian Structural Time Series (BSTS)
State Space Decomposition:
y_t = μ_t + τ_t + β_t + ε_t
μ_t = trend component (drifts slowly over time)
τ_t = seasonal component (weekly, daily patterns)
β_t = regression component (event effects — holidays, deployments)
ε_t = anomaly term (what we want to detect!)
Bayesian Update:
At each time step t:
Prior: P(state at t | data up to t-1)
→ Based on Kalman Filter prediction
Likelihood: P(observation | state at t)
Posterior: P(state at t | all data up to t)
→ Updated belief about true state
Normal pattern: 0.1% error rate, Bayesian prior distribution N(0.001, 0.0001) Sudden spike: Error rate = 5% Bayesian posterior: "This is 3.2 standard deviations from normal" Anomaly Score = P(error ≥ 5% | normal conditions) = 0.00065 Alert threshold: P < 0.001 → FIRE ALERT! Message: "API error rate anomaly detected. Expected: 0.1% ± 0.03%, Actual: 5%" Advantage over threshold-based: → During traffic spikes, Bayesian prior adjusts automatically → Reduces false positives by 40% vs fixed thresholds → Azure SLA: detects issues 10 minutes earlier on average
import numpy as np
import pymc as pm
import matplotlib.pyplot as plt
# Bayesian model for CPU anomaly detection
with pm.Model() as anomaly_model:
# Prior: Normal CPU usage (learned from historical data)
mu_prior = pm.Normal("mu", mu=0.45, sigma=0.10) # Expected CPU ~45%
sigma_prior = pm.HalfNormal("sigma", sigma=0.10) # Uncertainty in normal CPU
# Likelihood: Current observations
cpu_observations = pm.Normal(
"cpu", mu=mu_prior, sigma=sigma_prior,
observed=current_cpu_readings
)
# Posterior inference
trace = pm.sample(2000, tune=1000, return_inferencedata=True)
# Get posterior distribution for current mean
posterior_mu = trace.posterior["mu"].values.flatten()
print(f"Posterior mean CPU: {posterior_mu.mean():.3f}")
print(f"95% Credible Interval: [{np.percentile(posterior_mu, 2.5):.3f}, {np.percentile(posterior_mu, 97.5):.3f}]")
# Anomaly detection
observed_cpu = 0.92 # 92% CPU — is this anomalous?
p_anomaly = np.mean(posterior_mu > observed_cpu)
print(f"P(normal CPU > {observed_cpu}) = {p_anomaly:.6f}")
if p_anomaly samples_A)
print(f"P(B > A) = {p_B_better:.3f}")
print(f"Decision: {"Launch B ✅" if p_B_better > 0.95 else "More data needed ⏳"}")
Definition: A controlled experiment where two versions (A=control, B=treatment) are shown to different user groups to determine which performs better statistically.
Null Hypothesis (H₀): No difference between A and B (new feature has no effect) Alt. Hypothesis (H₁): B is significantly better than A p-value: Probability of seeing this result IF H₀ is true → p < 0.05: Reject H₀ → statistically significant! Significance Level (α): Threshold (typically 0.05 = 5% chance of false positive) Statistical Power (1-β): Probability of detecting real effect (aim for 80%) Minimum Detectable Effect (MDE): Smallest improvement worth detecting
Google Search: 8.5 billion searches/day → Even 0.01% improvement = 850,000 better searches/day → A/B test must detect tiny improvements reliably → Cant show bad results to too many users → Many experiments running simultaneously (thousands!)
Hypothesis: New BERT-based ranking improves result quality
Primary Metric: Click-Through Rate (CTR) on top result
Secondary Metrics: Long-click rate, query reformulations, session length
Guardrail Metrics: Query latency < 200ms, ad revenue stable
Sample Size Calculation:
Baseline CTR = 34%
MDE = +1% (detect if CTR improves to 35%)
α = 0.05, Power = 0.80
Required sample size = 2 × [(Zα/2 + Zβ)² × p(1-p)] / MDE²
= 2 × [(1.96 + 0.84)² × 0.34×0.66] / 0.01²
≈ 170,000 searches per group
Google uses: Diversion unit = search query session (not individual query) → Ensures user sees consistent experience Traffic Split: 10% users → Control (current ranking) 10% users → Treatment (new BERT ranking) 80% users → Not in experiment (buffer) Why only 20%? → If new algorithm is terrible, limit damage to 10% of users → Can detect effects with 10% each side statistically
Duration: 2 weeks minimum → Day 1-3: Novelty effect (users behave differently with new things) → Week 1: Detect weekly patterns (Mon vs Fri behavior) → Week 2: Confirm effect is stable and real Monitor daily: → Is the experiment affecting critical systems? → Any unexpected spikes/crashes? → Is data pipeline working correctly?
Results: Control: CTR = 34.2% (n=180,000 sessions) Treatment: CTR = 35.1% (n=182,000 sessions) Two-Sample Z-Test for proportions: z = (p_T - p_C) / √[p̄(1-p̄)(1/n_T + 1/n_C)] z = (0.351 - 0.342) / √[0.346×0.654×(1/182000 + 1/180000)] z = 0.009 / 0.00158 = 5.70 p-value = P(|Z| > 5.70) = 0.0000000 << 0.05 → HIGHLY SIGNIFICANT! Launch the change! 95% Confidence Interval: [+0.7%, +1.1%] improvement in CTR
→ CTR improved 0.9% (statistically significant) → Latency OK (<200ms) → Ad revenue stable (+0.1%, not significant — good) → Long-click rate improved +2% (users finding better answers) → DECISION: LAUNCH to 100% of users
Instead of showing A to one group, B to another: → Interleave A and B results in single SERP → User implicitly votes by clicking → 100x more efficient — need 100x fewer users! Used by Google, Netflix, Spotify for ranking algorithms
Frequentist: Run fixed time, then decide Bayesian: Update beliefs continuously, stop when confident P(B > A | data) — direct probability! "87% confident that B is better by at least 0.5%" More intuitive, can stop early safely
Use pre-experiment metric data to reduce variance → Adjusts for individual user baseline differences → Can detect same effect with 50% fewer users! Google, Microsoft use this routinely
import numpy as np
from scipy import stats
import statsmodels.stats.proportion as proportion
# Google Search A/B Test Results
n_control = 180000
n_treatment = 182000
clicks_control = 61560 # 34.2% CTR
clicks_treatment = 63882 # 35.1% CTR
p_control = clicks_control / n_control
p_treatment = clicks_treatment / n_treatment
# Two-proportion Z-test
z_stat, p_value = proportion.proportions_ztest(
count=[clicks_treatment, clicks_control],
nobs=[n_treatment, n_control],
alternative="larger" # One-tailed: is treatment better?
)
print(f"Control CTR: {p_control:.3f} ({p_control*100:.1f}%)")
print(f"Treatment CTR: {p_treatment:.3f} ({p_treatment*100:.1f}%)")
print(f"Lift: +{(p_treatment - p_control)*100:.2f}%")
print(f"Z-statistic: {z_stat:.2f}")
print(f"P-value: {p_value:.6f}")
print(f"Result: {"LAUNCH ✅" if p_value < 0.05 else "DO NOT LAUNCH ❌"}")
# 95% Confidence Interval for lift
ci = proportion.confint_proportions_2indep(
clicks_treatment, n_treatment,
clicks_control, n_control,
alpha=0.05, method="newcomb"
)
print(f"95% CI: [{ci[0]*100:.2f}%, {ci[1]*100:.2f}%]")
# Power analysis — how many users do we need?
from statsmodels.stats.power import NormalIndPower
analysis = NormalIndPower()
n = analysis.solve_power(
effect_size=0.01 / (0.34 * 0.66)**0.5, # standardized effect
alpha=0.05,
power=0.80,
alternative="two-sided"
)
print(f"Required sample size per group: {int(n):,}")
| Aspect | BERT | GPT (4) |
|---|---|---|
| Architecture | Transformer Encoder only | Transformer Decoder only |
| Reading Direction | Bidirectional (left + right) | Left-to-right only (autoregressive) |
| Pre-training Task | Masked Language Model (MLM) | Next Token Prediction (CLM) |
| Best For | Understanding tasks | Generation tasks |
| Use Cases | Classification, QA, NER | Chat, Code, Writing, Reasoning |
| Analogy | Good at reading comprehension | Good at creative writing |
| Company | Google (2018) | OpenAI (2020-2024) |
Parameters: ~1.8 Trillion (estimated) Training Data: 300 Billion+ tokens from internet, books, code Context Window: 128,000 tokens (GPT-4 Turbo) Training: Next token prediction on massive text corpus
Prompt: "The capital of France is" Step 1: Model sees ["The", "capital", "of", "France", "is"] Step 2: Predicts next token → "Paris" (probability distribution over 50K vocab) Step 3: Appends "Paris" to context Step 4: Predicts next token → "." (period) Step 5: Generates until stop token or max length Each step: Full forward pass through all transformer layers GPT-4 Turbo: 128 decoder layers × attention → ~50ms per token
1. Pre-training: Predict next token on 300B tokens of text → Learns world knowledge, language patterns, reasoning 2. Supervised Fine-Tuning (SFT): → Human trainers write ideal responses to prompts → Fine-tune GPT on these demonstrations 3. RLHF (Reinforcement Learning from Human Feedback): → Human raters rank different model responses → Train Reward Model on rankings → Use PPO (RL algorithm) to optimize GPT against reward model → Result: Aligned, helpful, harmless ChatGPT!
Microsoft invested $13B in OpenAI → Access to GPT-4 API → Azure OpenAI Service (enterprise-grade deployment) → Integrated into every Microsoft product as "Copilot"
How it works: 1. Developer writes comment/function signature 2. GitHub Copilot sends context to GPT-4 Codex 3. GPT generates code completion 4. Developer accepts/rejects suggestion Training: Fine-tuned GPT on 1B+ lines of GitHub code Results: → 55% of code written by Copilot (GitHub study 2023) → 88% developers say more productive → $19/month subscription, 1M+ users
Word: "Draft an email declining this meeting politely"
→ GPT reads email thread → generates professional decline
Excel: "Create a chart showing sales trends by region"
→ GPT writes Excel formula + creates chart
PowerPoint: "Create a 10-slide pitch deck about our Q4 results"
→ GPT generates slides from uploaded data
Teams: Real-time meeting transcript → GPT generates summary + action items
Enterprise API for GPT-4:
- Security: Data doesnt train OpenAI models
- Compliance: HIPAA, SOC2, EU Data Residency
- Scale: Auto-scales to millions of API calls
- Used by: Shell, KPMG, H&R Block, Volkswagen
Implementation:
import openai
openai.api_base = "https://YOUR-RESOURCE.openai.azure.com/"
openai.api_key = AZURE_KEY
response = openai.ChatCompletion.create(
engine="gpt-4-turbo",
messages=[{"role":"user", "content":"Summarize this sales report"}]
)
from transformers import GPT2LMHeadModel, GPT2Tokenizer
import torch
# Load GPT-2 (smaller GPT for demonstration)
tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
model = GPT2LMHeadModel.from_pretrained("gpt2")
# Generate text (like GPT-4 but smaller)
def generate(prompt, max_new_tokens=100, temperature=0.7):
inputs = tokenizer.encode(prompt, return_tensors="pt")
with torch.no_grad():
outputs = model.generate(
inputs,
max_new_tokens=max_new_tokens,
temperature=temperature, # Creativity: higher = more random
top_k=50, # Sample from top 50 tokens
top_p=0.95, # Nucleus sampling
do_sample=True,
pad_token_id=tokenizer.eos_token_id
)
return tokenizer.decode(outputs[0], skip_special_tokens=True)
# Test
result = generate("The future of artificial intelligence is")
print(result)
# For Microsoft Copilot-style chat
from openai import AzureOpenAI
client = AzureOpenAI(
azure_endpoint="https://YOUR-ENDPOINT.openai.azure.com",
api_key="YOUR-KEY",
api_version="2024-02-01"
)
response = client.chat.completions.create(
model="gpt-4-turbo",
messages=[
{"role": "system", "content": "You are a helpful Microsoft 365 Copilot assistant."},
{"role": "user", "content": "Summarize this quarterly report in 3 bullet points"}
],
max_tokens=500,
temperature=0.7
)
print(response.choices[0].message.content)
GPT-3: 175B parameters, 700GB storage, needs 8 A100 GPUs BERT-Large: 340M parameters, 1.3GB, 400ms inference For production (millions of requests): → Cost: Too expensive (GPU cost = $3-5 per 1000 requests) → Latency: 400ms too slow (need <50ms for real-time) → Edge: Cant run on phone/IoT device Solution: Model Compression via Quantization + Pruning
Idea: Reduce numerical precision of weights from 32-bit float → 8-bit or 4-bit integers.
| Precision | Bits | Memory | Speed | Accuracy Loss |
|---|---|---|---|---|
| FP32 (full) | 32 bits | 4 bytes/param | Baseline | None |
| FP16 (half) | 16 bits | 2 bytes/param | 2x faster | <0.1% |
| INT8 | 8 bits | 1 byte/param | 4x faster | <1% |
| INT4 | 4 bits | 0.5 bytes/param | 8x faster | 1-3% |
Example: BERT-Large with quantization FP32: 340M params × 4 bytes = 1.3GB, 400ms INT8: 340M params × 1 byte = 340MB, 100ms 4x faster, 4x smaller, <1% accuracy drop!
1. Post-Training Quantization (PTQ): → Quantize after training, no re-training needed → Fast but slight accuracy drop 2. Quantization-Aware Training (QAT): → Simulate quantization during training → Better accuracy, slower to implement → Used by Google on Pixel phones 3. Dynamic Quantization: → Weights stored as INT8, converted to FP32 at runtime → Good for NLP (LSTM, BERT)
Idea: Remove unimportant weights (close to zero) — like trimming dead branches from a tree.
Research shows: In a trained neural network, ~90% of weights are redundant and can be removed with minimal accuracy loss! Lottery Ticket Hypothesis (MIT 2019): Large networks contain small "winning ticket" subnetworks → These subnetworks can be trained alone to same accuracy!
| Type | Whats Removed | Speedup |
|---|---|---|
| Weight Pruning | Individual weights (unstructured) | 2-5x (with sparse hardware) |
| Neuron Pruning | Entire neurons | 2-3x |
| Filter Pruning | Entire CNN filters | 2-4x (hardware-friendly) |
| Attention Head Pruning | BERT attention heads | 1.5-2x |
| Layer Pruning | Entire layers | 3-5x |
Step 1: Train Full Model → FP32 precision, full size Step 2: Pruning (remove 30-90% weights) → Use Azure ML Automated Pruning → Iterative: prune 10% → retrain → prune 10% → retrain Step 3: Knowledge Distillation → Large "Teacher" model → Small "Student" model → Azure ML has DistilBERT, DistilGPT2 built-in Step 4: Quantization → Azure Model Optimizer converts to INT8 → ONNX Runtime for deployment (works on any hardware) Step 5: Deploy on Azure → Azure Kubernetes Service (AKS) for auto-scaling → Azure IoT Edge for edge devices → Azure Cognitive Services API for customers
Use Case: Extract text from invoices, receipts, IDs Original Model: 500MB BERT-based model → Too slow (800ms), too large for edge After Optimization: - Pruning: Remove 40% of attention heads → 300MB - Quantization INT8: 300MB → 75MB - Knowledge Distillation to DistilBERT: 75MB → 30MB - Final: 30MB model, 80ms latency, 95% accuracy Deployed on: → Azure Cloud: millions of API calls/day → Azure IoT Edge: runs on factory floor, offline → Surface laptops: Windows Hello face recognition
import torch
import torch.nn.utils.prune as prune
from torch.quantization import quantize_dynamic
# 1. Dynamic Quantization (easiest — Microsoft recommends for BERT)
model = BertForSequenceClassification.from_pretrained("bert-base-uncased")
quantized_model = quantize_dynamic(
model,
{torch.nn.Linear}, # Quantize linear layers
dtype=torch.qint8 # INT8 quantization
)
print(f"Original: {get_model_size(model):.1f} MB") # ~420 MB
print(f"Quantized: {get_model_size(quantized_model):.1f} MB") # ~105 MB
# 2. Structured Pruning (remove entire neurons)
def prune_model(model, pruning_rate=0.3):
for name, module in model.named_modules():
if isinstance(module, torch.nn.Linear):
prune.l1_unstructured(module, name="weight", amount=pruning_rate)
prune.remove(module, "weight") # Make permanent
return model
pruned_model = prune_model(model, pruning_rate=0.4)
# 3. Export to ONNX for Azure deployment
dummy_input = (
torch.ones(1, 128, dtype=torch.long),
torch.ones(1, 128, dtype=torch.long)
)
torch.onnx.export(
quantized_model, dummy_input, "azure_model.onnx",
input_names=["input_ids", "attention_mask"],
output_names=["logits"],
dynamic_axes={"input_ids": {0: "batch_size", 1: "sequence"}}
)
print("Model exported for Azure deployment!")
Core Idea: Instead of reading all input equally, let the model focus on the most relevant parts when making each decision.
Reading: "The trophy didnt fit in the suitcase because it was too large" → To understand "it", your brain focuses on (attends to) "trophy" → Not "didnt", "fit", "because" — those are less relevant here Attention mechanism does the same!
1. For each query position, compute score with all keys: score(q, k) = q · k^T / √d_k (dot-product similarity) 2. Normalize with Softmax → get attention weights (sum to 1) α = softmax(scores) 3. Weighted sum of Values: output = Σ(α_i × v_i) High attention weight → that position is very relevant Low attention weight → ignore that position
Why Multiple Heads? Different heads can focus on different aspects simultaneously.
Head 1: Focuses on syntactic relationships (subject-verb) Head 2: Focuses on semantic meaning (synonyms) Head 3: Focuses on long-range dependencies Head 4: Focuses on coreference (pronoun resolution) MultiHead(Q,K,V) = Concat(head_1,...,head_h) × W_O where head_i = Attention(Q×W_i^Q, K×W_i^K, V×W_i^V)
Example with 8 heads (BERT-Base): Each head: 768/8 = 96 dimensional → 8 different "views" of same input → Concatenated back to 768 dimensions → More expressive than single attention!
500M+ members, 20M+ job listings Must recommend RIGHT jobs to RIGHT people at RIGHT time User has: profile, skills, experience, connections, activity
User profile has: [title, skills, experience, education, location] Query: Job description embedding Key/Value: User profile sections Attention weights might be: - Job title match: 0.45 (highest — most relevant) - Required skills: 0.30 - Years experience: 0.15 - Education: 0.07 - Location: 0.03 → Model focuses on title + skills for this job recommendation
Job 1: "Junior Data Analyst" Job 2: "Data Analyst" Job 3: "Senior Data Analyst" Job 4: "Data Science Lead" → Attention learns career trajectory pattern → Recommends "Head of Data Science" or "VP of Analytics" → NOT "Junior Analyst" (backwards step)
Job requires: [Python, SQL, TensorFlow, Communication, Leadership] User has: [Python: 5yr, SQL: 3yr, R: 2yr, Java: 4yr] Cross-attention aligns: Python ↔ Python (high match: 0.9) SQL ↔ SQL (high match: 0.8) TensorFlow ↔ R (partial match: 0.3) Java ↔ none (irrelevant: 0.1) → 68% match → Show job with "Skill Gap: TensorFlow"
Input: User profile + Job description
↓
Dual Encoder with Multi-Head Attention:
User Tower: Self-attention over profile sections (12 heads)
Job Tower: Self-attention over job requirements (12 heads)
↓
Cross-Attention: Align user skills ↔ job requirements
↓
Relevance Score → Ranking
import torch
import torch.nn as nn
import torch.nn.functional as F
import math
class MultiHeadAttention(nn.Module):
def __init__(self, d_model=512, num_heads=8):
super().__init__()
self.d_k = d_model // num_heads
self.num_heads = num_heads
self.W_q = nn.Linear(d_model, d_model)
self.W_k = nn.Linear(d_model, d_model)
self.W_v = nn.Linear(d_model, d_model)
self.W_o = nn.Linear(d_model, d_model)
def split_heads(self, x, batch_size):
x = x.view(batch_size, -1, self.num_heads, self.d_k)
return x.transpose(1, 2) # (batch, heads, seq, d_k)
def forward(self, query, key, value, mask=None):
batch_size = query.shape[0]
Q = self.split_heads(self.W_q(query), batch_size)
K = self.split_heads(self.W_k(key), batch_size)
V = self.split_heads(self.W_v(value), batch_size)
# Scaled dot-product attention
scores = torch.matmul(Q, K.transpose(-2,-1)) / math.sqrt(self.d_k)
if mask is not None: scores = scores.masked_fill(mask==0, -1e9)
attn_weights = F.softmax(scores, dim=-1)
context = torch.matmul(attn_weights, V)
context = context.transpose(1, 2).contiguous().view(batch_size, -1, self.num_heads * self.d_k)
return self.W_o(context), attn_weights
# LinkedIn: Get attention weights to see which skills matched
model = MultiHeadAttention(d_model=512, num_heads=8)
output, attention_weights = model(user_embedding, job_embedding, job_embedding)
print(f"Top skill attention: {attention_weights[0, 0, :, :].topk(3)}")
Definition: Take a model trained on large data (source task), reuse its learned knowledge for a new, related task (target task).
Analogy: A doctor who learned biology, chemistry, anatomy (pre-training) can quickly specialize in cardiology (fine-tuning) — doesnt start from scratch!
Layer 1: Learns edges, corners (universal — any image task) Layer 2: Learns textures, patterns (still universal) Layer 3: Learns parts (eyes, wheels, branches) Layer 4: Learns objects (specific to original task) Layer 5: Classification head (task-specific — replace this!)
| Approach | What Gets Trained | Data Needed | Use Case |
|---|---|---|---|
| Feature Extraction | Only new head layers | Very little (100s) | Very different task/small data |
| Fine-Tuning (partial) | Last few layers + head | Medium (1K-10K) | Related task, medium data |
| Full Fine-Tuning | All layers + head | Large (100K+) | Similar task, lots of data |
Apple processes AI on the device (not cloud) for privacy. iPhone has limited compute. Solution: Transfer Learning enables small, efficient models that are powerful.
Pre-trained: MobileNetV3 on 1M+ celebrity faces (ImageNet) Fine-tuned: Your face (30 images at setup) On-device: Neural Engine processes in <1ms Privacy: Face data NEVER leaves your phone
Pre-trained: Large CNN on 100M labeled images (Apple servers) Transfer to device: Compressed MobileNet (only 20MB) Fine-tuning: Learns YOUR photos — your dog, family, places Result: "People", "Places", "Pets" albums created automatically
Pre-trained: Acoustic model on 1B+ audio samples (server) Fine-tuned: Adapts to YOUR voice pattern On-device: Runs locally for "Hey Siri" — no internet needed Privacy: Voice never sent to Apple
Pre-trained: GPT-style language model on internet text Fine-tuned: Learns YOUR writing style, slang, names All on-device: No keystrokes ever sent to servers Result: Predicts "bae", custom words, Tamil names correctly
Training Pipeline: Large Model (Servers) → Knowledge Distillation → Small Model (Phone) Knowledge Distillation: Teacher Model: 340M params (BERT-Large) — server trained Student Model: 14M params (DistilBERT) — phone deployed Student learns from teachers soft predictions → 90% accuracy with 4% of parameters! Apple Neural Engine (ANE): - 16-core Neural Engine in A17 Pro chip - 38 TOPS (trillion operations per second) - Optimized for matrix multiplication (ML core ops) - Energy efficient: 10x less power than GPU
import tensorflow as tf
from tensorflow.keras.applications import MobileNetV3Large
from tensorflow.keras.layers import Dense, GlobalAveragePooling2D, Dropout
from tensorflow.keras.models import Model
# Step 1: Load pre-trained MobileNetV3 (efficient for mobile)
base_model = MobileNetV3Large(
weights="imagenet",
include_top=False,
input_shape=(224, 224, 3)
)
# Step 2: Freeze base model (feature extraction phase)
base_model.trainable = False
# Step 3: Add custom classification head
x = base_model.output
x = GlobalAveragePooling2D()(x)
x = Dense(256, activation="relu")(x)
x = Dropout(0.3)(x)
output = Dense(num_classes, activation="softmax")(x)
model = Model(inputs=base_model.input, outputs=output)
model.compile(optimizer=tf.keras.optimizers.Adam(1e-3),
loss="categorical_crossentropy",
metrics=["accuracy"])
# Phase 1: Train only head (10 epochs)
model.fit(train_data, epochs=10, validation_data=val_data)
# Step 4: Fine-tune last 30 layers
base_model.trainable = True
for layer in base_model.layers[:-30]:
layer.trainable = False
# Phase 2: Fine-tune with very low LR
model.compile(optimizer=tf.keras.optimizers.Adam(1e-5),
loss="categorical_crossentropy",
metrics=["accuracy"])
model.fit(train_data, epochs=20, validation_data=val_data)
# Convert to TFLite for iPhone
converter = tf.lite.TFLiteConverter.from_keras_model(model)
converter.optimizations = [tf.lite.Optimize.DEFAULT] # Quantization
tflite_model = converter.convert()
# Save for iPhone deployment
with open("apple_model.tflite", "wb") as f:
f.write(tflite_model)
Transformers replaced RNNs by processing all words simultaneously using Self-Attention, making them much faster and more accurate.
| Component | Purpose |
|---|---|
| Self-Attention | Each word attends to every other word — understands context |
| Multi-Head Attention | Multiple attention heads = multiple perspectives simultaneously |
| Positional Encoding | Adds word order info (since no sequential processing) |
| Feed Forward Network | Non-linear transformation after attention |
| Layer Normalization | Stabilizes training |
| Encoder + Decoder | Encoder reads input, Decoder generates output |
Attention(Q, K, V) = softmax(QK^T / √d_k) × V Q = Query (what am I looking for?) K = Key (what do I contain?) V = Value (what info do I have?) d_k = key dimension (for scaling) Example: "The animal didnt cross the street because it was too tired" → "it" attends most strongly to "animal" — Self-attention resolves this!
Old GPT: "The cat sat on" → predicts next word
BERT: "The cat [MASK] on the mat" → predicts masked word
→ Reads BOTH left AND right context simultaneously
BERT-Base: 12 Transformer layers, 768 hidden size, 12 attention heads → 110M parameters BERT-Large: 24 Transformer layers, 1024 hidden size, 16 attention heads → 340M parameters
Query: "Can you get medicine for someone pharmacy" Old System: Focused on "medicine", "pharmacy" → returned generic pharmacy info User Intent: Can someone ELSE pick up my prescription? Query: "2019 brazil traveler to USA need visa" Old System: "Brazil", "USA", "visa" → results about Americans going to Brazil Correct: Person FROM Brazil traveling TO USA — needs to understand direction
from transformers import BertTokenizer, BertForSequenceClassification
from transformers import Trainer, TrainingArguments
import torch
# Load pre-trained BERT
tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")
model = BertForSequenceClassification.from_pretrained("bert-base-uncased", num_labels=2)
# Tokenize input
text = "Best smartphone under 50000 rupees in India"
inputs = tokenizer(text, return_tensors="pt", padding=True, truncation=True, max_length=128)
# Predict
with torch.no_grad():
outputs = model(**inputs)
prediction = torch.argmax(outputs.logits, dim=1)
print(f"Predicted class: {prediction.item()}")
# Fine-tune on custom data
training_args = TrainingArguments(
output_dir="./results",
num_train_epochs=3,
per_device_train_batch_size=16,
learning_rate=2e-5, # Small LR for fine-tuning
warmup_steps=500,
weight_decay=0.01,
)
trainer = Trainer(
model=model,
args=training_args,
train_dataset=train_dataset,
eval_dataset=eval_dataset
)
trainer.train()
| Model | Key Difference | Use Case |
|---|---|---|
| BERT | Bidirectional encoder | Classification, NER, QA |
| RoBERTa | BERT + more data + better training | Better performance |
| DistilBERT | 60% smaller, 97% of BERT accuracy | Mobile/edge deployment |
| GPT-4 | Unidirectional decoder | Text generation |
| T5 | Encoder-Decoder, text-to-text | Translation, summarization |
| Layer | Function | Example Output |
|---|---|---|
| Conv Layer | Extracts features using filters (edges, textures, shapes) | Feature maps |
| ReLU | Activation — removes negative values | Non-linear features |
| Pooling | Reduces spatial size, keeps key info | Smaller feature maps |
| Flatten | Converts 2D → 1D | 1D vector |
| Dense/FC | Classification/Regression | Final prediction |
Input Image (224x224x3)
↓
Conv Layer (32 filters, 3x3) → Feature Maps (222x222x32)
↓
ReLU Activation
↓
MaxPooling (2x2) → (111x111x32)
↓
Conv Layer (64 filters, 3x3) → (109x109x64)
↓
MaxPooling → (54x54x64)
↓
Flatten → (186,624 neurons)
↓
Dense Layer (512 neurons)
↓
Output Layer (Softmax)
Netflix found that 82% of viewer attention goes to thumbnails. Different users click on different visual styles for the SAME movie. Example: For "Stranger Things" — one user prefers the horror elements, another prefers character portraits.
Movie Frames → CNN (ResNet-50) → Embedding Vector (2048-dim)
↓
User History → Embedding Layer → User Vector (512-dim)
↓
Similarity Score = Dot Product
↓
Serve Top-Scoring Thumbnail
import tensorflow as tf
from tensorflow.keras.applications import ResNet50
from tensorflow.keras.models import Model
from tensorflow.keras.layers import Dense, GlobalAveragePooling2D
# Load pre-trained ResNet50 (trained on ImageNet)
base_model = ResNet50(weights="imagenet", include_top=False, input_shape=(224, 224, 3))
# Add custom layers for thumbnail scoring
x = base_model.output
x = GlobalAveragePooling2D()(x)
x = Dense(512, activation="relu")(x)
output = Dense(1, activation="sigmoid", name="engagement_score")(x)
# Netflix thumbnail model
netflix_model = Model(inputs=base_model.input, outputs=output)
# Freeze base, train only custom layers
for layer in base_model.layers:
layer.trainable = False
netflix_model.compile(optimizer="adam", loss="binary_crossentropy", metrics=["accuracy"])
# Training: X = thumbnail images, y = click rate (1=clicked, 0=not clicked)
netflix_model.fit(X_thumbnails, y_clicks, epochs=10, batch_size=32, validation_split=0.2)
| Predicted | |||
| Positive | Negative | ||
| Actual | Positive | True Positive (TP) | False Negative (FN) |
| Negative | False Positive (FP) | True Negative (TN) | |
Precision = TP / (TP + FP) Question: "Of all cases we predicted as positive, how many were actually positive?" Focus: How accurate are our positive predictions?
Recall = TP / (TP + FN) Question: "Of all actual positive cases, how many did we catch?" Focus: How many positive cases did we miss?
F1 = 2 * (Precision * Recall) / (Precision + Recall) Balance between Precision and Recall
Dataset: 10,000 transactions Actual Frauds: 100 (1% - highly imbalanced!) Legitimate: 9,900 (99%)
| Predicted Fraud | Predicted Legitimate | Total | |
|---|---|---|---|
| Actual Fraud | 85 (TP) | 15 (FN) | 100 |
| Actual Legitimate | 200 (FP) | 9,700 (TN) | 9,900 |
| Total | 285 | 9,715 | 10,000 |
Precision = 85 / (85 + 200) = 85/285 = 0.298 = 29.8% → Only 30% of flagged transactions are actually fraud Recall = 85 / (85 + 15) = 85/100 = 0.85 = 85% → We caught 85% of all fraud cases F1-Score = 2 * (0.298 * 0.85) / (0.298 + 0.85) = 0.441 = 44.1% Accuracy = (85 + 9,700) / 10,000 = 97.85% → Misleading! High accuracy despite poor fraud detection
| Metric | What Happened | Business Impact |
|---|---|---|
| 85 True Positives | Correctly blocked 85 fraudulent transactions | ✅ Saved money, protected customers |
| 15 False Negatives | Missed 15 frauds (let them through) | ❌ Lost $15,000 (avg $1,000 per fraud) Customer chargebacks, reputation damage |
| 200 False Positives | Blocked 200 legitimate transactions | ❌ Lost sales: $40,000 (avg $200 per transaction) Customer frustration, abandoned carts |
| 9,700 True Negatives | Correctly approved legitimate transactions | ✅ Happy customers, revenue generated |
False Positives are very costly
Examples:
Translation: "When you say YES, be really sure it's correct"
False Negatives are very costly
Examples:
Translation: "Don't miss any positive cases, even if some false alarms"
Need balance, both FP and FN are important
Examples:
Strategy: MAXIMIZE RECALL (catch all frauds) Accept: More false positives (manual review) Reason: $5000 fraud loss >> cost of manual review
Strategy: MAXIMIZE PRECISION (avoid blocking legitimate customers) Accept: Some frauds slip through Reason: Customer frustration >> small fraud loss
Strategy: OPTIMIZE F1-SCORE (balance) Use: Risk scoring + customer history
from sklearn.metrics import precision_score, recall_score, f1_score
from sklearn.metrics import classification_report, confusion_matrix
import numpy as np
# Amazon fraud detection predictions
y_true = np.array([1,1,1,0,0,0,1,0,0,1]) # 1=Fraud, 0=Legitimate
y_pred = np.array([1,1,0,0,1,0,1,0,0,1])
# Calculate metrics
precision = precision_score(y_true, y_pred)
recall = recall_score(y_true, y_pred)
f1 = f1_score(y_true, y_pred)
print(f"Precision: {precision:.2f}") # 0.75
print(f"Recall: {recall:.2f}") # 0.80
print(f"F1-Score: {f1:.2f}") # 0.77
# Detailed report
print(classification_report(y_true, y_pred,
target_names=['Legitimate', 'Fraud']))
# Confusion Matrix
cm = confusion_matrix(y_true, y_pred)
print(f"\nConfusion Matrix:\n{cm}")
# Adjust threshold for recall optimization
from sklearn.metrics import precision_recall_curve
# Get probabilities instead of binary predictions
y_probs = model.predict_proba(X_test)[:, 1]
# Find threshold that gives 95% recall
precisions, recalls, thresholds = precision_recall_curve(y_true, y_probs)
# For 95% recall, what precision do we get?
idx = np.argmax(recalls >= 0.95)
optimal_threshold = thresholds[idx]
print(f"For 95% recall, use threshold: {optimal_threshold:.3f}")
print(f"This gives precision: {precisions[idx]:.2f}")
Increase Threshold (stricter): ┌──────────────────────────────┐ │ Precision ↑ (fewer FP) │ │ Recall ↓ (more FN) │ │ Example: Flag only 99% sure │ └──────────────────────────────┘ Decrease Threshold (lenient): ┌──────────────────────────────┐ │ Precision ↓ (more FP) │ │ Recall ↑ (fewer FN) │ │ Example: Flag if 30% sure │ └──────────────────────────────┘
Amazon's Dynamic Threshold:
How to Answer:
Red Flag Answers:
Gradient Descent: An optimization algorithm to minimize the cost function by iteratively moving in the direction of steepest descent.
Analogy: Like walking down a mountain in fog - you feel the slope and take small steps downhill to reach the valley (minimum).
Formula:
θ = θ - α * ∇J(θ) Where: θ = parameters (weights) α = learning rate ∇J(θ) = gradient (derivative of cost function)
| Variant | Batch Size | Speed | Memory | Convergence | Best For |
|---|---|---|---|---|---|
| Batch GD | All data | Slow | High | Smooth | Small datasets |
| Stochastic GD | 1 sample | Fast | Low | Noisy | Online learning |
| Mini-batch GD | 32-512 | Balanced | Medium | Stable | Most cases |
for epoch in range(num_epochs):
gradient = calculate_gradient(entire_dataset)
weights = weights - learning_rate * gradient
Pros: Smooth convergence, guaranteed to reach global minimum (for convex functions)
Cons: Very slow for large datasets, high memory usage
Amazon Example: NOT used for product recommendations (billions of products) - too slow!
for epoch in range(num_epochs):
for sample in shuffled_dataset:
gradient = calculate_gradient(single_sample)
weights = weights - learning_rate * gradient
Pros: Fast updates, can escape local minima, low memory
Cons: Noisy updates, might not converge exactly
Amazon Example: Real-time fraud detection - updates model immediately with each transaction
batch_size = 256
for epoch in range(num_epochs):
for batch in create_batches(dataset, batch_size):
gradient = calculate_gradient(batch)
weights = weights - learning_rate * gradient
Pros: Balance of speed and stability, GPU-friendly, reduces variance
Cons: Requires batch size tuning
Amazon Example: Product recommendation system - processes 256-512 users per batch
v = β * v + (1-β) * gradient weights = weights - learning_rate * v β typically 0.9
Concept: Like a ball rolling downhill - builds momentum in consistent directions
Amazon Use: Alexa speech recognition - helps escape plateaus faster
m = β1 * m + (1-β1) * gradient # First moment (mean) v = β2 * v + (1-β2) * gradient² # Second moment (variance) m_hat = m / (1 - β1^t) # Bias correction v_hat = v / (1 - β2^t) weights = weights - α * m_hat / (√v_hat + ε) Typical values: β1 = 0.9 β2 = 0.999 ε = 1e-8 α = 0.001
Why Adam is Best:
Amazon Production Example:
v = β * v + (1-β) * gradient² weights = weights - learning_rate * gradient / √(v + ε)
Amazon Use: Good for RNNs in Alexa natural language processing
G = G + gradient² weights = weights - learning_rate * gradient / √(G + ε)
Amazon Use: Sparse features in ad click prediction (many zero values)
| Use Case | Data Size | Recommended Optimizer | Batch Size |
|---|---|---|---|
| Product Recommendations | Billions of items | Adam + Mini-batch | 512-1024 |
| Real-time Fraud Detection | Streaming data | SGD | 1 |
| Image Classification (Rekognition) | Millions of images | Adam + Mini-batch | 128-256 |
| Search Query Understanding | Billions of queries | Adam | 256 |
| Alexa Speech Recognition | Millions of audio clips | Adam / RMSprop | 64-128 |
| Demand Forecasting | Historical sales data | Adam + Mini-batch | 128 |
| Ad Click Prediction | Sparse features | AdaGrad | 512 |
import tensorflow as tf
from tensorflow.keras.optimizers import SGD, Adam, RMSprop
# 1. Stochastic Gradient Descent
sgd = SGD(learning_rate=0.01)
model.compile(optimizer=sgd, loss='categorical_crossentropy')
# 2. SGD with Momentum
sgd_momentum = SGD(learning_rate=0.01, momentum=0.9)
# 3. Adam (Most Popular)
adam = Adam(
learning_rate=0.001,
beta_1=0.9,
beta_2=0.999,
epsilon=1e-8
)
model.compile(optimizer=adam, loss='sparse_categorical_crossentropy')
# 4. RMSprop
rmsprop = RMSprop(learning_rate=0.001, rho=0.9)
# 5. AdaGrad
adagrad = tf.keras.optimizers.Adagrad(learning_rate=0.01)
# PyTorch Example
import torch.optim as optim
optimizer = optim.Adam(model.parameters(), lr=0.001)
for epoch in range(num_epochs):
for batch_data, batch_labels in data_loader:
optimizer.zero_grad()
outputs = model(batch_data)
loss = criterion(outputs, batch_labels)
loss.backward()
optimizer.step()
Bias: Error from oversimplified assumptions in the learning algorithm. High bias = underfitting.
Variance: Error from sensitivity to small fluctuations in training data. High variance = overfitting.
Goal: Find the sweet spot that minimizes total error (Bias² + Variance + Irreducible Error)
| Symptom | High Bias (Underfitting) | High Variance (Overfitting) |
|---|---|---|
| Training Error | High (~15%) | Low (~1%) |
| Validation Error | High (~16%) | Much Higher (~20%) |
| Gap | Small gap | Large gap |
Scenario: Search ranking model performs poorly
Step 1 - Diagnose:
Training accuracy: 92% Validation accuracy: 91% → High Bias (both low) Action: Model is too simple
Step 2 - Fix: Add features like:
Alternative Scenario:
Training accuracy: 99% Validation accuracy: 78% → High Variance (large gap) Action: Model is overfitting
Fix:
| Type | How it works | Output Quality | Use When |
|---|---|---|---|
| Extractive | Select important sentences directly from text | Grammatical, factual, may be disjointed | Need accuracy, legal docs, news |
| Abstractive | Generate new sentences (like a human summary) | Fluent, coherent, may hallucinate | Need readability, customer support |
from sklearn.feature_extraction.text import TfidfVectorizer
from nltk.tokenize import sent_tokenize
import numpy as np
def extractive_summarize(document, n_sentences=3):
sentences = sent_tokenize(document)
# TF-IDF on sentences
vectorizer = TfidfVectorizer(stop_words="english")
tfidf_matrix = vectorizer.fit_transform(sentences)
# Score each sentence by sum of TF-IDF weights
sentence_scores = np.array(tfidf_matrix.sum(axis=1)).flatten()
# Pick top n sentences (in original order)
top_indices = sorted(np.argsort(sentence_scores)[-n_sentences:])
summary = " ".join([sentences[i] for i in top_indices])
return summary
import networkx as nx
from sklearn.metrics.pairwise import cosine_similarity
from sentence_transformers import SentenceTransformer
def textrank_summarize(document, n_sentences=3):
sentences = sent_tokenize(document)
# Create sentence embeddings
model = SentenceTransformer("paraphrase-MiniLM-L6-v2")
embeddings = model.encode(sentences)
# Build similarity matrix (graph)
similarity_matrix = cosine_similarity(embeddings)
np.fill_diagonal(similarity_matrix, 0) # No self-similarity
# Apply PageRank to find most "central" sentences
graph = nx.from_numpy_array(similarity_matrix)
scores = nx.pagerank(graph)
# Get top sentences
top_indices = sorted(scores, key=scores.get, reverse=True)[:n_sentences]
top_indices.sort() # Maintain original order
return " ".join([sentences[i] for i in top_indices])
from transformers import pipeline, T5Tokenizer, T5ForConditionalGeneration
# Method 1: Using pipeline (easy)
summarizer = pipeline("summarization", model="facebook/bart-large-cnn")
article = """Google announced a major update to its search algorithm today.
The new update, called "Helpful Content Update 2.0", focuses on promoting
original, high-quality content while demoting AI-generated spam. The update
will roll out globally over the next two weeks and is expected to affect
approximately 15% of search results in English..."""
summary = summarizer(
article,
max_length=130, # Max output length
min_length=30, # Min output length
do_sample=False, # Greedy decoding (deterministic)
truncation=True
)[0]["summary_text"]
print(summary)
# "Google has announced a major update to its search algorithm,
# called Helpful Content Update 2.0, which aims to promote original
# content while reducing AI-generated spam across 15% of search results."
# Method 2: T5 (Google`s model — "Text-to-Text Transfer Transformer")
tokenizer = T5Tokenizer.from_pretrained("t5-base")
model = T5ForConditionalGeneration.from_pretrained("t5-base")
# T5 takes text prompts! "summarize:", "translate:", "question:"
input_text = "summarize: " + article
input_ids = tokenizer.encode(input_text, return_tensors="pt", max_length=512, truncation=True)
summary_ids = model.generate(
input_ids,
max_length=150,
min_length=40,
num_beams=4, # Beam search — better quality
early_stopping=True,
length_penalty=2.0 # Encourage longer summaries
)
t5_summary = tokenizer.decode(summary_ids[0], skip_special_tokens=True)
When you ask: "What is machine learning?" Google doesn`t just return pages — it shows a text box with a direct answer. This is EXTRACTIVE summarization: 1. TextRank identifies most relevant passage from top-ranked page 2. Answer Extraction model (BERT-based) extracts the answer span 3. Displayed as "Featured Snippet" Google Gemini — Abstractive (for complex questions): "Give me a summary of the latest AI research trends" → Gemini reads multiple sources → generates coherent summary → ABSTRACTIVE — new sentences that weren`t in original documents
Teams Meeting (90 minutes) → Copilot generates 2-page summary: Step 1: Speech-to-Text (Azure AI Speech) → Transcribe entire meeting in real-time Step 2: Speaker Diarization → "Who said what" — label each segment with speaker name Step 3: Abstractive Summarization (GPT-4) → System Prompt: "You are a professional meeting assistant. Summarize this meeting transcript. Extract: 1. Key decisions made 2. Action items with owners 3. Open questions Format as bullet points." → GPT-4 generates structured summary Step 4: Action Item Extraction (NER-style) → Pattern: "[PERSON] will [ACTION] by [DATE]" → "John will send the proposal by Friday" → Creates task in Microsoft To Do automatically! Microsoft says: Copilot saves ~2.5 hours per week per user in meeting notes. At $30/month Copilot subscription × 100M enterprise users = huge business impact.
Problem: Product has 5,000 reviews — too many to read!
Solution: Generate 3-sentence abstract summary
Approach:
1. Filter reviews by rating (1-2 star separately from 4-5 star)
2. Run aspect-based extraction: {battery: neg, screen: pos, price: pos}
3. Generate: "Customers love the display quality and value for money.
The main concerns are battery life and build quality.
Most recommend this product for casual use."
Amazon A/B tested: Products with AI summaries get 15% more purchases
→ Clearer expectations → fewer returns!
ROUGE (Recall-Oriented Understudy for Gisting Evaluation):
ROUGE-1: Unigram overlap between system summary and reference
ROUGE-2: Bigram overlap
ROUGE-L: Longest Common Subsequence
from rouge_score import rouge_scorer
scorer = rouge_scorer.RougeScorer(["rouge1", "rouge2", "rougeL"])
reference = "The cat sat on the mat near the window"
generated = "A cat rested on a mat by the window"
scores = scorer.score(reference, generated)
print(f"ROUGE-1: {scores["rouge1"].fmeasure:.3f}")
print(f"ROUGE-2: {scores["rouge2"].fmeasure:.3f}")
print(f"ROUGE-L: {scores["rougeL"].fmeasure:.3f}")
Definition: Unsupervised technique to discover abstract "topics" that occur in a collection of documents. No labels needed — discovers structure automatically!
Documents are mixtures of topics. Topics are distributions over words. Document (LinkedIn post): "My machine learning model improved prediction accuracy by 20%" Topic Mixture: → 70% "Data Science" topic → 20% "Career Achievement" topic → 10% "Technology" topic "Data Science" topic (high-probability words): model (0.12), accuracy (0.09), prediction (0.08), training (0.07), features (0.06)...
Generative Model (how LDA imagines documents are created):
1. For each document d, choose topic distribution θ_d ~ Dirichlet(α)
(e.g., 70% ML, 20% Career, 10% Tech)
2. For each word w in document:
a. Choose a topic z ~ Multinomial(θ_d)
b. Choose word w ~ Multinomial(φ_z) (word distribution of topic z)
LDA learns:
φ = Word distributions per topic (what words describe each topic?)
θ = Topic distributions per document (what mix of topics is each doc?)
Hyperparameters:
α (alpha): Low (0.1) = sparse topics (doc about 1-2 topics)
High (1.0) = mixed topics (doc about many topics)
β (beta): Low = specific topics (few words)
High = broad topics (many words)
from sklearn.decomposition import LatentDirichletAllocation
from sklearn.feature_extraction.text import CountVectorizer
# LinkedIn posts corpus
posts = linkedin_posts_text # List of strings
# Vectorize
vectorizer = CountVectorizer(max_features=5000, stop_words="english", min_df=5)
X = vectorizer.fit_transform(posts)
# Train LDA with 15 topics
lda = LatentDirichletAllocation(
n_components=15, # Number of topics
max_iter=50,
learning_method="online", # Good for large corpora
learning_offset=50.0,
random_state=42
)
lda.fit(X)
# Print top words per topic
feature_names = vectorizer.get_feature_names_out()
for topic_idx, topic in enumerate(lda.components_):
top_words = [feature_names[i] for i in topic.argsort()[:-10:-1]]
print(f"Topic {topic_idx}: {', '.join(top_words)}")
Process: 1. Run LDA on all LinkedIn posts (billions!) 2. Each post gets a topic vector: [0.7, 0.1, 0.05, 0.15, ...] 3. Each user`s "interest profile" = average topic of posts they engaged with 4. Recommend posts whose topic vector is close to user interest vector Example: User engages with: ML articles, Python tutorials, tech startup news User interest: [0.6 ML, 0.3 Tech, 0.1 Career] → Recommend next: "Deep Learning optimization techniques" (ML topic 0.8)
Problem: LinkedIn has 50,000+ different skill tags.
Users add: "ML", "Machine Learning", "Statistical Learning", "AI/ML"
→ These all mean the same thing!
Topic Modeling Solution:
→ Run LDA on job descriptions
→ Topic 3: {machine learning, neural networks, tensorflow, sklearn, models}
→ All synonyms cluster in same topic!
→ Build: "Machine Learning" (canonical) → ["ML", "Statistical Learning", "AI/ML"]
→ Skills graph: ML → related to [Deep Learning, Python, Statistics]
Recruiter: "Find me jobs similar to my current role"
User`s Role: "Senior Data Scientist at Amazon, ML models, Python, team lead"
Step 1: Get topic distribution of user`s role description
[0.5 Data Science, 0.3 Leadership, 0.2 Cloud/AWS]
Step 2: Compute Jensen-Shannon divergence between topic distributions
of all job postings vs user`s profile
Step 3: Jobs with smallest JSD = most topically similar
→ Recommend: "Principal Data Scientist, ML Lead, AI Research Scientist"
This is faster than BERT embedding comparison for millions of jobs!
Run LDA monthly on new LinkedIn posts: Compare topic word distributions: Jan 2024 vs Jul 2024 "Topic 7" word distribution change: Jan: Python(0.12), SQL(0.10), Tableau(0.08) → "Data Analytics" Jul: ChatGPT(0.15), LLM(0.13), Prompt(0.11) → "Generative AI" → LinkedIn alert: "Generative AI skills are trending in your network!" → Learning recommendations: "Complete LLM course to stay competitive"
| Model | Improvement Over LDA | Use Case |
|---|---|---|
| BERTopic | Uses BERT embeddings + clustering — better topics | Short texts, Twitter |
| NMF | Non-negative Matrix Factorization — faster, interpretable | Large-scale document collections |
| Top2Vec | Jointly learns document + topic embeddings | Semantic topic discovery |
| CTM (Contextualized) | BERT + LDA combined | Nuanced topic modeling |
# BERTopic — State of the art (LinkedIn might use this now)
from bertopic import BERTopic
topic_model = BERTopic(
language="english",
calculate_probabilities=True,
verbose=True,
nr_topics=20 # Reduce to 20 topics
)
topics, probs = topic_model.fit_transform(linkedin_posts)
# Visualize topics
topic_model.visualize_topics()
topic_model.visualize_barchart(top_n_topics=10)
# Get topic info
print(topic_model.get_topic_info().head(10))
Definition: Automatically determine the emotional tone (positive, negative, neutral) of text.
| Type | Output | Use Case |
|---|---|---|
| Binary | Positive/Negative | Product review classification |
| Ternary | Positive/Neutral/Negative | Customer feedback |
| Fine-grained | 1-5 stars | Rating prediction |
| Aspect-based | Sentiment per topic | "Great camera, poor battery" |
| Emotion Detection | Joy/Anger/Sadness/Fear | Mental health apps |
# Methods:
# 1. Rule-Based (VADER — fast, no training)
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer
analyzer = SentimentIntensityAnalyzer()
tweet = "This movie was absolutely AMAZING!!! Best film ever 🎉"
scores = analyzer.polarity_scores(tweet)
# {"neg": 0.0, "neu": 0.25, "pos": 0.75, "compound": 0.87}
# compound > 0.05 = positive, < -0.05 = negative
# 2. ML-Based (Logistic Regression + TF-IDF)
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.feature_extraction.text import TfidfVectorizer
pipeline = Pipeline([
("tfidf", TfidfVectorizer(ngram_range=(1,2), max_features=50000)),
("clf", LogisticRegression(C=1.0))
])
pipeline.fit(X_train, y_train)
# 3. Transformer-Based (Best accuracy — BERT fine-tuned)
from transformers import pipeline as hf_pipeline
sentiment_pipe = hf_pipeline("sentiment-analysis", model="distilbert-base-uncased-finetuned-sst-2-english")
result = sentiment_pipe("This product is terrible and a waste of money!")
# [{"label": "NEGATIVE", "score": 0.9997}]
Definition: Identify and classify named entities (people, organizations, locations, dates, etc.) in text.
Input: "Elon Musk announced that Tesla will build a new factory in India in 2026"
NER Output:
Elon Musk → PERSON
Tesla → ORGANIZATION
India → LOCATION (GPE - Geopolitical Entity)
2026 → DATE
# Using spaCy (fast, production-ready)
import spacy
nlp = spacy.load("en_core_web_sm")
doc = nlp("Apple CEO Tim Cook unveiled iPhone 15 in San Francisco yesterday")
for ent in doc.ents:
print(f"{ent.text:20} → {ent.label_:10} → {spacy.explain(ent.label_)}")
# Apple → ORG → Companies, agencies, institutions
# Tim Cook → PERSON → People, fictional characters
# iPhone 15 → PRODUCT → Objects, vehicles, foods, etc.
# San Francisco → GPE → Countries, cities, states
# yesterday → DATE → Absolute or relative dates
6,000 tweets/second → Apache Kafka (message queue) → Distributed across 100+ consumer groups → Each group processes geographic region
For each tweet: 1. Language Detection (FastText classifier) → "¿Cuál es tu película favorita?" → Spanish 2. NER Extraction (spaCy / custom Transformer) → Extract: people, hashtags, organizations, places, products 3. Sentiment Analysis (DistilBERT — 60% smaller than BERT) → @MicrosoftTeams is DOWN again! → NEGATIVE, sentiment=-0.87 4. Hashtag + Mention Analysis → #WorldCup → Group with other #WorldCup tweets 5. Topic Classification → Tech / Sports / Politics / Entertainment / Health
Twitter uses a Bayesian changepoint detection algorithm: For each entity/hashtag, track: - Volume_t = tweet count in last 30 minutes - Volume_expected = historical baseline (same time, same day of week) Trend Score = (Volume_t - Volume_expected) / std_deviation If Trend Score > threshold (typically 3σ): → Candidate for "What`s Happening" section But also consider: - Unique users (not just volume) — prevent spam - Geographic clustering - Rate of acceleration (trending faster?) - Age of the trend (new vs sustained)
Not all trends shown to all users! If #ManchesterUnited is trending: → Show to: UK users, football fans, Man Utd followers → Don`t show to: Users with no sports history Personalization: User interests × Global trend score → Personal trend rank
"Just tried the new Samsung Galaxy S25 camera and WOW it blows away iPhone! 📸" NER: → Samsung Galaxy S25 (PRODUCT) → iPhone (PRODUCT) Sentiment: POSITIVE (compound=0.82) Aspect-based: → Camera → POSITIVE → Comparison: Samsung > iPhone (sentiment) Brand Monitoring Alert: → Samsung team: "Camera getting positive mentions — keep it up!" → Apple team: "iPhone camera perception declining vs Samsung — investigate!" Twitter sells this brand intelligence data to companies!
import spacy
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer
import re
from collections import Counter, defaultdict
# Load models
nlp = spacy.load("en_core_web_sm")
vader = SentimentIntensityAnalyzer()
def process_tweet(tweet_text):
"""Full NLP pipeline for a single tweet"""
# Clean
clean = re.sub(r"httpS+", "", tweet_text) # Remove URLs
# NER
doc = nlp(clean)
entities = {ent.label_: ent.text for ent in doc.ents}
# Sentiment
sentiment_scores = vader.polarity_scores(clean)
sentiment_label = (
"positive" if sentiment_scores["compound"] > 0.05
else "negative" if sentiment_scores["compound"] < -0.05
else "neutral"
)
# Extract hashtags & mentions
hashtags = re.findall(r"#w+", tweet_text)
mentions = re.findall(r"@w+", tweet_text)
return {
"entities": entities,
"sentiment": sentiment_label,
"sentiment_score": sentiment_scores["compound"],
"hashtags": hashtags,
"mentions": mentions
}
# Trend detection
def detect_trends(tweets_stream, window_minutes=30):
"""Detect trending topics in tweet stream"""
entity_counts = Counter()
hashtag_counts = Counter()
sentiment_by_entity = defaultdict(list)
for tweet in tweets_stream:
result = process_tweet(tweet["text"])
for entity in result["entities"].values():
entity_counts[entity] += 1
sentiment_by_entity[entity].append(result["sentiment_score"])
for tag in result["hashtags"]:
hashtag_counts[tag] += 1
# Report top trends
print("📈 TRENDING ENTITIES:")
for entity, count in entity_counts.most_common(10):
avg_sentiment = sum(sentiment_by_entity[entity]) / len(sentiment_by_entity[entity])
print(f" {entity}: {count} mentions | Sentiment: {avg_sentiment:.2f}")
print("
🔥 TRENDING HASHTAGS:")
for tag, count in hashtag_counts.most_common(5):
print(f" {tag}: {count} mentions")
These exact patterns appear in Meta, Google, Amazon SQL interviews. Master these and you can handle 90% of data science SQL questions.
-- Meta Interview: "Calculate 7-day rolling average of daily active users"
WITH daily_dau AS (
SELECT
activity_date AS date,
COUNT(DISTINCT user_id) AS dau
FROM user_activity
GROUP BY activity_date
)
SELECT
date,
dau,
ROUND(AVG(dau) OVER (
ORDER BY date
ROWS BETWEEN 6 PRECEDING AND CURRENT ROW
), 0) AS rolling_7day_avg,
SUM(dau) OVER (ORDER BY date ROWS UNBOUNDED PRECEDING) AS cumulative_users
FROM daily_dau
ORDER BY date;
-- Amazon: "Find user IDs who logged in for 3+ consecutive days"
WITH login_dates AS (
SELECT DISTINCT user_id, login_date
FROM user_logins
),
with_lag AS (
SELECT
user_id, login_date,
login_date - ROW_NUMBER() OVER (PARTITION BY user_id ORDER BY login_date) AS streak_group
FROM login_dates
),
streaks AS (
SELECT user_id, streak_group,
MIN(login_date) AS streak_start,
MAX(login_date) AS streak_end,
COUNT(*) AS streak_length
FROM with_lag
GROUP BY user_id, streak_group
)
SELECT DISTINCT user_id, streak_start, streak_end, streak_length
FROM streaks
WHERE streak_length >= 3
ORDER BY streak_length DESC;
-- Google: "Analyze the user signup funnel conversion at each step"
-- Funnel: Landing → Sign Up → Email Verify → Profile Complete → First Search
WITH funnel_steps AS (
SELECT
user_id,
MAX(CASE WHEN event_type = "landing_page_view" THEN 1 ELSE 0 END) AS step1_landing,
MAX(CASE WHEN event_type = "signup_click" THEN 1 ELSE 0 END) AS step2_signup,
MAX(CASE WHEN event_type = "email_verified" THEN 1 ELSE 0 END) AS step3_verified,
MAX(CASE WHEN event_type = "profile_completed" THEN 1 ELSE 0 END) AS step4_profile,
MAX(CASE WHEN event_type = "first_search" THEN 1 ELSE 0 END) AS step5_search
FROM user_events
WHERE event_date >= CURRENT_DATE - 30
GROUP BY user_id
)
SELECT
SUM(step1_landing) AS step1_count,
SUM(step2_signup) AS step2_count,
SUM(step3_verified) AS step3_count,
SUM(step4_profile) AS step4_count,
SUM(step5_search) AS step5_count,
ROUND(100.0 * SUM(step2_signup) / SUM(step1_landing), 1) AS step1_to_2_rate,
ROUND(100.0 * SUM(step3_verified) / SUM(step2_signup), 1) AS step2_to_3_rate,
ROUND(100.0 * SUM(step4_profile) / SUM(step3_verified), 1) AS step3_to_4_rate,
ROUND(100.0 * SUM(step5_search) / SUM(step4_profile), 1) AS step4_to_5_rate,
ROUND(100.0 * SUM(step5_search) / SUM(step1_landing), 1) AS overall_conversion_rate
FROM funnel_steps;
-- Amazon: "Find median order value per product category"
-- (AVG misleads for skewed data — median is better!)
-- Method 1: PERCENTILE_CONT (standard SQL)
SELECT
category,
COUNT(*) AS order_count,
ROUND(AVG(order_amount), 2) AS mean_order,
PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY order_amount) AS median_order,
PERCENTILE_CONT(0.25) WITHIN GROUP (ORDER BY order_amount) AS p25,
PERCENTILE_CONT(0.75) WITHIN GROUP (ORDER BY order_amount) AS p75,
PERCENTILE_CONT(0.75) WITHIN GROUP (ORDER BY order_amount) -
PERCENTILE_CONT(0.25) WITHIN GROUP (ORDER BY order_amount) AS IQR
FROM orders o
JOIN products p ON o.product_id = p.product_id
WHERE o.status = "completed"
GROUP BY category
ORDER BY median_order DESC;
-- Method 2: Without PERCENTILE_CONT (if not available)
WITH ranked AS (
SELECT category, order_amount,
ROW_NUMBER() OVER (PARTITION BY category ORDER BY order_amount) AS rn,
COUNT(*) OVER (PARTITION BY category) AS total_count
FROM orders o JOIN products p ON o.product_id = p.product_id
)
SELECT category, AVG(order_amount) AS median
FROM ranked
WHERE rn IN (FLOOR((total_count+1)/2.0), CEIL((total_count+1)/2.0))
GROUP BY category;
-- Meta: "What is the 30-day retention rate for new users?"
-- Retention = % of users active on day 30 who were also new on day 1
WITH new_users AS (
SELECT user_id, MIN(activity_date) AS first_activity_date
FROM user_activity
GROUP BY user_id
),
retention AS (
SELECT
nu.user_id,
nu.first_activity_date,
MAX(CASE
WHEN ua.activity_date = nu.first_activity_date + 30 THEN 1
ELSE 0
END) AS is_retained_day30
FROM new_users nu
LEFT JOIN user_activity ua ON nu.user_id = ua.user_id
WHERE nu.first_activity_date BETWEEN "2024-01-01" AND "2024-06-01"
GROUP BY nu.user_id, nu.first_activity_date
)
SELECT
first_activity_date AS cohort_date,
COUNT(*) AS new_users,
SUM(is_retained_day30) AS retained_users,
ROUND(100.0 * SUM(is_retained_day30) / COUNT(*), 2) AS day30_retention_rate
FROM retention
GROUP BY first_activity_date
ORDER BY cohort_date;
-- Amazon: "Find product pairs that are frequently bought together"
SELECT
oi1.product_id AS product_a,
oi2.product_id AS product_b,
COUNT(DISTINCT oi1.order_id) AS orders_with_both,
ROUND(100.0 * COUNT(DISTINCT oi1.order_id) /
(SELECT COUNT(DISTINCT order_id) FROM order_items), 4) AS support_pct
FROM order_items oi1
JOIN order_items oi2
ON oi1.order_id = oi2.order_id
AND oi1.product_id = 100 -- Minimum support
ORDER BY orders_with_both DESC
LIMIT 20;
-- Result: "Diapers + Beer" urban legend, or "Phone + Phone Case"
-- Google: "Find top 3 most searched queries per country"
WITH ranked_queries AS (
SELECT
country,
query,
search_count,
DENSE_RANK() OVER (PARTITION BY country ORDER BY search_count DESC) AS rnk
FROM (
SELECT country, query, COUNT(*) AS search_count
FROM search_logs
WHERE log_date = CURRENT_DATE - 1
GROUP BY country, query
) agg
)
SELECT country, query, search_count, rnk
FROM ranked_queries
WHERE rnk <= 3
ORDER BY country, rnk;
-- DENSE_RANK vs RANK vs ROW_NUMBER:
-- RANK: 1,1,3 (ties get same rank, skips next)
-- DENSE_RANK: 1,1,2 (ties same rank, no skip)
-- ROW_NUMBER: 1,2,3 (always unique, arbitrary for ties)
| Join Type | Returns | When to Use |
|---|---|---|
| INNER JOIN | Only matching rows in BOTH tables | When you need records present in both |
| LEFT JOIN | All from left + matches from right (NULL if no match) | All left records, optional right data |
| RIGHT JOIN | All from right + matches from left | Rarely used; prefer LEFT JOIN |
| FULL OUTER JOIN | All rows from both, NULLs where no match | Compare two complete datasets |
| CROSS JOIN | Cartesian product (every combination) | Generate all combinations (A/B tests) |
| SELF JOIN | Table joined with itself | Hierarchical or comparative data |
customers(customer_id, name, email, city, registration_date) orders(order_id, customer_id, order_date, total_amount, status) order_items(item_id, order_id, product_id, quantity, price) products(product_id, name, category, seller_id, stock) sellers(seller_id, name, rating, country)
-- Find customers and their orders (only customers WITH orders) SELECT c.name, c.city, o.order_id, o.total_amount, o.order_date FROM customers c INNER JOIN orders o ON c.customer_id = o.customer_id WHERE o.order_date >= `2024-01-01` ORDER BY o.total_amount DESC; -- Result: ONLY customers who placed orders in 2024 -- Customers with no orders are EXCLUDED -- Amazon use: Revenue analysis for active buyers
-- Find all customers, including those who NEVER ordered
-- Critical for churn analysis!
SELECT
c.customer_id, c.name, c.email, c.registration_date,
COUNT(o.order_id) AS total_orders,
COALESCE(SUM(o.total_amount), 0) AS lifetime_value,
MAX(o.order_date) AS last_order_date,
CASE
WHEN o.order_id IS NULL THEN `Never Ordered`
WHEN MAX(o.order_date) < CURRENT_DATE - 90 THEN `Lapsed (90+ days)`
ELSE `Active`
END AS customer_status
FROM customers c
LEFT JOIN orders o ON c.customer_id = o.customer_id
GROUP BY c.customer_id, c.name, c.email, c.registration_date, o.order_id
ORDER BY lifetime_value DESC;
-- Amazon insight:
-- customer_status = `Never Ordered` → Show signup incentive
-- customer_status = `Lapsed` → Send re-engagement email
-- customer_status = `Active` → Prime upgrade campaign
-- Products with no orders AND orders with deleted products (data quality check)
SELECT
p.product_id AS product_product_id,
p.name AS product_name,
oi.product_id AS ordered_product_id,
oi.order_id,
oi.quantity
FROM products p
FULL OUTER JOIN order_items oi ON p.product_id = oi.product_id
-- Where product has no orders:
WHERE oi.order_id IS NULL -- Dead inventory!
-- OR Where order has no matching product:
-- OR p.product_id IS NULL -- Ghost orders — data integrity issue!
-- Amazon use: Inventory management, data quality audits
-- Complete sales analysis: customer, product, seller, order
SELECT
c.name AS customer_name,
c.city,
p.name AS product_name,
p.category,
s.name AS seller_name,
s.country AS seller_country,
o.order_date,
oi.quantity,
oi.price,
oi.quantity * oi.price AS line_total,
o.status AS order_status
FROM orders o
INNER JOIN customers c ON o.customer_id = c.customer_id
INNER JOIN order_items oi ON o.order_id = oi.order_id
INNER JOIN products p ON oi.product_id = p.product_id
LEFT JOIN sellers s ON p.seller_id = s.seller_id -- Left: some products = Amazon direct
WHERE o.order_date BETWEEN `2024-01-01` AND `2024-12-31`
AND o.status = `completed`
ORDER BY line_total DESC;
-- Amazon use: Daily sales dashboard, seller performance reports
-- Customers who referred other customers (referral program)
SELECT
referee.customer_id AS referral_customer_id,
referee.name AS referral_name,
referred.customer_id AS new_customer_id,
referred.name AS new_customer_name,
referred.registration_date AS joined_date
FROM customers referee
INNER JOIN customers referred
ON referee.customer_id = referred.referred_by_customer_id
ORDER BY referred.registration_date DESC;
-- Also useful for: Product hierarchy, employee-manager, comment threads
-- Generate all combinations of test parameters for A/B testing
SELECT
v.variant_name,
p.page_location,
d.device_type,
d.device_type || "_" || v.variant_name || "_" || p.page_location AS experiment_id
FROM (VALUES ("control"), ("treatment_A"), ("treatment_B")) v(variant_name)
CROSS JOIN (VALUES ("homepage"), ("checkout"), ("product_page")) p(page_location)
CROSS JOIN (VALUES ("mobile"), ("desktop"), ("tablet")) d(device_type);
-- Generates 3 variants × 3 pages × 3 devices = 27 experiment configs
-- Amazon uses this for multi-variate testing configuration
Q1: "Find the second highest order value per customer"
SELECT customer_id, order_amount
FROM (
SELECT customer_id, order_amount,
DENSE_RANK() OVER (PARTITION BY customer_id ORDER BY order_amount DESC) AS rnk
FROM orders
) ranked
WHERE rnk = 2;
Q2: "Find customers who ordered every month in 2024"
SELECT customer_id
FROM orders
WHERE YEAR(order_date) = 2024
GROUP BY customer_id
HAVING COUNT(DISTINCT MONTH(order_date)) = 12;
Q3: "Running total of revenue per day"
SELECT order_date, daily_revenue,
SUM(daily_revenue) OVER (ORDER BY order_date ROWS UNBOUNDED PRECEDING) AS running_total
FROM (
SELECT order_date, SUM(total_amount) AS daily_revenue
FROM orders GROUP BY order_date
) daily;
These exact patterns appear in Meta, Google, Amazon SQL interviews. Master these and you can handle 90% of data science SQL questions.
-- Meta Interview: "Calculate 7-day rolling average of daily active users"
WITH daily_dau AS (
SELECT
activity_date AS date,
COUNT(DISTINCT user_id) AS dau
FROM user_activity
GROUP BY activity_date
)
SELECT
date,
dau,
ROUND(AVG(dau) OVER (
ORDER BY date
ROWS BETWEEN 6 PRECEDING AND CURRENT ROW
), 0) AS rolling_7day_avg,
SUM(dau) OVER (ORDER BY date ROWS UNBOUNDED PRECEDING) AS cumulative_users
FROM daily_dau
ORDER BY date;
-- Amazon: "Find user IDs who logged in for 3+ consecutive days"
WITH login_dates AS (
SELECT DISTINCT user_id, login_date
FROM user_logins
),
with_lag AS (
SELECT
user_id, login_date,
login_date - ROW_NUMBER() OVER (PARTITION BY user_id ORDER BY login_date) AS streak_group
FROM login_dates
),
streaks AS (
SELECT user_id, streak_group,
MIN(login_date) AS streak_start,
MAX(login_date) AS streak_end,
COUNT(*) AS streak_length
FROM with_lag
GROUP BY user_id, streak_group
)
SELECT DISTINCT user_id, streak_start, streak_end, streak_length
FROM streaks
WHERE streak_length >= 3
ORDER BY streak_length DESC;
-- Google: "Analyze the user signup funnel conversion at each step"
-- Funnel: Landing → Sign Up → Email Verify → Profile Complete → First Search
WITH funnel_steps AS (
SELECT
user_id,
MAX(CASE WHEN event_type = "landing_page_view" THEN 1 ELSE 0 END) AS step1_landing,
MAX(CASE WHEN event_type = "signup_click" THEN 1 ELSE 0 END) AS step2_signup,
MAX(CASE WHEN event_type = "email_verified" THEN 1 ELSE 0 END) AS step3_verified,
MAX(CASE WHEN event_type = "profile_completed" THEN 1 ELSE 0 END) AS step4_profile,
MAX(CASE WHEN event_type = "first_search" THEN 1 ELSE 0 END) AS step5_search
FROM user_events
WHERE event_date >= CURRENT_DATE - 30
GROUP BY user_id
)
SELECT
SUM(step1_landing) AS step1_count,
SUM(step2_signup) AS step2_count,
SUM(step3_verified) AS step3_count,
SUM(step4_profile) AS step4_count,
SUM(step5_search) AS step5_count,
ROUND(100.0 * SUM(step2_signup) / SUM(step1_landing), 1) AS step1_to_2_rate,
ROUND(100.0 * SUM(step3_verified) / SUM(step2_signup), 1) AS step2_to_3_rate,
ROUND(100.0 * SUM(step4_profile) / SUM(step3_verified), 1) AS step3_to_4_rate,
ROUND(100.0 * SUM(step5_search) / SUM(step4_profile), 1) AS step4_to_5_rate,
ROUND(100.0 * SUM(step5_search) / SUM(step1_landing), 1) AS overall_conversion_rate
FROM funnel_steps;
-- Amazon: "Find median order value per product category"
-- (AVG misleads for skewed data — median is better!)
-- Method 1: PERCENTILE_CONT (standard SQL)
SELECT
category,
COUNT(*) AS order_count,
ROUND(AVG(order_amount), 2) AS mean_order,
PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY order_amount) AS median_order,
PERCENTILE_CONT(0.25) WITHIN GROUP (ORDER BY order_amount) AS p25,
PERCENTILE_CONT(0.75) WITHIN GROUP (ORDER BY order_amount) AS p75,
PERCENTILE_CONT(0.75) WITHIN GROUP (ORDER BY order_amount) -
PERCENTILE_CONT(0.25) WITHIN GROUP (ORDER BY order_amount) AS IQR
FROM orders o
JOIN products p ON o.product_id = p.product_id
WHERE o.status = "completed"
GROUP BY category
ORDER BY median_order DESC;
-- Method 2: Without PERCENTILE_CONT (if not available)
WITH ranked AS (
SELECT category, order_amount,
ROW_NUMBER() OVER (PARTITION BY category ORDER BY order_amount) AS rn,
COUNT(*) OVER (PARTITION BY category) AS total_count
FROM orders o JOIN products p ON o.product_id = p.product_id
)
SELECT category, AVG(order_amount) AS median
FROM ranked
WHERE rn IN (FLOOR((total_count+1)/2.0), CEIL((total_count+1)/2.0))
GROUP BY category;
-- Meta: "What is the 30-day retention rate for new users?"
-- Retention = % of users active on day 30 who were also new on day 1
WITH new_users AS (
SELECT user_id, MIN(activity_date) AS first_activity_date
FROM user_activity
GROUP BY user_id
),
retention AS (
SELECT
nu.user_id,
nu.first_activity_date,
MAX(CASE
WHEN ua.activity_date = nu.first_activity_date + 30 THEN 1
ELSE 0
END) AS is_retained_day30
FROM new_users nu
LEFT JOIN user_activity ua ON nu.user_id = ua.user_id
WHERE nu.first_activity_date BETWEEN "2024-01-01" AND "2024-06-01"
GROUP BY nu.user_id, nu.first_activity_date
)
SELECT
first_activity_date AS cohort_date,
COUNT(*) AS new_users,
SUM(is_retained_day30) AS retained_users,
ROUND(100.0 * SUM(is_retained_day30) / COUNT(*), 2) AS day30_retention_rate
FROM retention
GROUP BY first_activity_date
ORDER BY cohort_date;
-- Amazon: "Find product pairs that are frequently bought together"
SELECT
oi1.product_id AS product_a,
oi2.product_id AS product_b,
COUNT(DISTINCT oi1.order_id) AS orders_with_both,
ROUND(100.0 * COUNT(DISTINCT oi1.order_id) /
(SELECT COUNT(DISTINCT order_id) FROM order_items), 4) AS support_pct
FROM order_items oi1
JOIN order_items oi2
ON oi1.order_id = oi2.order_id
AND oi1.product_id = 100 -- Minimum support
ORDER BY orders_with_both DESC
LIMIT 20;
-- Result: "Diapers + Beer" urban legend, or "Phone + Phone Case"
-- Google: "Find top 3 most searched queries per country"
WITH ranked_queries AS (
SELECT
country,
query,
search_count,
DENSE_RANK() OVER (PARTITION BY country ORDER BY search_count DESC) AS rnk
FROM (
SELECT country, query, COUNT(*) AS search_count
FROM search_logs
WHERE log_date = CURRENT_DATE - 1
GROUP BY country, query
) agg
)
SELECT country, query, search_count, rnk
FROM ranked_queries
WHERE rnk <= 3
ORDER BY country, rnk;
-- DENSE_RANK vs RANK vs ROW_NUMBER:
-- RANK: 1,1,3 (ties get same rank, skips next)
-- DENSE_RANK: 1,1,2 (ties same rank, no skip)
-- ROW_NUMBER: 1,2,3 (always unique, arbitrary for ties)
| Join Type | Returns | When to Use |
|---|---|---|
| INNER JOIN | Only matching rows in BOTH tables | When you need records present in both |
| LEFT JOIN | All from left + matches from right (NULL if no match) | All left records, optional right data |
| RIGHT JOIN | All from right + matches from left | Rarely used; prefer LEFT JOIN |
| FULL OUTER JOIN | All rows from both, NULLs where no match | Compare two complete datasets |
| CROSS JOIN | Cartesian product (every combination) | Generate all combinations (A/B tests) |
| SELF JOIN | Table joined with itself | Hierarchical or comparative data |
customers(customer_id, name, email, city, registration_date) orders(order_id, customer_id, order_date, total_amount, status) order_items(item_id, order_id, product_id, quantity, price) products(product_id, name, category, seller_id, stock) sellers(seller_id, name, rating, country)
-- Find customers and their orders (only customers WITH orders) SELECT c.name, c.city, o.order_id, o.total_amount, o.order_date FROM customers c INNER JOIN orders o ON c.customer_id = o.customer_id WHERE o.order_date >= `2024-01-01` ORDER BY o.total_amount DESC; -- Result: ONLY customers who placed orders in 2024 -- Customers with no orders are EXCLUDED -- Amazon use: Revenue analysis for active buyers
-- Find all customers, including those who NEVER ordered
-- Critical for churn analysis!
SELECT
c.customer_id, c.name, c.email, c.registration_date,
COUNT(o.order_id) AS total_orders,
COALESCE(SUM(o.total_amount), 0) AS lifetime_value,
MAX(o.order_date) AS last_order_date,
CASE
WHEN o.order_id IS NULL THEN `Never Ordered`
WHEN MAX(o.order_date) < CURRENT_DATE - 90 THEN `Lapsed (90+ days)`
ELSE `Active`
END AS customer_status
FROM customers c
LEFT JOIN orders o ON c.customer_id = o.customer_id
GROUP BY c.customer_id, c.name, c.email, c.registration_date, o.order_id
ORDER BY lifetime_value DESC;
-- Amazon insight:
-- customer_status = `Never Ordered` → Show signup incentive
-- customer_status = `Lapsed` → Send re-engagement email
-- customer_status = `Active` → Prime upgrade campaign
-- Products with no orders AND orders with deleted products (data quality check)
SELECT
p.product_id AS product_product_id,
p.name AS product_name,
oi.product_id AS ordered_product_id,
oi.order_id,
oi.quantity
FROM products p
FULL OUTER JOIN order_items oi ON p.product_id = oi.product_id
-- Where product has no orders:
WHERE oi.order_id IS NULL -- Dead inventory!
-- OR Where order has no matching product:
-- OR p.product_id IS NULL -- Ghost orders — data integrity issue!
-- Amazon use: Inventory management, data quality audits
-- Complete sales analysis: customer, product, seller, order
SELECT
c.name AS customer_name,
c.city,
p.name AS product_name,
p.category,
s.name AS seller_name,
s.country AS seller_country,
o.order_date,
oi.quantity,
oi.price,
oi.quantity * oi.price AS line_total,
o.status AS order_status
FROM orders o
INNER JOIN customers c ON o.customer_id = c.customer_id
INNER JOIN order_items oi ON o.order_id = oi.order_id
INNER JOIN products p ON oi.product_id = p.product_id
LEFT JOIN sellers s ON p.seller_id = s.seller_id -- Left: some products = Amazon direct
WHERE o.order_date BETWEEN `2024-01-01` AND `2024-12-31`
AND o.status = `completed`
ORDER BY line_total DESC;
-- Amazon use: Daily sales dashboard, seller performance reports
-- Customers who referred other customers (referral program)
SELECT
referee.customer_id AS referral_customer_id,
referee.name AS referral_name,
referred.customer_id AS new_customer_id,
referred.name AS new_customer_name,
referred.registration_date AS joined_date
FROM customers referee
INNER JOIN customers referred
ON referee.customer_id = referred.referred_by_customer_id
ORDER BY referred.registration_date DESC;
-- Also useful for: Product hierarchy, employee-manager, comment threads
-- Generate all combinations of test parameters for A/B testing
SELECT
v.variant_name,
p.page_location,
d.device_type,
d.device_type || "_" || v.variant_name || "_" || p.page_location AS experiment_id
FROM (VALUES ("control"), ("treatment_A"), ("treatment_B")) v(variant_name)
CROSS JOIN (VALUES ("homepage"), ("checkout"), ("product_page")) p(page_location)
CROSS JOIN (VALUES ("mobile"), ("desktop"), ("tablet")) d(device_type);
-- Generates 3 variants × 3 pages × 3 devices = 27 experiment configs
-- Amazon uses this for multi-variate testing configuration
Q1: "Find the second highest order value per customer"
SELECT customer_id, order_amount
FROM (
SELECT customer_id, order_amount,
DENSE_RANK() OVER (PARTITION BY customer_id ORDER BY order_amount DESC) AS rnk
FROM orders
) ranked
WHERE rnk = 2;
Q2: "Find customers who ordered every month in 2024"
SELECT customer_id
FROM orders
WHERE YEAR(order_date) = 2024
GROUP BY customer_id
HAVING COUNT(DISTINCT MONTH(order_date)) = 12;
Q3: "Running total of revenue per day"
SELECT order_date, daily_revenue,
SUM(daily_revenue) OVER (ORDER BY order_date ROWS UNBOUNDED PRECEDING) AS running_total
FROM (
SELECT order_date, SUM(total_amount) AS daily_revenue
FROM orders GROUP BY order_date
) daily;
These exact patterns appear in Meta, Google, Amazon SQL interviews. Master these and you can handle 90% of data science SQL questions.
-- Meta Interview: "Calculate 7-day rolling average of daily active users"
WITH daily_dau AS (
SELECT
activity_date AS date,
COUNT(DISTINCT user_id) AS dau
FROM user_activity
GROUP BY activity_date
)
SELECT
date,
dau,
ROUND(AVG(dau) OVER (
ORDER BY date
ROWS BETWEEN 6 PRECEDING AND CURRENT ROW
), 0) AS rolling_7day_avg,
SUM(dau) OVER (ORDER BY date ROWS UNBOUNDED PRECEDING) AS cumulative_users
FROM daily_dau
ORDER BY date;
-- Amazon: "Find user IDs who logged in for 3+ consecutive days"
WITH login_dates AS (
SELECT DISTINCT user_id, login_date
FROM user_logins
),
with_lag AS (
SELECT
user_id, login_date,
login_date - ROW_NUMBER() OVER (PARTITION BY user_id ORDER BY login_date) AS streak_group
FROM login_dates
),
streaks AS (
SELECT user_id, streak_group,
MIN(login_date) AS streak_start,
MAX(login_date) AS streak_end,
COUNT(*) AS streak_length
FROM with_lag
GROUP BY user_id, streak_group
)
SELECT DISTINCT user_id, streak_start, streak_end, streak_length
FROM streaks
WHERE streak_length >= 3
ORDER BY streak_length DESC;
-- Google: "Analyze the user signup funnel conversion at each step"
-- Funnel: Landing → Sign Up → Email Verify → Profile Complete → First Search
WITH funnel_steps AS (
SELECT
user_id,
MAX(CASE WHEN event_type = "landing_page_view" THEN 1 ELSE 0 END) AS step1_landing,
MAX(CASE WHEN event_type = "signup_click" THEN 1 ELSE 0 END) AS step2_signup,
MAX(CASE WHEN event_type = "email_verified" THEN 1 ELSE 0 END) AS step3_verified,
MAX(CASE WHEN event_type = "profile_completed" THEN 1 ELSE 0 END) AS step4_profile,
MAX(CASE WHEN event_type = "first_search" THEN 1 ELSE 0 END) AS step5_search
FROM user_events
WHERE event_date >= CURRENT_DATE - 30
GROUP BY user_id
)
SELECT
SUM(step1_landing) AS step1_count,
SUM(step2_signup) AS step2_count,
SUM(step3_verified) AS step3_count,
SUM(step4_profile) AS step4_count,
SUM(step5_search) AS step5_count,
ROUND(100.0 * SUM(step2_signup) / SUM(step1_landing), 1) AS step1_to_2_rate,
ROUND(100.0 * SUM(step3_verified) / SUM(step2_signup), 1) AS step2_to_3_rate,
ROUND(100.0 * SUM(step4_profile) / SUM(step3_verified), 1) AS step3_to_4_rate,
ROUND(100.0 * SUM(step5_search) / SUM(step4_profile), 1) AS step4_to_5_rate,
ROUND(100.0 * SUM(step5_search) / SUM(step1_landing), 1) AS overall_conversion_rate
FROM funnel_steps;
-- Amazon: "Find median order value per product category"
-- (AVG misleads for skewed data — median is better!)
-- Method 1: PERCENTILE_CONT (standard SQL)
SELECT
category,
COUNT(*) AS order_count,
ROUND(AVG(order_amount), 2) AS mean_order,
PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY order_amount) AS median_order,
PERCENTILE_CONT(0.25) WITHIN GROUP (ORDER BY order_amount) AS p25,
PERCENTILE_CONT(0.75) WITHIN GROUP (ORDER BY order_amount) AS p75,
PERCENTILE_CONT(0.75) WITHIN GROUP (ORDER BY order_amount) -
PERCENTILE_CONT(0.25) WITHIN GROUP (ORDER BY order_amount) AS IQR
FROM orders o
JOIN products p ON o.product_id = p.product_id
WHERE o.status = "completed"
GROUP BY category
ORDER BY median_order DESC;
-- Method 2: Without PERCENTILE_CONT (if not available)
WITH ranked AS (
SELECT category, order_amount,
ROW_NUMBER() OVER (PARTITION BY category ORDER BY order_amount) AS rn,
COUNT(*) OVER (PARTITION BY category) AS total_count
FROM orders o JOIN products p ON o.product_id = p.product_id
)
SELECT category, AVG(order_amount) AS median
FROM ranked
WHERE rn IN (FLOOR((total_count+1)/2.0), CEIL((total_count+1)/2.0))
GROUP BY category;
-- Meta: "What is the 30-day retention rate for new users?"
-- Retention = % of users active on day 30 who were also new on day 1
WITH new_users AS (
SELECT user_id, MIN(activity_date) AS first_activity_date
FROM user_activity
GROUP BY user_id
),
retention AS (
SELECT
nu.user_id,
nu.first_activity_date,
MAX(CASE
WHEN ua.activity_date = nu.first_activity_date + 30 THEN 1
ELSE 0
END) AS is_retained_day30
FROM new_users nu
LEFT JOIN user_activity ua ON nu.user_id = ua.user_id
WHERE nu.first_activity_date BETWEEN "2024-01-01" AND "2024-06-01"
GROUP BY nu.user_id, nu.first_activity_date
)
SELECT
first_activity_date AS cohort_date,
COUNT(*) AS new_users,
SUM(is_retained_day30) AS retained_users,
ROUND(100.0 * SUM(is_retained_day30) / COUNT(*), 2) AS day30_retention_rate
FROM retention
GROUP BY first_activity_date
ORDER BY cohort_date;
-- Amazon: "Find product pairs that are frequently bought together"
SELECT
oi1.product_id AS product_a,
oi2.product_id AS product_b,
COUNT(DISTINCT oi1.order_id) AS orders_with_both,
ROUND(100.0 * COUNT(DISTINCT oi1.order_id) /
(SELECT COUNT(DISTINCT order_id) FROM order_items), 4) AS support_pct
FROM order_items oi1
JOIN order_items oi2
ON oi1.order_id = oi2.order_id
AND oi1.product_id = 100 -- Minimum support
ORDER BY orders_with_both DESC
LIMIT 20;
-- Result: "Diapers + Beer" urban legend, or "Phone + Phone Case"
-- Google: "Find top 3 most searched queries per country"
WITH ranked_queries AS (
SELECT
country,
query,
search_count,
DENSE_RANK() OVER (PARTITION BY country ORDER BY search_count DESC) AS rnk
FROM (
SELECT country, query, COUNT(*) AS search_count
FROM search_logs
WHERE log_date = CURRENT_DATE - 1
GROUP BY country, query
) agg
)
SELECT country, query, search_count, rnk
FROM ranked_queries
WHERE rnk <= 3
ORDER BY country, rnk;
-- DENSE_RANK vs RANK vs ROW_NUMBER:
-- RANK: 1,1,3 (ties get same rank, skips next)
-- DENSE_RANK: 1,1,2 (ties same rank, no skip)
-- ROW_NUMBER: 1,2,3 (always unique, arbitrary for ties)
| Join Type | Returns | When to Use |
|---|---|---|
| INNER JOIN | Only matching rows in BOTH tables | When you need records present in both |
| LEFT JOIN | All from left + matches from right (NULL if no match) | All left records, optional right data |
| RIGHT JOIN | All from right + matches from left | Rarely used; prefer LEFT JOIN |
| FULL OUTER JOIN | All rows from both, NULLs where no match | Compare two complete datasets |
| CROSS JOIN | Cartesian product (every combination) | Generate all combinations (A/B tests) |
| SELF JOIN | Table joined with itself | Hierarchical or comparative data |
customers(customer_id, name, email, city, registration_date) orders(order_id, customer_id, order_date, total_amount, status) order_items(item_id, order_id, product_id, quantity, price) products(product_id, name, category, seller_id, stock) sellers(seller_id, name, rating, country)
-- Find customers and their orders (only customers WITH orders) SELECT c.name, c.city, o.order_id, o.total_amount, o.order_date FROM customers c INNER JOIN orders o ON c.customer_id = o.customer_id WHERE o.order_date >= `2024-01-01` ORDER BY o.total_amount DESC; -- Result: ONLY customers who placed orders in 2024 -- Customers with no orders are EXCLUDED -- Amazon use: Revenue analysis for active buyers
-- Find all customers, including those who NEVER ordered
-- Critical for churn analysis!
SELECT
c.customer_id, c.name, c.email, c.registration_date,
COUNT(o.order_id) AS total_orders,
COALESCE(SUM(o.total_amount), 0) AS lifetime_value,
MAX(o.order_date) AS last_order_date,
CASE
WHEN o.order_id IS NULL THEN `Never Ordered`
WHEN MAX(o.order_date) < CURRENT_DATE - 90 THEN `Lapsed (90+ days)`
ELSE `Active`
END AS customer_status
FROM customers c
LEFT JOIN orders o ON c.customer_id = o.customer_id
GROUP BY c.customer_id, c.name, c.email, c.registration_date, o.order_id
ORDER BY lifetime_value DESC;
-- Amazon insight:
-- customer_status = `Never Ordered` → Show signup incentive
-- customer_status = `Lapsed` → Send re-engagement email
-- customer_status = `Active` → Prime upgrade campaign
-- Products with no orders AND orders with deleted products (data quality check)
SELECT
p.product_id AS product_product_id,
p.name AS product_name,
oi.product_id AS ordered_product_id,
oi.order_id,
oi.quantity
FROM products p
FULL OUTER JOIN order_items oi ON p.product_id = oi.product_id
-- Where product has no orders:
WHERE oi.order_id IS NULL -- Dead inventory!
-- OR Where order has no matching product:
-- OR p.product_id IS NULL -- Ghost orders — data integrity issue!
-- Amazon use: Inventory management, data quality audits
-- Complete sales analysis: customer, product, seller, order
SELECT
c.name AS customer_name,
c.city,
p.name AS product_name,
p.category,
s.name AS seller_name,
s.country AS seller_country,
o.order_date,
oi.quantity,
oi.price,
oi.quantity * oi.price AS line_total,
o.status AS order_status
FROM orders o
INNER JOIN customers c ON o.customer_id = c.customer_id
INNER JOIN order_items oi ON o.order_id = oi.order_id
INNER JOIN products p ON oi.product_id = p.product_id
LEFT JOIN sellers s ON p.seller_id = s.seller_id -- Left: some products = Amazon direct
WHERE o.order_date BETWEEN `2024-01-01` AND `2024-12-31`
AND o.status = `completed`
ORDER BY line_total DESC;
-- Amazon use: Daily sales dashboard, seller performance reports
-- Customers who referred other customers (referral program)
SELECT
referee.customer_id AS referral_customer_id,
referee.name AS referral_name,
referred.customer_id AS new_customer_id,
referred.name AS new_customer_name,
referred.registration_date AS joined_date
FROM customers referee
INNER JOIN customers referred
ON referee.customer_id = referred.referred_by_customer_id
ORDER BY referred.registration_date DESC;
-- Also useful for: Product hierarchy, employee-manager, comment threads
-- Generate all combinations of test parameters for A/B testing
SELECT
v.variant_name,
p.page_location,
d.device_type,
d.device_type || "_" || v.variant_name || "_" || p.page_location AS experiment_id
FROM (VALUES ("control"), ("treatment_A"), ("treatment_B")) v(variant_name)
CROSS JOIN (VALUES ("homepage"), ("checkout"), ("product_page")) p(page_location)
CROSS JOIN (VALUES ("mobile"), ("desktop"), ("tablet")) d(device_type);
-- Generates 3 variants × 3 pages × 3 devices = 27 experiment configs
-- Amazon uses this for multi-variate testing configuration
Q1: "Find the second highest order value per customer"
SELECT customer_id, order_amount
FROM (
SELECT customer_id, order_amount,
DENSE_RANK() OVER (PARTITION BY customer_id ORDER BY order_amount DESC) AS rnk
FROM orders
) ranked
WHERE rnk = 2;
Q2: "Find customers who ordered every month in 2024"
SELECT customer_id
FROM orders
WHERE YEAR(order_date) = 2024
GROUP BY customer_id
HAVING COUNT(DISTINCT MONTH(order_date)) = 12;
Q3: "Running total of revenue per day"
SELECT order_date, daily_revenue,
SUM(daily_revenue) OVER (ORDER BY order_date ROWS UNBOUNDED PRECEDING) AS running_total
FROM (
SELECT order_date, SUM(total_amount) AS daily_revenue
FROM orders GROUP BY order_date
) daily;
PDF: f(x) = (1/σ√2π) × exp(-(x-μ)²/2σ²) Parameters: μ (mean), σ (standard deviation) Shape: Bell curve, symmetric When it appears: → Errors/noise in measurements → Height, weight, test scores in population → Residuals in linear regression (assumption!) → Returns of financial instruments (approximate) 68-95-99.7 Rule: 68% of data within 1σ, 95% within 2σ, 99.7% within 3σ Real MNC Examples: → Google: Search latency (mostly ~200ms, bell shaped) → Amazon: Product rating distribution per category → LinkedIn: Salary distribution in an industry
P(X=k) = C(n,k) × p^k × (1-p)^(n-k) Parameters: n (trials), p (success probability) Shape: Symmetric if p=0.5, skewed otherwise When it appears: → Binary outcomes (click/no-click, buy/no-buy) → A/B test results (n users, each clicks with prob p) → Defect detection (n products, each defective with prob p) Real MNC Examples: → Google Ads: Will user click this ad? (1 trial, p=CTR) → Amazon: Will customer buy? (Bernoulli if single item) → Netflix: Will user subscribe after trial? (Binomial over 1000 trial users) Example: Amazon sends 10,000 promotional emails, 3% conversion rate Expected sales = np = 10000 × 0.03 = 300 Std Dev = √(np(1-p)) = √(10000×0.03×0.97) = 17 95% CI: [266, 334] sales
P(X=k) = (λ^k × e^(-λ)) / k! Parameter: λ (average rate of events per time/space interval) Shape: Right-skewed, approaches Normal for large λ When it appears: → Count of events in fixed time/space → Number of arrivals per unit time → Rare event modeling Real MNC Examples: → Uber: Number of ride requests per minute per zone λ = 12 requests/min, P(X≥20) = rare surge event → Netflix: Server errors per minute λ = 0.5 errors/min → P(X>3 errors) = alert threshold → Amazon: Customer service calls per hour λ = 150 calls/hour → staffing planning → Google: Malicious requests per second λ = 2/sec → P(X>10) triggers DDoS protection Poisson Process Assumption: Events occur independently, at constant average rate
PDF: f(x) = λ × e^(-λx) Parameter: λ (rate), Mean = 1/λ Companion to Poisson: Time BETWEEN Poisson events is Exponential When it appears: → Time between events (arrivals, failures) → Service times, wait times → Survival analysis (time until event) Real MNC Examples: → Uber: Time between ride requests in a zone → Amazon: Time between customer complaints (customer service SLA) → Netflix: Time until a server fails (reliability engineering) → Google Cloud: Time between network packet arrivals Example — Uber Driver Wait Time: Average ride requests arrive every 5 minutes (λ=0.2/min) Time between requests ~ Exponential(λ=0.2) P(wait > 10 minutes) = e^(-0.2×10) = e^(-2) = 13.5%
PDF: f(x;α,β) = x^(α-1)(1-x)^(β-1) / B(α,β) Domain: [0, 1] — perfect for probabilities! Parameters: α, β control shape When it appears: → Prior distribution for probabilities (Bayesian) → Modeling conversion rates, CTR → A/B testing (Bayesian approach) Real MNC Examples: → Google Ads: Model CTR probability for new ad (Beta prior + Binomial data) → Amazon: Model product defect rate → Netflix: Model subscription conversion probability → Uber: Model driver acceptance rate Bayesian Update with Beta: Prior: Beta(α, β) — your initial belief about conversion rate Observe: k conversions out of n trials Posterior: Beta(α+k, β+n-k) — updated belief! This is conjugate prior for Binomial!
The 80-20 Rule: → 20% of products generate 80% of revenue → 20% of users create 80% of content → A few words are very frequent, most are rare Real MNC Examples: → Netflix: Top 10% of shows get 60%+ of viewing → Amazon: Top 20% products = 80% revenue → Twitter: Top 1% of users create 50%+ of content → Google Search: Few queries are searched millions of times, most once Implication for ML: → Word frequency in NLP follows power law (Zipf`s law) → Social network degree distribution follows power law → Need special handling: class imbalance, rare item recommendations
import numpy as np
import matplotlib.pyplot as plt
from scipy import stats
# 1. Uber ride requests — Poisson
lambda_rate = 12 # requests per minute
x = np.arange(0, 30)
poisson_pmf = stats.poisson.pmf(x, lambda_rate)
print(f"P(>20 requests in 1 min): {1 - stats.poisson.cdf(20, lambda_rate):.4f}")
# 2. Netflix A/B Test — Binomial
n_users = 50000
p_convert = 0.032 # 3.2% trial-to-subscription
binomial_rv = stats.binom(n=n_users, p=p_convert)
expected_subs = n_users * p_convert
print(f"Expected new subscriptions: {expected_subs:.0f}")
print(f"P(less than 1400 subs): {binomial_rv.cdf(1400):.4f}")
# 3. Google Ads CTR — Beta Distribution
# Prior: Alpha=10, Beta=1000 (historic ~1% CTR)
# After seeing 50 clicks in 3000 impressions:
alpha_prior = 10; beta_prior = 1000
clicks = 50; impressions = 3000
alpha_post = alpha_prior + clicks
beta_post = beta_prior + (impressions - clicks)
posterior = stats.beta(alpha_post, beta_post)
print(f"
Google Ads CTR:")
print(f"Posterior mean: {posterior.mean():.4f} ({posterior.mean()*100:.2f}%)")
ci = posterior.interval(0.95)
print(f"95% Credible Interval: [{ci[0]*100:.2f}%, {ci[1]*100:.2f}%]")
# 4. Amazon time between complaints — Exponential
lambda_service = 1/8 # avg 1 complaint per 8 hours
exp_rv = stats.expon(scale=1/lambda_service)
print(f"
Amazon Service SLA:")
print(f"P(next complaint within 2 hours): {exp_rv.cdf(2):.3f}")
print(f"P(no complaint for 24 hours): {1-exp_rv.cdf(24):.3f}")
# Test if data follows Normal distribution
from scipy.stats import normaltest
data = np.random.normal(0, 1, 1000)
stat, p_value = normaltest(data)
print(f"
Normality Test p-value: {p_value:.4f}")
print("Normal distribution" if p_value > 0.05 else "NOT normal — transform or use non-parametric!")
Definition: An unsupervised dimensionality reduction technique that finds the directions of maximum variance in data and projects data onto these directions.
Netflix has: 5000+ shows, each described by 500+ features (genre tags, actor embeddings, director style, language, mood, etc.) Problem 1: High dimensionality → curse of dimensionality Problem 2: Many features are correlated (action + explosions, romance + drama) Problem 3: Visualization impossible in 500D PCA Solution: → Find 50 directions that explain 95% of variance → Reduce 500 features to 50 "principal components" → Each PC = linear combination of original features → PCs are uncorrelated (orthogonal)
1. Standardize data: X_std = (X - mean) / std → Each feature has mean=0, variance=1 2. Compute Covariance Matrix: C = X_std^T × X_std / (n-1) → Captures relationships between features 3. Eigenvalue Decomposition: C = V × Λ × V^T V = Eigenvectors (directions of variance) Λ = Eigenvalues (amount of variance in each direction) 4. Sort by eigenvalue (largest first) → First eigenvector = direction of MOST variance (PC1) → Second eigenvector = next most variance (PC2), etc. 5. Choose k components: Explained Variance Ratio = λ_i / Σλ_j Select k where cumulative explained variance ≥ 95% 6. Transform data: X_reduced = X_std × V_k (k = number of components kept)
Imagine describing a person in 3D (height, weight, age):
PC1: "Overall body size" (explains 70% variance)
→ Combination of height + weight
PC2: "Age factor" (explains 20% variance)
→ Mostly age, some weight
PC3: "Shape" (explains 10% variance)
→ Height minus weight
We could keep just PC1 + PC2 and explain 90%!
Netflix Content-Based Matrix:
Action Romance Thriller Sci-Fi Comedy ... (500 genre tags)
Inception 0.9 0.1 0.8 0.9 0.0 ...
Friends 0.0 0.7 0.0 0.0 0.9 ...
Avengers 0.95 0.2 0.5 0.7 0.1 ...
...
5000 shows × 500 features = 2,500,000 numbers to process!
PCA(n_components=50): Each show now represented by 50 numbers instead of 500 PC1 might capture: "Dark & Intense" (action + thriller + sci-fi load high) PC2 might capture: "Feel-Good" (comedy + romance load high) PC3 might capture: "Intellectual" (documentary + drama load high) Now: 5000 shows × 50 PCs (vs 500 original features) Much faster to compute similarities!
User rating matrix:
Movie1 Movie2 Movie3 ... (5000 movies)
User1: 4.0 - 3.0 ...
User2: - 5.0 4.5 ...
PCA on users: Reduce to "user taste vectors" in same 50D space
User1 vector: [0.8, 0.2, 0.7, ...] in PC space
"User likes dark, intense, intellectual content"
Find shows closest to user`s taste vector (cosine similarity) → Much faster search in 50D vs 500D → 10x speedup in recommendation generation Netflix processes 250M user × 5000 show recommendations hourly PCA makes this computationally feasible!
Netflix`s actual system uses Matrix Factorization (similar to PCA) called SVD (Singular Value Decomposition): Rating Matrix R ≈ U × S × V^T U = User latent factors (250M × 50) S = Strength of each factor V = Item latent factors (5000 × 50) Recommendation: predicted_rating = U[user] · V[movie] Find top-k movies with highest predicted rating!
import numpy as np
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
import matplotlib.pyplot as plt
# Netflix-style content feature matrix
# 1000 shows × 100 genre/mood tags
content_matrix = netflix_content_features # (1000, 100)
# Step 1: Standardize
scaler = StandardScaler()
X_scaled = scaler.fit_transform(content_matrix)
# Step 2: Apply PCA
pca = PCA()
pca.fit(X_scaled)
# Step 3: Choose number of components (explain 95% variance)
cumulative_variance = np.cumsum(pca.explained_variance_ratio_)
n_components = np.argmax(cumulative_variance >= 0.95) + 1
print(f"Components for 95% variance: {n_components}") # ~30-40 components
# Visualize explained variance
plt.plot(cumulative_variance)
plt.axhline(y=0.95, color="red", linestyle="--", label="95% threshold")
plt.xlabel("Number of Components"); plt.ylabel("Cumulative Explained Variance")
plt.title("Scree Plot — Netflix Content PCA")
# Step 4: Transform
pca_final = PCA(n_components=n_components)
X_reduced = pca_final.fit_transform(X_scaled)
print(f"Original: {X_scaled.shape}, Reduced: {X_reduced.shape}")
# Step 5: Find similar content (Netflix recommendation)
from sklearn.metrics.pairwise import cosine_similarity
similarity_matrix = cosine_similarity(X_reduced)
def get_recommendations(show_idx, top_k=5):
similarities = similarity_matrix[show_idx]
similar_indices = np.argsort(similarities)[::-1][1:top_k+1]
return [(show_names[i], similarities[i]) for i in similar_indices]
# Recommend shows similar to "Inception"
inception_idx = show_names.index("Inception")
recommendations = get_recommendations(inception_idx, top_k=5)
print("
Shows similar to Inception:")
for show, score in recommendations:
print(f" {show}: {score:.3f} similarity")
# What do PCA components mean?
print("
Top features for PC1 (first principal component):")
pc1_loadings = pd.Series(pca_final.components_[0], index=feature_names)
print(pc1_loadings.abs().nlargest(5)) # Top 5 contributing features
| Step | Description | Amazon Example |
|---|---|---|
| 1. State Hypotheses | H₀ (null) and H₁ (alternative) | H₀: New checkout has no effect on conversion. H₁: New checkout increases conversion |
| 2. Choose α | Significance level (false positive rate) | α = 0.05 (5% false alarm risk) |
| 3. Collect Data | Run experiment, gather observations | 2 weeks A/B test, 100K users each |
| 4. Compute Test Statistic | Summarize data into single number | z-statistic, t-statistic, chi-square |
| 5. Get p-value | Probability under H₀ | p = 0.02 |
| 6. Make Decision | p < α → Reject H₀ | p=0.02 < 0.05 → Launch! |
| True Reality | |||
| H₀ True (no real effect) |
H₁ True (real effect exists) |
||
| Decision | Reject H₀ (say effect exists) |
Type I Error (α) False Positive "False Alarm" |
Correct! True Positive Power = 1-β |
| Fail to Reject H₀ (say no effect) |
Correct! True Negative Specificity |
Type II Error (β) False Negative "Missed Discovery" |
|
Type I (α) = False POSITIVE = "Cry Wolf" — claim effect when none exists Type II (β) = False NEGATIVE = "Miss the Wolf" — miss real effect
H₀: One-click button doesnt change purchase conversion rate (3%)
H₁: One-click button INCREASES conversion rate
Type I Error (α = 0.05):
What it means: Launch button, but it actually doesnt help
Consequence:
→ Engineering cost of permanent feature ($500K)
→ Potential UX complexity for no benefit
→ 5% chance of this mistake with α=0.05
Type II Error (β = 0.20 for 80% power):
What it means: Decide button doesnt help, but it actually does!
Consequence:
→ Lose potential $5M/year in additional revenue
→ Customers face extra friction in checkout
→ 20% chance of missing real improvement
Amazons decision: Lower α (0.01) when feature cost is high
Run longer test to reduce β
H₀: Batch of products is defect-free
H₁: Batch has more than 2% defect rate
Type I Error: REJECT good batch → costly recall, destroy good products
Cost: $200K product waste, supplier relationship damage
Type II Error: ACCEPT bad batch → defective products reach customers
Cost: $2M in returns, reviews damage, Amazon reputation
Decision: For safety-critical products, minimize Type II Error (β)
Use α = 0.01 (very strict) — rather over-detect than miss
Power = 95% (very high) to catch real defects
Testing new recommendation algorithm for books:
Type I (False Positive — say algorithm is better, its not):
→ Deploy suboptimal algorithm
→ Users see slightly worse recommendations
→ Minor revenue impact: -$10K/day
→ Acceptable risk!
Type II (False Negative — say algorithm is worse, its actually better):
→ Discard a genuinely better algorithm
→ Lose: +$50K/day additional revenue opportunity
→ High opportunity cost!
Decision: For revenue-positive features, minimize Type II Error
Use higher α (0.10) — more willing to accept false positive
Ensures we dont miss improvements!
import numpy as np
from scipy import stats
from statsmodels.stats.power import TTestPower
# Amazon Checkout Conversion A/B Test
# Control: 3% conversion, Treatment: 3.5% conversion (goal)
n = 10000 # per group
alpha = 0.05
# Simulate data
np.random.seed(42)
control = np.random.binomial(1, 0.03, n) # 3% conversion
treatment = np.random.binomial(1, 0.035, n) # 3.5% conversion
# Z-test for proportions
from statsmodels.stats.proportion import proportions_ztest
count = np.array([treatment.sum(), control.sum()])
nobs = np.array([n, n])
z_stat, p_value = proportions_ztest(count, nobs, alternative="larger")
print(f"Control conversion: {control.mean():.3f} ({control.mean()*100:.1f}%)")
print(f"Treatment conversion: {treatment.mean():.3f} ({treatment.mean()*100:.1f}%)")
print(f"Z-statistic: {z_stat:.3f}")
print(f"P-value: {p_value:.4f}")
print(f"Decision: {"Launch ✅" if p_value < alpha else "Do Not Launch ❌"}")
# Power Analysis — minimize Type II Error
from statsmodels.stats.power import NormalIndPower
analysis = NormalIndPower()
# Given: current power (1-β)
effect_size = 0.005 / (0.03 * 0.97)**0.5 # Standardized effect
current_power = analysis.power(effect_size=effect_size, nobs1=n, alpha=alpha)
print(f"
Current Power (1-β): {current_power:.3f}")
print(f"Type II Error Rate (β): {1-current_power:.3f}")
# How many users to achieve 90% power?
n_required = analysis.solve_power(effect_size=effect_size, power=0.9, alpha=alpha)
print(f"Users needed for 90% power: {int(n_required*2):,} total (each group: {int(n_required):,})")
Statement: The sampling distribution of the sample mean approaches a normal distribution as sample size increases, regardless of the population distribution.
If X₁, X₂, ..., Xₙ are i.i.d. samples from ANY distribution with: - Mean: μ - Variance: σ² Then as n → ∞: X̄ ~ N(μ, σ²/n) Standard Error (SE) = σ/√n Key insight: With n ≥ 30, sample means are approximately normal even if the original data is skewed, bimodal, etc.!
Population: All Netflix users worldwide (250M+) Unknown distribution of daily watch time (very skewed, not normal!) → Many users watch 0 min (didnt open app) → Some watch 1-2 hours → Few watch 8+ hours (binge-watch days) Question: What is average daily watch time? Sample: Take 1000 random users Sample Mean: X̄ = 87 minutes Sample Std Dev: s = 52 minutes Standard Error: SE = 52/√1000 = 1.64 minutes By CLT: The sample mean is approximately Normal! → Can use normal distribution for hypothesis testing
Correct Definition: The probability of observing a test statistic as extreme as (or more extreme than) the one calculated, ASSUMING the null hypothesis is true.
P-value = P(data | H₀ is true) NOT: ❌ "Probability that H₀ is true" (wrong!) ❌ "Probability that our result is due to chance" (wrong!) ❌ "Probability that we made an error" (wrong!) Correct interpretation: "If there were truly no effect, we would see data this extreme only X% of the time"
H₀: New "Skip Intro" button has no effect on viewing duration H₁: Skip Intro button increases viewing duration Data: Control group (no button): Mean = 87 min, n=10000 Treatment group (with button): Mean = 91 min, n=10000, SD=52 t-statistic = (91 - 87) / (52/√10000) = 4 / 0.52 = 7.69 P-value = P(t > 7.69) ≈ 0.000001 Interpretation: "If Skip Intro had NO effect, we would see a 4-minute difference this large in only 0.0001% of experiments" → Very strong evidence against H₀ → Launch the feature!
Definition: A range of values that we are X% confident contains the true population parameter.
95% CI Formula: CI = X̄ ± Zα/2 × (σ/√n) CI = X̄ ± 1.96 × SE Correct interpretation: "If we repeated this experiment 100 times, 95 of those CIs would contain the true mean" Netflix Example: X̄ = 91 minutes, SE = 0.52, n=10000 95% CI = 91 ± 1.96 × 0.52 = 91 ± 1.02 = [89.98, 92.02] minutes Meaning: "We are 95% confident the true avg viewing time with Skip Intro button is between 89.98 and 92.02 minutes per day"
| Aspect | P-value | Confidence Interval |
|---|---|---|
| Tells you | Is effect significant? | How BIG is the effect? |
| Business value | Should we launch? | Is it worth launching? |
| Netflix example | p<0.05 → launch | [+1.5, +6.5] min → is 4min valuable? |
| Recommendation | Always report both! | CI is more informative |
Scenario: Should Netflix launch Auto-Play next episode feature? Groups: Control: No auto-play (n=50,000 users, 2 weeks) Treatment: Auto-play after 5 seconds (n=50,000 users, 2 weeks) Results: Control: Mean daily viewing = 85 min, SD = 60 min Treatment: Mean daily viewing = 93 min, SD = 65 min Analysis: Effect size = 8 minutes (+9.4% increase) SE = √((60²/50000) + (65²/50000)) = √(156.5) = 0.56 min t-stat = 8 / 0.56 = 14.3 p-value ≈ 0.0000001 (extremely significant!) 95% CI: [6.9 min, 9.1 min] increase Business Decision: → 8 extra min/user/day × 250M users = 33M+ extra viewing hours/day → More viewing → lower churn → +$XXM revenue → Counter-consideration: Does it annoy users? (check NPS) → DECISION: Launch with opt-out option Reality: Netflix launched auto-play in 2016. It became one of their most impactful features.
import numpy as np
from scipy import stats
# Netflix auto-play experiment
np.random.seed(42)
control = np.random.normal(85, 60, 50000) # no auto-play
treatment = np.random.normal(93, 65, 50000) # with auto-play
# Hypothesis test
t_stat, p_value = stats.ttest_ind(treatment, control)
print(f"T-statistic: {t_stat:.2f}")
print(f"P-value: {p_value:.8f}")
print(f"Significant: {p_value 0 else "includes zero"}")
# Effect size (Cohen d)
pooled_std = np.sqrt((control.std()**2 + treatment.std()**2) / 2)
cohens_d = diff / pooled_std
print(f"
Effect Size (Cohen d): {cohens_d:.3f}")
print(f"Interpretation: {"Small" if cohens_d < 0.2 else "Medium" if cohens_d < 0.5 else "Large"}")
Goal: Compress data into a small latent space (encoder), then reconstruct it (decoder). Learns the most important features automatically.
Architecture:
Input (784 dims) → [Encoder] → Latent Space (32 dims) → [Decoder] → Output (784 dims)
Encoder: Compresses 784 → 512 → 256 → 128 → 32
Decoder: Reconstructs 32 → 128 → 256 → 512 → 784
Loss = Reconstruction Loss (how close is output to input?)
= MSE(input, reconstructed_output)
The 32-dimensional latent vector = compressed representation!
It captures the ESSENCE of the input data.
| Application | How | Company |
|---|---|---|
| Anomaly Detection | High reconstruction error = anomaly | Amazon (fraud) |
| Dimensionality Reduction | Use latent space instead of PCA | Netflix (content embedding) |
| Image Denoising | Noisy input → clean output | Adobe Photoshop AI |
| Data Compression | Latent code = compressed data | General use |
| Feature Learning | Latent space = meaningful features | Spotify (music DNA) |
Problem with regular AE: Latent space is not smooth — cant generate new samples by sampling from it.
VAE Solution: Encode to a probability distribution (mean + variance), not a single point. Enables generation!
Regular AE: Input → [Encoder] → z (single point) → [Decoder] VAE: Input → [Encoder] → μ, σ → Sample z ~ N(μ, σ²) → [Decoder] VAE Loss = Reconstruction Loss + KL Divergence KL Divergence: Forces latent distribution close to N(0,1) → Smooth, continuous latent space → Can sample ANY point and generate realistic output! Reparameterization Trick: z = μ + σ × ε, where ε ~ N(0,1) → Makes sampling differentiable for backprop!
Step 1: Audio Feature Extraction Each song → analyze 30-second clip → Extract: tempo (BPM), key, loudness, energy, danceability, valence → Mel Spectrogram: visual representation of audio frequencies Step 2: CNN Autoencoder on Spectrograms Mel Spectrogram (128×128 image) → CNN Encoder → 128-dim music embedding → Similar sounds → Similar embeddings (close in latent space) Step 3: User Listening History User listens to 50 songs → average their embeddings → "User music taste vector" in same 128-dim space Step 4: Recommendation = Nearest Neighbor Search Find songs whose embedding is closest to user vector!
Each song mapped to latent space: → Classical music: clusters in one region → Hip-hop: different cluster → But overlapping regions = fusion genres! VAE advantage: Can GENERATE new music in latent space → Interpolate between two songs → "Find music halfway between jazz and electronic" → Spotifys "Blend" playlist uses this concept!
1. Collaborative Filtering: "Users like you listened to X" → ALS (Alternating Least Squares) matrix factorization → User matrix (500K features) × Item matrix (500K features) 2. Content-Based (CNN Autoencoder): → Works for NEW songs with 0 listeners! → "Cold start problem" solved by audio features 3. NLP on Playlist Names: → "chill sunday morning" → Word2Vec embeddings → Connect semantic meaning to audio features 4. Blend all signals → Final recommendation ranking
import torch
import torch.nn as nn
import torch.nn.functional as F
class VAE(nn.Module):
def __init__(self, input_dim=784, hidden_dim=512, latent_dim=32):
super().__init__()
# Encoder
self.fc1 = nn.Linear(input_dim, hidden_dim)
self.fc_mu = nn.Linear(hidden_dim, latent_dim) # Mean
self.fc_sigma = nn.Linear(hidden_dim, latent_dim) # Log variance
# Decoder
self.fc3 = nn.Linear(latent_dim, hidden_dim)
self.fc4 = nn.Linear(hidden_dim, input_dim)
def encode(self, x):
h = F.relu(self.fc1(x))
return self.fc_mu(h), self.fc_sigma(h) # μ and log(σ²)
def reparameterize(self, mu, log_var):
std = torch.exp(0.5 * log_var)
eps = torch.randn_like(std) # ε ~ N(0,1)
return mu + eps * std # z = μ + σε
def decode(self, z):
h = F.relu(self.fc3(z))
return torch.sigmoid(self.fc4(h))
def forward(self, x):
mu, log_var = self.encode(x)
z = self.reparameterize(mu, log_var)
return self.decode(z), mu, log_var
def vae_loss(recon_x, x, mu, log_var):
# Reconstruction loss
recon_loss = F.binary_cross_entropy(recon_x, x, reduction="sum")
# KL divergence: force latent space to be N(0,1)
kl_loss = -0.5 * torch.sum(1 + log_var - mu.pow(2) - log_var.exp())
return recon_loss + kl_loss # β-VAE: use β * kl_loss for control
# Spotify-style music embedding
# After training, use ONLY the encoder
vae = VAE(input_dim=16384, latent_dim=128) # Mel spectrogram → 128 dim
def get_music_embedding(audio_spectrogram):
mu, _ = vae.encode(audio_spectrogram)
return mu # Use mean as the song embedding
# Find similar songs
import numpy as np
from sklearn.neighbors import NearestNeighbors
song_embeddings = np.array([get_music_embedding(spec) for spec in all_songs])
knn = NearestNeighbors(n_neighbors=10, metric="cosine")
knn.fit(song_embeddings)
# Recommend for user
user_vector = song_embeddings[user_liked_songs].mean(axis=0)
distances, indices = knn.kneighbors([user_vector])
recommended_songs = [all_songs[i] for i in indices[0]]
As training progresses, each layer receives inputs from previous layer. When previous layer weights change → input distribution changes too! → "Internal Covariate Shift" Effect: Each layer must continuously adapt to shifting input distribution → Very slow training → Need small learning rates → Requires careful weight initialization → Hard to train networks deeper than 5-10 layers
Solution: Normalize layer inputs to have zero mean and unit variance using the current mini-batch statistics.
For a mini-batch B = {x_1, x_2, ..., x_m}:
Step 1: Batch mean: μ_B = (1/m) Σ x_i
Step 2: Batch variance: σ²_B = (1/m) Σ (x_i - μ_B)²
Step 3: Normalize: x̂_i = (x_i - μ_B) / √(σ²_B + ε)
Step 4: Scale & shift: y_i = γ × x̂_i + β
Where:
γ (gamma) = learned scale parameter
β (beta) = learned shift parameter
ε = small constant (1e-5) for numerical stability
These γ and β let the network UNDO normalization if needed!
Conv → [BatchNorm] → ReLU → Pool → ... (for CNNs) Linear → [BatchNorm] → ReLU → ... (for fully connected) Important: Applied BEFORE activation (most common)
| Benefit | Impact | Production Relevance |
|---|---|---|
| Faster training | 10-14x faster convergence | Reduces GPU cost by 70%+ |
| Higher learning rates | 10x larger LR possible | Even faster training |
| Regularization effect | Often replaces Dropout | Simpler architecture |
| Less initialization sensitivity | Works with basic init | Easier to set up new models |
| Enables very deep networks | 100+ layer networks | ResNet (152 layers) uses BN! |
Without BN: Only 20-30 layers trainable, 65% accuracy With BN: 152 layers trainable, 96%+ accuracy on ImageNet Google Photos, Google Lens — all use ResNet with BN Training time: 2 weeks → 3 days with BN on same hardware
DLRM (Deep Learning Recommendation Model): Feature processing layers use BN → 40% faster training → Trains on 1TB+ data daily → Updates model every few hours for real-time recommendations
DeepETA model (2022): Predicts trip duration Uses BN in all dense layers → Stabilizes training across different city scales → Mumbai trips: 1-60 min, New York: 5-120 min → BN normalizes these different scales automatically → Single model works globally!
| Variant | Normalizes Over | Best For |
|---|---|---|
| Batch Norm | Mini-batch dimension | CNN, large batch training |
| Layer Norm | Feature dimension per sample | Transformers, NLP, small batches |
| Instance Norm | Spatial, per channel per sample | Style transfer, image gen |
| Group Norm | Groups of channels | Detection, small batches |
Key Note: Transformers (BERT, GPT) use Layer Normalization — normalizes over feature dim per sample, works with batch size = 1!
import torch
import torch.nn as nn
# Batch Normalization in CNN (Google ResNet style)
class ConvBNReLU(nn.Module):
def __init__(self, in_channels, out_channels, kernel_size=3, stride=1):
super().__init__()
self.conv = nn.Conv2d(in_channels, out_channels, kernel_size, stride, padding=1, bias=False)
self.bn = nn.BatchNorm2d(out_channels) # BN after Conv
self.relu = nn.ReLU(inplace=True)
def forward(self, x):
return self.relu(self.bn(self.conv(x))) # Conv → BN → ReLU
# Layer Normalization in Transformer (BERT style)
class TransformerBlock(nn.Module):
def __init__(self, d_model=768, num_heads=12, ff_dim=3072):
super().__init__()
self.attention = nn.MultiheadAttention(d_model, num_heads)
self.norm1 = nn.LayerNorm(d_model) # Layer Norm after attention
self.ff = nn.Sequential(
nn.Linear(d_model, ff_dim), nn.GELU(),
nn.Linear(ff_dim, d_model)
)
self.norm2 = nn.LayerNorm(d_model) # Layer Norm after feed-forward
def forward(self, x):
# Pre-norm or Post-norm (Post-norm shown here)
attn_out, _ = self.attention(x, x, x)
x = self.norm1(x + attn_out) # Residual + LayerNorm
ff_out = self.ff(x)
return self.norm2(x + ff_out) # Residual + LayerNorm
# Manual Batch Norm implementation
class ManualBatchNorm(nn.Module):
def __init__(self, num_features, eps=1e-5, momentum=0.1):
super().__init__()
self.gamma = nn.Parameter(torch.ones(num_features)) # Scale
self.beta = nn.Parameter(torch.zeros(num_features)) # Shift
self.eps = eps
def forward(self, x):
if self.training:
mean = x.mean(dim=0, keepdim=True)
var = x.var(dim=0, keepdim=True, unbiased=False)
x_norm = (x - mean) / torch.sqrt(var + self.eps)
return self.gamma * x_norm + self.beta
Core Idea: An Agent learns to take Actions in an Environment to maximize cumulative Reward over time — through trial and error.
| Component | Description | Netflix Example |
|---|---|---|
| Agent | The learner/decision maker | Recommendation system |
| Environment | What agent interacts with | Netflix platform + users |
| State (s) | Current situation | User watch history, time of day |
| Action (a) | What agent can do | Which show to recommend |
| Reward (r) | Feedback signal | +1 if user watches 30min, -1 if skip |
| Policy (π) | Agent strategy | Recommendation algorithm |
Maximize: G_t = r_t + γ*r_{t+1} + γ²*r_{t+2} + ...
γ (gamma) = discount factor (0-1)
→ Immediate rewards > future rewards
→ γ=0.95: future reward at step 20 = 0.95^20 ≈ 0.36 of immediate
Q-Learning: Q(s,a) = r + γ * max Q(s", a") DQN: Deep Q-Network — Neural Net approximates Q-function Policy Gradient: Directly optimize policy π PPO: Proximal Policy Optimization — stable, most used today A3C: Asynchronous Actor-Critic — parallel training
Environment: Go board (10^170 possible positions) State: Board position (19x19 grid) Actions: Where to place next stone Reward: +1 win, -1 loss Training: Self-play (played millions of games vs itself) Result: Defeated world champion Lee Sedol 4-1 (2016)
Environment: Google server farms worldwide
State: 20 sensor readings (temperature, power, pumps)
Actions: Adjust cooling systems (120 possible actions)
Reward: Minimize energy used per compute unit (PUE)
Before RL: Human engineers optimized manually
After RL: 40% reduction in cooling energy cost
15% reduction in overall PUE
Saves hundreds of millions of dollars annually!
RL + Deep Learning to predict 3D protein structure from amino acid sequence 200M+ protein structures predicted 30-year problem solved in 2020 Impact: Drug discovery, disease research
Supervised: Learn from past data — static RL: Learn from ongoing interactions — adaptive! Problem with Supervised: - User watches a bad recommendation → labeled as "liked" (watched it) - Doesnt capture long-term satisfaction RL Solution: - Short-term: Did user click? - Medium-term: Did user watch 70%+? - Long-term: Did user remain subscribed this month? - Long-long-term: Did recommendation diversity keep user engaged over 6 months?
State: [Watch history, search queries, ratings, time of day, device] Action: Select 1 show from 5000+ to show in "Top Pick" slot Reward Function: +3 points: User watches 80%+ of show +1 point: User watches 20-80% 0 points: User watches <20% -1 point: User closes app immediately -2 points: User cancels subscription Policy: Deep Q-Network (DQN) Training: Continuous A/B testing — half get RL recommendations, half get baseline
import numpy as np
import torch
import torch.nn as nn
# Deep Q-Network (DQN)
class DQN(nn.Module):
def __init__(self, state_size, action_size):
super().__init__()
self.network = nn.Sequential(
nn.Linear(state_size, 128), nn.ReLU(),
nn.Linear(128, 128), nn.ReLU(),
nn.Linear(128, action_size)
)
def forward(self, x): return self.network(x)
# Netflix-style recommendation RL
class RecommendationRL:
def __init__(self, state_size=50, n_shows=1000, lr=0.001, gamma=0.95):
self.gamma = gamma
self.epsilon = 1.0 # exploration rate
self.model = DQN(state_size, n_shows)
self.optimizer = torch.optim.Adam(self.model.parameters(), lr=lr)
def select_action(self, state):
if np.random.random() 0.01: self.epsilon *= 0.995 # Decay exploration
Invented by: Ian Goodfellow (2014) — one of the most important AI papers ever.
| Player | Role | Analogy |
|---|---|---|
| Generator (G) | Creates fake data to fool Discriminator | Art forger — makes fake paintings |
| Discriminator (D) | Tells real from fake | Art expert — spots fakes |
1. Generator creates fake image from random noise 2. Discriminator sees real + fake images 3. Discriminator tries to correctly classify: Real=1, Fake=0 4. Generator improves to fool Discriminator 5. Repeat → Both keep improving! Equilibrium: Generator creates images Discriminator cant distinguish = Nash Equilibrium
Discriminator Loss: max V(D) = E[log D(x)] + E[log(1 - D(G(z)))] → Maximize: correctly classify real AND fake Generator Loss: min V(G) = E[log(1 - D(G(z)))] → Minimize: make Discriminator think fake is real
| GAN Type | Innovation | Use Case |
|---|---|---|
| DCGAN | Deep Conv layers in GAN | High-quality image gen |
| StyleGAN2 | Control style at each layer | Photo-realistic faces |
| CycleGAN | Unpaired image translation | Horse→Zebra, Summer→Winter |
| Conditional GAN | Generate specific class | "Generate a cat image" |
| WGAN | Wasserstein distance — stable training | Better training stability |
| Pix2Pix | Paired image-to-image translation | Sketch→Photo |
import torch
import torch.nn as nn
# Generator
class Generator(nn.Module):
def __init__(self, noise_dim=100, img_dim=784):
super().__init__()
self.gen = nn.Sequential(
nn.Linear(noise_dim, 256), nn.LeakyReLU(0.2),
nn.Linear(256, 512), nn.LeakyReLU(0.2),
nn.Linear(512, 1024), nn.LeakyReLU(0.2),
nn.Linear(1024, img_dim), nn.Tanh()
)
def forward(self, x): return self.gen(x)
# Discriminator
class Discriminator(nn.Module):
def __init__(self, img_dim=784):
super().__init__()
self.disc = nn.Sequential(
nn.Linear(img_dim, 512), nn.LeakyReLU(0.2),
nn.Linear(512, 256), nn.LeakyReLU(0.2),
nn.Linear(256, 1), nn.Sigmoid()
)
def forward(self, x): return self.disc(x)
# Training
gen = Generator(); disc = Discriminator()
opt_gen = torch.optim.Adam(gen.parameters(), lr=3e-4)
opt_disc = torch.optim.Adam(disc.parameters(), lr=3e-4)
criterion = nn.BCELoss()
for epoch in range(num_epochs):
noise = torch.randn(batch_size, 100)
fake = gen(noise)
# Train Discriminator
disc_real = disc(real_images).view(-1)
disc_fake = disc(fake.detach()).view(-1)
loss_disc = criterion(disc_real, ones) + criterion(disc_fake, zeros)
opt_disc.zero_grad(); loss_disc.backward(); opt_disc.step()
# Train Generator
output = disc(fake).view(-1)
loss_gen = criterion(output, ones) # Want disc to think fake is real
opt_gen.zero_grad(); loss_gen.backward(); opt_gen.step()
Purpose: Process sequential data by maintaining a hidden state that carries information from previous steps.
Standard NN: Input → Output (no memory)
RNN: Input(t) + Hidden(t-1) → Output(t) + Hidden(t)
Unrolled RNN:
x1 → [h1] → x2 → [h2] → x3 → [h3] → Output
↑_________↑_________↑
hidden state flows forward
For long sequences (e.g., 100 steps): Gradient gets multiplied by weights 100 times → Gradient → 0 (vanishes) or → ∞ (explodes) → RNN "forgets" information from early steps → Cant learn long-range dependencies!
Solution: LSTMs add gates to control what to remember, forget, and output.
| Gate | Function | Analogy |
|---|---|---|
| Forget Gate | What to erase from memory | "Delete old irrelevant info" |
| Input Gate | What new info to store | "Write new important info" |
| Cell State | Long-term memory (conveyor belt) | "The main memory storage" |
| Output Gate | What to output now | "Share relevant info" |
f_t = σ(W_f · [h_{t-1}, x_t] + b_f) # Forget Gate
i_t = σ(W_i · [h_{t-1}, x_t] + b_i) # Input Gate
C̃_t = tanh(W_C · [h_{t-1}, x_t] + b_C) # Candidate values
C_t = f_t × C_{t-1} + i_t × C̃_t # Cell State update
o_t = σ(W_o · [h_{t-1}, x_t] + b_o) # Output Gate
h_t = o_t × tanh(C_t) # Hidden State output
Ride demand at 5pm depends on: yesterday at 5pm, last Friday at 5pm, last hours demand, current events, weather trends — ALL sequential patterns!
For each 5-minute window, LSTM receives: - Ride requests in past 60 minutes (12 time steps) - Driver availability count - Weather (rain, temperature) - Day of week, hour of day - Special events (concerts, sports games) - Historical demand at same time last week
Time Series Input (60 min × 12 features)
↓
LSTM Layer 1 (128 units) — learns hourly patterns
↓
LSTM Layer 2 (64 units) — learns daily patterns
↓
Dense Layer (32 units, ReLU)
↓
Output: Surge Multiplier (1.0x to 5.0x)
import numpy as np
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import LSTM, Dense, Dropout
# Prepare time series data
# X shape: (samples, timesteps=12, features=8)
# y: surge multiplier
def create_sequences(data, window=12):
X, y = [], []
for i in range(len(data) - window):
X.append(data[i:i+window])
y.append(data[i+window, 0]) # surge multiplier
return np.array(X), np.array(y)
X_train, y_train = create_sequences(uber_data)
# LSTM Model
model = Sequential([
LSTM(128, input_shape=(12, 8), return_sequences=True),
Dropout(0.2),
LSTM(64, return_sequences=False),
Dropout(0.2),
Dense(32, activation="relu"),
Dense(1, activation="linear") # surge multiplier output
])
model.compile(optimizer="adam", loss="mse", metrics=["mae"])
model.fit(X_train, y_train, epochs=50, batch_size=64, validation_split=0.2)
# Predict next 30 min surge
current_window = X_test[-1:] # last 12 time steps
predicted_surge = model.predict(current_window)
print(f"Predicted Surge: {predicted_surge[0][0]:.1f}x")
| Model | Memory | Speed | Best For |
|---|---|---|---|
| RNN | Short-term only | Fastest | Very short sequences |
| LSTM | Long + Short term | Slower | Long sequences, complex patterns |
| GRU | Better than RNN | Faster than LSTM | Medium sequences, faster training |
Definition: A technique to evaluate machine learning models by training on different subsets of data and testing on remaining portions to ensure the model generalizes well.
Why It's Important:
Simple Train-Test Split (70-30): → Model might perform well by luck on that specific test set → Results vary significantly with different random splits → Wastes data (only 70% used for training) → High variance in accuracy estimates
How It Works:
Visual Representation (K=5):
Iteration 1: [Test] [Train] [Train] [Train] [Train] Iteration 2: [Train] [Test] [Train] [Train] [Train] Iteration 3: [Train] [Train] [Test] [Train] [Train] Iteration 4: [Train] [Train] [Train] [Test] [Train] Iteration 5: [Train] [Train] [Train] [Train] [Test] Final Score = Average of 5 test scores
Dataset:
500,000 ad impressions Features: ad text, user demographics, time of day, device type, location Target: Click (1) or No Click (0)
Problem: Need to predict which ads users will click to optimize ad placement
Step 1: Split Data
Total: 500,000 impressions Each fold: 100,000 impressions
Step 2: Train and Evaluate
Fold 1: Train on impressions 100K-500K, Test on 0-100K
Accuracy: 87.2%, CTR Prediction Error: 0.12
Fold 2: Train on impressions 0-100K + 200K-500K, Test on 100K-200K
Accuracy: 88.1%, CTR Prediction Error: 0.11
Fold 3: Train on impressions 0-200K + 300K-500K, Test on 200K-300K
Accuracy: 86.8%, CTR Prediction Error: 0.13
Fold 4: Train on impressions 0-300K + 400K-500K, Test on 300K-400K
Accuracy: 87.5%, CTR Prediction Error: 0.12
Fold 5: Train on impressions 0-400K, Test on 400K-500K
Accuracy: 88.3%, CTR Prediction Error: 0.11
Step 3: Calculate Final Metrics
Average Accuracy = (87.2 + 88.1 + 86.8 + 87.5 + 88.3) / 5 = 87.58% Standard Deviation = 0.58% Confidence: "Model achieves 87.58% ± 0.58% accuracy"
K typically = 5 or 10
Use when: Standard datasets, balanced classes
Maintains class distribution in each fold
Example: If 30% clicks, 70% no-clicks Each fold will have same 30-70 ratio
Use when: Imbalanced classes (like click data - mostly no-clicks)
K = number of samples (extreme case)
Use when: Very small datasets (<1000 samples)
Don't use: Large datasets (too computationally expensive)
Train: [1-100] → Test: [101-120] Train: [1-120] → Test: [121-140] Train: [1-140] → Test: [141-160]
Use when: Time-dependent data (stock prices, user behavior over time)
Google Example: Search query trends prediction
Ensures samples from same group are in same fold
Google Example: Keep all ads from same advertiser in same fold
from sklearn.model_selection import cross_val_score, KFold
from sklearn.ensemble import RandomForestClassifier
import numpy as np
# Google Ads CTR Prediction
X = ad_features # user, ad, context features
y = clicks # 0 or 1
model = RandomForestClassifier(n_estimators=100)
# K-Fold CV
kfold = KFold(n_splits=5, shuffle=True, random_state=42)
scores = cross_val_score(model, X, y, cv=kfold, scoring='accuracy')
print(f"Accuracy: {scores.mean():.3f} (+/- {scores.std():.3f})")
print(f"Fold scores: {scores}")
# Output:
# Accuracy: 0.876 (+/- 0.006)
# Fold scores: [0.872 0.881 0.868 0.875 0.883]
from sklearn.model_selection import StratifiedKFold
# For imbalanced click data (90% no-click, 10% click)
stratified_kfold = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
scores = cross_val_score(
model, X, y,
cv=stratified_kfold,
scoring='f1' # Better for imbalanced data
)
print(f"F1 Score: {scores.mean():.3f} (+/- {scores.std():.3f})")
from sklearn.model_selection import KFold
kfold = KFold(n_splits=5, shuffle=True, random_state=42)
fold_scores = []
for fold, (train_idx, test_idx) in enumerate(kfold.split(X), 1):
# Split data
X_train, X_test = X[train_idx], X[test_idx]
y_train, y_test = y[train_idx], y[test_idx]
# Train model
model.fit(X_train, y_train)
# Evaluate
score = model.score(X_test, y_test)
fold_scores.append(score)
print(f"Fold {fold}: Accuracy = {score:.3f}")
print(f"\nMean Accuracy: {np.mean(fold_scores):.3f}")
print(f"Std Dev: {np.std(fold_scores):.3f}")
from sklearn.model_selection import TimeSeriesSplit
tscv = TimeSeriesSplit(n_splits=5)
for train_idx, test_idx in tscv.split(X):
X_train, X_test = X[train_idx], X[test_idx]
y_train, y_test = y[train_idx], y[test_idx]
model.fit(X_train, y_train)
score = model.score(X_test, y_test)
print(f"Score: {score:.3f}")
| Situation | Recommended K | Reason |
|---|---|---|
| Small dataset (<1000) | K=10 or LOOCV | Maximize training data usage |
| Medium dataset (1K-100K) | K=5 or K=10 | Good balance of bias-variance |
| Large dataset (100K+) | K=3 or K=5 | Computational efficiency |
| Imbalanced classes | Stratified K-Fold | Maintain class distribution |
| Time series data | TimeSeriesSplit | Respect temporal order |
Definition: When a model learns the training data too well, including noise and random fluctuations, causing poor performance on new, unseen data.
Analogy: Like a student who memorizes answers instead of understanding concepts - they ace practice tests but fail real exams.
Use K-Fold Cross-Validation (k=5 or 10) Split data into K parts, train on K-1, test on 1 Rotate and average results
Meta Example: Testing News Feed algorithm across different user segments and time periods
L1 (Lasso): Cost = Loss + λ∑|weights| L2 (Ridge): Cost = Loss + λ∑(weights²)
Meta Example: Adding L2 penalty to prevent extremely large weights in ad click prediction models
Randomly drop 20-50% of neurons during training Forces network to learn robust features
Meta Example: Instagram's image classification model uses 0.5 dropout rate
Monitor validation loss during training Stop when validation loss stops improving Save best model checkpoint
Meta Example: WhatsApp spam detection stops training after 3 epochs with no validation improvement
Artificially increase training data by: - Image: rotation, flipping, cropping, color jittering - Text: synonym replacement, back-translation
Meta Example: Facebook photo tagging augments images with rotation, zoom, brightness changes
Meta Example: Messenger chatbot uses simpler LSTM with 2 layers instead of 5
Combine multiple models: - Bagging: Random Forest - Boosting: XGBoost, LightGBM - Stacking: Combine predictions from different models
Meta Example: Facebook ad ranking uses ensemble of 20+ models
More diverse data helps model generalize better
Meta Example: Instagram Explore uses billions of user interactions to train recommendation models
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
# 1. Train-Test Split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
# 2. Use Regularization
model = LogisticRegression(penalty='l2', C=0.1) # C is inverse of λ
# 3. Limit Model Complexity
rf = RandomForestClassifier(
max_depth=5, # Prevent deep trees
min_samples_split=20, # Require minimum samples
n_estimators=100
)
# 4. Cross-Validation
scores = cross_val_score(model, X_train, y_train, cv=5)
print(f"CV Accuracy: {scores.mean():.2f} (+/- {scores.std():.2f})")
# 5. Early Stopping (for neural networks)
from tensorflow.keras.callbacks import EarlyStopping
early_stop = EarlyStopping(
monitor='val_loss',
patience=3,
restore_best_weights=True
)
model.fit(X_train, y_train,
validation_split=0.2,
callbacks=[early_stop],
epochs=100)
Definition: Training a model using labeled data where the correct output is known.
Real-World Examples:
Common Algorithms: Linear Regression, Logistic Regression, Decision Trees, Random Forest, SVM, Neural Networks
Definition: Finding hidden patterns in unlabeled data without predefined categories.
Real-World Examples:
Common Algorithms: K-Means Clustering, Hierarchical Clustering, DBSCAN, PCA, Autoencoders
Supervised: "Learn from examples with answers" (like studying with an answer key)
Unsupervised: "Find patterns on your own" (like grouping items without instructions)
Master these concepts and walk into your next interview with confidence!