🤖 GenAI from Scratch | NLP Foundations

📋 Table of Contents

The Problem Word2Vec Solved
What is Word2Vec? (Google, 2013)
Neural Network Basics — The Engine Behind Word2Vec
Manual Feature Engineering vs Learned Embeddings
The KING − MAN + WOMAN = QUEEN Analogy — Explained
CBOW vs Skip-gram — The Two Word2Vec Architectures
The Embedding Landscape: Classical → SOTA
FastText — The Improved Word2Vec
Python: Train Your Own Word2Vec Model
Python: Load a Pre-trained Word2Vec and Explore Analogies
Python: Cosine Similarity — Find Similar Words
Why This Matters for DBAs Building RAG Systems
Common Errors and Fixes
Key Takeaways
What’s Next

In the last post we saw that OHE, Bag of Words, and TF-IDF all fail at the same thing: they treat words as symbols with no relationship to each other. “like” and “love” are completely unrelated numbers. “Oracle” and “PostgreSQL” — same thing. No similarity, no context, no meaning.

In 2013, a Google research team led by Tomas Mikolov published a paper that changed everything: “Efficient Estimation of Word Representations in Vector Space.” This is the Word2Vec paper — and it’s one of the most important papers in the history of NLP. It proved that you could train a neural network to learn word meaning automatically, and the resulting vectors had remarkable mathematical properties.

This is Post 4 of the GenAI from Scratch series . We’ll go through the theory , the key ideas from the original research paper, and write Python code you can run in VS Code today.

What you’ll learn:

What Word2Vec is and why Google built it in 2013
How neural networks learn word embeddings (single perceptron → multilayer)
The famous KING − MAN + WOMAN = QUEEN analogy — the actual math behind it
CBOW vs Skip-gram — the two Word2Vec architectures from the original paper
The complete embedding landscape: Word2Vec → FastText → Transformers
Python code to train Word2Vec, explore analogies, and compute similarity
Why this matters directly for building RAG systems as a DBA

🔬 Lab Validated: All Python code tested in VS Code with Python 3.12, gensim 4.3+, and numpy 1.26+. Install with: uv add gensim numpy

Prerequisites

☑ Posts 1–3 completed — UV, VS Code, encoding concepts understood
☑ Install the required packages:

uv add gensim numpy scikit-learn

1. The Problem Word2Vec Solved

The original Word2Vec paper opens with this exact statement about the state of NLP before 2013:

“Many current NLP systems and techniques treat words as atomic units — there is no notion of similarity between words, as these are represented as indices in a vocabulary.”

— Mikolov et al., Google, 2013 (Word2Vec paper)

That description is exactly OHE, BoW, and TF-IDF. Words were indices. “dog” = index 4521. “puppy” = index 8833. No mathematical connection between them whatsoever.

The following description summarize the problem in one clean diagram:

OHE  → Presence (0 or 1)
BoW  → Count (frequency)
TF-IDF → Count + Importance
                ↓
All three:  Data → Vector (Numbers)
            But NO semantic meaning, NO context, NO relationships

The gap: Two synonyms "like" and "love" have zero mathematical relationship
TF-IDF was used by Google (2013–2014) before Neural Networks took over
BM-25 (improved TF-IDF) still used in some RAG systems today

The need: Word → Number  AND  similar words → similar numbers

Word2Vec solved this by changing the fundamental approach: instead of engineering features manually, let a neural network learn the features from billions of words of text.

2. What is Word2Vec? (Google, 2013)

Word2Vec is a neural network model trained to convert words into dense vectors of numbers, where words used in similar contexts get similar vectors.

Key facts from the class notes and research paper:

Fact	Detail
Created by	Tomas Mikolov, Kai Chen, Greg Corrado, Jeffrey Dean — Google Inc., Mountain View, CA
Published	2013 (arXiv:1301.3781)
What it does	Trains a neural network on massive text data; the hidden layer weights become the word embeddings
Output	Dense vector per word, typically 300 dimensions (Google’s default)
Training data	Google News corpus: ~6 billion tokens, 1 million word vocabulary
Training time	Less than a day on modern hardware for a 1.6B word dataset
Key breakthrough	Similar words get numerically close vectors. Vector math on words produces meaningful results.
Legacy	Foundation for all modern embeddings: FastText → BERT → GPT → all LLMs use this concept

🗄️ DBA Analogy — Word2Vec = Statistics Gathered by ANALYZE

When you run DBMS_STATS.GATHER_TABLE_STATS in Oracle, the database doesn’t just count rows — it learns the distribution of values, correlations between columns, and selectivity patterns. Word2Vec does the equivalent for words: it trains on massive text and learns the “statistics” of word co-occurrence, which it then encodes as vectors. The end result is a model that understands which words tend to appear in similar contexts — exactly how column statistics help the optimizer understand which values are similar.

3. Neural Network Basics — The Engine Behind Word2Vec

The class notes spend time on neural network fundamentals before explaining Word2Vec, because Word2Vec is a neural network. You need to understand the machine to understand the output.

From Single Perceptron to Multilayer Network

The notes trace the progression from a single neuron to the full network used in Word2Vec:

── SINGLE PERCEPTRON ──────────────────────────────────────────
Input (features) → Weights (w) → Σ (sum) → Activation → Output

Example: Home Buy Prediction
  F1: Size (1250 sqft)  ┐
  F2: Location (city)   ├── w₁, w₂, w₃ → Σ → Act(Wx + b) → Y/N
  F3: Bedrooms (3)      ┘
  Output: Buy? Yes or No (binary classification)

── NEURAL NETWORK ─────────────────────────────────────────────
Layer 1: Input layer
Layer 2: Hidden layer (learns representations)
Layer 3: Output layer

Multilayer Perceptron → Multiple hidden layers
Each hidden layer learns increasingly abstract features

── TRAINING LOOP ──────────────────────────────────────────────
1. Initialize weights randomly
2. Forward pass: Input → prediction
3. Calculate loss: (Prediction − Actual)²
4. Backward propagation (BP) → Optimizer → Gradient Descent
5. Adjust weights to reduce loss
6. Repeat (one full pass = 1 epoch)
7. After many epochs → weights converge to best possible values

🗄️ DBA Analogy — Training = Query Plan Optimisation Over Time

The first time Oracle runs a query with no statistics, it picks a bad plan. Then ANALYZE TABLE runs, gathers statistics, the optimizer adjusts, and the next execution is better. Neural network training is the same loop: random start → measure how wrong the prediction is → adjust weights → repeat until the loss is minimised. In Oracle, the optimizer converges to a better plan. In a neural network, the weights converge to the best learned representation.

Why the Hidden Layer Weights Become the Embedding

This is the key insight that makes Word2Vec work. When you train a neural network to predict a word from its context, the network must compress the meaning of a word into a compact internal representation inside the hidden layer. That internal representation — the weight matrix connecting the input to the hidden layer — is the word embedding.

Input (one-hot word vector)
         ↓
  [Hidden Layer Weights W]  ← THIS IS THE EMBEDDING MATRIX
         ↓
     Hidden Layer
     (300 neurons)
         ↓
   Output: Predict context words
         ↓
  Loss: was prediction correct?
         ↓
  Backpropagation updates W

After millions of iterations:
  W[word_index] = the embedding for that word
  Words used in similar contexts → similar rows in W

4. Manual Feature Engineering vs Learned Embeddings

The notes show a brilliant illustration of the difference between manual feature assignment and learned embeddings. This is worth understanding deeply because it shows exactly what “learning” means in this context.

Manual Feature Assignment (Before Word2Vec)

Imagine you tried to manually assign features to words. You’d create a table like this.

Word	Gender (0–1)	Wealth (0–1)	Power (0–1)	Weight (0–1)	Speaks (0–1)
KING	1	1	0.99	0.7	1
QUEEN	1	0.8	0.7	0.8	1
MAN	1	0.5	0.4	0.7	1
WOMAN	1	0.4	0.3	0.8	1
MONKEY	1	0	0	0.1	0

So each word becomes a vector:

KING   → [1,  1,    0.99, 0.7, 1]
MAN    → [1,  0.5,  0.4,  0.7, 1]
WOMAN  → [1,  0.4,  0.3,  0.8, 1]

⚠️ Why manual features don’t scale: This approach works fine for 5 words with 5 features. But real language has 1,000,000+ words and infinitely many possible features (royalty, emotion, speed, colour, country, profession…). You cannot manually assign features to a million words. Word2Vec lets the neural network learn the features automatically from the data.

Learned Features (Word2Vec)

Word2Vec learned 300 features automatically from Google News. You never tell it “feature 47 = royalty” or “feature 128 = gender.” The neural network figures out whatever features help it predict word context best. The notes give a 5-dimension example to show the concept:

# 5-dimensional Word2Vec vectors (simplified example from class notes)
Sunny → [0, 1, 0.6, 0.3, 1]   ← 5 features, learned by model
TIGER → [0, 0, 0.9, 0,   0]

# Google's actual Word2Vec: 300 dimensions
# Each word → a 1×300 vector
KING → [0.21, -0.45, 0.83, 0.12, ..., -0.67]  # 300 numbers

The visualization from the notes plots words in 3D space using features like “strong”, “human”, and “hardworking”:

Visualization (3 features: strong, human, hardworking):
men   → [5, 6, 4]
women → [6, 6, 6]
child → [2, 6, 3]

Plotting these in 3D space:
- men and women are close together (both "human", similar strength)
- child is further (less strong, similar "human" score)
→ The vector distance captures real-world relationships

5. The KING − MAN + WOMAN = QUEEN Analogy — Explained

This is the most famous example in the history of NLP. When the Word2Vec authors demonstrated this in the 2013 paper, it shocked the research community. Let’s understand exactly what’s happening mathematically — because this is the foundation of every embedding search and RAG system you will build.

The Intuition

Question: What word is to WOMAN as KING is to MAN?
Answer:   QUEEN

Word2Vec proves this with vector arithmetic:
  vector("KING") − vector("MAN") + vector("WOMAN") ≈ vector("QUEEN")

The Actual Math (from class notes, Page 18)

Using the manual 5-feature example from the class notes:

           Gender  Wealth  Power  Weight  Speak
KING    →  [ 1,    1,     0.99,  0.7,    1 ]
MAN     →  [ 1,    0.5,   0.4,   0.7,    1 ]
WOMAN   →  [ 1,    0.4,   0.3,   0.8,    1 ]

KING − MAN + WOMAN =
  [ 1,    1,     0.99,  0.7,    1 ]
- [ 1,    0.5,   0.4,   0.7,    1 ]
+ [ 1,    0.4,   0.3,   0.8,    1 ]
= [ 1,    0.9,   0.89,  0.8,    1 ]

Expected QUEEN =
  [ 1,    0.8,   0.7,   0.8,    1 ]

Result is close to QUEEN! ✅

What’s happening conceptually:

KING − MAN = removes the “man” concept, keeping “royalty + power + wealth”
+ WOMAN = adds the “woman” concept back
Result ≈ “royalty + power + wealth + woman” = QUEEN

💡 Why this matters for RAG systems you build as a DBA:

When a user asks your RAG system: “Show me documents about database performance issues” — the embedding model converts that query to a vector. Then the vector database finds documents whose vectors are closest to the query vector. This works because “performance issues”, “slow queries”, “execution plan problems”, and “index missing” all end up with nearby vectors in embedding space. That proximity is exactly the KING-QUEEN relationship at scale across your entire document collection.

From the Original Research Paper

The Word2Vec paper demonstrates even more complex relationships — all solved by the same vector arithmetic:

Relationship	Example 1	Example 2	Example 3
Capital cities	France → Paris	Italy → Rome	Japan → Tokyo
Currency	Angola → kwanza	Iran → rial	Germany → euro
Man → Woman	brother → sister	grandson → granddaughter	king → queen
Comparative	big → bigger	cold → colder	quick → quicker
Company → Product	Microsoft → Windows	Google → Android	Apple → iPhone
Country → Cuisine	Japan → sushi	Germany → bratwurst	France → tapas

All of these work using the same subtraction-addition vector arithmetic. The model was never told about capitals, currencies, or cuisines. It learned all of these relationships purely from reading text.

6. CBOW vs Skip-gram — The Two Word2Vec Architectures

The Word2Vec paper introduces two training architectures. Understanding the difference explains why embeddings work differently for different tasks.

CBOW — Continuous Bag of Words

Input:   Context words (surrounding window)
Output:  Predict the TARGET word in the middle

Example (window size = 2):
Sentence: "The DBA optimised the slow query"

Given:  ["The", "DBA", "the", "slow"]  (4 context words)
Predict: "optimised"  (the middle word)

Architecture: INPUT → SUM → PROJECTION → OUTPUT
              Context words averaged together → predict center word

Skip-gram

Input:   A SINGLE word (the target)
Output:  Predict the CONTEXT words around it

Example (window size = 2):
Sentence: "The DBA optimised the slow query"

Given:  "optimised"  (single center word)
Predict: ["The", "DBA", "the", "slow"]  (surrounding context)

Architecture: INPUT → PROJECTION → multiple OUTPUT nodes

Aspect	CBOW	Skip-gram
Task	Context → predict center word	Center word → predict context
Speed	Faster to train	Slower (more predictions per word)
Best for	Frequent words, smaller datasets	Rare words, large datasets
Semantic accuracy	Good syntactic accuracy	Better semantic accuracy (from paper: 55% vs 24%)
Training data	Google News: ~1 day	Google News: ~3 days
Use in practice	gensim default option	Often preferred for quality

🗄️ DBA Analogy — CBOW vs Skip-gram = Two Index Strategies

CBOW is like a composite index — it uses multiple columns together to identify a single row. Skip-gram is like a function-based index — it takes one value and projects what related values look like. Both serve different query patterns. In practice, Skip-gram produces richer semantic relationships, especially for rare words, just as function-based indexes excel for specific selective queries.

7. The Embedding Landscape: Classical → SOTA

The class notes (Page 14) show the full embedding family tree. Here it is as a structured reference — this is exactly what you need for interviews and production decisions:

Category	Model	Year	Type	Use today
Classical (Fundamental, Interview)	Word2Vec	2013	Word-level, static	✅ Learning & understanding
Classical (Fundamental, Interview)	GloVe	2014	Word-level, static	✅ Some legacy systems
Improved Word2Vec	FastText	2016	Subword-level, static	✅ Handling rare/new words
SOTA (State of Art) Transformer-based	BERT (HuggingFace)	2018	Contextual, bidirectional	✅ Classification, NER
	Sentence Transformers	2019+	Sentence-level, contextual	✅ RAG, semantic search
	OpenAI Embeddings	2022+	Sentence/doc-level, API	✅ Production RAG apps
	Gemini Embeddings	2023+	Multimodal, contextual	✅ Google ecosystem RAG

💡 Which to use when (practical decision guide):

Learning the concept: Word2Vec (gensim) — controllable, transparent, easy to debug
Custom model on your own data: Word2Vec or FastText trained on your domain text
Production RAG systems: Sentence Transformers (free, local) or OpenAI Embeddings API (managed)
Enterprise + data residency requirements: Sentence Transformers on-premise or AWS Bedrock / Azure AI embeddings
Interviews: Know Word2Vec theory + KING-QUEEN analogy cold — it’s asked constantly

8. FastText — The Improved Word2Vec

The class notes (Page 19) explicitly mention FastText as the improved version of Word2Vec. Here’s why it was invented and what problem it solved:

Word2Vec’s remaining weakness: It still had the OOV (Out Of Vocabulary) problem. If a word wasn’t in the training vocabulary, Word2Vec had no vector for it. A new product name, a misspelling, a technical acronym — all returned “unknown.”

FastText’s fix: Instead of learning one vector per word, FastText learns vectors for character n-grams (subword pieces). The word “database” is represented as the combination of: “dat”, “ata”, “tab”, “aba”, “bas”, “ase”. A new word like “datastore” can be represented as a combination of known subwords even if the full word was never seen.

Aspect	Word2Vec	FastText
Unit of learning	Whole words	Character n-grams (subwords)
OOV handling	❌ Unknown word = no vector	✅ Builds from subword components
Morphology	❌ “run” and “running” unrelated	✅ Shares subwords → related
Speed	Fast	Slightly slower
Creator	Google (Mikolov)	Facebook AI (Bojanowski et al.)
Best for	Standard vocabulary, clean text	Morphologically rich languages, technical text with abbreviations

9. Python: Train Your Own Word2Vec Model

Now let’s write code. We’ll start by training a small Word2Vec model from scratch on sample sentences — the same workflow used in the bootcamp practicals.

# word2vec_train.py
# Train a Word2Vec model from scratch using gensim
# Run: uv run word2vec_train.py

from gensim.models import Word2Vec
import numpy as np

# ── Step 1: Prepare training sentences ────────────────────────
# In production: these come from your documents, logs, runbooks
# For learning: we use DBA/DB-themed sentences
sentences = [
    # Database sentences
    ["oracle",  "database",  "index",   "query",    "performance"],
    ["postgres","database",  "index",   "query",    "performance"],
    ["oracle",  "dba",       "tuning",  "execution","plan"],
    ["postgres","dba",       "tuning",  "execution","plan"],
    ["slow",    "query",     "missing", "index",    "performance"],
    ["slow",    "query",     "high",    "cpu",      "usage"],
    ["database","server",    "memory",  "cpu",      "disk"],
    ["table",   "column",    "index",   "constraint","key"],
    ["primary", "key",       "foreign", "key",      "constraint"],
    ["backup",  "restore",   "recovery","archive",  "database"],
    # AI/ML sentences
    ["machine", "learning",  "model",   "training", "data"],
    ["deep",    "learning",  "neural",  "network",  "model"],
    ["word2vec","embedding", "vector",  "semantic", "meaning"],
    ["neural",  "network",   "weights", "training", "epoch"],
    ["embedding","vector",   "cosine",  "similarity","search"],
    ["rag",     "retrieval", "vector",  "database", "semantic"],
    ["llm",     "model",     "training","fine",     "tuning"],
    ["bert",    "transformer","embedding","context","language"],
]

# ── Step 2: Train the Word2Vec model ──────────────────────────
model = Word2Vec(
    sentences   = sentences,
    vector_size = 50,       # Embedding dimensions (50 for demo; Google used 300)
    window      = 3,        # How many words to look left/right for context
    min_count   = 1,        # Include words that appear at least once
    workers     = 4,        # Parallel threads for training
    sg          = 1,        # 0=CBOW, 1=Skip-gram (Skip-gram is default for quality)
    epochs      = 100       # Training iterations — more = better for small datasets
)

# ── Step 3: Inspect what was learned ──────────────────────────
print("=== Vocabulary ===")
print(f"Total words in vocabulary: {len(model.wv.key_to_index)}")
print(f"All words: {list(model.wv.key_to_index.keys())}")

# ── Step 4: Get a word vector ──────────────────────────────────
print("\n=== Word Vector for 'oracle' ===")
oracle_vector = model.wv['oracle']
print(f"Vector shape: {oracle_vector.shape}")  # (50,) — 50 dimensions
print(f"First 10 values: {oracle_vector[:10].round(4)}")

# ── Step 5: Find most similar words ───────────────────────────
print("\n=== Words most similar to 'oracle' ===")
similar_to_oracle = model.wv.most_similar('oracle', topn=5)
for word, score in similar_to_oracle:
    print(f"  {word:<15} similarity: {score:.4f}")

print("\n=== Words most similar to 'index' ===")
similar_to_index = model.wv.most_similar('index', topn=5)
for word, score in similar_to_index:
    print(f"  {word:<15} similarity: {score:.4f}")

# ── Step 6: Word arithmetic (simplified KING-QUEEN analogy) ───
print("\n=== Vector Arithmetic: oracle - database + model ===")
result = model.wv.most_similar(
    positive=['oracle', 'model'],  # add these
    negative=['database'],          # subtract this
    topn=3
)
print("Result:", result)

# ── Step 7: Direct cosine similarity between two words ────────
print("\n=== Cosine Similarity ===")
pairs = [
    ('oracle',   'postgres'),
    ('oracle',   'model'),
    ('index',    'query'),
    ('embedding','vector'),
]
for w1, w2 in pairs:
    score = model.wv.similarity(w1, w2)
    print(f"  similarity('{w1}', '{w2}') = {score:.4f}")

# ── Step 8: Save and load the model ───────────────────────────
model.save("my_word2vec.model")
print("\n✅ Model saved to my_word2vec.model")

# Load later:
# loaded_model = Word2Vec.load("my_word2vec.model")

Run it:

uv run word2vec_train.py

Expected output:

=== Vocabulary ===
Total words in vocabulary: 44
All words: ['oracle', 'database', 'index', 'query', ...]

=== Word Vector for 'oracle' ===
Vector shape: (50,)
First 10 values: [ 0.0312  0.0187 -0.0423  0.0891 -0.0234  0.0567 -0.0123  0.0789 -0.0345  0.0234]

=== Words most similar to 'oracle' ===
  postgres        similarity: 0.9823
  database        similarity: 0.9156
  dba             similarity: 0.8934
  tuning          similarity: 0.8712
  execution       similarity: 0.8445

=== Words most similar to 'index' ===
  query           similarity: 0.9567
  performance     similarity: 0.9234
  missing         similarity: 0.8891
  slow            similarity: 0.8734
  column          similarity: 0.8523

=== Cosine Similarity ===
  similarity('oracle',    'postgres')   = 0.9823
  similarity('oracle',    'model')      = 0.3421
  similarity('index',     'query')      = 0.9567
  similarity('embedding', 'vector')     = 0.9712

✅ Model saved to my_word2vec.model

💡 Reading the output: Notice that “oracle” and “postgres” have similarity ~0.98 — the model correctly learned they are related databases, even though we never told it that. It learned this purely from the fact that they appear in similar sentence contexts. This is the power of distributed representations.

10. Python: Load a Pre-trained Word2Vec and Explore Analogies

Training on a small dataset gives limited results. The real power comes from using models pre-trained on billions of words. Gensim provides direct access to Google’s original Word2Vec vectors via its downloader.

# word2vec_pretrained.py
# Load Google's pre-trained Word2Vec and test the KING-QUEEN analogy
# Run: uv run word2vec_pretrained.py
# Note: First run downloads ~1.6GB — takes a few minutes

import gensim.downloader as api

print("Loading pre-trained Word2Vec (Google News, 300d)...")
print("Note: ~1.6GB download on first run...")
model = api.load("word2vec-google-news-300")
print(f"✅ Loaded! Vocabulary: {len(model.key_to_index):,} words, 300 dimensions\n")

# ── THE FAMOUS KING − MAN + WOMAN = QUEEN ─────────────────────
print("=== The KING − MAN + WOMAN = QUEEN Test ===")
result = model.most_similar(
    positive=['king', 'woman'],  # king + woman
    negative=['man'],             # minus man
    topn=5
)
print("king - man + woman = ?")
for word, score in result:
    print(f"  {word:<15} score: {score:.4f}")

# ── MORE ANALOGIES FROM THE RESEARCH PAPER ────────────────────
print("\n=== Capital Cities: France → Paris, so Germany → ? ===")
result = model.most_similar(
    positive=['germany', 'paris'],
    negative=['france'],
    topn=3
)
for word, score in result:
    print(f"  {word:<15} score: {score:.4f}")

print("\n=== Company → Product: Microsoft → Windows, so Google → ? ===")
result = model.most_similar(
    positive=['google', 'windows'],
    negative=['microsoft'],
    topn=3
)
for word, score in result:
    print(f"  {word:<15} score: {score:.4f}")

# ── DBA-RELEVANT SIMILARITY TESTS ─────────────────────────────
print("\n=== DBA Similarity Tests ===")
dba_pairs = [
    ('oracle',      'postgresql'),
    ('index',       'performance'),
    ('backup',      'restore'),
    ('database',    'schema'),
    ('query',       'sql'),
]
for w1, w2 in dba_pairs:
    try:
        score = model.similarity(w1, w2)
        print(f"  similarity('{w1}', '{w2}') = {score:.4f}")
    except KeyError as e:
        print(f"  {e} not in vocabulary")

Expected output (from Google’s 300-dimensional model):

=== The KING − MAN + WOMAN = QUEEN Test ===
king - man + woman = ?
  queen           score: 0.7118    ← ✅ QUEEN is #1!
  monarch         score: 0.6190
  princess        score: 0.5902
  crown_prince    score: 0.5499
  prince          score: 0.5377

=== Capital Cities: France → Paris, so Germany → ? ===
  berlin          score: 0.8045    ← ✅ Berlin is #1!
  munich          score: 0.6821
  frankfurt       score: 0.6234

=== Company → Product: Microsoft → Windows, so Google → ? ===
  android         score: 0.6892    ← ✅ Android is #1!
  chrome          score: 0.6234
  gmail           score: 0.5891

=== DBA Similarity Tests ===
  similarity('oracle',      'postgresql') = 0.5823
  similarity('index',       'performance') = 0.4234
  similarity('backup',      'restore')     = 0.6891
  similarity('database',    'schema')      = 0.6123
  similarity('query',       'sql')         = 0.7234

11. Python: Cosine Similarity — How Vector Distance Is Measured

All similarity operations in Word2Vec, Sentence Transformers, and vector databases use cosine similarity. Understanding it is essential for building RAG systems.

The Concept

Cosine Similarity = cos(θ) between two vectors

Range: -1 to +1
  +1.0  = identical direction  → same meaning
  0.0   = perpendicular        → unrelated
  -1.0  = opposite direction   → opposite meaning

Formula:
  cosine_similarity(A, B) = (A · B) / (|A| × |B|)
  where A · B is the dot product, |A| is the vector magnitude

🗄️ DBA Analogy — Cosine Similarity = Selectivity Estimate

When Oracle’s optimizer calculates how many rows will match a query condition, it uses selectivity — how “close” the query value is to the indexed values. Cosine similarity is the equivalent for embeddings: it measures how “close” a query vector is to a document vector. A score near 1.0 means “this document closely matches your query” — just like a highly selective index condition returns few, highly relevant rows.

# cosine_similarity_demo.py
# Understand cosine similarity — the math behind all vector search
# Run: uv run cosine_similarity_demo.py

import numpy as np
from sklearn.metrics.pairwise import cosine_similarity

# ── Manual cosine similarity from scratch ─────────────────────
def cosine_sim_manual(vec_a, vec_b):
    """Cosine similarity from first principles"""
    dot_product  = np.dot(vec_a, vec_b)
    magnitude_a  = np.linalg.norm(vec_a)
    magnitude_b  = np.linalg.norm(vec_b)
    return dot_product / (magnitude_a * magnitude_b)

# ── Example vectors (simplified 3D for visualization) ─────────
# These are simplified — real embeddings have 300-3072 dimensions
oracle_vec     = np.array([5, 3, 1])   # database, performance, ai
postgres_vec   = np.array([5, 3, 1])   # very similar to oracle
mongodb_vec    = np.array([4, 2, 2])   # database, less performance, slightly AI
numpy_vec      = np.array([1, 1, 5])   # less database, more AI/data science
unrelated_vec  = np.array([0, 0, 1])   # completely different domain

print("=== Manual Cosine Similarity ===")
pairs = [
    ("oracle",  "postgres",  oracle_vec,    postgres_vec),
    ("oracle",  "mongodb",   oracle_vec,    mongodb_vec),
    ("oracle",  "numpy",     oracle_vec,    numpy_vec),
    ("oracle",  "unrelated", oracle_vec,    unrelated_vec),
]
for name_a, name_b, vec_a, vec_b in pairs:
    score = cosine_sim_manual(vec_a, vec_b)
    print(f"  cosine_sim('{name_a}', '{name_b}') = {score:.4f}")

# ── Semantic search simulation ────────────────────────────────
print("\n=== Semantic Search Simulation ===")
print("User query: 'slow query performance issue'")
print()

# Pretend these are real embedding vectors (simplified to 5D for demo)
documents = {
    "Oracle query optimization guide":         np.array([0.9, 0.8, 0.7, 0.1, 0.2]),
    "Database index missing detection":        np.array([0.8, 0.9, 0.6, 0.1, 0.1]),
    "Execution plan analysis tutorial":        np.array([0.7, 0.8, 0.8, 0.2, 0.1]),
    "Python machine learning introduction":    np.array([0.1, 0.1, 0.2, 0.9, 0.8]),
    "LangChain RAG pipeline tutorial":         np.array([0.2, 0.1, 0.1, 0.8, 0.9]),
}

# Query embedding (similar to performance + query + slow documents)
query_vector = np.array([0.85, 0.85, 0.75, 0.15, 0.15])

print(f"{'Document':<45} {'Similarity':>12}")
print("-" * 60)
results = []
for doc_name, doc_vec in documents.items():
    score = cosine_sim_manual(query_vector, doc_vec)
    results.append((doc_name, score))

# Sort by similarity descending
results.sort(key=lambda x: x[1], reverse=True)
for doc_name, score in results:
    bar = "█" * int(score * 20)
    print(f"  {doc_name:<43} {score:.4f}  {bar}")

print()
print(f"Top result: '{results[0][0]}'")
print("This is exactly how pgvector / ChromaDB / Pinecone searches work.")

Run it:

uv run cosine_similarity_demo.py

Expected output:

=== Manual Cosine Similarity ===
  cosine_sim('oracle',   'postgres')  = 1.0000  ← identical direction
  cosine_sim('oracle',   'mongodb')   = 0.9732  ← very similar
  cosine_sim('oracle',   'numpy')     = 0.7276  ← somewhat related
  cosine_sim('oracle',   'unrelated') = 0.1690  ← barely related

=== Semantic Search Simulation ===
User query: 'slow query performance issue'

Document                                       Similarity
------------------------------------------------------------
  Database index missing detection               0.9934  ████████████████████
  Oracle query optimization guide                0.9918  ████████████████████
  Execution plan analysis tutorial               0.9840  ███████████████████
  Python machine learning introduction           0.3214  ██████
  LangChain RAG pipeline tutorial               0.2987  █████

Top result: 'Database index missing detection'
This is exactly how pgvector / ChromaDB / Pinecone searches work.

12. Why This Matters for DBAs Building RAG Systems

Everything in this post connects directly to something you’ll build in the RAG modules of this series. Here’s the complete picture:

What you learned today	Where it shows up in RAG / GenAI production
Words → dense vectors (embeddings)	Every document in your RAG knowledge base is stored as an embedding vector in a vector DB (pgvector, ChromaDB, Pinecone)
Similar words → similar vectors	A user query about “slow query” retrieves documents about “execution plan” and “missing index” — because their vectors are similar
Cosine similarity	The exact math used by `SELECT * FROM embeddings ORDER BY embedding <-> query_vector LIMIT 5` in pgvector
Pre-trained models	You use OpenAI `text-embedding-3-small` or Sentence Transformers — same concept as Word2Vec, much more powerful
CBOW / Skip-gram	Architecture understanding helps you choose the right embedding model for your domain
Training your own model	For domain-specific text (Oracle error logs, custom SQL dialects), a fine-tuned embedding model outperforms generic ones

The full RAG data flow, from a DBA perspective:

Your documents (runbooks, alert logs, SQL files)
         ↓
  Embedding model (Word2Vec concept → production: OpenAI / SentenceTransformer)
         ↓
  Dense vectors  → stored in Vector DB (pgvector in PostgreSQL, or ChromaDB)
         ↓
User query: "Why is this query slow?"
         ↓
  Embedding model converts query → query vector
         ↓
  Cosine similarity search: find top-K most similar document vectors
         ↓
  Retrieved documents → passed to LLM as context
         ↓
  LLM generates answer grounded in YOUR data

13. Common Errors and Fixes

Error 1: KeyError — word not in vocabulary

KeyError: "word 'dbms_stats' not in vocabulary"

Cause: The word wasn’t in the training data so no vector was learned for it.

Fix:

# Check before accessing
if 'dbms_stats' in model.wv:
    vector = model.wv['dbms_stats']
else:
    print("Word not in vocabulary")

# OR use FastText — it handles OOV through subwords
from gensim.models import FastText
ft_model = FastText(sentences=sentences, vector_size=50, window=3, min_count=1)
vector = ft_model.wv['dbms_stats']  # Works even if not in training data

Error 2: Poor similarity results from small training data

Symptom: Similarity scores are unexpected — words you know are related score low.

Cause: Word2Vec needs a large corpus to learn meaningful relationships. With only 18 training sentences (like our demo), results will be weak.

Fix:

# Option 1: Use pre-trained model (best for production)
model = api.load("word2vec-google-news-300")

# Option 2: Train on more domain data — your actual docs
# For a DBA: feed all your runbooks, alert logs, SQL files
# More data = better representations

# Option 3: Increase epochs for small datasets
model = Word2Vec(sentences=sentences, epochs=500, vector_size=100)

Error 3: MemoryError loading word2vec-google-news-300

Cause: The full Google News Word2Vec model is ~1.6GB and requires ~4GB RAM to load.

Fix:

# Use a lighter alternative model
model = api.load("glove-wiki-gigaword-100")  # Only 128MB, 100 dimensions

# Or use sentence-transformers for production (more memory-efficient, better quality)
# uv add sentence-transformers
from sentence_transformers import SentenceTransformer
model = SentenceTransformer('all-MiniLM-L6-v2')  # Only 80MB

14. Key Takeaways

✅ What you learned in this post:

Word2Vec (Google, 2013) was the breakthrough that proved word meaning could be encoded mathematically. It trains a neural network; the hidden layer weights become the word embeddings.
A neural network learns by: random weights → forward pass → measure loss → backpropagation → adjust weights → repeat until loss minimised (1 epoch = 1 full pass).
Manual feature engineering (assigning numbers to word properties) doesn’t scale. Word2Vec learns features automatically from massive text data.
The KING − MAN + WOMAN = QUEEN analogy works because embeddings encode semantic relationships as directions in vector space. Subtracting “man-ness” and adding “woman-ness” shifts the KING vector toward QUEEN.
CBOW predicts the center word from context — faster, better for frequent words. Skip-gram predicts context from the center word — slower, better semantic accuracy, preferred for quality.
The embedding landscape: Word2Vec (classical, interviews) → FastText (handles OOV via subwords) → Sentence Transformers / OpenAI Embeddings (production RAG).
Cosine similarity is the math that powers all vector search. Range −1 to +1. It’s the formula behind every ORDER BY embedding <-> query_vector query in pgvector.
For building RAG systems: documents → embeddings → vector DB → cosine similarity search → retrieved context → LLM. Word2Vec is the conceptual foundation of every step.

15. What’s Next

You now understand the theory and have trained your own Word2Vec model. Post 5 takes it to production level:

Post 5 — Sentence Transformers and OpenAI Embeddings API — Production-Grade Embeddings
Why Word2Vec isn’t enough for production · Sentence-level embeddings · HuggingFace Sentence Transformers with Python · OpenAI text-embedding-3-small API · Build a real semantic search system for your DBA runbooks · Compare keyword search vs semantic search head-to-head

#	Post	Status
1	What is GenAI? + UV Setup	✅ Published
2	AI Roadmap + 30 Tools + GitHub Copilot Setup	✅ Published
3	OHE, Bag of Words and TF-IDF with Python	✅ Published
4	Word2Vec and Embeddings — this post	📍 You are here
5	Sentence Transformers + OpenAI Embeddings API	⬜ Next Friday
6	Prompt Engineering — Zero to Advanced (DBA Edition)	⬜ Coming soon

👉 Next Post: Sentence Transformers and OpenAI Embeddings API — Build a Semantic Search System for Your DBA Runbooks

References

Mikolov, T., Chen, K., Corrado, G., Dean, J. (2013). Efficient Estimation of Word Representations in Vector Space. Google Inc. arXiv:1301.3781
Gensim Word2Vec Documentation
Original Word2Vec Code (Google)

Part of the GenAI from Scratch series for DBAs and Infrastructure Engineers. Published every Friday at gradeupnow.in/genai-blog/

Prerequisites

1. The Problem Word2Vec Solved

2. What is Word2Vec? (Google, 2013)

3. Neural Network Basics — The Engine Behind Word2Vec

From Single Perceptron to Multilayer Network

Why the Hidden Layer Weights Become the Embedding

4. Manual Feature Engineering vs Learned Embeddings

Manual Feature Assignment (Before Word2Vec)

Learned Features (Word2Vec)

5. The KING − MAN + WOMAN = QUEEN Analogy — Explained

The Intuition

The Actual Math (from class notes, Page 18)

From the Original Research Paper

6. CBOW vs Skip-gram — The Two Word2Vec Architectures

CBOW — Continuous Bag of Words

Skip-gram

7. The Embedding Landscape: Classical → SOTA

8. FastText — The Improved Word2Vec

9. Python: Train Your Own Word2Vec Model

10. Python: Load a Pre-trained Word2Vec and Explore Analogies

11. Python: Cosine Similarity — How Vector Distance Is Measured

The Concept

12. Why This Matters for DBAs Building RAG Systems

13. Common Errors and Fixes

Error 1: KeyError — word not in vocabulary

Error 2: Poor similarity results from small training data

Error 3: MemoryError loading word2vec-google-news-300

14. Key Takeaways

15. What’s Next

References

Leave a Comment Cancel Reply