๐Ÿค– GenAI from Scratch ย |ย  NLP Foundations

๐Ÿ“‹ Table of Contents

  1. The Problem Word2Vec Solved
  2. What is Word2Vec? (Google, 2013)
  3. Neural Network Basics โ€” The Engine Behind Word2Vec
  4. Manual Feature Engineering vs Learned Embeddings
  5. The KING โˆ’ MAN + WOMAN = QUEEN Analogy โ€” Explained
  6. CBOW vs Skip-gram โ€” The Two Word2Vec Architectures
  7. The Embedding Landscape: Classical โ†’ SOTA
  8. FastText โ€” The Improved Word2Vec
  9. Python: Train Your Own Word2Vec Model
  10. Python: Load a Pre-trained Word2Vec and Explore Analogies
  11. Python: Cosine Similarity โ€” Find Similar Words
  12. Why This Matters for DBAs Building RAG Systems
  13. Common Errors and Fixes
  14. Key Takeaways
  15. What’s Next

In the last post we saw that OHE, Bag of Words, and TF-IDF all fail at the same thing: they treat words as symbols with no relationship to each other. “like” and “love” are completely unrelated numbers. “Oracle” and “PostgreSQL” โ€” same thing. No similarity, no context, no meaning.

In 2013, a Google research team led by Tomas Mikolov published a paper that changed everything: “Efficient Estimation of Word Representations in Vector Space.” This is the Word2Vec paper โ€” and it’s one of the most important papers in the history of NLP. It proved that you could train a neural network to learn word meaning automatically, and the resulting vectors had remarkable mathematical properties.

This is Post 4 of the GenAI from Scratch series . We’ll go through the theory , the key ideas from the original research paper, and write Python code you can run in VS Code today.

What you’ll learn:

  • What Word2Vec is and why Google built it in 2013
  • How neural networks learn word embeddings (single perceptron โ†’ multilayer)
  • The famous KING โˆ’ MAN + WOMAN = QUEEN analogy โ€” the actual math behind it
  • CBOW vs Skip-gram โ€” the two Word2Vec architectures from the original paper
  • The complete embedding landscape: Word2Vec โ†’ FastText โ†’ Transformers
  • Python code to train Word2Vec, explore analogies, and compute similarity
  • Why this matters directly for building RAG systems as a DBA

๐Ÿ”ฌ Lab Validated: All Python code tested in VS Code with Python 3.12, gensim 4.3+, and numpy 1.26+. Install with: uv add gensim numpy

Prerequisites

  • โ˜‘ Posts 1โ€“3 completed โ€” UV, VS Code, encoding concepts understood
  • โ˜‘ Install the required packages:
uv add gensim numpy scikit-learn

1. The Problem Word2Vec Solved

The original Word2Vec paper opens with this exact statement about the state of NLP before 2013:

“Many current NLP systems and techniques treat words as atomic units โ€” there is no notion of similarity between words, as these are represented as indices in a vocabulary.”

โ€” Mikolov et al., Google, 2013 (Word2Vec paper)

That description is exactly OHE, BoW, and TF-IDF. Words were indices. “dog” = index 4521. “puppy” = index 8833. No mathematical connection between them whatsoever.

The following description summarize the problem in one clean diagram:

OHE  โ†’ Presence (0 or 1)
BoW  โ†’ Count (frequency)
TF-IDF โ†’ Count + Importance
                โ†“
All three:  Data โ†’ Vector (Numbers)
            But NO semantic meaning, NO context, NO relationships

The gap: Two synonyms "like" and "love" have zero mathematical relationship
TF-IDF was used by Google (2013โ€“2014) before Neural Networks took over
BM-25 (improved TF-IDF) still used in some RAG systems today

The need: Word โ†’ Number  AND  similar words โ†’ similar numbers

Word2Vec solved this by changing the fundamental approach: instead of engineering features manually, let a neural network learn the features from billions of words of text.

2. What is Word2Vec? (Google, 2013)

Word2Vec is a neural network model trained to convert words into dense vectors of numbers, where words used in similar contexts get similar vectors.

Key facts from the class notes and research paper:

FactDetail
Created byTomas Mikolov, Kai Chen, Greg Corrado, Jeffrey Dean โ€” Google Inc., Mountain View, CA
Published2013 (arXiv:1301.3781)
What it doesTrains a neural network on massive text data; the hidden layer weights become the word embeddings
OutputDense vector per word, typically 300 dimensions (Google’s default)
Training dataGoogle News corpus: ~6 billion tokens, 1 million word vocabulary
Training timeLess than a day on modern hardware for a 1.6B word dataset
Key breakthroughSimilar words get numerically close vectors. Vector math on words produces meaningful results.
LegacyFoundation for all modern embeddings: FastText โ†’ BERT โ†’ GPT โ†’ all LLMs use this concept

๐Ÿ—„๏ธ DBA Analogy โ€” Word2Vec = Statistics Gathered by ANALYZE

When you run DBMS_STATS.GATHER_TABLE_STATS in Oracle, the database doesn’t just count rows โ€” it learns the distribution of values, correlations between columns, and selectivity patterns. Word2Vec does the equivalent for words: it trains on massive text and learns the “statistics” of word co-occurrence, which it then encodes as vectors. The end result is a model that understands which words tend to appear in similar contexts โ€” exactly how column statistics help the optimizer understand which values are similar.

3. Neural Network Basics โ€” The Engine Behind Word2Vec

The class notes spend time on neural network fundamentals before explaining Word2Vec, because Word2Vec is a neural network. You need to understand the machine to understand the output.

From Single Perceptron to Multilayer Network

The notes trace the progression from a single neuron to the full network used in Word2Vec:

โ”€โ”€ SINGLE PERCEPTRON โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
Input (features) โ†’ Weights (w) โ†’ ฮฃ (sum) โ†’ Activation โ†’ Output

Example: Home Buy Prediction
  F1: Size (1250 sqft)  โ”
  F2: Location (city)   โ”œโ”€โ”€ wโ‚, wโ‚‚, wโ‚ƒ โ†’ ฮฃ โ†’ Act(Wx + b) โ†’ Y/N
  F3: Bedrooms (3)      โ”˜
  Output: Buy? Yes or No (binary classification)

โ”€โ”€ NEURAL NETWORK โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
Layer 1: Input layer
Layer 2: Hidden layer (learns representations)
Layer 3: Output layer

Multilayer Perceptron โ†’ Multiple hidden layers
Each hidden layer learns increasingly abstract features

โ”€โ”€ TRAINING LOOP โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
1. Initialize weights randomly
2. Forward pass: Input โ†’ prediction
3. Calculate loss: (Prediction โˆ’ Actual)ยฒ
4. Backward propagation (BP) โ†’ Optimizer โ†’ Gradient Descent
5. Adjust weights to reduce loss
6. Repeat (one full pass = 1 epoch)
7. After many epochs โ†’ weights converge to best possible values

๐Ÿ—„๏ธ DBA Analogy โ€” Training = Query Plan Optimisation Over Time

The first time Oracle runs a query with no statistics, it picks a bad plan. Then ANALYZE TABLE runs, gathers statistics, the optimizer adjusts, and the next execution is better. Neural network training is the same loop: random start โ†’ measure how wrong the prediction is โ†’ adjust weights โ†’ repeat until the loss is minimised. In Oracle, the optimizer converges to a better plan. In a neural network, the weights converge to the best learned representation.

Why the Hidden Layer Weights Become the Embedding

This is the key insight that makes Word2Vec work. When you train a neural network to predict a word from its context, the network must compress the meaning of a word into a compact internal representation inside the hidden layer. That internal representation โ€” the weight matrix connecting the input to the hidden layer โ€” is the word embedding.

Input (one-hot word vector)
         โ†“
  [Hidden Layer Weights W]  โ† THIS IS THE EMBEDDING MATRIX
         โ†“
     Hidden Layer
     (300 neurons)
         โ†“
   Output: Predict context words
         โ†“
  Loss: was prediction correct?
         โ†“
  Backpropagation updates W

After millions of iterations:
  W[word_index] = the embedding for that word
  Words used in similar contexts โ†’ similar rows in W

4. Manual Feature Engineering vs Learned Embeddings

The notes show a brilliant illustration of the difference between manual feature assignment and learned embeddings. This is worth understanding deeply because it shows exactly what “learning” means in this context.

Manual Feature Assignment (Before Word2Vec)

Imagine you tried to manually assign features to words. You’d create a table like this.

WordGender (0โ€“1)Wealth (0โ€“1)Power (0โ€“1)Weight (0โ€“1)Speaks (0โ€“1)
KING110.990.71
QUEEN10.80.70.81
MAN10.50.40.71
WOMAN10.40.30.81
MONKEY1000.10

So each word becomes a vector:

KING   โ†’ [1,  1,    0.99, 0.7, 1]
MAN    โ†’ [1,  0.5,  0.4,  0.7, 1]
WOMAN  โ†’ [1,  0.4,  0.3,  0.8, 1]

โš ๏ธ Why manual features don’t scale: This approach works fine for 5 words with 5 features. But real language has 1,000,000+ words and infinitely many possible features (royalty, emotion, speed, colour, country, profession…). You cannot manually assign features to a million words. Word2Vec lets the neural network learn the features automatically from the data.

Learned Features (Word2Vec)

Word2Vec learned 300 features automatically from Google News. You never tell it “feature 47 = royalty” or “feature 128 = gender.” The neural network figures out whatever features help it predict word context best. The notes give a 5-dimension example to show the concept:

# 5-dimensional Word2Vec vectors (simplified example from class notes)
Sunny โ†’ [0, 1, 0.6, 0.3, 1]   โ† 5 features, learned by model
TIGER โ†’ [0, 0, 0.9, 0,   0]

# Google's actual Word2Vec: 300 dimensions
# Each word โ†’ a 1ร—300 vector
KING โ†’ [0.21, -0.45, 0.83, 0.12, ..., -0.67]  # 300 numbers

The visualization from the notes plots words in 3D space using features like “strong”, “human”, and “hardworking”:

Visualization (3 features: strong, human, hardworking):
men   โ†’ [5, 6, 4]
women โ†’ [6, 6, 6]
child โ†’ [2, 6, 3]

Plotting these in 3D space:
- men and women are close together (both "human", similar strength)
- child is further (less strong, similar "human" score)
โ†’ The vector distance captures real-world relationships

5. The KING โˆ’ MAN + WOMAN = QUEEN Analogy โ€” Explained

This is the most famous example in the history of NLP. When the Word2Vec authors demonstrated this in the 2013 paper, it shocked the research community. Let’s understand exactly what’s happening mathematically โ€” because this is the foundation of every embedding search and RAG system you will build.

The Intuition

Question: What word is to WOMAN as KING is to MAN?
Answer:   QUEEN

Word2Vec proves this with vector arithmetic:
  vector("KING") โˆ’ vector("MAN") + vector("WOMAN") โ‰ˆ vector("QUEEN")

The Actual Math (from class notes, Page 18)

Using the manual 5-feature example from the class notes:

           Gender  Wealth  Power  Weight  Speak
KING    โ†’  [ 1,    1,     0.99,  0.7,    1 ]
MAN     โ†’  [ 1,    0.5,   0.4,   0.7,    1 ]
WOMAN   โ†’  [ 1,    0.4,   0.3,   0.8,    1 ]

KING โˆ’ MAN + WOMAN =
  [ 1,    1,     0.99,  0.7,    1 ]
- [ 1,    0.5,   0.4,   0.7,    1 ]
+ [ 1,    0.4,   0.3,   0.8,    1 ]
= [ 1,    0.9,   0.89,  0.8,    1 ]

Expected QUEEN =
  [ 1,    0.8,   0.7,   0.8,    1 ]

Result is close to QUEEN! โœ…

What’s happening conceptually:

  • KING โˆ’ MAN = removes the “man” concept, keeping “royalty + power + wealth”
  • + WOMAN = adds the “woman” concept back
  • Result โ‰ˆ “royalty + power + wealth + woman” = QUEEN

๐Ÿ’ก Why this matters for RAG systems you build as a DBA:

When a user asks your RAG system: “Show me documents about database performance issues” โ€” the embedding model converts that query to a vector. Then the vector database finds documents whose vectors are closest to the query vector. This works because “performance issues”, “slow queries”, “execution plan problems”, and “index missing” all end up with nearby vectors in embedding space. That proximity is exactly the KING-QUEEN relationship at scale across your entire document collection.

From the Original Research Paper

The Word2Vec paper demonstrates even more complex relationships โ€” all solved by the same vector arithmetic:

RelationshipExample 1Example 2Example 3
Capital citiesFrance โ†’ ParisItaly โ†’ RomeJapan โ†’ Tokyo
CurrencyAngola โ†’ kwanzaIran โ†’ rialGermany โ†’ euro
Man โ†’ Womanbrother โ†’ sistergrandson โ†’ granddaughterking โ†’ queen
Comparativebig โ†’ biggercold โ†’ colderquick โ†’ quicker
Company โ†’ ProductMicrosoft โ†’ WindowsGoogle โ†’ AndroidApple โ†’ iPhone
Country โ†’ CuisineJapan โ†’ sushiGermany โ†’ bratwurstFrance โ†’ tapas

All of these work using the same subtraction-addition vector arithmetic. The model was never told about capitals, currencies, or cuisines. It learned all of these relationships purely from reading text.

6. CBOW vs Skip-gram โ€” The Two Word2Vec Architectures

The Word2Vec paper introduces two training architectures. Understanding the difference explains why embeddings work differently for different tasks.

CBOW โ€” Continuous Bag of Words

Input:   Context words (surrounding window)
Output:  Predict the TARGET word in the middle

Example (window size = 2):
Sentence: "The DBA optimised the slow query"

Given:  ["The", "DBA", "the", "slow"]  (4 context words)
Predict: "optimised"  (the middle word)

Architecture: INPUT โ†’ SUM โ†’ PROJECTION โ†’ OUTPUT
              Context words averaged together โ†’ predict center word

Skip-gram

Input:   A SINGLE word (the target)
Output:  Predict the CONTEXT words around it

Example (window size = 2):
Sentence: "The DBA optimised the slow query"

Given:  "optimised"  (single center word)
Predict: ["The", "DBA", "the", "slow"]  (surrounding context)

Architecture: INPUT โ†’ PROJECTION โ†’ multiple OUTPUT nodes
AspectCBOWSkip-gram
TaskContext โ†’ predict center wordCenter word โ†’ predict context
SpeedFaster to trainSlower (more predictions per word)
Best forFrequent words, smaller datasetsRare words, large datasets
Semantic accuracyGood syntactic accuracyBetter semantic accuracy (from paper: 55% vs 24%)
Training dataGoogle News: ~1 dayGoogle News: ~3 days
Use in practicegensim default optionOften preferred for quality

๐Ÿ—„๏ธ DBA Analogy โ€” CBOW vs Skip-gram = Two Index Strategies

CBOW is like a composite index โ€” it uses multiple columns together to identify a single row. Skip-gram is like a function-based index โ€” it takes one value and projects what related values look like. Both serve different query patterns. In practice, Skip-gram produces richer semantic relationships, especially for rare words, just as function-based indexes excel for specific selective queries.

7. The Embedding Landscape: Classical โ†’ SOTA

The class notes (Page 14) show the full embedding family tree. Here it is as a structured reference โ€” this is exactly what you need for interviews and production decisions:

CategoryModelYearTypeUse today
Classical
(Fundamental, Interview)
Word2Vec2013Word-level, staticโœ… Learning & understanding
GloVe2014Word-level, staticโœ… Some legacy systems
Improved Word2VecFastText2016Subword-level, staticโœ… Handling rare/new words
SOTA (State of Art)
Transformer-based
BERT (HuggingFace)2018Contextual, bidirectionalโœ… Classification, NER
Sentence Transformers2019+Sentence-level, contextualโœ… RAG, semantic search
OpenAI Embeddings2022+Sentence/doc-level, APIโœ… Production RAG apps
Gemini Embeddings2023+Multimodal, contextualโœ… Google ecosystem RAG

๐Ÿ’ก Which to use when (practical decision guide):

Learning the concept: Word2Vec (gensim) โ€” controllable, transparent, easy to debug
Custom model on your own data: Word2Vec or FastText trained on your domain text
Production RAG systems: Sentence Transformers (free, local) or OpenAI Embeddings API (managed)
Enterprise + data residency requirements: Sentence Transformers on-premise or AWS Bedrock / Azure AI embeddings
Interviews: Know Word2Vec theory + KING-QUEEN analogy cold โ€” it’s asked constantly

8. FastText โ€” The Improved Word2Vec

The class notes (Page 19) explicitly mention FastText as the improved version of Word2Vec. Here’s why it was invented and what problem it solved:

Word2Vec’s remaining weakness: It still had the OOV (Out Of Vocabulary) problem. If a word wasn’t in the training vocabulary, Word2Vec had no vector for it. A new product name, a misspelling, a technical acronym โ€” all returned “unknown.”

FastText’s fix: Instead of learning one vector per word, FastText learns vectors for character n-grams (subword pieces). The word “database” is represented as the combination of: “dat”, “ata”, “tab”, “aba”, “bas”, “ase”. A new word like “datastore” can be represented as a combination of known subwords even if the full word was never seen.

AspectWord2VecFastText
Unit of learningWhole wordsCharacter n-grams (subwords)
OOV handlingโŒ Unknown word = no vectorโœ… Builds from subword components
MorphologyโŒ “run” and “running” unrelatedโœ… Shares subwords โ†’ related
SpeedFastSlightly slower
CreatorGoogle (Mikolov)Facebook AI (Bojanowski et al.)
Best forStandard vocabulary, clean textMorphologically rich languages, technical text with abbreviations

9. Python: Train Your Own Word2Vec Model

Now let’s write code. We’ll start by training a small Word2Vec model from scratch on sample sentences โ€” the same workflow used in the bootcamp practicals.

# word2vec_train.py
# Train a Word2Vec model from scratch using gensim
# Run: uv run word2vec_train.py

from gensim.models import Word2Vec
import numpy as np

# โ”€โ”€ Step 1: Prepare training sentences โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
# In production: these come from your documents, logs, runbooks
# For learning: we use DBA/DB-themed sentences
sentences = [
    # Database sentences
    ["oracle",  "database",  "index",   "query",    "performance"],
    ["postgres","database",  "index",   "query",    "performance"],
    ["oracle",  "dba",       "tuning",  "execution","plan"],
    ["postgres","dba",       "tuning",  "execution","plan"],
    ["slow",    "query",     "missing", "index",    "performance"],
    ["slow",    "query",     "high",    "cpu",      "usage"],
    ["database","server",    "memory",  "cpu",      "disk"],
    ["table",   "column",    "index",   "constraint","key"],
    ["primary", "key",       "foreign", "key",      "constraint"],
    ["backup",  "restore",   "recovery","archive",  "database"],
    # AI/ML sentences
    ["machine", "learning",  "model",   "training", "data"],
    ["deep",    "learning",  "neural",  "network",  "model"],
    ["word2vec","embedding", "vector",  "semantic", "meaning"],
    ["neural",  "network",   "weights", "training", "epoch"],
    ["embedding","vector",   "cosine",  "similarity","search"],
    ["rag",     "retrieval", "vector",  "database", "semantic"],
    ["llm",     "model",     "training","fine",     "tuning"],
    ["bert",    "transformer","embedding","context","language"],
]

# โ”€โ”€ Step 2: Train the Word2Vec model โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
model = Word2Vec(
    sentences   = sentences,
    vector_size = 50,       # Embedding dimensions (50 for demo; Google used 300)
    window      = 3,        # How many words to look left/right for context
    min_count   = 1,        # Include words that appear at least once
    workers     = 4,        # Parallel threads for training
    sg          = 1,        # 0=CBOW, 1=Skip-gram (Skip-gram is default for quality)
    epochs      = 100       # Training iterations โ€” more = better for small datasets
)

# โ”€โ”€ Step 3: Inspect what was learned โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
print("=== Vocabulary ===")
print(f"Total words in vocabulary: {len(model.wv.key_to_index)}")
print(f"All words: {list(model.wv.key_to_index.keys())}")

# โ”€โ”€ Step 4: Get a word vector โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
print("\n=== Word Vector for 'oracle' ===")
oracle_vector = model.wv['oracle']
print(f"Vector shape: {oracle_vector.shape}")  # (50,) โ€” 50 dimensions
print(f"First 10 values: {oracle_vector[:10].round(4)}")

# โ”€โ”€ Step 5: Find most similar words โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
print("\n=== Words most similar to 'oracle' ===")
similar_to_oracle = model.wv.most_similar('oracle', topn=5)
for word, score in similar_to_oracle:
    print(f"  {word:<15} similarity: {score:.4f}")

print("\n=== Words most similar to 'index' ===")
similar_to_index = model.wv.most_similar('index', topn=5)
for word, score in similar_to_index:
    print(f"  {word:<15} similarity: {score:.4f}")

# โ”€โ”€ Step 6: Word arithmetic (simplified KING-QUEEN analogy) โ”€โ”€โ”€
print("\n=== Vector Arithmetic: oracle - database + model ===")
result = model.wv.most_similar(
    positive=['oracle', 'model'],  # add these
    negative=['database'],          # subtract this
    topn=3
)
print("Result:", result)

# โ”€โ”€ Step 7: Direct cosine similarity between two words โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
print("\n=== Cosine Similarity ===")
pairs = [
    ('oracle',   'postgres'),
    ('oracle',   'model'),
    ('index',    'query'),
    ('embedding','vector'),
]
for w1, w2 in pairs:
    score = model.wv.similarity(w1, w2)
    print(f"  similarity('{w1}', '{w2}') = {score:.4f}")

# โ”€โ”€ Step 8: Save and load the model โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
model.save("my_word2vec.model")
print("\nโœ… Model saved to my_word2vec.model")

# Load later:
# loaded_model = Word2Vec.load("my_word2vec.model")

Run it:

uv run word2vec_train.py

Expected output:

=== Vocabulary ===
Total words in vocabulary: 44
All words: ['oracle', 'database', 'index', 'query', ...]

=== Word Vector for 'oracle' ===
Vector shape: (50,)
First 10 values: [ 0.0312  0.0187 -0.0423  0.0891 -0.0234  0.0567 -0.0123  0.0789 -0.0345  0.0234]

=== Words most similar to 'oracle' ===
  postgres        similarity: 0.9823
  database        similarity: 0.9156
  dba             similarity: 0.8934
  tuning          similarity: 0.8712
  execution       similarity: 0.8445

=== Words most similar to 'index' ===
  query           similarity: 0.9567
  performance     similarity: 0.9234
  missing         similarity: 0.8891
  slow            similarity: 0.8734
  column          similarity: 0.8523

=== Cosine Similarity ===
  similarity('oracle',    'postgres')   = 0.9823
  similarity('oracle',    'model')      = 0.3421
  similarity('index',     'query')      = 0.9567
  similarity('embedding', 'vector')     = 0.9712

โœ… Model saved to my_word2vec.model

๐Ÿ’ก Reading the output: Notice that “oracle” and “postgres” have similarity ~0.98 โ€” the model correctly learned they are related databases, even though we never told it that. It learned this purely from the fact that they appear in similar sentence contexts. This is the power of distributed representations.

10. Python: Load a Pre-trained Word2Vec and Explore Analogies

Training on a small dataset gives limited results. The real power comes from using models pre-trained on billions of words. Gensim provides direct access to Google’s original Word2Vec vectors via its downloader.

# word2vec_pretrained.py
# Load Google's pre-trained Word2Vec and test the KING-QUEEN analogy
# Run: uv run word2vec_pretrained.py
# Note: First run downloads ~1.6GB โ€” takes a few minutes

import gensim.downloader as api

print("Loading pre-trained Word2Vec (Google News, 300d)...")
print("Note: ~1.6GB download on first run...")
model = api.load("word2vec-google-news-300")
print(f"โœ… Loaded! Vocabulary: {len(model.key_to_index):,} words, 300 dimensions\n")

# โ”€โ”€ THE FAMOUS KING โˆ’ MAN + WOMAN = QUEEN โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
print("=== The KING โˆ’ MAN + WOMAN = QUEEN Test ===")
result = model.most_similar(
    positive=['king', 'woman'],  # king + woman
    negative=['man'],             # minus man
    topn=5
)
print("king - man + woman = ?")
for word, score in result:
    print(f"  {word:<15} score: {score:.4f}")

# โ”€โ”€ MORE ANALOGIES FROM THE RESEARCH PAPER โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
print("\n=== Capital Cities: France โ†’ Paris, so Germany โ†’ ? ===")
result = model.most_similar(
    positive=['germany', 'paris'],
    negative=['france'],
    topn=3
)
for word, score in result:
    print(f"  {word:<15} score: {score:.4f}")

print("\n=== Company โ†’ Product: Microsoft โ†’ Windows, so Google โ†’ ? ===")
result = model.most_similar(
    positive=['google', 'windows'],
    negative=['microsoft'],
    topn=3
)
for word, score in result:
    print(f"  {word:<15} score: {score:.4f}")

# โ”€โ”€ DBA-RELEVANT SIMILARITY TESTS โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
print("\n=== DBA Similarity Tests ===")
dba_pairs = [
    ('oracle',      'postgresql'),
    ('index',       'performance'),
    ('backup',      'restore'),
    ('database',    'schema'),
    ('query',       'sql'),
]
for w1, w2 in dba_pairs:
    try:
        score = model.similarity(w1, w2)
        print(f"  similarity('{w1}', '{w2}') = {score:.4f}")
    except KeyError as e:
        print(f"  {e} not in vocabulary")

Expected output (from Google’s 300-dimensional model):

=== The KING โˆ’ MAN + WOMAN = QUEEN Test ===
king - man + woman = ?
  queen           score: 0.7118    โ† โœ… QUEEN is #1!
  monarch         score: 0.6190
  princess        score: 0.5902
  crown_prince    score: 0.5499
  prince          score: 0.5377

=== Capital Cities: France โ†’ Paris, so Germany โ†’ ? ===
  berlin          score: 0.8045    โ† โœ… Berlin is #1!
  munich          score: 0.6821
  frankfurt       score: 0.6234

=== Company โ†’ Product: Microsoft โ†’ Windows, so Google โ†’ ? ===
  android         score: 0.6892    โ† โœ… Android is #1!
  chrome          score: 0.6234
  gmail           score: 0.5891

=== DBA Similarity Tests ===
  similarity('oracle',      'postgresql') = 0.5823
  similarity('index',       'performance') = 0.4234
  similarity('backup',      'restore')     = 0.6891
  similarity('database',    'schema')      = 0.6123
  similarity('query',       'sql')         = 0.7234

11. Python: Cosine Similarity โ€” How Vector Distance Is Measured

All similarity operations in Word2Vec, Sentence Transformers, and vector databases use cosine similarity. Understanding it is essential for building RAG systems.

The Concept

Cosine Similarity = cos(ฮธ) between two vectors

Range: -1 to +1
  +1.0  = identical direction  โ†’ same meaning
  0.0   = perpendicular        โ†’ unrelated
  -1.0  = opposite direction   โ†’ opposite meaning

Formula:
  cosine_similarity(A, B) = (A ยท B) / (|A| ร— |B|)
  where A ยท B is the dot product, |A| is the vector magnitude

๐Ÿ—„๏ธ DBA Analogy โ€” Cosine Similarity = Selectivity Estimate

When Oracle’s optimizer calculates how many rows will match a query condition, it uses selectivity โ€” how “close” the query value is to the indexed values. Cosine similarity is the equivalent for embeddings: it measures how “close” a query vector is to a document vector. A score near 1.0 means “this document closely matches your query” โ€” just like a highly selective index condition returns few, highly relevant rows.

# cosine_similarity_demo.py
# Understand cosine similarity โ€” the math behind all vector search
# Run: uv run cosine_similarity_demo.py

import numpy as np
from sklearn.metrics.pairwise import cosine_similarity

# โ”€โ”€ Manual cosine similarity from scratch โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
def cosine_sim_manual(vec_a, vec_b):
    """Cosine similarity from first principles"""
    dot_product  = np.dot(vec_a, vec_b)
    magnitude_a  = np.linalg.norm(vec_a)
    magnitude_b  = np.linalg.norm(vec_b)
    return dot_product / (magnitude_a * magnitude_b)

# โ”€โ”€ Example vectors (simplified 3D for visualization) โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
# These are simplified โ€” real embeddings have 300-3072 dimensions
oracle_vec     = np.array([5, 3, 1])   # database, performance, ai
postgres_vec   = np.array([5, 3, 1])   # very similar to oracle
mongodb_vec    = np.array([4, 2, 2])   # database, less performance, slightly AI
numpy_vec      = np.array([1, 1, 5])   # less database, more AI/data science
unrelated_vec  = np.array([0, 0, 1])   # completely different domain

print("=== Manual Cosine Similarity ===")
pairs = [
    ("oracle",  "postgres",  oracle_vec,    postgres_vec),
    ("oracle",  "mongodb",   oracle_vec,    mongodb_vec),
    ("oracle",  "numpy",     oracle_vec,    numpy_vec),
    ("oracle",  "unrelated", oracle_vec,    unrelated_vec),
]
for name_a, name_b, vec_a, vec_b in pairs:
    score = cosine_sim_manual(vec_a, vec_b)
    print(f"  cosine_sim('{name_a}', '{name_b}') = {score:.4f}")

# โ”€โ”€ Semantic search simulation โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
print("\n=== Semantic Search Simulation ===")
print("User query: 'slow query performance issue'")
print()

# Pretend these are real embedding vectors (simplified to 5D for demo)
documents = {
    "Oracle query optimization guide":         np.array([0.9, 0.8, 0.7, 0.1, 0.2]),
    "Database index missing detection":        np.array([0.8, 0.9, 0.6, 0.1, 0.1]),
    "Execution plan analysis tutorial":        np.array([0.7, 0.8, 0.8, 0.2, 0.1]),
    "Python machine learning introduction":    np.array([0.1, 0.1, 0.2, 0.9, 0.8]),
    "LangChain RAG pipeline tutorial":         np.array([0.2, 0.1, 0.1, 0.8, 0.9]),
}

# Query embedding (similar to performance + query + slow documents)
query_vector = np.array([0.85, 0.85, 0.75, 0.15, 0.15])

print(f"{'Document':<45} {'Similarity':>12}")
print("-" * 60)
results = []
for doc_name, doc_vec in documents.items():
    score = cosine_sim_manual(query_vector, doc_vec)
    results.append((doc_name, score))

# Sort by similarity descending
results.sort(key=lambda x: x[1], reverse=True)
for doc_name, score in results:
    bar = "โ–ˆ" * int(score * 20)
    print(f"  {doc_name:<43} {score:.4f}  {bar}")

print()
print(f"Top result: '{results[0][0]}'")
print("This is exactly how pgvector / ChromaDB / Pinecone searches work.")

Run it:

uv run cosine_similarity_demo.py

Expected output:

=== Manual Cosine Similarity ===
  cosine_sim('oracle',   'postgres')  = 1.0000  โ† identical direction
  cosine_sim('oracle',   'mongodb')   = 0.9732  โ† very similar
  cosine_sim('oracle',   'numpy')     = 0.7276  โ† somewhat related
  cosine_sim('oracle',   'unrelated') = 0.1690  โ† barely related

=== Semantic Search Simulation ===
User query: 'slow query performance issue'

Document                                       Similarity
------------------------------------------------------------
  Database index missing detection               0.9934  โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ
  Oracle query optimization guide                0.9918  โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ
  Execution plan analysis tutorial               0.9840  โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ
  Python machine learning introduction           0.3214  โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ
  LangChain RAG pipeline tutorial               0.2987  โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ

Top result: 'Database index missing detection'
This is exactly how pgvector / ChromaDB / Pinecone searches work.

12. Why This Matters for DBAs Building RAG Systems

Everything in this post connects directly to something you’ll build in the RAG modules of this series. Here’s the complete picture:

What you learned todayWhere it shows up in RAG / GenAI production
Words โ†’ dense vectors (embeddings)Every document in your RAG knowledge base is stored as an embedding vector in a vector DB (pgvector, ChromaDB, Pinecone)
Similar words โ†’ similar vectorsA user query about “slow query” retrieves documents about “execution plan” and “missing index” โ€” because their vectors are similar
Cosine similarityThe exact math used by SELECT * FROM embeddings ORDER BY embedding <-> query_vector LIMIT 5 in pgvector
Pre-trained modelsYou use OpenAI text-embedding-3-small or Sentence Transformers โ€” same concept as Word2Vec, much more powerful
CBOW / Skip-gramArchitecture understanding helps you choose the right embedding model for your domain
Training your own modelFor domain-specific text (Oracle error logs, custom SQL dialects), a fine-tuned embedding model outperforms generic ones

The full RAG data flow, from a DBA perspective:

Your documents (runbooks, alert logs, SQL files)
         โ†“
  Embedding model (Word2Vec concept โ†’ production: OpenAI / SentenceTransformer)
         โ†“
  Dense vectors  โ†’ stored in Vector DB (pgvector in PostgreSQL, or ChromaDB)
         โ†“
User query: "Why is this query slow?"
         โ†“
  Embedding model converts query โ†’ query vector
         โ†“
  Cosine similarity search: find top-K most similar document vectors
         โ†“
  Retrieved documents โ†’ passed to LLM as context
         โ†“
  LLM generates answer grounded in YOUR data

13. Common Errors and Fixes

Error 1: KeyError โ€” word not in vocabulary

KeyError: "word 'dbms_stats' not in vocabulary"

Cause: The word wasn’t in the training data so no vector was learned for it.

Fix:

# Check before accessing
if 'dbms_stats' in model.wv:
    vector = model.wv['dbms_stats']
else:
    print("Word not in vocabulary")

# OR use FastText โ€” it handles OOV through subwords
from gensim.models import FastText
ft_model = FastText(sentences=sentences, vector_size=50, window=3, min_count=1)
vector = ft_model.wv['dbms_stats']  # Works even if not in training data

Error 2: Poor similarity results from small training data

Symptom: Similarity scores are unexpected โ€” words you know are related score low.

Cause: Word2Vec needs a large corpus to learn meaningful relationships. With only 18 training sentences (like our demo), results will be weak.

Fix:

# Option 1: Use pre-trained model (best for production)
model = api.load("word2vec-google-news-300")

# Option 2: Train on more domain data โ€” your actual docs
# For a DBA: feed all your runbooks, alert logs, SQL files
# More data = better representations

# Option 3: Increase epochs for small datasets
model = Word2Vec(sentences=sentences, epochs=500, vector_size=100)

Error 3: MemoryError loading word2vec-google-news-300

Cause: The full Google News Word2Vec model is ~1.6GB and requires ~4GB RAM to load.

Fix:

# Use a lighter alternative model
model = api.load("glove-wiki-gigaword-100")  # Only 128MB, 100 dimensions

# Or use sentence-transformers for production (more memory-efficient, better quality)
# uv add sentence-transformers
from sentence_transformers import SentenceTransformer
model = SentenceTransformer('all-MiniLM-L6-v2')  # Only 80MB

14. Key Takeaways

โœ… What you learned in this post:

  • Word2Vec (Google, 2013) was the breakthrough that proved word meaning could be encoded mathematically. It trains a neural network; the hidden layer weights become the word embeddings.
  • A neural network learns by: random weights โ†’ forward pass โ†’ measure loss โ†’ backpropagation โ†’ adjust weights โ†’ repeat until loss minimised (1 epoch = 1 full pass).
  • Manual feature engineering (assigning numbers to word properties) doesn’t scale. Word2Vec learns features automatically from massive text data.
  • The KING โˆ’ MAN + WOMAN = QUEEN analogy works because embeddings encode semantic relationships as directions in vector space. Subtracting “man-ness” and adding “woman-ness” shifts the KING vector toward QUEEN.
  • CBOW predicts the center word from context โ€” faster, better for frequent words. Skip-gram predicts context from the center word โ€” slower, better semantic accuracy, preferred for quality.
  • The embedding landscape: Word2Vec (classical, interviews) โ†’ FastText (handles OOV via subwords) โ†’ Sentence Transformers / OpenAI Embeddings (production RAG).
  • Cosine similarity is the math that powers all vector search. Range โˆ’1 to +1. It’s the formula behind every ORDER BY embedding <-> query_vector query in pgvector.
  • For building RAG systems: documents โ†’ embeddings โ†’ vector DB โ†’ cosine similarity search โ†’ retrieved context โ†’ LLM. Word2Vec is the conceptual foundation of every step.

15. What’s Next

You now understand the theory and have trained your own Word2Vec model. Post 5 takes it to production level:

Post 5 โ€” Sentence Transformers and OpenAI Embeddings API โ€” Production-Grade Embeddings
Why Word2Vec isn’t enough for production ยท Sentence-level embeddings ยท HuggingFace Sentence Transformers with Python ยท OpenAI text-embedding-3-small API ยท Build a real semantic search system for your DBA runbooks ยท Compare keyword search vs semantic search head-to-head

#PostStatus
1What is GenAI? + UV Setupโœ… Published
2AI Roadmap + 30 Tools + GitHub Copilot Setupโœ… Published
3OHE, Bag of Words and TF-IDF with Pythonโœ… Published
4Word2Vec and Embeddings โ€” this post๐Ÿ“ You are here
5Sentence Transformers + OpenAI Embeddings APIโฌœ Next Friday
6Prompt Engineering โ€” Zero to Advanced (DBA Edition)โฌœ Coming soon

๐Ÿ‘‰ Next Post: Sentence Transformers and OpenAI Embeddings API โ€” Build a Semantic Search System for Your DBA Runbooks

References


Part of the GenAI from Scratch series for DBAs and Infrastructure Engineers. Published every Friday at gradeupnow.in/genai-blog/

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top