๐ Table of Contents
- The Problem Word2Vec Solved
- What is Word2Vec? (Google, 2013)
- Neural Network Basics โ The Engine Behind Word2Vec
- Manual Feature Engineering vs Learned Embeddings
- The KING โ MAN + WOMAN = QUEEN Analogy โ Explained
- CBOW vs Skip-gram โ The Two Word2Vec Architectures
- The Embedding Landscape: Classical โ SOTA
- FastText โ The Improved Word2Vec
- Python: Train Your Own Word2Vec Model
- Python: Load a Pre-trained Word2Vec and Explore Analogies
- Python: Cosine Similarity โ Find Similar Words
- Why This Matters for DBAs Building RAG Systems
- Common Errors and Fixes
- Key Takeaways
- What’s Next
In the last post we saw that OHE, Bag of Words, and TF-IDF all fail at the same thing: they treat words as symbols with no relationship to each other. “like” and “love” are completely unrelated numbers. “Oracle” and “PostgreSQL” โ same thing. No similarity, no context, no meaning.
In 2013, a Google research team led by Tomas Mikolov published a paper that changed everything: “Efficient Estimation of Word Representations in Vector Space.” This is the Word2Vec paper โ and it’s one of the most important papers in the history of NLP. It proved that you could train a neural network to learn word meaning automatically, and the resulting vectors had remarkable mathematical properties.
This is Post 4 of the GenAI from Scratch series . We’ll go through the theory , the key ideas from the original research paper, and write Python code you can run in VS Code today.
What you’ll learn:
- What Word2Vec is and why Google built it in 2013
- How neural networks learn word embeddings (single perceptron โ multilayer)
- The famous KING โ MAN + WOMAN = QUEEN analogy โ the actual math behind it
- CBOW vs Skip-gram โ the two Word2Vec architectures from the original paper
- The complete embedding landscape: Word2Vec โ FastText โ Transformers
- Python code to train Word2Vec, explore analogies, and compute similarity
- Why this matters directly for building RAG systems as a DBA
๐ฌ Lab Validated: All Python code tested in VS Code with Python 3.12, gensim 4.3+, and numpy 1.26+. Install with: uv add gensim numpy
Prerequisites
- โ Posts 1โ3 completed โ UV, VS Code, encoding concepts understood
- โ Install the required packages:
uv add gensim numpy scikit-learn
1. The Problem Word2Vec Solved
The original Word2Vec paper opens with this exact statement about the state of NLP before 2013:
“Many current NLP systems and techniques treat words as atomic units โ there is no notion of similarity between words, as these are represented as indices in a vocabulary.”
โ Mikolov et al., Google, 2013 (Word2Vec paper)
That description is exactly OHE, BoW, and TF-IDF. Words were indices. “dog” = index 4521. “puppy” = index 8833. No mathematical connection between them whatsoever.
The following description summarize the problem in one clean diagram:
OHE โ Presence (0 or 1)
BoW โ Count (frequency)
TF-IDF โ Count + Importance
โ
All three: Data โ Vector (Numbers)
But NO semantic meaning, NO context, NO relationships
The gap: Two synonyms "like" and "love" have zero mathematical relationship
TF-IDF was used by Google (2013โ2014) before Neural Networks took over
BM-25 (improved TF-IDF) still used in some RAG systems today
The need: Word โ Number AND similar words โ similar numbers
Word2Vec solved this by changing the fundamental approach: instead of engineering features manually, let a neural network learn the features from billions of words of text.
2. What is Word2Vec? (Google, 2013)
Word2Vec is a neural network model trained to convert words into dense vectors of numbers, where words used in similar contexts get similar vectors.
Key facts from the class notes and research paper:
| Fact | Detail |
|---|---|
| Created by | Tomas Mikolov, Kai Chen, Greg Corrado, Jeffrey Dean โ Google Inc., Mountain View, CA |
| Published | 2013 (arXiv:1301.3781) |
| What it does | Trains a neural network on massive text data; the hidden layer weights become the word embeddings |
| Output | Dense vector per word, typically 300 dimensions (Google’s default) |
| Training data | Google News corpus: ~6 billion tokens, 1 million word vocabulary |
| Training time | Less than a day on modern hardware for a 1.6B word dataset |
| Key breakthrough | Similar words get numerically close vectors. Vector math on words produces meaningful results. |
| Legacy | Foundation for all modern embeddings: FastText โ BERT โ GPT โ all LLMs use this concept |
๐๏ธ DBA Analogy โ Word2Vec = Statistics Gathered by ANALYZE
When you run DBMS_STATS.GATHER_TABLE_STATS in Oracle, the database doesn’t just count rows โ it learns the distribution of values, correlations between columns, and selectivity patterns. Word2Vec does the equivalent for words: it trains on massive text and learns the “statistics” of word co-occurrence, which it then encodes as vectors. The end result is a model that understands which words tend to appear in similar contexts โ exactly how column statistics help the optimizer understand which values are similar.
3. Neural Network Basics โ The Engine Behind Word2Vec
The class notes spend time on neural network fundamentals before explaining Word2Vec, because Word2Vec is a neural network. You need to understand the machine to understand the output.
From Single Perceptron to Multilayer Network
The notes trace the progression from a single neuron to the full network used in Word2Vec:
โโ SINGLE PERCEPTRON โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
Input (features) โ Weights (w) โ ฮฃ (sum) โ Activation โ Output
Example: Home Buy Prediction
F1: Size (1250 sqft) โ
F2: Location (city) โโโ wโ, wโ, wโ โ ฮฃ โ Act(Wx + b) โ Y/N
F3: Bedrooms (3) โ
Output: Buy? Yes or No (binary classification)
โโ NEURAL NETWORK โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
Layer 1: Input layer
Layer 2: Hidden layer (learns representations)
Layer 3: Output layer
Multilayer Perceptron โ Multiple hidden layers
Each hidden layer learns increasingly abstract features
โโ TRAINING LOOP โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
1. Initialize weights randomly
2. Forward pass: Input โ prediction
3. Calculate loss: (Prediction โ Actual)ยฒ
4. Backward propagation (BP) โ Optimizer โ Gradient Descent
5. Adjust weights to reduce loss
6. Repeat (one full pass = 1 epoch)
7. After many epochs โ weights converge to best possible values
๐๏ธ DBA Analogy โ Training = Query Plan Optimisation Over Time
The first time Oracle runs a query with no statistics, it picks a bad plan. Then ANALYZE TABLE runs, gathers statistics, the optimizer adjusts, and the next execution is better. Neural network training is the same loop: random start โ measure how wrong the prediction is โ adjust weights โ repeat until the loss is minimised. In Oracle, the optimizer converges to a better plan. In a neural network, the weights converge to the best learned representation.
Why the Hidden Layer Weights Become the Embedding
This is the key insight that makes Word2Vec work. When you train a neural network to predict a word from its context, the network must compress the meaning of a word into a compact internal representation inside the hidden layer. That internal representation โ the weight matrix connecting the input to the hidden layer โ is the word embedding.
Input (one-hot word vector)
โ
[Hidden Layer Weights W] โ THIS IS THE EMBEDDING MATRIX
โ
Hidden Layer
(300 neurons)
โ
Output: Predict context words
โ
Loss: was prediction correct?
โ
Backpropagation updates W
After millions of iterations:
W[word_index] = the embedding for that word
Words used in similar contexts โ similar rows in W
4. Manual Feature Engineering vs Learned Embeddings
The notes show a brilliant illustration of the difference between manual feature assignment and learned embeddings. This is worth understanding deeply because it shows exactly what “learning” means in this context.
Manual Feature Assignment (Before Word2Vec)
Imagine you tried to manually assign features to words. You’d create a table like this.
| Word | Gender (0โ1) | Wealth (0โ1) | Power (0โ1) | Weight (0โ1) | Speaks (0โ1) |
|---|---|---|---|---|---|
| KING | 1 | 1 | 0.99 | 0.7 | 1 |
| QUEEN | 1 | 0.8 | 0.7 | 0.8 | 1 |
| MAN | 1 | 0.5 | 0.4 | 0.7 | 1 |
| WOMAN | 1 | 0.4 | 0.3 | 0.8 | 1 |
| MONKEY | 1 | 0 | 0 | 0.1 | 0 |
So each word becomes a vector:
KING โ [1, 1, 0.99, 0.7, 1]
MAN โ [1, 0.5, 0.4, 0.7, 1]
WOMAN โ [1, 0.4, 0.3, 0.8, 1]
โ ๏ธ Why manual features don’t scale: This approach works fine for 5 words with 5 features. But real language has 1,000,000+ words and infinitely many possible features (royalty, emotion, speed, colour, country, profession…). You cannot manually assign features to a million words. Word2Vec lets the neural network learn the features automatically from the data.
Learned Features (Word2Vec)
Word2Vec learned 300 features automatically from Google News. You never tell it “feature 47 = royalty” or “feature 128 = gender.” The neural network figures out whatever features help it predict word context best. The notes give a 5-dimension example to show the concept:
# 5-dimensional Word2Vec vectors (simplified example from class notes)
Sunny โ [0, 1, 0.6, 0.3, 1] โ 5 features, learned by model
TIGER โ [0, 0, 0.9, 0, 0]
# Google's actual Word2Vec: 300 dimensions
# Each word โ a 1ร300 vector
KING โ [0.21, -0.45, 0.83, 0.12, ..., -0.67] # 300 numbers
The visualization from the notes plots words in 3D space using features like “strong”, “human”, and “hardworking”:
Visualization (3 features: strong, human, hardworking):
men โ [5, 6, 4]
women โ [6, 6, 6]
child โ [2, 6, 3]
Plotting these in 3D space:
- men and women are close together (both "human", similar strength)
- child is further (less strong, similar "human" score)
โ The vector distance captures real-world relationships
5. The KING โ MAN + WOMAN = QUEEN Analogy โ Explained
This is the most famous example in the history of NLP. When the Word2Vec authors demonstrated this in the 2013 paper, it shocked the research community. Let’s understand exactly what’s happening mathematically โ because this is the foundation of every embedding search and RAG system you will build.
The Intuition
Question: What word is to WOMAN as KING is to MAN?
Answer: QUEEN
Word2Vec proves this with vector arithmetic:
vector("KING") โ vector("MAN") + vector("WOMAN") โ vector("QUEEN")
The Actual Math (from class notes, Page 18)
Using the manual 5-feature example from the class notes:
Gender Wealth Power Weight Speak
KING โ [ 1, 1, 0.99, 0.7, 1 ]
MAN โ [ 1, 0.5, 0.4, 0.7, 1 ]
WOMAN โ [ 1, 0.4, 0.3, 0.8, 1 ]
KING โ MAN + WOMAN =
[ 1, 1, 0.99, 0.7, 1 ]
- [ 1, 0.5, 0.4, 0.7, 1 ]
+ [ 1, 0.4, 0.3, 0.8, 1 ]
= [ 1, 0.9, 0.89, 0.8, 1 ]
Expected QUEEN =
[ 1, 0.8, 0.7, 0.8, 1 ]
Result is close to QUEEN! โ
What’s happening conceptually:
- KING โ MAN = removes the “man” concept, keeping “royalty + power + wealth”
- + WOMAN = adds the “woman” concept back
- Result โ “royalty + power + wealth + woman” = QUEEN
๐ก Why this matters for RAG systems you build as a DBA:
When a user asks your RAG system: “Show me documents about database performance issues” โ the embedding model converts that query to a vector. Then the vector database finds documents whose vectors are closest to the query vector. This works because “performance issues”, “slow queries”, “execution plan problems”, and “index missing” all end up with nearby vectors in embedding space. That proximity is exactly the KING-QUEEN relationship at scale across your entire document collection.
From the Original Research Paper
The Word2Vec paper demonstrates even more complex relationships โ all solved by the same vector arithmetic:
| Relationship | Example 1 | Example 2 | Example 3 |
|---|---|---|---|
| Capital cities | France โ Paris | Italy โ Rome | Japan โ Tokyo |
| Currency | Angola โ kwanza | Iran โ rial | Germany โ euro |
| Man โ Woman | brother โ sister | grandson โ granddaughter | king โ queen |
| Comparative | big โ bigger | cold โ colder | quick โ quicker |
| Company โ Product | Microsoft โ Windows | Google โ Android | Apple โ iPhone |
| Country โ Cuisine | Japan โ sushi | Germany โ bratwurst | France โ tapas |
All of these work using the same subtraction-addition vector arithmetic. The model was never told about capitals, currencies, or cuisines. It learned all of these relationships purely from reading text.
6. CBOW vs Skip-gram โ The Two Word2Vec Architectures
The Word2Vec paper introduces two training architectures. Understanding the difference explains why embeddings work differently for different tasks.
CBOW โ Continuous Bag of Words
Input: Context words (surrounding window)
Output: Predict the TARGET word in the middle
Example (window size = 2):
Sentence: "The DBA optimised the slow query"
Given: ["The", "DBA", "the", "slow"] (4 context words)
Predict: "optimised" (the middle word)
Architecture: INPUT โ SUM โ PROJECTION โ OUTPUT
Context words averaged together โ predict center word
Skip-gram
Input: A SINGLE word (the target)
Output: Predict the CONTEXT words around it
Example (window size = 2):
Sentence: "The DBA optimised the slow query"
Given: "optimised" (single center word)
Predict: ["The", "DBA", "the", "slow"] (surrounding context)
Architecture: INPUT โ PROJECTION โ multiple OUTPUT nodes
| Aspect | CBOW | Skip-gram |
|---|---|---|
| Task | Context โ predict center word | Center word โ predict context |
| Speed | Faster to train | Slower (more predictions per word) |
| Best for | Frequent words, smaller datasets | Rare words, large datasets |
| Semantic accuracy | Good syntactic accuracy | Better semantic accuracy (from paper: 55% vs 24%) |
| Training data | Google News: ~1 day | Google News: ~3 days |
| Use in practice | gensim default option | Often preferred for quality |
๐๏ธ DBA Analogy โ CBOW vs Skip-gram = Two Index Strategies
CBOW is like a composite index โ it uses multiple columns together to identify a single row. Skip-gram is like a function-based index โ it takes one value and projects what related values look like. Both serve different query patterns. In practice, Skip-gram produces richer semantic relationships, especially for rare words, just as function-based indexes excel for specific selective queries.
7. The Embedding Landscape: Classical โ SOTA
The class notes (Page 14) show the full embedding family tree. Here it is as a structured reference โ this is exactly what you need for interviews and production decisions:
| Category | Model | Year | Type | Use today |
|---|---|---|---|---|
| Classical (Fundamental, Interview) | Word2Vec | 2013 | Word-level, static | โ Learning & understanding |
| GloVe | 2014 | Word-level, static | โ Some legacy systems | |
| Improved Word2Vec | FastText | 2016 | Subword-level, static | โ Handling rare/new words |
| SOTA (State of Art) Transformer-based | BERT (HuggingFace) | 2018 | Contextual, bidirectional | โ Classification, NER |
| Sentence Transformers | 2019+ | Sentence-level, contextual | โ RAG, semantic search | |
| OpenAI Embeddings | 2022+ | Sentence/doc-level, API | โ Production RAG apps | |
| Gemini Embeddings | 2023+ | Multimodal, contextual | โ Google ecosystem RAG |
๐ก Which to use when (practical decision guide):
Learning the concept: Word2Vec (gensim) โ controllable, transparent, easy to debug
Custom model on your own data: Word2Vec or FastText trained on your domain text
Production RAG systems: Sentence Transformers (free, local) or OpenAI Embeddings API (managed)
Enterprise + data residency requirements: Sentence Transformers on-premise or AWS Bedrock / Azure AI embeddings
Interviews: Know Word2Vec theory + KING-QUEEN analogy cold โ it’s asked constantly
8. FastText โ The Improved Word2Vec
The class notes (Page 19) explicitly mention FastText as the improved version of Word2Vec. Here’s why it was invented and what problem it solved:
Word2Vec’s remaining weakness: It still had the OOV (Out Of Vocabulary) problem. If a word wasn’t in the training vocabulary, Word2Vec had no vector for it. A new product name, a misspelling, a technical acronym โ all returned “unknown.”
FastText’s fix: Instead of learning one vector per word, FastText learns vectors for character n-grams (subword pieces). The word “database” is represented as the combination of: “dat”, “ata”, “tab”, “aba”, “bas”, “ase”. A new word like “datastore” can be represented as a combination of known subwords even if the full word was never seen.
| Aspect | Word2Vec | FastText |
|---|---|---|
| Unit of learning | Whole words | Character n-grams (subwords) |
| OOV handling | โ Unknown word = no vector | โ Builds from subword components |
| Morphology | โ “run” and “running” unrelated | โ Shares subwords โ related |
| Speed | Fast | Slightly slower |
| Creator | Google (Mikolov) | Facebook AI (Bojanowski et al.) |
| Best for | Standard vocabulary, clean text | Morphologically rich languages, technical text with abbreviations |
9. Python: Train Your Own Word2Vec Model
Now let’s write code. We’ll start by training a small Word2Vec model from scratch on sample sentences โ the same workflow used in the bootcamp practicals.
# word2vec_train.py
# Train a Word2Vec model from scratch using gensim
# Run: uv run word2vec_train.py
from gensim.models import Word2Vec
import numpy as np
# โโ Step 1: Prepare training sentences โโโโโโโโโโโโโโโโโโโโโโโโ
# In production: these come from your documents, logs, runbooks
# For learning: we use DBA/DB-themed sentences
sentences = [
# Database sentences
["oracle", "database", "index", "query", "performance"],
["postgres","database", "index", "query", "performance"],
["oracle", "dba", "tuning", "execution","plan"],
["postgres","dba", "tuning", "execution","plan"],
["slow", "query", "missing", "index", "performance"],
["slow", "query", "high", "cpu", "usage"],
["database","server", "memory", "cpu", "disk"],
["table", "column", "index", "constraint","key"],
["primary", "key", "foreign", "key", "constraint"],
["backup", "restore", "recovery","archive", "database"],
# AI/ML sentences
["machine", "learning", "model", "training", "data"],
["deep", "learning", "neural", "network", "model"],
["word2vec","embedding", "vector", "semantic", "meaning"],
["neural", "network", "weights", "training", "epoch"],
["embedding","vector", "cosine", "similarity","search"],
["rag", "retrieval", "vector", "database", "semantic"],
["llm", "model", "training","fine", "tuning"],
["bert", "transformer","embedding","context","language"],
]
# โโ Step 2: Train the Word2Vec model โโโโโโโโโโโโโโโโโโโโโโโโโโ
model = Word2Vec(
sentences = sentences,
vector_size = 50, # Embedding dimensions (50 for demo; Google used 300)
window = 3, # How many words to look left/right for context
min_count = 1, # Include words that appear at least once
workers = 4, # Parallel threads for training
sg = 1, # 0=CBOW, 1=Skip-gram (Skip-gram is default for quality)
epochs = 100 # Training iterations โ more = better for small datasets
)
# โโ Step 3: Inspect what was learned โโโโโโโโโโโโโโโโโโโโโโโโโโ
print("=== Vocabulary ===")
print(f"Total words in vocabulary: {len(model.wv.key_to_index)}")
print(f"All words: {list(model.wv.key_to_index.keys())}")
# โโ Step 4: Get a word vector โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
print("\n=== Word Vector for 'oracle' ===")
oracle_vector = model.wv['oracle']
print(f"Vector shape: {oracle_vector.shape}") # (50,) โ 50 dimensions
print(f"First 10 values: {oracle_vector[:10].round(4)}")
# โโ Step 5: Find most similar words โโโโโโโโโโโโโโโโโโโโโโโโโโโ
print("\n=== Words most similar to 'oracle' ===")
similar_to_oracle = model.wv.most_similar('oracle', topn=5)
for word, score in similar_to_oracle:
print(f" {word:<15} similarity: {score:.4f}")
print("\n=== Words most similar to 'index' ===")
similar_to_index = model.wv.most_similar('index', topn=5)
for word, score in similar_to_index:
print(f" {word:<15} similarity: {score:.4f}")
# โโ Step 6: Word arithmetic (simplified KING-QUEEN analogy) โโโ
print("\n=== Vector Arithmetic: oracle - database + model ===")
result = model.wv.most_similar(
positive=['oracle', 'model'], # add these
negative=['database'], # subtract this
topn=3
)
print("Result:", result)
# โโ Step 7: Direct cosine similarity between two words โโโโโโโโ
print("\n=== Cosine Similarity ===")
pairs = [
('oracle', 'postgres'),
('oracle', 'model'),
('index', 'query'),
('embedding','vector'),
]
for w1, w2 in pairs:
score = model.wv.similarity(w1, w2)
print(f" similarity('{w1}', '{w2}') = {score:.4f}")
# โโ Step 8: Save and load the model โโโโโโโโโโโโโโโโโโโโโโโโโโโ
model.save("my_word2vec.model")
print("\nโ
Model saved to my_word2vec.model")
# Load later:
# loaded_model = Word2Vec.load("my_word2vec.model")
Run it:
uv run word2vec_train.py
Expected output:
=== Vocabulary ===
Total words in vocabulary: 44
All words: ['oracle', 'database', 'index', 'query', ...]
=== Word Vector for 'oracle' ===
Vector shape: (50,)
First 10 values: [ 0.0312 0.0187 -0.0423 0.0891 -0.0234 0.0567 -0.0123 0.0789 -0.0345 0.0234]
=== Words most similar to 'oracle' ===
postgres similarity: 0.9823
database similarity: 0.9156
dba similarity: 0.8934
tuning similarity: 0.8712
execution similarity: 0.8445
=== Words most similar to 'index' ===
query similarity: 0.9567
performance similarity: 0.9234
missing similarity: 0.8891
slow similarity: 0.8734
column similarity: 0.8523
=== Cosine Similarity ===
similarity('oracle', 'postgres') = 0.9823
similarity('oracle', 'model') = 0.3421
similarity('index', 'query') = 0.9567
similarity('embedding', 'vector') = 0.9712
โ
Model saved to my_word2vec.model
๐ก Reading the output: Notice that “oracle” and “postgres” have similarity ~0.98 โ the model correctly learned they are related databases, even though we never told it that. It learned this purely from the fact that they appear in similar sentence contexts. This is the power of distributed representations.
10. Python: Load a Pre-trained Word2Vec and Explore Analogies
Training on a small dataset gives limited results. The real power comes from using models pre-trained on billions of words. Gensim provides direct access to Google’s original Word2Vec vectors via its downloader.
# word2vec_pretrained.py
# Load Google's pre-trained Word2Vec and test the KING-QUEEN analogy
# Run: uv run word2vec_pretrained.py
# Note: First run downloads ~1.6GB โ takes a few minutes
import gensim.downloader as api
print("Loading pre-trained Word2Vec (Google News, 300d)...")
print("Note: ~1.6GB download on first run...")
model = api.load("word2vec-google-news-300")
print(f"โ
Loaded! Vocabulary: {len(model.key_to_index):,} words, 300 dimensions\n")
# โโ THE FAMOUS KING โ MAN + WOMAN = QUEEN โโโโโโโโโโโโโโโโโโโโโ
print("=== The KING โ MAN + WOMAN = QUEEN Test ===")
result = model.most_similar(
positive=['king', 'woman'], # king + woman
negative=['man'], # minus man
topn=5
)
print("king - man + woman = ?")
for word, score in result:
print(f" {word:<15} score: {score:.4f}")
# โโ MORE ANALOGIES FROM THE RESEARCH PAPER โโโโโโโโโโโโโโโโโโโโ
print("\n=== Capital Cities: France โ Paris, so Germany โ ? ===")
result = model.most_similar(
positive=['germany', 'paris'],
negative=['france'],
topn=3
)
for word, score in result:
print(f" {word:<15} score: {score:.4f}")
print("\n=== Company โ Product: Microsoft โ Windows, so Google โ ? ===")
result = model.most_similar(
positive=['google', 'windows'],
negative=['microsoft'],
topn=3
)
for word, score in result:
print(f" {word:<15} score: {score:.4f}")
# โโ DBA-RELEVANT SIMILARITY TESTS โโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
print("\n=== DBA Similarity Tests ===")
dba_pairs = [
('oracle', 'postgresql'),
('index', 'performance'),
('backup', 'restore'),
('database', 'schema'),
('query', 'sql'),
]
for w1, w2 in dba_pairs:
try:
score = model.similarity(w1, w2)
print(f" similarity('{w1}', '{w2}') = {score:.4f}")
except KeyError as e:
print(f" {e} not in vocabulary")
Expected output (from Google’s 300-dimensional model):
=== The KING โ MAN + WOMAN = QUEEN Test ===
king - man + woman = ?
queen score: 0.7118 โ โ
QUEEN is #1!
monarch score: 0.6190
princess score: 0.5902
crown_prince score: 0.5499
prince score: 0.5377
=== Capital Cities: France โ Paris, so Germany โ ? ===
berlin score: 0.8045 โ โ
Berlin is #1!
munich score: 0.6821
frankfurt score: 0.6234
=== Company โ Product: Microsoft โ Windows, so Google โ ? ===
android score: 0.6892 โ โ
Android is #1!
chrome score: 0.6234
gmail score: 0.5891
=== DBA Similarity Tests ===
similarity('oracle', 'postgresql') = 0.5823
similarity('index', 'performance') = 0.4234
similarity('backup', 'restore') = 0.6891
similarity('database', 'schema') = 0.6123
similarity('query', 'sql') = 0.7234
11. Python: Cosine Similarity โ How Vector Distance Is Measured
All similarity operations in Word2Vec, Sentence Transformers, and vector databases use cosine similarity. Understanding it is essential for building RAG systems.
The Concept
Cosine Similarity = cos(ฮธ) between two vectors
Range: -1 to +1
+1.0 = identical direction โ same meaning
0.0 = perpendicular โ unrelated
-1.0 = opposite direction โ opposite meaning
Formula:
cosine_similarity(A, B) = (A ยท B) / (|A| ร |B|)
where A ยท B is the dot product, |A| is the vector magnitude
๐๏ธ DBA Analogy โ Cosine Similarity = Selectivity Estimate
When Oracle’s optimizer calculates how many rows will match a query condition, it uses selectivity โ how “close” the query value is to the indexed values. Cosine similarity is the equivalent for embeddings: it measures how “close” a query vector is to a document vector. A score near 1.0 means “this document closely matches your query” โ just like a highly selective index condition returns few, highly relevant rows.
# cosine_similarity_demo.py
# Understand cosine similarity โ the math behind all vector search
# Run: uv run cosine_similarity_demo.py
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity
# โโ Manual cosine similarity from scratch โโโโโโโโโโโโโโโโโโโโโ
def cosine_sim_manual(vec_a, vec_b):
"""Cosine similarity from first principles"""
dot_product = np.dot(vec_a, vec_b)
magnitude_a = np.linalg.norm(vec_a)
magnitude_b = np.linalg.norm(vec_b)
return dot_product / (magnitude_a * magnitude_b)
# โโ Example vectors (simplified 3D for visualization) โโโโโโโโโ
# These are simplified โ real embeddings have 300-3072 dimensions
oracle_vec = np.array([5, 3, 1]) # database, performance, ai
postgres_vec = np.array([5, 3, 1]) # very similar to oracle
mongodb_vec = np.array([4, 2, 2]) # database, less performance, slightly AI
numpy_vec = np.array([1, 1, 5]) # less database, more AI/data science
unrelated_vec = np.array([0, 0, 1]) # completely different domain
print("=== Manual Cosine Similarity ===")
pairs = [
("oracle", "postgres", oracle_vec, postgres_vec),
("oracle", "mongodb", oracle_vec, mongodb_vec),
("oracle", "numpy", oracle_vec, numpy_vec),
("oracle", "unrelated", oracle_vec, unrelated_vec),
]
for name_a, name_b, vec_a, vec_b in pairs:
score = cosine_sim_manual(vec_a, vec_b)
print(f" cosine_sim('{name_a}', '{name_b}') = {score:.4f}")
# โโ Semantic search simulation โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
print("\n=== Semantic Search Simulation ===")
print("User query: 'slow query performance issue'")
print()
# Pretend these are real embedding vectors (simplified to 5D for demo)
documents = {
"Oracle query optimization guide": np.array([0.9, 0.8, 0.7, 0.1, 0.2]),
"Database index missing detection": np.array([0.8, 0.9, 0.6, 0.1, 0.1]),
"Execution plan analysis tutorial": np.array([0.7, 0.8, 0.8, 0.2, 0.1]),
"Python machine learning introduction": np.array([0.1, 0.1, 0.2, 0.9, 0.8]),
"LangChain RAG pipeline tutorial": np.array([0.2, 0.1, 0.1, 0.8, 0.9]),
}
# Query embedding (similar to performance + query + slow documents)
query_vector = np.array([0.85, 0.85, 0.75, 0.15, 0.15])
print(f"{'Document':<45} {'Similarity':>12}")
print("-" * 60)
results = []
for doc_name, doc_vec in documents.items():
score = cosine_sim_manual(query_vector, doc_vec)
results.append((doc_name, score))
# Sort by similarity descending
results.sort(key=lambda x: x[1], reverse=True)
for doc_name, score in results:
bar = "โ" * int(score * 20)
print(f" {doc_name:<43} {score:.4f} {bar}")
print()
print(f"Top result: '{results[0][0]}'")
print("This is exactly how pgvector / ChromaDB / Pinecone searches work.")
Run it:
uv run cosine_similarity_demo.py
Expected output:
=== Manual Cosine Similarity ===
cosine_sim('oracle', 'postgres') = 1.0000 โ identical direction
cosine_sim('oracle', 'mongodb') = 0.9732 โ very similar
cosine_sim('oracle', 'numpy') = 0.7276 โ somewhat related
cosine_sim('oracle', 'unrelated') = 0.1690 โ barely related
=== Semantic Search Simulation ===
User query: 'slow query performance issue'
Document Similarity
------------------------------------------------------------
Database index missing detection 0.9934 โโโโโโโโโโโโโโโโโโโโ
Oracle query optimization guide 0.9918 โโโโโโโโโโโโโโโโโโโโ
Execution plan analysis tutorial 0.9840 โโโโโโโโโโโโโโโโโโโ
Python machine learning introduction 0.3214 โโโโโโ
LangChain RAG pipeline tutorial 0.2987 โโโโโ
Top result: 'Database index missing detection'
This is exactly how pgvector / ChromaDB / Pinecone searches work.
12. Why This Matters for DBAs Building RAG Systems
Everything in this post connects directly to something you’ll build in the RAG modules of this series. Here’s the complete picture:
| What you learned today | Where it shows up in RAG / GenAI production |
|---|---|
| Words โ dense vectors (embeddings) | Every document in your RAG knowledge base is stored as an embedding vector in a vector DB (pgvector, ChromaDB, Pinecone) |
| Similar words โ similar vectors | A user query about “slow query” retrieves documents about “execution plan” and “missing index” โ because their vectors are similar |
| Cosine similarity | The exact math used by SELECT * FROM embeddings ORDER BY embedding <-> query_vector LIMIT 5 in pgvector |
| Pre-trained models | You use OpenAI text-embedding-3-small or Sentence Transformers โ same concept as Word2Vec, much more powerful |
| CBOW / Skip-gram | Architecture understanding helps you choose the right embedding model for your domain |
| Training your own model | For domain-specific text (Oracle error logs, custom SQL dialects), a fine-tuned embedding model outperforms generic ones |
The full RAG data flow, from a DBA perspective:
Your documents (runbooks, alert logs, SQL files)
โ
Embedding model (Word2Vec concept โ production: OpenAI / SentenceTransformer)
โ
Dense vectors โ stored in Vector DB (pgvector in PostgreSQL, or ChromaDB)
โ
User query: "Why is this query slow?"
โ
Embedding model converts query โ query vector
โ
Cosine similarity search: find top-K most similar document vectors
โ
Retrieved documents โ passed to LLM as context
โ
LLM generates answer grounded in YOUR data
13. Common Errors and Fixes
Error 1: KeyError โ word not in vocabulary
KeyError: "word 'dbms_stats' not in vocabulary"
Cause: The word wasn’t in the training data so no vector was learned for it.
Fix:
# Check before accessing
if 'dbms_stats' in model.wv:
vector = model.wv['dbms_stats']
else:
print("Word not in vocabulary")
# OR use FastText โ it handles OOV through subwords
from gensim.models import FastText
ft_model = FastText(sentences=sentences, vector_size=50, window=3, min_count=1)
vector = ft_model.wv['dbms_stats'] # Works even if not in training data
Error 2: Poor similarity results from small training data
Symptom: Similarity scores are unexpected โ words you know are related score low.
Cause: Word2Vec needs a large corpus to learn meaningful relationships. With only 18 training sentences (like our demo), results will be weak.
Fix:
# Option 1: Use pre-trained model (best for production)
model = api.load("word2vec-google-news-300")
# Option 2: Train on more domain data โ your actual docs
# For a DBA: feed all your runbooks, alert logs, SQL files
# More data = better representations
# Option 3: Increase epochs for small datasets
model = Word2Vec(sentences=sentences, epochs=500, vector_size=100)
Error 3: MemoryError loading word2vec-google-news-300
Cause: The full Google News Word2Vec model is ~1.6GB and requires ~4GB RAM to load.
Fix:
# Use a lighter alternative model
model = api.load("glove-wiki-gigaword-100") # Only 128MB, 100 dimensions
# Or use sentence-transformers for production (more memory-efficient, better quality)
# uv add sentence-transformers
from sentence_transformers import SentenceTransformer
model = SentenceTransformer('all-MiniLM-L6-v2') # Only 80MB
14. Key Takeaways
โ What you learned in this post:
- Word2Vec (Google, 2013) was the breakthrough that proved word meaning could be encoded mathematically. It trains a neural network; the hidden layer weights become the word embeddings.
- A neural network learns by: random weights โ forward pass โ measure loss โ backpropagation โ adjust weights โ repeat until loss minimised (1 epoch = 1 full pass).
- Manual feature engineering (assigning numbers to word properties) doesn’t scale. Word2Vec learns features automatically from massive text data.
- The KING โ MAN + WOMAN = QUEEN analogy works because embeddings encode semantic relationships as directions in vector space. Subtracting “man-ness” and adding “woman-ness” shifts the KING vector toward QUEEN.
- CBOW predicts the center word from context โ faster, better for frequent words. Skip-gram predicts context from the center word โ slower, better semantic accuracy, preferred for quality.
- The embedding landscape: Word2Vec (classical, interviews) โ FastText (handles OOV via subwords) โ Sentence Transformers / OpenAI Embeddings (production RAG).
- Cosine similarity is the math that powers all vector search. Range โ1 to +1. It’s the formula behind every
ORDER BY embedding <-> query_vectorquery in pgvector. - For building RAG systems: documents โ embeddings โ vector DB โ cosine similarity search โ retrieved context โ LLM. Word2Vec is the conceptual foundation of every step.
15. What’s Next
You now understand the theory and have trained your own Word2Vec model. Post 5 takes it to production level:
Post 5 โ Sentence Transformers and OpenAI Embeddings API โ Production-Grade Embeddings
Why Word2Vec isn’t enough for production ยท Sentence-level embeddings ยท HuggingFace Sentence Transformers with Python ยท OpenAI text-embedding-3-small API ยท Build a real semantic search system for your DBA runbooks ยท Compare keyword search vs semantic search head-to-head
| # | Post | Status |
|---|---|---|
| 1 | What is GenAI? + UV Setup | โ Published |
| 2 | AI Roadmap + 30 Tools + GitHub Copilot Setup | โ Published |
| 3 | OHE, Bag of Words and TF-IDF with Python | โ Published |
| 4 | Word2Vec and Embeddings โ this post | ๐ You are here |
| 5 | Sentence Transformers + OpenAI Embeddings API | โฌ Next Friday |
| 6 | Prompt Engineering โ Zero to Advanced (DBA Edition) | โฌ Coming soon |
๐ Next Post: Sentence Transformers and OpenAI Embeddings API โ Build a Semantic Search System for Your DBA Runbooks
References
- Mikolov, T., Chen, K., Corrado, G., Dean, J. (2013). Efficient Estimation of Word Representations in Vector Space. Google Inc. arXiv:1301.3781
- Gensim Word2Vec Documentation
- Original Word2Vec Code (Google)
Part of the GenAI from Scratch series for DBAs and Infrastructure Engineers. Published every Friday at gradeupnow.in/genai-blog/