Word2Vec, Embeddings and Semantic Search with hands-on Python code

🤖GenAI from Scratch — Post 4 of 24

📋 Table of Contents

Why AI Can’t Read Text Directly
The 4 Text Encoding Techniques — Overview
One Hot Encoding (OHE) — Theory + Python Code
Bag of Words (BoW) — Theory + Python Code
TF-IDF — Theory + Python Code
Why OHE / BoW / TF-IDF All Fail for GenAI
Embeddings — The Fix That Powers LLMs
Full Comparison Table
Install and Setup
Common Errors and Fixes
Key Takeaways
What’s Next

Before an LLM can answer your question about a slow Oracle query, it has to do something that seems impossible: convert your English text into numbers. Computers don’t understand words. They understand numbers. So the first problem every NLP and GenAI system must solve is: how do you turn text into numbers — and do it in a way that preserves meaning?

This is Post 4 of the GenAI from Scratch series, We cover the evolution of text encoding — from the simplest approach (One Hot Encoding) through to the modern approach (Embeddings) that powers ChatGPT. Every concept includes working Python code , which you can run in VS Code right now.

What you’ll learn:

Why text must be converted to numbers before AI can process it
One Hot Encoding (OHE) — what it is, how to code it, pros and cons
Bag of Words (BoW) — how word counting works in ML, with Python
TF-IDF — how to measure word importance, not just word count
Why all three fail for GenAI — and what Embeddings solve
The complete comparison table for interviews and production decisions

🔬 Lab Validated: All code in this post is taken directly from encoding.ipynb — the actual bootcamp notebook. Tested in VS Code with Python 3.12 and scikit-learn 1.4+.

Prerequisites

☑ Posts 1 and 2 completed — UV installed, VS Code set up
☑ Your genai-bootcamp project folder from Post 1
☑ Install the required package (one command):

uv add scikit-learn numpy

Lab Environment

Component	Version
Python	3.12
scikit-learn	1.4+
numpy	1.26+
IDE	VS Code 1.88+
Notebook	encoding.ipynb

1. Why AI Can’t Read Text Directly

This is the foundational question with. Every ML and AI model — whether it’s a simple Naive Bayes classifier or a billion-parameter LLM — operates entirely on numbers. Text is meaningless to a computer unless it’s been converted to a numerical representation first.

Text (what humans write)
        ↓
  Must be encoded
        ↓
Numbers / Vectors (what AI processes)
        ↓
   ML / DL / LLM
        ↓
    Prediction / Generation

🗄️ DBA Analogy — Text Encoding = Character Set Conversion

You already deal with this concept every day. Oracle stores every VARCHAR2 as bytes using a character set (AL32UTF8, WE8ISO8859P1, etc.). The word “INDEX” is not stored as letters — it’s stored as byte values: 73, 78, 68, 69, 88. Text encoding for AI is the same principle at a higher level: convert human language into a numerical format that the model can do math on.

The challenge is not just converting text to numbers. The challenge is doing it in a way that preserves meaning. A character set doesn’t care that “database” and “datastore” are related concepts. AI encoding has to.

There are four main techniques, and they represent a historical evolution of increasing sophistication:

2. The 4 Text Encoding Techniques — Overview

#	Technique	The question it answers	Era	Understands meaning?
1	One Hot Encoding (OHE)	“Is the word present or not?” → 0 or 1	2012–2015	❌ No
2	Bag of Words (BoW)	“How many times does the word appear?”	2014–2016	❌ No
3	TF-IDF	“How important is this word in this document?”	2014–2016	❌ No (importance only)
4	Embeddings	“What is the meaning of this text in context?”	Word2Vec → Transformers	✅ Yes

The first three are classical NLP encoding techniques. They’re still important to understand because they explain exactly why modern embeddings were invented and what problems they solved. They’re also heavily tested in interviews — the handwritten notes explicitly mark them as “Ask Interview.”

3. One Hot Encoding (OHE) — Theory + Python Code

The Concept

One Hot Encoding answers exactly one question: “Is this word present in the document, or not?” The output is a binary vector — all zeros except for a single 1 in the position of the word in the vocabulary.

Here’s the exact example :

Data (4 documents):

D1 → people watch movie
D2 → people watch cricket
D3 → people like movie
D4 → people like cricket

Step 1 — Build vocabulary (unique words from the data, sorted alphabetically):

Vocabulary = {people, watch, like, movie, cricket}
Vector dimension = 5  (one position per unique word)

Step 2 — Encode each document (1 = word present, 0 = word absent):

Document	people	watch	like	movie	cricket	OHE Vector
D1: people watch movie	1	1	0	1	0	[1, 1, 0, 1, 0]
D2: people watch cricket	1	1	0	0	1	[1, 1, 0, 0, 1]
D3: people like movie	1	0	1	1	0	[1, 0, 1, 1, 0]
D4: people like cricket	1	0	1	0	1	[1, 0, 1, 0, 1]

🗄️ DBA Analogy — OHE = Bitmap Index

A bitmap index in Oracle stores exactly this — for each unique value in a column, a bit vector showing which rows contain that value (1) and which don’t (0). One Hot Encoding is a bitmap index for words in a document. The vocabulary is your indexed column, the documents are your rows, and the 0s and 1s are the bitmap.

Python Code — OHE

# ohe_demo.py
# One Hot Encoding — From encoding.ipynb (Class 04)
# Run: uv run ohe_demo.py

from sklearn.preprocessing import OneHotEncoder
import numpy as np

# ── Step 1: Define your document ──────────────────────────────────
document = ["my name is sunny and I love AI"]

# ── Step 2: Tokenise — split into individual words ─────────────────
# lower() ensures "AI" and "ai" are treated as the same word
tokens = [sentence.lower().split() for sentence in document]
print("Tokens:", tokens)
# Output: [['my', 'name', 'is', 'sunny', 'and', 'i', 'love', 'ai']]

# ── Step 3: Reshape for sklearn — needs [[word], [word], ...] ─────
# Each word must be its own row for the encoder
all_words = [[word] for sentence in tokens for word in sentence]
print("Words formatted for encoder:", all_words)
# Output: [['my'], ['name'], ['is'], ['sunny'], ['and'], ['i'], ['love'], ['ai']]

# ── Step 4: Create and FIT the encoder ────────────────────────────
# sparse_output=False → returns a NumPy array instead of a sparse matrix
# Makes it easier to read and print
encoder = OneHotEncoder(sparse_output=False)
encoder.fit(all_words)   # Learn the vocabulary from the data

# ── Step 5: Check the vocabulary learned ──────────────────────────
print("\nVocabulary learned:", encoder.categories_[0])
# Output: ['ai' 'and' 'i' 'is' 'love' 'my' 'name' 'sunny']
# Note: sorted alphabetically — this is sklearn's default behaviour

# ── Step 6: Encode each sentence ─────────────────────────────────
for sentence in tokens:
    encoded = encoder.transform([[word] for word in sentence])
    print(f"\nSentence: {sentence}")
    print(f"OHE matrix shape: {encoded.shape}")  # (num_words, vocab_size)
    print(encoded)

Run it:

uv run ohe_demo.py

Expected output:

Tokens: [['my', 'name', 'is', 'sunny', 'and', 'i', 'love', 'ai']]
Words formatted for encoder: [['my'], ['name'], ['is'], ...]

Vocabulary learned: ['ai' 'and' 'i' 'is' 'love' 'my' 'name' 'sunny']

Sentence: ['my', 'name', 'is', 'sunny', 'and', 'i', 'love', 'ai']
OHE matrix shape: (8, 8)
[[0. 0. 0. 0. 0. 1. 0. 0.]   ← 'my'    → position 5 is 1
 [0. 0. 0. 0. 0. 0. 1. 0.]   ← 'name'  → position 6 is 1
 [0. 0. 0. 1. 0. 0. 0. 0.]   ← 'is'    → position 3 is 1
 [0. 0. 0. 0. 0. 0. 0. 1.]   ← 'sunny' → position 7 is 1
 [0. 1. 0. 0. 0. 0. 0. 0.]   ← 'and'   → position 1 is 1
 [0. 0. 1. 0. 0. 0. 0. 0.]   ← 'i'     → position 2 is 1
 [0. 0. 0. 0. 1. 0. 0. 0.]   ← 'love'  → position 4 is 1
 [1. 0. 0. 0. 0. 0. 0. 0.]]  ← 'ai'    → position 0 is 1

Tokens: [[‘my’, ‘name’, ‘is’, ‘sunny’, ‘and’, ‘i’, ‘love’, ‘ai’]] Words formatted for encoder: [[‘my’], [‘name’], [‘is’], [‘sunny’], [‘and’], [‘i’], [‘love’], [‘ai’]] Vocabulary learned: [‘ai’ ‘and’ ‘i’ ‘is’ ‘love’ ‘my’ ‘name’ ‘sunny’] Sentence: [‘my’, ‘name’, ‘is’, ‘sunny’, ‘and’, ‘i’, ‘love’, ‘ai’] OHE matrix shape: (8, 8) [[0. 0. 0. 0. 0. 1. 0. 0.] [0. 0. 0. 0. 0. 0. 1. 0.] [0. 0. 0. 1. 0. 0. 0. 0.] [0. 0. 0. 0. 0. 0. 0. 1.] [0. 1. 0. 0. 0. 0. 0. 0.] [0. 0. 1. 0. 0. 0. 0. 0.] [0. 0. 0. 0. 1. 0. 0. 0.] [1. 0. 0. 0. 0. 0. 0. 0.]]

OHE Pros and Cons (Interview Notes)

✅ Pros	❌ Cons
Easy to implement	Sparse matrix — mostly zeros, huge memory waste. Vocabulary of 10,000 words → each document is a vector with 9,997 zeros
Simple binary representation (0 and 1)	High dimensionality — vector size grows with vocabulary size
No training required — direct mapping	No semantic understanding — “like” and “love” are completely unrelated in OHE
Direct mapping from words to numbers	OOV Problem — if a new word “enjoy” appears in D5 but wasn’t in the training vocabulary, OHE cannot handle it. The word is unknown.

⚠️ The OOV Problem — Critical to Understand:

If your vocabulary was built from D1–D4 and a new document D5 contains the word “enjoy” — OHE has no idea what to do with it. “enjoy” is not in the vocabulary. This is called Out Of Vocabulary (OOV). It’s a fundamental limitation of all classical encoding methods. Embeddings (Word2Vec, Transformers) largely solve this.

Training vocab: {people, watch, like, movie, cricket}
D5 = "people enjoy movie"
"enjoy" → NOT in vocabulary → OHE cannot encode it

4. Bag of Words (BoW) — Theory + Python Code

The Concept

Bag of Words answers: “How many times does each word appear?” Instead of a binary 0/1, it stores the actual word count. The assumption is: “If a word repeats more in a sentence, BoW assumes it is more important.”

Example from the class notes:

D1 → people watch movie and watch movie again
D2 → people watch cricket and watch cricket
D3 → people like movie and like movie a lot
D4 → people like cricket

Vocabulary (8 unique words, sorted):

['again', 'and', 'cricket', 'like', 'lot', 'movie', 'people', 'watch']
 Vector dimension = 8

BoW matrix (count of each word per document):

Doc	again	and	cricket	like	lot	movie	people	watch
D1	1	1	0	0	0	2	1	2
D2	0	1	2	0	0	0	1	2
D3	0	1	0	2	1	2	1	0
D4	0	0	1	1	0	0	1	0

D1 vector: [1, 1, 0, 0, 0, 2, 1, 2] — “movie” and “watch” appear twice, so they score 2.

🗄️ DBA Analogy — BoW = COUNT(*) GROUP BY word

BoW is literally a SELECT word, COUNT(*) FROM tokenized_document GROUP BY word. It counts word frequency and turns the results into a fixed-length vector. The vocabulary is your lookup table, and each document becomes one row with a count for each word. No magic — just counting.

Python Code — BoW from the Bootcamp Notebook

# bow_demo.py
# Bag of Words — from encoding.ipynb (Class 04)
# sklearn docs: scikit-learn.org/0.15/modules/generated/
#               sklearn.feature_extraction.text.CountVectorizer.html
# Run: uv run bow_demo.py

from sklearn.feature_extraction.text import CountVectorizer

# ── Dataset — same as class notes ─────────────────────────────────
documents = [
    "people watch movie and watch movie again",
    "people watch cricket and watch cricket",
    "people like movie and like movie a lot",
    "people like cricket"
]

# ── Step 1: Create the CountVectorizer (BoW) ──────────────────────
bow = CountVectorizer()

# ── Step 2: fit_transform — learn vocabulary AND encode in one step
# fit()      → learns the vocabulary from all documents
# transform()→ converts documents to count vectors
bow_matrix = bow.fit_transform(documents)

# ── Step 3: Inspect the vocabulary ────────────────────────────────
print("Vocabulary:", bow.get_feature_names_out())
# Output: ['again' 'and' 'cricket' 'like' 'lot' 'movie' 'people' 'watch']

# ── Step 4: View the full BoW matrix ──────────────────────────────
print("\nBoW Matrix (full):")
print(bow_matrix.toarray())

# ── Step 5: View per document ─────────────────────────────────────
print("\nPer-document vectors:")
vocab = bow.get_feature_names_out()
for i, doc in enumerate(documents):
    print(f"\nD{i+1}: '{doc}'")
    print(f"Vector: {bow_matrix.toarray()[i]}")

# ── Step 6: OOV demo — what happens with an unknown word ──────────
# "lion" and "king" are NOT in the training vocabulary
new_doc = ["lion is the king of jungle"]
new_vector = bow.transform(new_doc)
print("\n--- OOV Demo ---")
print(f"New doc: {new_doc[0]}")
print(f"Vector:  {new_vector.toarray()}")
# Output: [[0 0 0 0 0 0 0 0]]
# ALL zeros — no words in this sentence existed in the vocabulary
print("Result: ALL zeros — BoW has no idea what this sentence means!")

Run it:

uv run bow_demo.py

Expected output:

Vocabulary: ['again' 'and' 'cricket' 'like' 'lot' 'movie' 'people' 'watch']

BoW Matrix (full):
[[1 1 0 0 0 2 1 2]
 [0 1 2 0 0 0 1 2]
 [0 1 0 2 1 2 1 0]
 [0 0 1 1 0 0 1 0]]

Per-document vectors:
D1: 'people watch movie and watch movie again'
Vector: [1 1 0 0 0 2 1 2]

D2: 'people watch cricket and watch cricket'
Vector: [0 1 2 0 0 0 1 2]

D3: 'people like movie and like movie a lot'
Vector: [0 1 0 2 1 2 1 0]

D4: 'people like cricket'
Vector: [0 0 1 1 0 0 1 0]

--- OOV Demo ---
New doc: lion is the king of jungle
Vector:  [[0 0 0 0 0 0 0 0]]
Result: ALL zeros — BoW has no idea what this sentence means!

BoW Pros and Cons

✅ Pros	❌ Cons
Easy to implement, very simple logic	Ignores word order — “dog bites man” and “man bites dog” produce identical vectors: {dog, bites, man} → [1,1,1]
Captures word frequency — more occurrences = higher importance	No semantic understanding — “like” and “love” are treated as completely different words with zero relation
Works well for classical NLP: text classification, spam detection, sentiment analysis (basic)	High dimensionality — vocabulary of 50,000 words → 50,000-dimension vectors, mostly zeros
No training required — direct conversion text → numbers	Sparse representation — memory-inefficient, computation-wasteful
Used successfully 2014–2016 with Naive Bayes and RNNs/LSTMs	OOV problem — new words not in training vocabulary are silently ignored
	Overemphasizes frequent words — “movie movie movie” gets high count even if it’s meaningless repetition

5. TF-IDF — Theory + Python Code

The Concept

BoW has a critical problem: common words like “and”, “the”, “people” appear in every document and get high counts — but they carry no meaningful signal. TF-IDF fixes this by weighting words by their importance, not just their count.

TF-IDF stands for Term Frequency × Inverse Document Frequency.

TF-IDF(word, document) = TF(word, document) × IDF(word)

TF(word, D)  = occurrences of word in D
               ───────────────────────────
               total words in D

IDF(word)    = log( total number of documents )
                   ──────────────────────────────
                   number of documents containing word

Why TF? More occurrences in a document = more important in that document.
Why IDF? Common words appear in many documents — they should get lower weight.
Why multiply TF × IDF? TF = importance within document. IDF = importance across corpus.
Why use log in IDF? Without log, a rare word in 1 out of 1000 documents gets IDF=1000. A common word in 500 documents gets IDF=2. The difference (1000 vs 2) is extreme and makes the model unstable. Log compresses this to 6.9 vs 0.69 — reasonable and balanced.

TF-IDF Worked Example

Documents:
D1 → people watch cricket
D2 → cricket watch cricket
D3 → people give comment
D4 → cricket give comment

Vocabulary: ['comment', 'cricket', 'give', 'people', 'watch']
Total documents (N) = 4

Computing TF-IDF for “cricket” in D1:

TF("cricket", D1)  = 1/3  (appears once, 3 total words)
IDF("cricket")     = log(4/3)  (appears in 3 of 4 documents)
                   ≈ 0.288

TF-IDF = (1/3) × log(4/3) ≈ 0.096

The final TF-IDF matrix from the class notes (computed values):

Doc	comment	cricket	give	people	watch
D1	0	0.096	0	0.231	0.231
D2	0	0.191	0	0	0.231
D3	0.231	0	0.231	0.231	0
D4	0.231	0.096	0.231	0	0

Notice: “cricket” appears in 3 documents (D1, D2, D4) so it gets lower weight (0.096, 0.191) compared to “comment” which only appears in 2 documents (0.231). Rarer words get higher TF-IDF scores.

Python Code — TF-IDF from the Bootcamp Notebook

# tfidf_demo.py
# TF-IDF — from encoding.ipynb (Class 04)
# Run: uv run tfidf_demo.py

from sklearn.feature_extraction.text import TfidfVectorizer

# ── Dataset ──────────────────────────────────────────────────────
documents = [
    "people watch movie and watch movie again",
    "people watch cricket and watch cricket",
    "people like movie and like movie a lot",
    "people like cricket"
]

# ── Step 1: Create TF-IDF vectorizer ─────────────────────────────
tf_idf = TfidfVectorizer()

# ── Step 2: Fit and transform in one step ─────────────────────────
# This learns vocabulary + computes TF-IDF for every word in every document
tf_idf_vector = tf_idf.fit_transform(documents)

# ── Step 3: Check what was returned ───────────────────────────────
print("Type:", type(tf_idf_vector))
print("Shape:", tf_idf_vector.shape)
# Output: <4x8 sparse matrix> — 4 documents, 8 unique words

# ── Step 4: View the vocabulary ───────────────────────────────────
print("\nVocabulary:", tf_idf.get_feature_names_out())
# Output: ['again' 'and' 'cricket' 'like' 'lot' 'movie' 'people' 'watch']

# ── Step 5: View TF-IDF matrix as array ───────────────────────────
print("\nTF-IDF Matrix (full):")
print(tf_idf_vector.toarray())

# ── Step 6: View per document — easier to read ────────────────────
print("\nPer-document TF-IDF vectors:")
vocab = tf_idf.get_feature_names_out()
print(f"{'Vocab':<10}", "  ".join(f"{w:<8}" for w in vocab))
print("-" * 70)
for i, doc in enumerate(documents):
    values = tf_idf_vector.toarray()[i]
    print(f"D{i+1:<9}", "  ".join(f"{v:<8.3f}" for v in values))

# ── Step 7: Interpretation ────────────────────────────────────────
print("\n--- Interpretation ---")
print("Higher TF-IDF = word is important in this doc but rare across all docs")
print("Lower TF-IDF  = word appears in many docs (less distinctive)")
print("Zero TF-IDF   = word not present in this document")

Run it:

uv run tfidf_demo.py

Expected output:

Type: <class 'scipy.sparse._csr.csr_matrix'>
Shape: (4, 8)

Vocabulary: ['again' 'and' 'cricket' 'like' 'lot' 'movie' 'people' 'watch']

TF-IDF Matrix (full):
[[0.388 0.247 0.     0.     0.     0.611 0.202 0.611]
 [0.     0.268 0.663 0.     0.     0.     0.219 0.663]
 [0.     0.247 0.     0.611 0.388 0.611 0.202 0.    ]
 [0.     0.     0.640 0.640 0.     0.     0.424 0.    ]]

Per-document TF-IDF vectors:
Vocab      again     and       cricket   like      lot       movie     people    watch
----------------------------------------------------------------------
D1         0.388     0.247     0.000     0.000     0.000     0.611     0.202     0.611
D2         0.000     0.268     0.663     0.000     0.000     0.000     0.219     0.663
D3         0.000     0.247     0.000     0.611     0.388     0.611     0.202     0.000
D4         0.000     0.000     0.640     0.640     0.000     0.000     0.424     0.000

--- Interpretation ---
Higher TF-IDF = word is important in this doc but rare across all docs
Lower TF-IDF  = word appears in many docs (less distinctive)
Zero TF-IDF   = word not present in this document

💡 Reading the TF-IDF output: Notice that “people” scores 0.202–0.424 across all documents because it appears in every document (low IDF). Meanwhile “again” scores 0.388 in D1 only, because it appears in just that one document (high IDF = high distinctiveness). “people” is like a stop word — present everywhere, low signal. “again” is a distinctive word — only in D1.

TF-IDF Pros and Cons

✅ Pros	❌ Cons
Captures word importance, not just count. Rare words → higher weight; common words → lower weight	Still ignores word order — same problem as BoW
Reduces impact of common words (stop words) automatically via IDF	No semantic understanding — “car” and “automobile” are still unrelated
Better than BoW for search engines, information retrieval, document ranking	Still sparse and high-dimensional (vocabulary size = vector size)
No training required — direct calculation	Doesn’t handle context — “bank” (river vs finance) gets the same vector everywhere
Simple and interpretable — TF × IDF = clear logic anyone can verify	OOV problem — new words not in training vocabulary are ignored

The final verdict on TF-IDF from the class notes: “TF-IDF improves by adding importance but still fails to understand meaning and context.” This limitation is what motivated the shift to embeddings.

6. Why OHE / BoW / TF-IDF All Fail for GenAI

This close with a clean demonstration of the fundamental failure of all three classical techniques. This is the “aha” moment that explains why embeddings were invented:

Sentence 1: "I like this movie"
Sentence 2: "I love this film"

These two sentences mean almost the same thing.

OHE / BoW / TF-IDF result:
  "like" ≠ "love"   → treated as completely different
  "movie" ≠ "film"  → treated as completely different
  → Model thinks Sentence 1 and Sentence 2 are UNRELATED

Embeddings result:
  "like" ≈ "love"   → numerically close vectors
  "movie" ≈ "film"  → numerically close vectors
  → Model understands these sentences are SIMILAR

All three classical techniques have the same fundamental problem: they are statistical representations — they count and weight words, but they have no understanding of what words mean or how they relate to each other. The number they assign to “like” has no mathematical relationship to the number for “love.”

Here’s the summary of all their shared failures:

Failure	OHE	BoW	TF-IDF	Embeddings
No semantic understanding (like ≠ love)	❌	❌	❌	✅ Fixed
Ignores word order (dog bites man = man bites dog)	❌	❌	❌	✅ Fixed (Transformers)
High dimensionality (50K vocab → 50K vector)	❌	❌	❌	✅ Fixed (dense, compact)
Sparse (mostly zeros)	❌	❌	❌	✅ Fixed (dense vectors)
OOV — can’t handle new words	❌	❌	❌	✅ Largely fixed
No context (“bank” = river or finance?)	❌	❌	❌	✅ Fixed (contextual embeddings)

7. Embeddings — The Fix That Powers LLMs

Embeddings answer the question OHE/BoW/TF-IDF could never answer: “What is the meaning of this text in context?”

Instead of a sparse binary or count vector, an embedding is a dense vector of decimal numbers (typically 768 to 3072 dimensions) learned by training a neural network on massive amounts of text. Words with similar meanings end up with numerically similar vectors.

# Conceptual illustration — actual values differ
# Classical encoding:
"like"   → [0, 0, 0, 1, 0, 0, 0, 0]  ← OHE, one position only
"love"   → [0, 0, 0, 0, 1, 0, 0, 0]  ← completely different position
# Math distance between them: far apart → model sees them as unrelated

# Embeddings:
"like"   → [0.21, -0.45, 0.83, 0.12, ...]  ← dense, 768+ numbers
"love"   → [0.19, -0.43, 0.87, 0.14, ...]  ← very similar numbers!
# Math distance (cosine similarity): very close → model knows they're related

# Even more powerful:
"Oracle" → [0.55, 0.12, -0.33, ...]   ← database context
"PostgreSQL" → [0.53, 0.14, -0.31, ...] ← numerically close!
# The model learned that Oracle and PostgreSQL are related concepts

The evolution path from class notes:

OHE / BoW / TF-IDF  →  Word2Vec  →  Transformer-based models
(2012–2015)             (first real    (BERT, GPT, all modern LLMs)
                        embeddings)

Word2Vec was the breakthrough — it proved that word meaning could be 
captured mathematically. Transformers then made embeddings contextual:
the same word gets different embeddings depending on the surrounding text.

✅ Why this matters for DBAs building RAG systems: When you store documents in a vector database like ChromaDB, pgvector, or Pinecone, you are storing embedding vectors — not OHE or BoW vectors. The search (“find me documents similar to this query”) works by finding vectors that are mathematically close to the query vector. This is called semantic search — and it only works because embeddings capture meaning. We’ll build this in Post 4.

8. Full Comparison Table (Interview Reference)

Feature	OHE	Bag of Words	TF-IDF	Embeddings
What it captures	Word presence (0/1)	Word frequency (count)	Word importance (TF×IDF score)	Word meaning + context (dense vector)
Vector type	Sparse binary	Sparse integer	Sparse float	Dense float (768–3072 dims)
Vector size	= vocabulary size	= vocabulary size	= vocabulary size	Fixed (e.g. 768) — not vocabulary-dependent
Training required?	No	No	No	Yes (pre-trained models available)
Word order?	Ignored	Ignored	Ignored	Captured (Transformers)
Semantic understanding?	❌ None	❌ None	❌ None	✅ Full
OOV problem?	❌ Yes	❌ Yes	❌ Yes	✅ Largely solved
sklearn class	`OneHotEncoder`	`CountVectorizer`	`TfidfVectorizer`	`sentence-transformers` / OpenAI API
Best used for	Simple categorical features, interview demos	Text classification, spam detection (classical)	Search engines, document ranking, keyword relevance	Semantic search, RAG, LLMs — everything modern
Era	2012–2015	2014–2016	2014–2016	2013 (Word2Vec) → 2017+ (Transformers)

9. Common Errors and Fixes

Error 1: NotFittedError on OneHotEncoder

Error (actual error from the bootcamp notebook):

NotFittedError: This OneHotEncoder instance is not fitted yet.
Call 'fit' with appropriate arguments before using this estimator.

Cause: You called encoder.transform() before calling encoder.fit(). This is the most common sklearn mistake — every encoder needs to see the data first (fit) before it can convert new data (transform).

Fix:

# WRONG — transform before fit
encoder = OneHotEncoder(sparse_output=False)
encoded = encoder.transform(all_words)  # ❌ NotFittedError

# RIGHT — fit first, then transform
encoder = OneHotEncoder(sparse_output=False)
encoder.fit(all_words)           # ✅ Learn vocabulary first
encoded = encoder.transform(all_words)  # ✅ Now encode

# OR use fit_transform() which does both in one call
encoded = encoder.fit_transform(all_words)  # ✅ Equivalent

DBA Analogy:fit() is like running ANALYZE TABLE to gather statistics. transform() is using those statistics. You can’t use statistics you haven’t gathered yet.

Error 2: ValueError — unknown categories in OneHotEncoder

ValueError: Found unknown categories ['enjoy'] in column 0 during transform

Cause: You’re trying to encode a word that wasn’t in the training vocabulary (the OOV problem).

Fix:

# Tell encoder to ignore unknown words instead of crashing
encoder = OneHotEncoder(sparse_output=False, handle_unknown='ignore')
encoder.fit(all_words)
# Now unknown words → all-zero row instead of error

Error 3: sparse_output parameter name changed in sklearn

TypeError: OneHotEncoder.__init__() got an unexpected keyword argument 'sparse'

Cause: In sklearn 1.2+, the parameter was renamed from sparse=False to sparse_output=False.

Fix:

# Old sklearn (before 1.2):
encoder = OneHotEncoder(sparse=False)

# New sklearn 1.2+ (use this):
encoder = OneHotEncoder(sparse_output=False)

Error 4: BoW transform on new document returns all zeros

Symptom: You transform a new document and get a zero vector. Not an error — but confusing.

Cause: This is expected OOV behaviour — none of the words in the new document exist in the training vocabulary. This is a design limitation of BoW, not a bug.

new_doc = ["lion is the king of jungle"]
# "lion", "king", "jungle" → not in training vocab
# Result: [[0 0 0 0 0 0 0 0]]  — all zeros, silent failure

# Fix: Use embeddings instead of BoW for production systems
# Embeddings handle unseen words through subword tokenization

10. Key Takeaways

✅ What you learned in this post:

All ML and AI models require text to be converted to numbers before processing. This is called text encoding.
OHE answers “is the word present?” — binary 0/1 vector per word in vocabulary. Simple but very sparse and no meaning.
Bag of Words answers “how many times?” — word count vectors. Better than OHE but still ignores order and meaning.
TF-IDF answers “how important is this word?” — weights by rarity across the corpus. Better than BoW for search and ranking, but still no semantic understanding.
All three share the same 6 critical failures: no word order, no semantics, sparse, high-dimensional, OOV problem, no context.
Embeddings solve all 6 problems — dense vectors of decimal numbers learned from training data, where similar words are numerically close. This is what powers every modern LLM and vector search system.
The sklearn pattern is always: create → fit() → transform(). Or fit_transform() which does both.
The NotFittedError is the most common sklearn mistake — always call fit() before transform().

11. What’s Next

Now that you understand why classical encoding fails, Post 4 builds the solution. We go hands-on with embeddings:

Post 4 — Word2Vec, Embeddings and Semantic Search — Build it in Python
What Word2Vec is · How embeddings capture meaning · Cosine similarity · Build a semantic search system from scratch · Compare keyword search vs semantic search · Use OpenAI embeddings API

#	Post	Status
1	What is GenAI? + UV Setup	✅ Published
2	AI Roadmap + 30 Tools + GitHub Copilot Setup	✅ Published
3	OHE, BoW, TF-IDF and Embeddings — this post	📍 You are here
4	Word2Vec, Embeddings and Semantic Search — hands-on Python	⬜ Next Friday
5	Prompt Engineering — Zero to Advanced (DBA Edition)	⬜ Coming soon

👉 Next Post: Word2Vec, Embeddings and Semantic Search — Build it in Python

References

Part of the GenAI from Scratch series for DBAs and Infrastructure Engineers. Published every Friday at gradeupnow.in/genai-blog/