π€GenAI from Scratch β Post 4 of 24 Β
π Table of Contents
- Why AI Can’t Read Text Directly
- The 4 Text Encoding Techniques β Overview
- One Hot Encoding (OHE) β Theory + Python Code
- Bag of Words (BoW) β Theory + Python Code
- TF-IDF β Theory + Python Code
- Why OHE / BoW / TF-IDF All Fail for GenAI
- Embeddings β The Fix That Powers LLMs
- Full Comparison Table
- Install and Setup
- Common Errors and Fixes
- Key Takeaways
- What’s Next
Before an LLM can answer your question about a slow Oracle query, it has to do something that seems impossible: convert your English text into numbers. Computers don’t understand words. They understand numbers. So the first problem every NLP and GenAI system must solve is: how do you turn text into numbers β and do it in a way that preserves meaning?
This is Post 4 of the GenAI from Scratch series, We cover the evolution of text encoding β from the simplest approach (One Hot Encoding) through to the modern approach (Embeddings) that powers ChatGPT. Every concept includes working Python code , which you can run in VS Code right now.
What you’ll learn:
- Why text must be converted to numbers before AI can process it
- One Hot Encoding (OHE) β what it is, how to code it, pros and cons
- Bag of Words (BoW) β how word counting works in ML, with Python
- TF-IDF β how to measure word importance, not just word count
- Why all three fail for GenAI β and what Embeddings solve
- The complete comparison table for interviews and production decisions
π¬ Lab Validated: All code in this post is taken directly from encoding.ipynb β the actual bootcamp notebook. Tested in VS Code with Python 3.12 and scikit-learn 1.4+.
Prerequisites
- β Posts 1 and 2 completed β UV installed, VS Code set up
- β Your
genai-bootcampproject folder from Post 1 - β Install the required package (one command):
uv add scikit-learn numpy
Lab Environment
| Component | Version |
|---|---|
| Python | 3.12 |
| scikit-learn | 1.4+ |
| numpy | 1.26+ |
| IDE | VS Code 1.88+ |
| Notebook | encoding.ipynb |
1. Why AI Can’t Read Text Directly
This is the foundational question with. Every ML and AI model β whether it’s a simple Naive Bayes classifier or a billion-parameter LLM β operates entirely on numbers. Text is meaningless to a computer unless it’s been converted to a numerical representation first.
Text (what humans write)
β
Must be encoded
β
Numbers / Vectors (what AI processes)
β
ML / DL / LLM
β
Prediction / Generation
ποΈ DBA Analogy β Text Encoding = Character Set Conversion
You already deal with this concept every day. Oracle stores every VARCHAR2 as bytes using a character set (AL32UTF8, WE8ISO8859P1, etc.). The word “INDEX” is not stored as letters β it’s stored as byte values: 73, 78, 68, 69, 88. Text encoding for AI is the same principle at a higher level: convert human language into a numerical format that the model can do math on.
The challenge is not just converting text to numbers. The challenge is doing it in a way that preserves meaning. A character set doesn’t care that “database” and “datastore” are related concepts. AI encoding has to.
There are four main techniques, and they represent a historical evolution of increasing sophistication:
2. The 4 Text Encoding Techniques β Overview
| # | Technique | The question it answers | Era | Understands meaning? |
|---|---|---|---|---|
| 1 | One Hot Encoding (OHE) | “Is the word present or not?” β 0 or 1 | 2012β2015 | β No |
| 2 | Bag of Words (BoW) | “How many times does the word appear?” | 2014β2016 | β No |
| 3 | TF-IDF | “How important is this word in this document?” | 2014β2016 | β No (importance only) |
| 4 | Embeddings | “What is the meaning of this text in context?” | Word2Vec β Transformers | β Yes |
The first three are classical NLP encoding techniques. They’re still important to understand because they explain exactly why modern embeddings were invented and what problems they solved. They’re also heavily tested in interviews β the handwritten notes explicitly mark them as “Ask Interview.”
3. One Hot Encoding (OHE) β Theory + Python Code
The Concept
One Hot Encoding answers exactly one question: “Is this word present in the document, or not?” The output is a binary vector β all zeros except for a single 1 in the position of the word in the vocabulary.
Here’s the exact example :
Data (4 documents):
D1 β people watch movie
D2 β people watch cricket
D3 β people like movie
D4 β people like cricket
Step 1 β Build vocabulary (unique words from the data, sorted alphabetically):
Vocabulary = {people, watch, like, movie, cricket}
Vector dimension = 5 (one position per unique word)
Step 2 β Encode each document (1 = word present, 0 = word absent):
| Document | people | watch | like | movie | cricket | OHE Vector |
|---|---|---|---|---|---|---|
| D1: people watch movie | 1 | 1 | 0 | 1 | 0 | [1, 1, 0, 1, 0] |
| D2: people watch cricket | 1 | 1 | 0 | 0 | 1 | [1, 1, 0, 0, 1] |
| D3: people like movie | 1 | 0 | 1 | 1 | 0 | [1, 0, 1, 1, 0] |
| D4: people like cricket | 1 | 0 | 1 | 0 | 1 | [1, 0, 1, 0, 1] |
ποΈ DBA Analogy β OHE = Bitmap Index
A bitmap index in Oracle stores exactly this β for each unique value in a column, a bit vector showing which rows contain that value (1) and which don’t (0). One Hot Encoding is a bitmap index for words in a document. The vocabulary is your indexed column, the documents are your rows, and the 0s and 1s are the bitmap.
Python Code β OHE
# ohe_demo.py
# One Hot Encoding β From encoding.ipynb (Class 04)
# Run: uv run ohe_demo.py
from sklearn.preprocessing import OneHotEncoder
import numpy as np
# ββ Step 1: Define your document ββββββββββββββββββββββββββββββββββ
document = ["my name is sunny and I love AI"]
# ββ Step 2: Tokenise β split into individual words βββββββββββββββββ
# lower() ensures "AI" and "ai" are treated as the same word
tokens = [sentence.lower().split() for sentence in document]
print("Tokens:", tokens)
# Output: [['my', 'name', 'is', 'sunny', 'and', 'i', 'love', 'ai']]
# ββ Step 3: Reshape for sklearn β needs [[word], [word], ...] βββββ
# Each word must be its own row for the encoder
all_words = [[word] for sentence in tokens for word in sentence]
print("Words formatted for encoder:", all_words)
# Output: [['my'], ['name'], ['is'], ['sunny'], ['and'], ['i'], ['love'], ['ai']]
# ββ Step 4: Create and FIT the encoder ββββββββββββββββββββββββββββ
# sparse_output=False β returns a NumPy array instead of a sparse matrix
# Makes it easier to read and print
encoder = OneHotEncoder(sparse_output=False)
encoder.fit(all_words) # Learn the vocabulary from the data
# ββ Step 5: Check the vocabulary learned ββββββββββββββββββββββββββ
print("\nVocabulary learned:", encoder.categories_[0])
# Output: ['ai' 'and' 'i' 'is' 'love' 'my' 'name' 'sunny']
# Note: sorted alphabetically β this is sklearn's default behaviour
# ββ Step 6: Encode each sentence βββββββββββββββββββββββββββββββββ
for sentence in tokens:
encoded = encoder.transform([[word] for word in sentence])
print(f"\nSentence: {sentence}")
print(f"OHE matrix shape: {encoded.shape}") # (num_words, vocab_size)
print(encoded)
Run it:
uv run ohe_demo.py
Expected output:
Tokens: [['my', 'name', 'is', 'sunny', 'and', 'i', 'love', 'ai']]
Words formatted for encoder: [['my'], ['name'], ['is'], ...]
Vocabulary learned: ['ai' 'and' 'i' 'is' 'love' 'my' 'name' 'sunny']
Sentence: ['my', 'name', 'is', 'sunny', 'and', 'i', 'love', 'ai']
OHE matrix shape: (8, 8)
[[0. 0. 0. 0. 0. 1. 0. 0.] β 'my' β position 5 is 1
[0. 0. 0. 0. 0. 0. 1. 0.] β 'name' β position 6 is 1
[0. 0. 0. 1. 0. 0. 0. 0.] β 'is' β position 3 is 1
[0. 0. 0. 0. 0. 0. 0. 1.] β 'sunny' β position 7 is 1
[0. 1. 0. 0. 0. 0. 0. 0.] β 'and' β position 1 is 1
[0. 0. 1. 0. 0. 0. 0. 0.] β 'i' β position 2 is 1
[0. 0. 0. 0. 1. 0. 0. 0.] β 'love' β position 4 is 1
[1. 0. 0. 0. 0. 0. 0. 0.]] β 'ai' β position 0 is 1
Tokens: [[‘my’, ‘name’, ‘is’, ‘sunny’, ‘and’, ‘i’, ‘love’, ‘ai’]] Words formatted for encoder: [[‘my’], [‘name’], [‘is’], [‘sunny’], [‘and’], [‘i’], [‘love’], [‘ai’]] Vocabulary learned: [‘ai’ ‘and’ ‘i’ ‘is’ ‘love’ ‘my’ ‘name’ ‘sunny’] Sentence: [‘my’, ‘name’, ‘is’, ‘sunny’, ‘and’, ‘i’, ‘love’, ‘ai’] OHE matrix shape: (8, 8) [[0. 0. 0. 0. 0. 1. 0. 0.] [0. 0. 0. 0. 0. 0. 1. 0.] [0. 0. 0. 1. 0. 0. 0. 0.] [0. 0. 0. 0. 0. 0. 0. 1.] [0. 1. 0. 0. 0. 0. 0. 0.] [0. 0. 1. 0. 0. 0. 0. 0.] [0. 0. 0. 0. 1. 0. 0. 0.] [1. 0. 0. 0. 0. 0. 0. 0.]]
OHE Pros and Cons (Interview Notes)
| β Pros | β Cons |
|---|---|
| Easy to implement | Sparse matrix β mostly zeros, huge memory waste. Vocabulary of 10,000 words β each document is a vector with 9,997 zeros |
| Simple binary representation (0 and 1) | High dimensionality β vector size grows with vocabulary size |
| No training required β direct mapping | No semantic understanding β “like” and “love” are completely unrelated in OHE |
| Direct mapping from words to numbers | OOV Problem β if a new word “enjoy” appears in D5 but wasn’t in the training vocabulary, OHE cannot handle it. The word is unknown. |
β οΈ The OOV Problem β Critical to Understand:
If your vocabulary was built from D1βD4 and a new document D5 contains the word “enjoy” β OHE has no idea what to do with it. “enjoy” is not in the vocabulary. This is called Out Of Vocabulary (OOV). It’s a fundamental limitation of all classical encoding methods. Embeddings (Word2Vec, Transformers) largely solve this.
Training vocab: {people, watch, like, movie, cricket}
D5 = "people enjoy movie"
"enjoy" β NOT in vocabulary β OHE cannot encode it
4. Bag of Words (BoW) β Theory + Python Code
The Concept
Bag of Words answers: “How many times does each word appear?” Instead of a binary 0/1, it stores the actual word count. The assumption is: “If a word repeats more in a sentence, BoW assumes it is more important.”
Example from the class notes:
D1 β people watch movie and watch movie again
D2 β people watch cricket and watch cricket
D3 β people like movie and like movie a lot
D4 β people like cricket
Vocabulary (8 unique words, sorted):
['again', 'and', 'cricket', 'like', 'lot', 'movie', 'people', 'watch']
Vector dimension = 8
BoW matrix (count of each word per document):
| Doc | again | and | cricket | like | lot | movie | people | watch |
|---|---|---|---|---|---|---|---|---|
| D1 | 1 | 1 | 0 | 0 | 0 | 2 | 1 | 2 |
| D2 | 0 | 1 | 2 | 0 | 0 | 0 | 1 | 2 |
| D3 | 0 | 1 | 0 | 2 | 1 | 2 | 1 | 0 |
| D4 | 0 | 0 | 1 | 1 | 0 | 0 | 1 | 0 |
D1 vector: [1, 1, 0, 0, 0, 2, 1, 2] β “movie” and “watch” appear twice, so they score 2.
ποΈ DBA Analogy β BoW = COUNT(*) GROUP BY word
BoW is literally a SELECT word, COUNT(*) FROM tokenized_document GROUP BY word. It counts word frequency and turns the results into a fixed-length vector. The vocabulary is your lookup table, and each document becomes one row with a count for each word. No magic β just counting.
Python Code β BoW from the Bootcamp Notebook
# bow_demo.py
# Bag of Words β from encoding.ipynb (Class 04)
# sklearn docs: scikit-learn.org/0.15/modules/generated/
# sklearn.feature_extraction.text.CountVectorizer.html
# Run: uv run bow_demo.py
from sklearn.feature_extraction.text import CountVectorizer
# ββ Dataset β same as class notes βββββββββββββββββββββββββββββββββ
documents = [
"people watch movie and watch movie again",
"people watch cricket and watch cricket",
"people like movie and like movie a lot",
"people like cricket"
]
# ββ Step 1: Create the CountVectorizer (BoW) ββββββββββββββββββββββ
bow = CountVectorizer()
# ββ Step 2: fit_transform β learn vocabulary AND encode in one step
# fit() β learns the vocabulary from all documents
# transform()β converts documents to count vectors
bow_matrix = bow.fit_transform(documents)
# ββ Step 3: Inspect the vocabulary ββββββββββββββββββββββββββββββββ
print("Vocabulary:", bow.get_feature_names_out())
# Output: ['again' 'and' 'cricket' 'like' 'lot' 'movie' 'people' 'watch']
# ββ Step 4: View the full BoW matrix ββββββββββββββββββββββββββββββ
print("\nBoW Matrix (full):")
print(bow_matrix.toarray())
# ββ Step 5: View per document βββββββββββββββββββββββββββββββββββββ
print("\nPer-document vectors:")
vocab = bow.get_feature_names_out()
for i, doc in enumerate(documents):
print(f"\nD{i+1}: '{doc}'")
print(f"Vector: {bow_matrix.toarray()[i]}")
# ββ Step 6: OOV demo β what happens with an unknown word ββββββββββ
# "lion" and "king" are NOT in the training vocabulary
new_doc = ["lion is the king of jungle"]
new_vector = bow.transform(new_doc)
print("\n--- OOV Demo ---")
print(f"New doc: {new_doc[0]}")
print(f"Vector: {new_vector.toarray()}")
# Output: [[0 0 0 0 0 0 0 0]]
# ALL zeros β no words in this sentence existed in the vocabulary
print("Result: ALL zeros β BoW has no idea what this sentence means!")
Run it:
uv run bow_demo.py
Expected output:
Vocabulary: ['again' 'and' 'cricket' 'like' 'lot' 'movie' 'people' 'watch']
BoW Matrix (full):
[[1 1 0 0 0 2 1 2]
[0 1 2 0 0 0 1 2]
[0 1 0 2 1 2 1 0]
[0 0 1 1 0 0 1 0]]
Per-document vectors:
D1: 'people watch movie and watch movie again'
Vector: [1 1 0 0 0 2 1 2]
D2: 'people watch cricket and watch cricket'
Vector: [0 1 2 0 0 0 1 2]
D3: 'people like movie and like movie a lot'
Vector: [0 1 0 2 1 2 1 0]
D4: 'people like cricket'
Vector: [0 0 1 1 0 0 1 0]
--- OOV Demo ---
New doc: lion is the king of jungle
Vector: [[0 0 0 0 0 0 0 0]]
Result: ALL zeros β BoW has no idea what this sentence means!
BoW Pros and Cons
| β Pros | β Cons |
|---|---|
| Easy to implement, very simple logic | Ignores word order β “dog bites man” and “man bites dog” produce identical vectors: {dog, bites, man} β [1,1,1] |
| Captures word frequency β more occurrences = higher importance | No semantic understanding β “like” and “love” are treated as completely different words with zero relation |
| Works well for classical NLP: text classification, spam detection, sentiment analysis (basic) | High dimensionality β vocabulary of 50,000 words β 50,000-dimension vectors, mostly zeros |
| No training required β direct conversion text β numbers | Sparse representation β memory-inefficient, computation-wasteful |
| Used successfully 2014β2016 with Naive Bayes and RNNs/LSTMs | OOV problem β new words not in training vocabulary are silently ignored |
| Overemphasizes frequent words β “movie movie movie” gets high count even if it’s meaningless repetition |
5. TF-IDF β Theory + Python Code
The Concept
BoW has a critical problem: common words like “and”, “the”, “people” appear in every document and get high counts β but they carry no meaningful signal. TF-IDF fixes this by weighting words by their importance, not just their count.
TF-IDF stands for Term Frequency Γ Inverse Document Frequency.
TF-IDF(word, document) = TF(word, document) Γ IDF(word)
TF(word, D) = occurrences of word in D
βββββββββββββββββββββββββββ
total words in D
IDF(word) = log( total number of documents )
ββββββββββββββββββββββββββββββ
number of documents containing word
Why TF? More occurrences in a document = more important in that document.
Why IDF? Common words appear in many documents β they should get lower weight.
Why multiply TF Γ IDF? TF = importance within document. IDF = importance across corpus.
Why use log in IDF? Without log, a rare word in 1 out of 1000 documents gets IDF=1000. A common word in 500 documents gets IDF=2. The difference (1000 vs 2) is extreme and makes the model unstable. Log compresses this to 6.9 vs 0.69 β reasonable and balanced.
TF-IDF Worked Example
Documents:
D1 β people watch cricket
D2 β cricket watch cricket
D3 β people give comment
D4 β cricket give comment
Vocabulary: ['comment', 'cricket', 'give', 'people', 'watch']
Total documents (N) = 4
Computing TF-IDF for “cricket” in D1:
TF("cricket", D1) = 1/3 (appears once, 3 total words)
IDF("cricket") = log(4/3) (appears in 3 of 4 documents)
β 0.288
TF-IDF = (1/3) Γ log(4/3) β 0.096
The final TF-IDF matrix from the class notes (computed values):
| Doc | comment | cricket | give | people | watch |
|---|---|---|---|---|---|
| D1 | 0 | 0.096 | 0 | 0.231 | 0.231 |
| D2 | 0 | 0.191 | 0 | 0 | 0.231 |
| D3 | 0.231 | 0 | 0.231 | 0.231 | 0 |
| D4 | 0.231 | 0.096 | 0.231 | 0 | 0 |
Notice: “cricket” appears in 3 documents (D1, D2, D4) so it gets lower weight (0.096, 0.191) compared to “comment” which only appears in 2 documents (0.231). Rarer words get higher TF-IDF scores.
Python Code β TF-IDF from the Bootcamp Notebook
# tfidf_demo.py
# TF-IDF β from encoding.ipynb (Class 04)
# Run: uv run tfidf_demo.py
from sklearn.feature_extraction.text import TfidfVectorizer
# ββ Dataset ββββββββββββββββββββββββββββββββββββββββββββββββββββββ
documents = [
"people watch movie and watch movie again",
"people watch cricket and watch cricket",
"people like movie and like movie a lot",
"people like cricket"
]
# ββ Step 1: Create TF-IDF vectorizer βββββββββββββββββββββββββββββ
tf_idf = TfidfVectorizer()
# ββ Step 2: Fit and transform in one step βββββββββββββββββββββββββ
# This learns vocabulary + computes TF-IDF for every word in every document
tf_idf_vector = tf_idf.fit_transform(documents)
# ββ Step 3: Check what was returned βββββββββββββββββββββββββββββββ
print("Type:", type(tf_idf_vector))
print("Shape:", tf_idf_vector.shape)
# Output: <4x8 sparse matrix> β 4 documents, 8 unique words
# ββ Step 4: View the vocabulary βββββββββββββββββββββββββββββββββββ
print("\nVocabulary:", tf_idf.get_feature_names_out())
# Output: ['again' 'and' 'cricket' 'like' 'lot' 'movie' 'people' 'watch']
# ββ Step 5: View TF-IDF matrix as array βββββββββββββββββββββββββββ
print("\nTF-IDF Matrix (full):")
print(tf_idf_vector.toarray())
# ββ Step 6: View per document β easier to read ββββββββββββββββββββ
print("\nPer-document TF-IDF vectors:")
vocab = tf_idf.get_feature_names_out()
print(f"{'Vocab':<10}", " ".join(f"{w:<8}" for w in vocab))
print("-" * 70)
for i, doc in enumerate(documents):
values = tf_idf_vector.toarray()[i]
print(f"D{i+1:<9}", " ".join(f"{v:<8.3f}" for v in values))
# ββ Step 7: Interpretation ββββββββββββββββββββββββββββββββββββββββ
print("\n--- Interpretation ---")
print("Higher TF-IDF = word is important in this doc but rare across all docs")
print("Lower TF-IDF = word appears in many docs (less distinctive)")
print("Zero TF-IDF = word not present in this document")
Run it:
uv run tfidf_demo.py
Expected output:
Type: <class 'scipy.sparse._csr.csr_matrix'>
Shape: (4, 8)
Vocabulary: ['again' 'and' 'cricket' 'like' 'lot' 'movie' 'people' 'watch']
TF-IDF Matrix (full):
[[0.388 0.247 0. 0. 0. 0.611 0.202 0.611]
[0. 0.268 0.663 0. 0. 0. 0.219 0.663]
[0. 0.247 0. 0.611 0.388 0.611 0.202 0. ]
[0. 0. 0.640 0.640 0. 0. 0.424 0. ]]
Per-document TF-IDF vectors:
Vocab again and cricket like lot movie people watch
----------------------------------------------------------------------
D1 0.388 0.247 0.000 0.000 0.000 0.611 0.202 0.611
D2 0.000 0.268 0.663 0.000 0.000 0.000 0.219 0.663
D3 0.000 0.247 0.000 0.611 0.388 0.611 0.202 0.000
D4 0.000 0.000 0.640 0.640 0.000 0.000 0.424 0.000
--- Interpretation ---
Higher TF-IDF = word is important in this doc but rare across all docs
Lower TF-IDF = word appears in many docs (less distinctive)
Zero TF-IDF = word not present in this document
π‘ Reading the TF-IDF output: Notice that “people” scores 0.202β0.424 across all documents because it appears in every document (low IDF). Meanwhile “again” scores 0.388 in D1 only, because it appears in just that one document (high IDF = high distinctiveness). “people” is like a stop word β present everywhere, low signal. “again” is a distinctive word β only in D1.
TF-IDF Pros and Cons
| β Pros | β Cons |
|---|---|
| Captures word importance, not just count. Rare words β higher weight; common words β lower weight | Still ignores word order β same problem as BoW |
| Reduces impact of common words (stop words) automatically via IDF | No semantic understanding β “car” and “automobile” are still unrelated |
| Better than BoW for search engines, information retrieval, document ranking | Still sparse and high-dimensional (vocabulary size = vector size) |
| No training required β direct calculation | Doesn’t handle context β “bank” (river vs finance) gets the same vector everywhere |
| Simple and interpretable β TF Γ IDF = clear logic anyone can verify | OOV problem β new words not in training vocabulary are ignored |
The final verdict on TF-IDF from the class notes: “TF-IDF improves by adding importance but still fails to understand meaning and context.” This limitation is what motivated the shift to embeddings.
6. Why OHE / BoW / TF-IDF All Fail for GenAI
This close with a clean demonstration of the fundamental failure of all three classical techniques. This is the “aha” moment that explains why embeddings were invented:
Sentence 1: "I like this movie"
Sentence 2: "I love this film"
These two sentences mean almost the same thing.
OHE / BoW / TF-IDF result:
"like" β "love" β treated as completely different
"movie" β "film" β treated as completely different
β Model thinks Sentence 1 and Sentence 2 are UNRELATED
Embeddings result:
"like" β "love" β numerically close vectors
"movie" β "film" β numerically close vectors
β Model understands these sentences are SIMILAR
All three classical techniques have the same fundamental problem: they are statistical representations β they count and weight words, but they have no understanding of what words mean or how they relate to each other. The number they assign to “like” has no mathematical relationship to the number for “love.”
Here’s the summary of all their shared failures:
| Failure | OHE | BoW | TF-IDF | Embeddings |
|---|---|---|---|---|
| No semantic understanding (like β love) | β | β | β | β Fixed |
| Ignores word order (dog bites man = man bites dog) | β | β | β | β Fixed (Transformers) |
| High dimensionality (50K vocab β 50K vector) | β | β | β | β Fixed (dense, compact) |
| Sparse (mostly zeros) | β | β | β | β Fixed (dense vectors) |
| OOV β can’t handle new words | β | β | β | β Largely fixed |
| No context (“bank” = river or finance?) | β | β | β | β Fixed (contextual embeddings) |
7. Embeddings β The Fix That Powers LLMs
Embeddings answer the question OHE/BoW/TF-IDF could never answer: “What is the meaning of this text in context?”
Instead of a sparse binary or count vector, an embedding is a dense vector of decimal numbers (typically 768 to 3072 dimensions) learned by training a neural network on massive amounts of text. Words with similar meanings end up with numerically similar vectors.
# Conceptual illustration β actual values differ
# Classical encoding:
"like" β [0, 0, 0, 1, 0, 0, 0, 0] β OHE, one position only
"love" β [0, 0, 0, 0, 1, 0, 0, 0] β completely different position
# Math distance between them: far apart β model sees them as unrelated
# Embeddings:
"like" β [0.21, -0.45, 0.83, 0.12, ...] β dense, 768+ numbers
"love" β [0.19, -0.43, 0.87, 0.14, ...] β very similar numbers!
# Math distance (cosine similarity): very close β model knows they're related
# Even more powerful:
"Oracle" β [0.55, 0.12, -0.33, ...] β database context
"PostgreSQL" β [0.53, 0.14, -0.31, ...] β numerically close!
# The model learned that Oracle and PostgreSQL are related concepts
The evolution path from class notes:
OHE / BoW / TF-IDF β Word2Vec β Transformer-based models
(2012β2015) (first real (BERT, GPT, all modern LLMs)
embeddings)
Word2Vec was the breakthrough β it proved that word meaning could be
captured mathematically. Transformers then made embeddings contextual:
the same word gets different embeddings depending on the surrounding text.
β Why this matters for DBAs building RAG systems: When you store documents in a vector database like ChromaDB, pgvector, or Pinecone, you are storing embedding vectors β not OHE or BoW vectors. The search (“find me documents similar to this query”) works by finding vectors that are mathematically close to the query vector. This is called semantic search β and it only works because embeddings capture meaning. We’ll build this in Post 4.
8. Full Comparison Table (Interview Reference)
| Feature | OHE | Bag of Words | TF-IDF | Embeddings |
|---|---|---|---|---|
| What it captures | Word presence (0/1) | Word frequency (count) | Word importance (TFΓIDF score) | Word meaning + context (dense vector) |
| Vector type | Sparse binary | Sparse integer | Sparse float | Dense float (768β3072 dims) |
| Vector size | = vocabulary size | = vocabulary size | = vocabulary size | Fixed (e.g. 768) β not vocabulary-dependent |
| Training required? | No | No | No | Yes (pre-trained models available) |
| Word order? | Ignored | Ignored | Ignored | Captured (Transformers) |
| Semantic understanding? | β None | β None | β None | β Full |
| OOV problem? | β Yes | β Yes | β Yes | β Largely solved |
| sklearn class | OneHotEncoder | CountVectorizer | TfidfVectorizer | sentence-transformers / OpenAI API |
| Best used for | Simple categorical features, interview demos | Text classification, spam detection (classical) | Search engines, document ranking, keyword relevance | Semantic search, RAG, LLMs β everything modern |
| Era | 2012β2015 | 2014β2016 | 2014β2016 | 2013 (Word2Vec) β 2017+ (Transformers) |
9. Common Errors and Fixes
Error 1: NotFittedError on OneHotEncoder
Error (actual error from the bootcamp notebook):
NotFittedError: This OneHotEncoder instance is not fitted yet.
Call 'fit' with appropriate arguments before using this estimator.
Cause: You called encoder.transform() before calling encoder.fit(). This is the most common sklearn mistake β every encoder needs to see the data first (fit) before it can convert new data (transform).
Fix:
# WRONG β transform before fit
encoder = OneHotEncoder(sparse_output=False)
encoded = encoder.transform(all_words) # β NotFittedError
# RIGHT β fit first, then transform
encoder = OneHotEncoder(sparse_output=False)
encoder.fit(all_words) # β
Learn vocabulary first
encoded = encoder.transform(all_words) # β
Now encode
# OR use fit_transform() which does both in one call
encoded = encoder.fit_transform(all_words) # β
Equivalent
DBA Analogy:fit() is like running ANALYZE TABLE to gather statistics. transform() is using those statistics. You can’t use statistics you haven’t gathered yet.
Error 2: ValueError β unknown categories in OneHotEncoder
ValueError: Found unknown categories ['enjoy'] in column 0 during transform
Cause: You’re trying to encode a word that wasn’t in the training vocabulary (the OOV problem).
Fix:
# Tell encoder to ignore unknown words instead of crashing
encoder = OneHotEncoder(sparse_output=False, handle_unknown='ignore')
encoder.fit(all_words)
# Now unknown words β all-zero row instead of error
Error 3: sparse_output parameter name changed in sklearn
TypeError: OneHotEncoder.__init__() got an unexpected keyword argument 'sparse'
Cause: In sklearn 1.2+, the parameter was renamed from sparse=False to sparse_output=False.
Fix:
# Old sklearn (before 1.2):
encoder = OneHotEncoder(sparse=False)
# New sklearn 1.2+ (use this):
encoder = OneHotEncoder(sparse_output=False)
Error 4: BoW transform on new document returns all zeros
Symptom: You transform a new document and get a zero vector. Not an error β but confusing.
Cause: This is expected OOV behaviour β none of the words in the new document exist in the training vocabulary. This is a design limitation of BoW, not a bug.
new_doc = ["lion is the king of jungle"]
# "lion", "king", "jungle" β not in training vocab
# Result: [[0 0 0 0 0 0 0 0]] β all zeros, silent failure
# Fix: Use embeddings instead of BoW for production systems
# Embeddings handle unseen words through subword tokenization
10. Key Takeaways
β What you learned in this post:
- All ML and AI models require text to be converted to numbers before processing. This is called text encoding.
- OHE answers “is the word present?” β binary 0/1 vector per word in vocabulary. Simple but very sparse and no meaning.
- Bag of Words answers “how many times?” β word count vectors. Better than OHE but still ignores order and meaning.
- TF-IDF answers “how important is this word?” β weights by rarity across the corpus. Better than BoW for search and ranking, but still no semantic understanding.
- All three share the same 6 critical failures: no word order, no semantics, sparse, high-dimensional, OOV problem, no context.
- Embeddings solve all 6 problems β dense vectors of decimal numbers learned from training data, where similar words are numerically close. This is what powers every modern LLM and vector search system.
- The sklearn pattern is always: create β fit() β transform(). Or fit_transform() which does both.
- The
NotFittedErroris the most common sklearn mistake β always callfit()beforetransform().
11. What’s Next
Now that you understand why classical encoding fails, Post 4 builds the solution. We go hands-on with embeddings:
Post 4 β Word2Vec, Embeddings and Semantic Search β Build it in Python
What Word2Vec is Β· How embeddings capture meaning Β· Cosine similarity Β· Build a semantic search system from scratch Β· Compare keyword search vs semantic search Β· Use OpenAI embeddings API
| # | Post | Status |
|---|---|---|
| 1 | What is GenAI? + UV Setup | β Published |
| 2 | AI Roadmap + 30 Tools + GitHub Copilot Setup | β Published |
| 3 | OHE, BoW, TF-IDF and Embeddings β this post | π You are here |
| 4 | Word2Vec, Embeddings and Semantic Search β hands-on Python | β¬ Next Friday |
| 5 | Prompt Engineering β Zero to Advanced (DBA Edition) | β¬ Coming soon |
π Next Post: Word2Vec, Embeddings and Semantic Search β Build it in Python
References
- sklearn OneHotEncoder Documentation
- sklearn CountVectorizer Documentation
- sklearn TfidfVectorizer Documentation
Part of the GenAI from Scratch series for DBAs and Infrastructure Engineers. Published every Friday at gradeupnow.in/genai-blog/