BERT
⚡ Interactive Visualization of BERT Under the Hood

Before you finish typing,
how does an AI model
guess the next word?

BERT (an AI model) reads an entire sentence at once rather than word-by-word, allowing it to understand the complete context. By looking left and right simultaneously, it easily guesses missing words and catches subtle meanings.

🎮 An Interactive & Transparent Illustration

Step through a live visual of how BERT processes a sentence — from raw tokens to attention patterns to final word predictions.

No coding needed Runs in browser 5–8 min tour
Live AI · No server · Runs in your browser

Watch real BERT predict the missing word

This is the actual bert-base-uncased model — the same one Google uses. It downloads once (~90 MB) and runs entirely in your browser.

🚀 Loading BERT model
Initialising…
✅  BERT is ready  ·  110M parameters loaded
The cat sat on the [MASK]  .
BERT will predict what word fills [MASK]
🎯 Top predictions — real BERT output

💡 This is exactly what the next slides explain step-by-step — how BERT went from a sentence to these probabilities.

How does AI know what word comes next?

The cat sat on the ___

An AI will predict the missing word. But first — what’s your guess?

Click to answer
Pick the word you think the AI will predict
💡 Pick one:
Step 1 of 6

First, the AI slices your sentence into pieces

Tokenization: Converting words and sub-words into unique numbered IDs that the AI can mathematically understand.

The cat sat on the ___
↓ slicing…
➕ Positional Embedding — Where does each word sit?
The
pos 0
cat
pos 1
sat
pos 2
on
pos 3
the
pos 4
[???]
pos 5
⚡ BERT reads ALL words at once
BERT processes every token in parallel — and it does it bidirectionally.
📍 So we stamp each word with its position
A position number is added to each token so BERT knows word order.
🔑 Two things happen here: first, each word is sliced into a token and given a unique ID (like a library catalogue number); then each token is stamped with its position (0, 1, 2…) so that even though BERT reads everything at once, it still knows cat comes before sat.
Step 2 of 6

Each token finds its neighbourhood

The AI converts every token into a location in meaning space. Words that belong together end up close together.

Context tokens
Prediction space
Top predictions
Positional cluster
Semantic neighbors

Notice: cat lives in Animal Alley, sat is in Action Street, and ??? floats in Furniture Lane — right next to mat, floor and chair. The AI already feels it should be something you sit on.

Hover over the words in the diagram
See what each word means to the AI — and why it matters for prediction
Step 3 of 6

Words listen to each other

Before guessing the blank, the AI asks every word: “Who should I pay attention to?”

⚡ Which word do you think matters most for predicting the blank?
Click a word above to answer
Which word matters most for predicting the blank?
Step 4 of 6

Not one listener — twelve

BERT runs 12 attention heads at the same time. Each head specialises in a different linguistic feature — subject, verb, grammar — then all 12 answers are merged back into one rich representation.

768-dim Meaning Vector
one per token
1
Head 1
64 dims
Subjects
2
Head 2
64 dims
Verbs
···
Heads 3–11
12
Head 12
64 dims
Grammar
Combined Contextual Vector
12 × 64 = 768 dims, concatenated
Tap the head cards above

Each card specialises in a different feature. By splitting 768 dims into 12 channels of 64, BERT doesn't have to choose between tracking the noun or the verb — it tracks both simultaneously.

Step 5 of 6

Run BERT live in your browser

Here is the actual code that predicts the missing word. Click Run to execute it right here — no server needed.

# Python code — runs as JavaScript inside your browser
from transformers import pipeline

model = pipeline("fill-mask", model="bert-base-uncased")
results = model("The cat sat on the [MASK].")

for r in results:
    print(f"{r['token_str']}: {r['score']:.1%}")
Click Run above to execute
BERT runs entirely in your browser — no server needed
Step 6 of 6

The AI scores every one of its 30,522 known words — highest score wins

Every word in the vocabulary gets a probability score. The word with the highest score wins. But before we reveal it…

🎯 How confident do you think the AI is in its top answer?
Make your guess above
Tap a percentage — then see how confident BERT really is
🏅 Top predictions — probability scores
🧠 Inside BERT

How BERT actually reads your sentence

Our demo showed you the key ideas — here's how BERT puts them all together in one pass.

🔐
Step 1 — Tokenize
Your sentence is split into word-pieces. "playing" → "play" + "##ing". Each piece gets a unique ID from BERT's 30,522-word vocabulary.
🌎
Step 2 — Embed into meaning space
Each token becomes a 768-number vector. Similar words land near each other. Position info is added so BERT knows word order.
👁️
Step 3 — Self-Attention × 12 layers
Each of BERT's 12 layers asks every word: "which other words matter most for understanding me?" The answer changes at every layer — building from simple patterns (layer 1) to rich meaning (layer 12). Crucially, each word can look left and right at the same time.
💡 This is why BERT is called Bidirectional — unlike GPT, which reads strictly left-to-right and can't “see ahead”, BERT sees the whole sentence at once, making it far better at understanding context and filling in blanks.
📈
Step 4 — Predict the masked word
The [MASK] token's final 768-number vector gets projected onto all 30,522 vocabulary words. A softmax turns those scores into probabilities. Highest wins.
🏗️ Architecture 👆 Hover any block to learn its role
Inputs Input Embedding + Positional Encoding ×12 Multi-Head Attention Add & Norm Feed Forward Add & Norm Linear Projection Softmax Output Probabilities
👆 Hover any block above
Hover any block in the diagram
Click or hover each block to learn exactly what it does inside BERT
BERT — Under the Hood

You already use BERT every day

BERT and its descendants power these tools you use right now:

🔍
Google Search
BERT reads your full query in both directions to understand your true intent — not just isolated keywords. Deeply integrated into Google infrastructure, it now helps map context for nearly 100% of English-language search queries.
🛒
Amazon / E-commerce Search
When you search "running shoes for flat feet", BERT computes semantic similarities instead of just matching exact words — serving products that match your physical need, not just your keywords.
🎉
You just watched how BERT works!
Tokens → Meaning → Attention → Prediction.
The same mechanism powers ChatGPT, Gemini, and every modern AI.
🧪 Data Lab

Change the training data — change the answer

Toggle sentence groups on or off. Watch the AI change its prediction live. Same model. Same architecture. Different data = different answer.

⚡ Live prediction

AI now predicts
floor
8 of 8 training matches
🤯 Same AI. Different data. Different answer!
You just proved that AI predictions are not absolute truth — they reflect whatever patterns existed in the training data. This is why researchers obsess over data quality. Biased data in → biased answers out. The model didn't change. Only the examples did.
Toggle groups above
Add or remove sentence groups from the training set — watch the AI prediction change live