Demystifying BERT Under the Hood

A linguistic idea, decades before AI

“You shall know a word by the company it keeps!”

— John Rupert Firth, 1957

e.g. “We went to the bank” vs. “the river bank”
Same word, different meaning, decided entirely by context.

Bidirectional Encoder Representations from Transformers reads text in both directions simultaneously to capture deep contextual meaning.

A real Google search use case

Google

✕

Before BERT

2019 brazil traveler to usa need a visa

Read as

US citizens travelling to Brazil. The exact opposite direction.

vs

✓

After BERT (25 Oct 2019)

2019 brazil traveler to usa need a visa

Correctly grasped

Visa requirements for Brazilian citizens visiting the United States.

One word, two opposite meanings

Semantic Search — Understanding Intent Over Keywords

10% of all English-language search queries were improved by this update.

Read → Visualize → Do → Remember

Part 1 — Theory & Deck

Visual Walkthrough

An intuitive, visual walkthrough of BERT — from raw text all the way down to self-attention layers.

Part 2 — Try it Out (link in comments)

Live DataLab Sandbox

Input any sentence of your own and watch it move through every architectural layer, live.

Transformer

Encoder-Only

Reads text bidirectionally in one pass, building contextual meaning.

BERT is Encoder-only.

Decoder-Only

Generates text sequentially one token at a time using masked attention.

ChatGPT is Decoder-only.

Encoder-Decoder

Combines both: encodes input context and decodes it to output text.

T5 is Encoder-Decoder.

How It is Different from GPT Models

Encoder · the reader

It listens & understands

Reads the whole sentence at once
Looks both ways — left & right context
Fills the blank from surrounding words

bert · understanding

The cat [MASK] on the mat.

BERT — Encoder-only, understands the context

Decoder · the writer

It speaks it out loud

Writes one word at a time
Each word builds on the ones before
Only looks backward, never ahead

gpt · generating

> ▋

GPT — Decoder-only, generates new text

Masked Language Modeling

How BERT is trained

1 Hide a word 2 Read both sides 3 Predict the [MASK]

Hide 15% of words behind a [MASK]
BERT guesses each from words on both sides
Learns grammar, facts & context on it's own

Transformer Architecture is used for deep bidirectional understanding

Now — under the hood

BERT, step by step

Tokenize→ Embeddings→ Attention→ Add & Norm→ Feed-Forward→ LM Head→ Predict

Seven steps — from words to a prediction

01 · Tokenize · like Lego bricks

Words snap into reusable pieces

30,522pieces are enough
to build every word

30,522 is the total size of its vocabulary dictionary.

A few hundred Lego shapes build anything — you never need a custom brick. (Here hug·##ging splits into two pieces, while his and favorite are common enough to be single pieces — the ## means “glue me to the piece before.”)

02 · Embeddings · like a colour code

Each piece becomes 768 numbers

Why not just a couple of numbers?

Flat — 2 numbers

Two very different movies land almost on top of each other — the computer can’t tell them apart.

→

Add one axis — 3 numbers

One extra dimension sends The Dark Knight into the dark, gritty corner while The Avengers stays bright and comic-book.

The BERT reality — 768 dimensions at once

the dark knight →

dim 214Philosophical villain

dim 402Orchestral score

dim 588CGI explosions

orange

→

colour code #412

→

R 240 · G 90 · B 20sits near other oranges

Humans can’t picture a 768-D graph, but the computer reads it like a colour code — “#412” tells you nothing, while R240·G90·B20 says orange. BERT builds one of these 768-number fingerprints for every single token, so similar words land near each other.

03 · Multi-head Attention · a panel of experts

Every word asks “who matters to me?”

Attentionevery word looks at every other

Every word weighs every other — the blank leans hardest on “transfer” & “money”.

Multi-head Attentiona panel of experts

Head 1subject

Head 2meaning

Head 3verb

Other headsHead 12

Like twelve people proofreading one page — the editor watches grammar, the lawyer watches risk, the designer watches rhythm. Each head notices a different link, then they pool their views.

04 · Add & Layer Norm · the safety net

Keep the signal strong & balanced

Adda shortcut for the original

Deep layers can wash the original out — like a photocopy of a photocopy. So the signal takes a shortcut and is added back, so nothing important is lost.

Layer Normthe sound leveller

→

wild, unevencalm, balanced

First add the original back so nothing is lost — then normalise, like a sound engineer riding the faders: nothing too loud, nothing gone. Every layer stays stable.

05 · Feed-Forward Network · a private workshop

Each word thinks it over

768

contextual
vector

Zoom in→

3072

room to
think

Filter→

3072

weak signals
dropped

Zoom out→

768

sharper
word

Having heard the room, each word steps aside on its own — expanding to think it over, then compressing back down, refining what it just learned.

06 · Language Model Head · to get the output

From vector to word-scores

BERT hands over the blank’s 768-number vector. The LM (Language Model) Head compares it against every word BERT knows.

Why multiply? The blank has a meaning-vector and so does every word. Their dot product measures how much two meanings point the same way: big for bank, tiny for cat.

[MASK]1 × 768 · every word768 × 30,522 = scores1 × 30,522 logits

One dot product per word → a raw logit (score) for all 30,522 words.

07 · Score & Predict · the final vote

The best-fit word wins

He went to the [MASK] to transfer money

raw score — softmax (eˣ) → probability

bank8.9eˣ→71%

atm7.1eˣ→12%

office6.2eˣ→5%

Softmax turns the logits into probabilities that add to 100%. Winner bank, 71%.