BERT Tutorial

BERT

Bidirectional Encoder Representations from Transformers.

BERT

BERT (Bidirectional Encoder Representations from Transformers) released by Google in 2018, changed everything. It was the first model to deeply understand context by looking at a word's left and right neighbors simultaneously.

Level 1 — Pre-training & Fine-tuning

BERT isn't just one model; it's a two-step process:

Pre-training: The model reads half of the internet to learn how language works.
Fine-tuning: You take that pre-trained model and teach it a specific task (like detecting spam) in just a few minutes.

Level 2 — MLM and NSP

BERT was trained using two clever unsupervised tasks:

Masked Language Modeling (MLM): Hiding 15% of words and making the model guess them.
Next Sentence Prediction (NSP): Guessing if Sentence B follows Sentence A.

Level 3 — Feature Extraction vs Fine-tuning

Advanced users can use BERT as a Feature Extractor (getting word vectors for other models) or through Full Fine-tuning (updating all BERT weights). Fine-tuning is generally superior for specialized accuracy.

BERT Sentence Embeddings

from transformers import AutoTokenizer, AutoModel
import torch

tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
model = AutoModel.from_pretrained("bert-base-uncased")

text = "BERT understands context bidirectionally."
inputs = tokenizer(text, return_tensors="pt")
outputs = model(**inputs)

# The 'last_hidden_state' contains contextual embeddings
embeddings = outputs.last_hidden_state

Previous: Positional Encoding