FastText
Tutorial Section
FastText
Learn how Facebook's FastText improves upon Word2Vec by utilizing subword information (character n-grams) to handle Out-Of-Vocabulary words.
FastText: Subword Embeddings
Created by Facebook's AI Research (FAIR) lab in 2016, FastText is an extension of the Word2Vec model. While Word2Vec and GloVe treat every word as a distinct, atomic entity, FastText breaks words down into smaller pieces called character n-grams.
The Subword Breakdown Example
How does FastText view the word "apple" using an n-gram size of n=3?
FastText adds special boundary characters < and > to denote the beginning and end of a word.
Word: <apple>
N-grams (n=3): [ "<ap", "app", "ppl", "ple", "le>" ]
N-grams (n=3): [ "<ap", "app", "ppl", "ple", "le>" ]
The final embedding for "apple" is the sum of the embeddings of all these little n-grams (plus the embedding for the whole word itself)!
Why is this Revolutionary?
- Handles Typos: If a user types "appple", Word2Vec completely crashes because it has never seen that word. FastText succeeds because "appple" shares 80% of its subword n-grams with "apple".
- Solves the OOV Problem: It can generate embeddings for Out-Of-Vocabulary (OOV) words it has never seen before, by summing their character parts.
- Great for Morphological Languages: Highly effective for languages like Turkish or Finnish where words are formed by gluing together many suffixes.
Gensim FastText Implementation
from gensim.models import FastText
corpus = [["hello", "world", "this", "is", "nlp"],
["machine", "learning", "is", "awesome"]]
# Train FastText
# min_n and max_n control the character n-gram sizes
model = FastText(sentences=corpus, vector_size=10,
window=3, min_count=1, min_n=3, max_n=6)
# The model has never seen "learnings", but it can
# calculate a vector anyway based on "learn" + "ing" + "s"!
oov_word = "learnings"
# This works perfectly, unlike Word2Vec!
vector = model.wv[oov_word]
print(f"Vector for {oov_word} generated successfully!")