FastText Tutorial Section

FastText

Learn how Facebook's FastText improves upon Word2Vec by utilizing subword information (character n-grams) to handle Out-Of-Vocabulary words.

FastText: Subword Embeddings

Created by Facebook's AI Research (FAIR) lab in 2016, FastText is an extension of the Word2Vec model. While Word2Vec and GloVe treat every word as a distinct, atomic entity, FastText breaks words down into smaller pieces called character n-grams.

The Subword Breakdown Example

How does FastText view the word "apple" using an n-gram size of n=3?

FastText adds special boundary characters < and > to denote the beginning and end of a word.

Word: <apple>
N-grams (n=3): [ "<ap", "app", "ppl", "ple", "le>" ]

The final embedding for "apple" is the sum of the embeddings of all these little n-grams (plus the embedding for the whole word itself)!

Why is this Revolutionary?

  • Handles Typos: If a user types "appple", Word2Vec completely crashes because it has never seen that word. FastText succeeds because "appple" shares 80% of its subword n-grams with "apple".
  • Solves the OOV Problem: It can generate embeddings for Out-Of-Vocabulary (OOV) words it has never seen before, by summing their character parts.
  • Great for Morphological Languages: Highly effective for languages like Turkish or Finnish where words are formed by gluing together many suffixes.
Gensim FastText Implementation
from gensim.models import FastText

corpus = [["hello", "world", "this", "is", "nlp"], 
          ["machine", "learning", "is", "awesome"]]

# Train FastText
# min_n and max_n control the character n-gram sizes
model = FastText(sentences=corpus, vector_size=10, 
                 window=3, min_count=1, min_n=3, max_n=6)

# The model has never seen "learnings", but it can 
# calculate a vector anyway based on "learn" + "ing" + "s"!
oov_word = "learnings"

# This works perfectly, unlike Word2Vec!
vector = model.wv[oov_word] 
print(f"Vector for {oov_word} generated successfully!")