Low-Resource NLP
Techniques for languages and domains with very little training data.
Leveraging Small Datasets
NLP traditionally relies on massive amounts of data (billions of tokens). However, most of the world's 7,000+ languages are "low-resource." This section covers how we build AI when data is scarce.
Level 1 — Learning Transfer
Instead of training from scratch, we use Pre-trained Models. A model trained on 100GB of English can often understand the structure of Swahili after seeing only a few thousand examples.
Few-Shot Learning: Providing just 2-5 examples of a task directly in the prompt is often enough for modern LLMs to perform reasonably well.
Level 2 — Data Augmentation
When you don't have data, you make it. Common techniques include:
- Back-translation: Translate English to French and back to English to create a slightly different (augmented) sentence.
- Thesaurus Substitution: Replace words with synonyms.
- Self-Training: Use a model to label unlabeled data, then retrain the model on its own best guesses.
Level 3 — Cross-Lingual Projection
Researchers use Parallel Corpora (translations of the same text) to "project" knowledge from a high-resource language like English onto a low-resource one. This involves aligning the vector spaces so that "Apple" and "Poma" (Catalan) overlap perfectly.
# Example workflow for augmenting a low-resource dataset
original = "The crop yield was very low this year."
en_to_fr = model.translate(original, target="fr") # "La récolte a été très faible cette année."
fr_to_en = model.translate(en_to_fr, target="en") # "The harvest was very poor this year."
# Now you have TWO unique sentences for training!