GaroVec v1.0 — Hybrid English↔Garo Embedding Model

GaroVec v1.0 is the first publicly documented Latin-script Garo embedding model, developed by MWire Labs to support linguistic equity and low-resource NLP for Northeast India. It combines FastText-style subword embeddings with bilingual alignment techniques to create a hybrid English↔Garo vector space, enabling cross-lingual applications such as lexicon building, translation support, and semantic search.

This model was built in collaboration with native Garo speakers and is part of a broader initiative to create reproducible, timestamped language resources for endangered and underrepresented languages. GaroVec is optimized for Latin-script Garo, which is commonly used in digital communication and educational contexts, and is designed to be modular and extensible for future dialectal or phonetic variants.

Released under a permissive license (CC BY-SA 4.0), GaroVec v1.0 is intended for public use in research, education, and civic technology. It is hosted on Hugging Face with full documentation, including training methodology, evaluation notes, and usage examples. This submission aims to make GaroVec discoverable to linguists, educators, and technologists working to preserve and revitalize the Garo language.

ELP Language

Garo

ELP Categories

Language and Technology Language Revitalization, Education, and Learning

Resource Types

App/Software

Country

India

Media Image

Placeholder 6

Tag

Computational Linguistics and NLP Creating Digital Materials Technology