MizoOCR
The first OCR model for the Mizo language, developed by MWire Labs.
Model Description
MizoOCR is a fine-tuned TrOCR model for recognizing printed Mizo text, including its unique diacritical characters (芒, 锚, 卯, 么, 没). It is built on microsoft/trocr-base-printed and trained on 70,000 deduplicated mix of curated + synthetic image-text pairs drawn from a 200k dataset generated by MWire Labs.
Performance
| Split | Character Accuracy |
|---|---|
| Validation | 89.61% |
| Test | 90.68% |
Training Data
- Total unique samples after deduplication: 102,171
- Training samples: 70,000
- Validation samples: 5,000
- Test samples: 5,000
Usage
from transformers import TrOCRProcessor, VisionEncoderDecoderModel
from PIL import Image
processor = TrOCRProcessor.from_pretrained("MWirelabs/mizo-ocr")
model = VisionEncoderDecoderModel.from_pretrained("MWirelabs/mizo-ocr")
image = Image.open("mizo_text.jpg").convert("RGB")
pixel_values = processor(image, return_tensors="pt").pixel_values
generated = model.generate(pixel_values)
text = processor.tokenizer.decode(generated[0], skip_special_tokens=True)
print(text)
Limitations
- Trained primarily on synthetic data with a small curated dataset; accuracy on real scanned documents may vary
- Optimized for printed text, not handwritten
- Performance may vary on heavily degraded or low-quality images
Citation
If you use this model, please cite:
@misc{mwirelabs2026mizoocr,
title={MizoOCR: First OCR Model for the Mizo Language},
author={MWire Labs},
year={2026},
publisher={Hugging Face},
url={https://huggingface.co/MWirelabs/mizo-ocr}
}
About MWire Labs
MWire Labs is an AI research organization based in Shillong, Meghalaya, India, specializing in language technology for Northeast India's indigenous languages.
- Downloads last month
- 15
Model tree for MWirelabs/mizo-ocr
Base model
microsoft/trocr-base-printed