MizoOCR

The first OCR model for the Mizo language, developed by MWire Labs.

Model Description

MizoOCR is a fine-tuned TrOCR model for recognizing printed Mizo text, including its unique diacritical characters (â, ê, î, ô, û). It is built on microsoft/trocr-base-printed and trained on 70,000 deduplicated mix of curated + synthetic image-text pairs drawn from a 200k dataset generated by MWire Labs.

Performance

Split	Character Accuracy
Validation	89.61%
Test	90.68%

Training Data

Total unique samples after deduplication: 102,171
Training samples: 70,000
Validation samples: 5,000
Test samples: 5,000

Usage

from transformers import TrOCRProcessor, VisionEncoderDecoderModel
from PIL import Image

processor = TrOCRProcessor.from_pretrained("MWirelabs/mizo-ocr")
model = VisionEncoderDecoderModel.from_pretrained("MWirelabs/mizo-ocr")

image = Image.open("mizo_text.jpg").convert("RGB")
pixel_values = processor(image, return_tensors="pt").pixel_values
generated = model.generate(pixel_values)
text = processor.tokenizer.decode(generated[0], skip_special_tokens=True)
print(text)

Limitations

Trained primarily on synthetic data with a small curated dataset; accuracy on real scanned documents may vary
Optimized for printed text, not handwritten
Performance may vary on heavily degraded or low-quality images

Citation

If you use this model, please cite:

@misc{mwirelabs2026mizoocr,
  title={MizoOCR: First OCR Model for the Mizo Language},
  author={MWire Labs},
  year={2026},
  publisher={Hugging Face},
  url={https://huggingface.co/MWirelabs/mizo-ocr}
}

About MWire Labs

MWire Labs is an AI research organization based in Shillong, Meghalaya, India, specializing in language technology for Northeast India's indigenous languages.

Downloads last month: 15

Safetensors

Model size

0.3B params

Tensor type

F32

Model tree for MWirelabs/mizo-ocr

Base model

microsoft/trocr-base-printed

Finetuned

(21)

this model