DistilBERT multilingual fine-tuned for Tatar Morphological Analysis
This model is a fine-tuned version of distilbert-base-multilingual-cased for morphological analysis of the Tatar language. It was trained on a subset of 80,000 sentences from the Tatar Morphological Corpus. The model predicts fine-grained morphological tags (e.g., N+Sg+Nom, V+PRES(Й)+3SG).
Performance on Test Set
| Metric | Value | 95% CI |
|---|---|---|
| Token Accuracy | 0.9850 | [0.9841, 0.9860] |
| Micro F1 | 0.9851 | [0.9841, 0.9860] |
| Macro F1 | 0.4324 | [0.4744, 0.5093]* |
*Note: macro F1 CI as reported in the paper.
Accuracy by Part of Speech (Top 10)
| POS | Accuracy |
|---|---|
| PUNCT | 1.0000 |
| NOUN | 0.9836 |
| VERB | 0.9535 |
| ADJ | 0.9626 |
| PRON | 0.9896 |
| PART | 0.9973 |
| PROPN | 0.9754 |
| ADP | 1.0000 |
| CCONJ | 1.0000 |
| ADV | 0.9845 |
Usage
from transformers import AutoTokenizer, AutoModelForTokenClassification
import torch
model_name = "TatarNLPWorld/distilbert-tatar-morph"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForTokenClassification.from_pretrained(model_name)
tokens = ["Татар", "теле", "бик", "бай", "."]
inputs = tokenizer(tokens, is_split_into_words=True, return_tensors="pt", truncation=True)
outputs = model(**inputs)
predictions = torch.argmax(outputs.logits, dim=2)
# Get tag mapping from model config
id2tag = model.config.id2label
word_ids = inputs.word_ids()
prev_word = None
for idx, word_idx in enumerate(word_ids):
if word_idx is not None and word_idx != prev_word:
tag_id = predictions[0][idx].item()
if isinstance(id2tag, dict):
tag = id2tag.get(str(tag_id), id2tag.get(tag_id, "UNK"))
else:
tag = id2tag[tag_id] if tag_id < len(id2tag) else "UNK"
print(tokens[word_idx], "->", tag)
prev_word = word_idx
Expected output (approximately):
Татар -> N+Sg+Nom
теле -> N+Sg+POSS_3(СЫ)+Nom
бик -> Adv
бай -> Adj
. -> PUNCT
Citation
If you use this model, please cite it as:
@misc{arabov-distilbert-tatar-morph-2026,
title = {DistilBERT multilingual fine-tuned for Tatar Morphological Analysis},
author = {Arabov Mullosharaf Kurbonovich},
year = {2026},
publisher = {Hugging Face},
url = {https://huggingface.co/TatarNLPWorld/distilbert-tatar-morph}
}
License
Apache 2.0
- Downloads last month
- 31