Token Classification
Safetensors
Tatar
distilbert
tatar
morphology

DistilBERT multilingual fine-tuned for Tatar Morphological Analysis

This model is a fine-tuned version of distilbert-base-multilingual-cased for morphological analysis of the Tatar language. It was trained on a subset of 80,000 sentences from the Tatar Morphological Corpus. The model predicts fine-grained morphological tags (e.g., N+Sg+Nom, V+PRES(Й)+3SG).

Performance on Test Set

Metric Value 95% CI
Token Accuracy 0.9850 [0.9841, 0.9860]
Micro F1 0.9851 [0.9841, 0.9860]
Macro F1 0.4324 [0.4744, 0.5093]*

*Note: macro F1 CI as reported in the paper.

Accuracy by Part of Speech (Top 10)

POS Accuracy
PUNCT 1.0000
NOUN 0.9836
VERB 0.9535
ADJ 0.9626
PRON 0.9896
PART 0.9973
PROPN 0.9754
ADP 1.0000
CCONJ 1.0000
ADV 0.9845

Usage

from transformers import AutoTokenizer, AutoModelForTokenClassification
import torch

model_name = "TatarNLPWorld/distilbert-tatar-morph"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForTokenClassification.from_pretrained(model_name)

tokens = ["Татар", "теле", "бик", "бай", "."]
inputs = tokenizer(tokens, is_split_into_words=True, return_tensors="pt", truncation=True)
outputs = model(**inputs)
predictions = torch.argmax(outputs.logits, dim=2)

# Get tag mapping from model config
id2tag = model.config.id2label

word_ids = inputs.word_ids()
prev_word = None
for idx, word_idx in enumerate(word_ids):
    if word_idx is not None and word_idx != prev_word:
        tag_id = predictions[0][idx].item()
        if isinstance(id2tag, dict):
            tag = id2tag.get(str(tag_id), id2tag.get(tag_id, "UNK"))
        else:
            tag = id2tag[tag_id] if tag_id < len(id2tag) else "UNK"
        print(tokens[word_idx], "->", tag)
    prev_word = word_idx

Expected output (approximately):

Татар -> N+Sg+Nom
теле -> N+Sg+POSS_3(СЫ)+Nom
бик -> Adv
бай -> Adj
. -> PUNCT

Citation

If you use this model, please cite it as:

@misc{arabov-distilbert-tatar-morph-2026,
  title = {DistilBERT multilingual fine-tuned for Tatar Morphological Analysis},
  author = {Arabov Mullosharaf Kurbonovich},
  year = {2026},
  publisher = {Hugging Face},
  url = {https://huggingface.co/TatarNLPWorld/distilbert-tatar-morph}
}

License

Apache 2.0

Downloads last month
31
Safetensors
Model size
0.1B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support