๐๏ธ Arabic Tokenizer Arena Pro
Advanced research & production platform for Arabic tokenization analysis
Select Tokenizer
Choose a tokenizer to analyze
Sample Texts
Select a sample or enter custom text
Sample Texts
๐ Arabic Tokenizer Leaderboard
All tokenizers evaluated on all 8 Arabic datasets from HuggingFace (~36,000+ samples total).
โณ Loading cached results...
๐ Leaderboard Results
๐ Per-Dataset Breakdown
๐ Evaluation Datasets
| Dataset | Category | Samples |
|---|---|---|
| ArabicMMLU | MSA Benchmark | 5,000 |
| ASTD | Egyptian Dialect | 5,000 |
| ATHAR | Classical Arabic | 5,000 |
| ARCD | QA Dataset | 1,395 |
| Ashaar | Poetry | 5,000 |
| Hadith | Religious | 5,000 |
| Arabic Sentiment | Social Media | 5,000 |
| SANAD | News | 5,000 |
Tokenization Evaluation Metrics Guide
Efficiency Metrics
| Metric | Description | Ideal Value | Why It Matters |
|---|---|---|---|
| Fertility | Tokens per word | 1.0 | Lower fertility = fewer tokens = faster inference & lower cost |
| Compression Ratio | Bytes per token | Higher is better | Better compression = more efficient encoding |
| Chars/Token | Characters per token | Higher is better | More characters per token = better vocabulary utilization |
Coverage Metrics
| Metric | Description | Ideal Value | Why It Matters |
|---|---|---|---|
| OOV Rate | Out-of-vocabulary percentage | 0% | Lower OOV = better vocabulary coverage |
| STRR | Single Token Retention Rate | Higher is better | More words preserved as single tokens = better semantic boundaries |
| Continued Words Ratio | Words split into multiple tokens | Lower is better | Fewer splits = better word boundary preservation |
Arabic-Specific Metrics
| Metric | Description | Why It Matters |
|---|---|---|
| Arabic Fertility | Tokens per Arabic word | Arabic-specific efficiency measure |
| Diacritic Preservation | Whether tashkeel is preserved | Important for religious & educational texts |
Scoring Formula (Leaderboard)
Score = (Fertility Score ร 0.45) + (Compression Score ร 0.35) + (UNK Score ร 0.20) ร 100
Where:
- Fertility Score = 2.0 / fertility (capped 0-1, inverted - lower fertility = higher score)
- Compression Score = compression / 6 (capped 0-1)
- UNK Score = 1 - (unk_ratio ร 20) (capped 0-1, inverted)
Research Background
These metrics are based on recent research including:
- "A Comprehensive Analysis of Various Tokenizers for Arabic LLMs" (2024)
- "Evaluating Various Tokenizers for Arabic Text Classification" (Alyafeai et al.)
- "Beyond Fertility: STRR as a Metric for Multilingual Tokenization" (2025)
- "Arabic Stable LM: Adapting Stable LM to Arabic" (2024)
๐ Submit Your Tokenizer
Evaluate any HuggingFace tokenizer on all 8 Arabic datasets and see how it compares.
Model Information
Model Type
Evaluation Results
๐ Submission Guidelines
- Model ID: Must be a valid HuggingFace model ID (e.g.,
organization/model-name) - Tokenizer: The model must have a tokenizer that can be loaded with
AutoTokenizer - Public Models: Only public models on HuggingFace Hub are supported
- Evaluation: Your tokenizer will be evaluated on all 8 Arabic datasets (~36,000+ samples)
๐ก Tips
- Lower fertility scores indicate better Arabic tokenization efficiency
- Compare your results with the leaderboard to see how your tokenizer ranks
๐๏ธ Arabic Tokenizer Arena Pro
A comprehensive platform for evaluating Arabic tokenizers across multiple dimensions
27
Available Tokenizers
8
Evaluation Datasets
8+
Metrics
๐ Available Tokenizers
Arabic BERT Models
- AraBERT v2 (AUB MIND Lab)
- AraBERT v2 Large (AUB MIND Lab)
- CAMeLBERT Mix (CAMeL Lab NYU Abu Dhabi)
- CAMeLBERT MSA (CAMeL Lab NYU Abu Dhabi)
- CAMeLBERT DA (CAMeL Lab NYU Abu Dhabi)
- CAMeLBERT CA (CAMeL Lab NYU Abu Dhabi)
- MARBERT (UBC NLP)
- ARBERT (UBC NLP)
- Arabic BERT (Safaya) (Safaya)
Arabic Tokenizers
- Aranizer PBE 86K (RIOTU Lab)
- Aranizer SP 86K (RIOTU Lab)
- Aranizer PBE 32K (RIOTU Lab)
- Aranizer SP 32K (RIOTU Lab)
Arabic LLMs
- AceGPT 13B Chat (Freedom Intelligence)
- SILMA 9B Instruct (SILMA AI)
- SILMA Kashif 2B (RAG) (SILMA AI)
- Fanar 9B Instruct (QCRI (Qatar))
- Arabic StableLM 2 Chat (Stability AI)
- Atlas-Chat 9B (Darija) (MBZUAI Paris)
- Atlas-Chat 2B (Darija) (MBZUAI Paris)
Multilingual Models
- Qwen 2.5 7B (Alibaba Qwen)
- Gemma 2 9B (Google)
- Mistral 7B v0.3 (Mistral AI)
- Mistral Nemo (Mistral AI + NVIDIA)
- XLM-RoBERTa Base (Facebook AI)
- mBERT (Google)
- Falcon 7B (Technology Innovation Institute)
โจ Features
๐
Comprehensive efficiency metrics (fertility, compression, STRR)
๐
Arabic-specific analysis (dialect support, diacritic preservation)
โ๏ธ
Side-by-side tokenizer comparison
๐จ
Beautiful token visualization
๐
Leaderboard with real HuggingFace datasets
๐
Support for MSA, dialectal, and Classical Arabic
๐ฏ Use Cases
๐ฌ Research
Compare tokenizers for Arabic NLP experiments
๐ Production
Select optimal tokenizer for deployment
๐ Education
Understand how different algorithms handle Arabic
๐ฐ Optimization
Identify cost-efficient tokenizers for API usage
Built with โค๏ธ for the Arabic NLP community