PaddleOCR-VL-1.5: Towards a Multi-Task 0.9B VLM for Robust In-the-Wild Document Parsing
Introduction
PaddleOCR-VL-1.5 is an advanced next-generation model of PaddleOCR-VL, achieving a new state-of-the-art accuracy of 94.5% on OmniDocBench v1.5. To rigorously evaluate robustness against real-world physical distortions—including scanning artifacts, skew, warping, screen photography, and illumination—we propose the Real5-OmniDocBench benchmark. Experimental results demonstrate that this enhanced model attains SOTA performance on the newly curated benchmark. Furthermore, we extend the model’s capabilities by incorporating seal recognition and text spotting tasks, while remaining a 0.9B ultra-compact VLM with high efficiency. This repository contains official transformers weights for PaddleOCR-VL-1.5.
Key Capabilities of PaddleOCR-VL-1.5
With a parameter size of 0.9B, PaddleOCR-VL-1.5 achieves 94.5% accuracy on OmniDocBench v1.5, surpassing the previous SOTA model PaddleOCR-VL. Significant improvements are observed in table, formula, and text recognition.
It introduces an innovative approach to document parsing by supporting irregular-shaped localization, enabling accurate polygonal detection under skewed and warped document conditions. Evaluations across five real-world scenarios—scanning, skew, warping, screen-photography, and illumination—demonstrate superior performance over mainstream open-source and proprietary models.
The model introduces text spotting (text-line localization and recognition), along with seal recognition, with all corresponding metrics setting new SOTA results in their respective tasks.
PaddleOCR-VL-1.5 further strengthens its capability in specialized scenarios and multilingual recognition. Recognition performance is improved for rare characters, ancient texts, multilingual tables, underlines, and checkboxes, and language coverage is extended to include China's Tibetan script and Bengali.
The model supports automatic cross-page table merging and cross-page paragraph heading recognition, effectively mitigating content fragmentation issues in long-document parsing.
Model Architecture
News
2026.01.29🚀 We release PaddleOCR-VL-1.5, —a Multi-Task 0.9B VLM for Robust In-the-Wild Document Parsing.
Usage
This model should be used with PPDocLayoutV3. Find an end-to-end example inference in this notebook.
Make sure to have transformers above v5.
python -m pip install "transformers>=5.0.0"
You can load the model as follows. Since you need to have the detected regions, please refer to notebook for complete inference.
import torch
from transformers import AutoProcessor, AutoModelForImageTextToText
model_id = "PaddlePaddle/PaddleOCR-VL-1.5-hf"
processor = AutoProcessor.from_pretrained(ocr_model_id)
model = AutoModelForImageTextToText.from_pretrained(
model_id, torch_dtype=torch.bfloat16
).to(device)
def ocr_region(crop, prompt):
"""Run PaddleOCR-VL 1.5 on a single cropped region."""
messages = [
{
"role": "user",
"content": [
{"type": "image", "image": crop},
{"type": "text", "text": prompt},
],
}
]
inputs = processor.apply_chat_template(
messages,
add_generation_prompt=True,
tokenize=True,
return_dict=True,
return_tensors="pt",
).to(ocr_model.device)
generated_ids = model.generate(**inputs, max_new_tokens=1024)
trimmed = generated_ids[0][inputs["input_ids"].shape[-1] :]
return processor.decode(trimmed, skip_special_tokens=True)
parsed_regions = []
# assuming you have detected regions from PPDocLayoutv3 in detections
for det in detections:
label = det["label"]
prompt = LABEL_TO_PROMPT.get(label)
x1, y1, x2, y2 = det["box"]
crop = image.crop((x1, y1, x2, y2))
text = recognise_region(crop, prompt)
parsed_regions.append({**det, "prompt": prompt, "text": text})
print(f"[{det['order']}] {label} prompt={prompt}")
print(text)
Performance
Document Parsing
1. OmniDocBench v1.5
PaddleOCR-VL-1.5 achieves SOTA performance for overall, text, formula, tables and reading order on OmniDocBench v1.5
Notes:
- Performance metrics are cited from the OmniDocBench official leaderboard, except for Gemini-3 Pro, Qwen3-VL-235B-A22B-Instruct and our model, which were evaluated independently.
2. Real5-OmniDocBench
Across all five diverse and challenging scenarios—scanning, warping, screen-photography, illumination, and skew—PaddleOCR-VL-1.5 consistently sets new SOTA records
Notes:
- Real5-OmniDocBench is a brand-new benchmark oriented toward real-world scenarios, which we constructed based on the OmniDocBench v1.5 dataset. The dataset comprises five distinct scenarios: Scanning, Warping, Screen-photography, Illumination, and Skew. For further details, please refer to Real5-OmniDocBench.
Inference Performance
Notes:
- End-to-End Inference Performance Comparison on OmniDocBench v1.5. PDF documents were processed in batches of 512 on a single NVIDIA A100 GPU. The reported end-to-end runtime includes both PDF rendering and Markdown generation. All methods rely on their built-in PDF parsing modules and default DPI settings to reflect out-of-the-box performance.
Visualization
Real-word Document Parsing
Illumination
Skew
Screen Photography
Scanning
Warping
Text Spotting
Seal Recognition
Acknowledgments
We would like to thank PaddleFormers, Keye, MinerU, OmniDocBench for providing valuable code, model weights and benchmarks. We also appreciate everyone's contribution to this open-source project!
Citation
If you find PaddleOCR-VL-1.5 helpful, feel free to give us a star and citation.
@misc{cui2026paddleocrvl15multitask09bvlm,
title={PaddleOCR-VL-1.5: Towards a Multi-Task 0.9B VLM for Robust In-the-Wild Document Parsing},
author={Cheng Cui and Ting Sun and Suyin Liang and Tingquan Gao and Zelun Zhang and Jiaxuan Liu and Xueqing Wang and Changda Zhou and Hongen Liu and Manhui Lin and Yue Zhang and Yubo Zhang and Yi Liu and Dianhai Yu and Yanjun Ma},
year={2026},
eprint={2601.21957},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2601.21957},
}
- Downloads last month
- 42
Model tree for merve/PaddleOCR-VL-1.5-hf
Base model
baidu/ERNIE-4.5-0.3B-Paddle