Introduction
Multimodal LLMs create new opportunities for extracting text from difficult images. But what are the pros and cons? How do Deepseek, Qwen, Gemini, and ChatGPT compare to traditional OCR packages?
This post compares different LLMs and traditional Python OCR tools using Jiwer’s WER and CER metrics to assess accuracy. Lessons from running large-scale text extraction using Gemini are also discussed.
Full evaluation code, tables, and workflows can be found in this GitHub repository
Key Findings
- LLMs offer high-accuracy OCR, but are costly, slow, and require powerful hardware.
- Traditional OCR tools are lightweight and fast, but generally less accurate.
OCR Packages (EasyOCR, Tesseract, PaddleOCR)
Archival OCR is important for NLP tasks like topic modelling. Python packages like EasyOCR, PyTesseract, and PaddleOCR are commonly used. This comparison focuses on practical performance, not theoretical strengths.
All evaluations use Jiwer’s Word Error Rate (WER) and Character Error Rate (CER).
Traditional OCR Results
Engine | WER | CER |
---|---|---|
EasyOCR | 0.89 | 0.67 |
Tesseract | 0.69 | 0.43 |
PaddleOCR | 0.79 | 0.76 |
Tesseract performs best overall. EasyOCR has a lower CER than PaddleOCR, but higher WER, suggesting it identifies characters well but struggles with correct word segmentation.
Preprocessing Impact
Step | WER | CER |
---|---|---|
Before Preprocessing | 0.77 | 0.60 |
After Preprocessing | 0.67 | 0.43 |
Preprocessing significantly improves accuracy. This step is recommended for all OCR pipelines.
Post-Processing with LLMs
Post-correction with LLMs can fix some OCR issues (e.g., word splits), but won't recover text not detected by the OCR engine. Also, token limits and prompt misinterpretations pose risks at scale.
LLMs as an OCR Solution
The results speak for themselves:
Engine | WER | CER | LLM |
---|---|---|---|
Gemini | 0.04 | 0.02 | Yes |
Qwen | 0.06 | 0.03 | Yes |
Deepseek | 0.10 | 0.06 | Yes |
ChatGPT | 0.58 | 0.45 | Yes |
Tesseract | 0.69 | 0.43 | No |
PaddleOCR | 0.79 | 0.76 | No |
EasyOCR | 0.89 | 0.67 | No |
Multimodal LLMs outperform traditional OCR significantly. However, they likely include some internal correction pipelines, making the comparison imperfect.
Word Mismatch Counts
Model | Word Mismatch |
---|---|
Gemini | 30 |
Qwen | 26 |
Deepseek | 276 |
ChatGPT | 108 |
Gemini performs best and has a usable Python wrapper. But it introduces new challenges.
Deployment Considerations: Gemini
Gemini’s main issues are:
-
Rate Limits:
- Gemini 2.0 Flash-Lite: 30 requests/min, 1,500/day
- See full limits
-
Copyright Flags:
- Gemini may falsely flag archival material.
- Handling involves rerouting failed items to alternative models.
Deployment Considerations: Qwen via Ollama
- Qwen via Ollama is an option for local runs
Negatives
- Requires sufficient hardware.
- Accuracy may drop slightly due to quantisation.
Positives
- Free
- No reliance on internet during runs
- Data Privacy
Qwen2.5VL, 7b performed:
WER: 0.22
CER: 0.15
This performance is notably worse than Qwen3 through browser interface. To increase results use a larger model, but you will need better specs than I have. Even then, this may not rival online usage. So it will be a decision for the user to make surrounding there priorities. One can also use Hugging Face for this task. This will be a user preference. Although it is worth noting Ollama is discussing their multimodal optimisation strategy which should turn heads
Issues with LLMs
LLMs are often criticised for stochasticity. This was tested using Deepseek.
Deepseek Consistency Results
Run | WER | CER |
---|---|---|
Run 1 | 0.10 | 0.06 |
Run 2 | 0.10 | 0.06 |
Run 3 | 0.10 | 0.06 |
Run 4 | 0.10 | 0.06 |
Deepseek showed consistent results across multiple runs.
Comparison | WER Difference |
---|---|
Run 2 vs Run 1 | 0.0128 |
Run 3 vs Run 1 | 0.0000 |
Run 4 vs Run 1 | 0.0118 |
Still, live monitoring of results (e.g., mean character count) is recommended for production pipelines.
Conclusion
This post compared LLMs and traditional OCR tools for image-to-text pipelines:
- Traditional OCR tools like Tesseract are stable and lightweight, but less accurate.
- LLMs like Gemini and Deepseek outperform on accuracy, but introduce complexity, cost, and deployment challenges.
The best choice depends on your goals:
- For cost-effective large-scale work: Tesseract with preprocessing
- For highest accuracy and smaller datasets: Gemini or Deepseek
For full code and data:
👉 GitHub Repository
References
- Amrhein, C. & Clematide, S. (2018). Supervised OCR Error Detection and Correction. JLCL, 33(1).
- Compton, T. (2025). OCR Evaluation. GitHub
- Hemmer, A. et al. (2024). Confidence-Aware OCR Error Detection. Document Analysis Systems, Springer.
- Kim, S. et al. (2025). LLMs and OCR for Historical Records. arXiv. doi:10.48550/arXiv.2501.11623
- Warwick Modern Record Centre, archival material
Top comments (0)