Thomas Compton

Posted on May 22

Comparing LLMs and Python OCR Packages: Opportunities and Challenges in OCR Accuracy

#ocr #openai #ai #gemini

Introduction

Multimodal LLMs create new opportunities for extracting text from difficult images. But what are the pros and cons? How do Deepseek, Qwen, Gemini, and ChatGPT compare to traditional OCR packages?

This post compares different LLMs and traditional Python OCR tools using Jiwer’s WER and CER metrics to assess accuracy. Lessons from running large-scale text extraction using Gemini are also discussed.

Full evaluation code, tables, and workflows can be found in this GitHub repository

Key Findings

LLMs offer high-accuracy OCR, but are costly, slow, and require powerful hardware.
Traditional OCR tools are lightweight and fast, but generally less accurate.

OCR Packages (EasyOCR, Tesseract, PaddleOCR)

Archival OCR is important for NLP tasks like topic modelling. Python packages like EasyOCR, PyTesseract, and PaddleOCR are commonly used. This comparison focuses on practical performance, not theoretical strengths.

All evaluations use Jiwer’s Word Error Rate (WER) and Character Error Rate (CER).

Traditional OCR Results

Engine	WER	CER
EasyOCR	0.89	0.67
Tesseract	0.69	0.43
PaddleOCR	0.79	0.76

Tesseract performs best overall. EasyOCR has a lower CER than PaddleOCR, but higher WER, suggesting it identifies characters well but struggles with correct word segmentation.

Preprocessing Impact

Step	WER	CER
Before Preprocessing	0.77	0.60
After Preprocessing	0.67	0.43

Preprocessing significantly improves accuracy. This step is recommended for all OCR pipelines.

Post-Processing with LLMs

Post-correction with LLMs can fix some OCR issues (e.g., word splits), but won't recover text not detected by the OCR engine. Also, token limits and prompt misinterpretations pose risks at scale.

LLMs as an OCR Solution

The results speak for themselves:

Engine	WER	CER	LLM
Gemini	0.04	0.02	Yes
Qwen	0.06	0.03	Yes
Deepseek	0.10	0.06	Yes
ChatGPT	0.58	0.45	Yes
Tesseract	0.69	0.43	No
PaddleOCR	0.79	0.76	No
EasyOCR	0.89	0.67	No

Multimodal LLMs outperform traditional OCR significantly. However, they likely include some internal correction pipelines, making the comparison imperfect.

Word Mismatch Counts

Model	Word Mismatch
Gemini	30
Qwen	26
Deepseek	276
ChatGPT	108

Gemini performs best and has a usable Python wrapper. But it introduces new challenges.

Deployment Considerations: Gemini

Gemini’s main issues are:

Rate Limits:
- Gemini 2.0 Flash-Lite: 30 requests/min, 1,500/day
- See full limits
Copyright Flags:
- Gemini may falsely flag archival material.
- Handling involves rerouting failed items to alternative models.

Deployment Considerations: Qwen via Ollama

Qwen via Ollama is an option for local runs

Negatives

Requires sufficient hardware.
Accuracy may drop slightly due to quantisation.

Positives

Free
No reliance on internet during runs
Data Privacy

Qwen2.5VL, 7b performed:
WER: 0.22 CER: 0.15

This performance is notably worse than Qwen3 through browser interface. To increase results use a larger model, but you will need better specs than I have. Even then, this may not rival online usage. So it will be a decision for the user to make surrounding there priorities. One can also use Hugging Face for this task. This will be a user preference. Although it is worth noting Ollama is discussing their multimodal optimisation strategy which should turn heads

Issues with LLMs

LLMs are often criticised for stochasticity. This was tested using Deepseek.

Deepseek Consistency Results

Run	WER	CER
Run 1	0.10	0.06
Run 2	0.10	0.06
Run 3	0.10	0.06
Run 4	0.10	0.06

Deepseek showed consistent results across multiple runs.

Comparison	WER Difference
Run 2 vs Run 1	0.0128
Run 3 vs Run 1	0.0000
Run 4 vs Run 1	0.0118

Still, live monitoring of results (e.g., mean character count) is recommended for production pipelines.

Conclusion

This post compared LLMs and traditional OCR tools for image-to-text pipelines:

Traditional OCR tools like Tesseract are stable and lightweight, but less accurate.
LLMs like Gemini and Deepseek outperform on accuracy, but introduce complexity, cost, and deployment challenges.

The best choice depends on your goals:

For cost-effective large-scale work: Tesseract with preprocessing
For highest accuracy and smaller datasets: Gemini or Deepseek

For full code and data:

👉 GitHub Repository

References

Amrhein, C. & Clematide, S. (2018). Supervised OCR Error Detection and Correction. JLCL, 33(1).
Compton, T. (2025). OCR Evaluation. GitHub
Hemmer, A. et al. (2024). Confidence-Aware OCR Error Detection. Document Analysis Systems, Springer.
Kim, S. et al. (2025). LLMs and OCR for Historical Records. arXiv. doi:10.48550/arXiv.2501.11623
Warwick Modern Record Centre, archival material

DEV Community