DEV Community

Thomas Compton
Thomas Compton

Posted on

Comparing LLMs and Python OCR Packages: Opportunities and Challenges in OCR Accuracy

Introduction

Multimodal LLMs create new opportunities for extracting text from difficult images. But what are the pros and cons? How do Deepseek, Qwen, Gemini, and ChatGPT compare to traditional OCR packages?

This post compares different LLMs and traditional Python OCR tools using Jiwer’s WER and CER metrics to assess accuracy. Lessons from running large-scale text extraction using Gemini are also discussed.

Full evaluation code, tables, and workflows can be found in this GitHub repository

Key Findings

  • LLMs offer high-accuracy OCR, but are costly, slow, and require powerful hardware.
  • Traditional OCR tools are lightweight and fast, but generally less accurate.

OCR Packages (EasyOCR, Tesseract, PaddleOCR)

Archival OCR is important for NLP tasks like topic modelling. Python packages like EasyOCR, PyTesseract, and PaddleOCR are commonly used. This comparison focuses on practical performance, not theoretical strengths.

All evaluations use Jiwer’s Word Error Rate (WER) and Character Error Rate (CER).

Traditional OCR Results

Engine WER CER
EasyOCR 0.89 0.67
Tesseract 0.69 0.43
PaddleOCR 0.79 0.76

Tesseract performs best overall. EasyOCR has a lower CER than PaddleOCR, but higher WER, suggesting it identifies characters well but struggles with correct word segmentation.

Preprocessing Impact

Step WER CER
Before Preprocessing 0.77 0.60
After Preprocessing 0.67 0.43

Preprocessing significantly improves accuracy. This step is recommended for all OCR pipelines.

Post-Processing with LLMs

Post-correction with LLMs can fix some OCR issues (e.g., word splits), but won't recover text not detected by the OCR engine. Also, token limits and prompt misinterpretations pose risks at scale.

LLMs as an OCR Solution

The results speak for themselves:

Engine WER CER LLM
Gemini 0.04 0.02 Yes
Qwen 0.06 0.03 Yes
Deepseek 0.10 0.06 Yes
ChatGPT 0.58 0.45 Yes
Tesseract 0.69 0.43 No
PaddleOCR 0.79 0.76 No
EasyOCR 0.89 0.67 No

Multimodal LLMs outperform traditional OCR significantly. However, they likely include some internal correction pipelines, making the comparison imperfect.

Word Mismatch Counts

Model Word Mismatch
Gemini 30
Qwen 26
Deepseek 276
ChatGPT 108

Gemini performs best and has a usable Python wrapper. But it introduces new challenges.

Deployment Considerations: Gemini

Gemini’s main issues are:

  • Rate Limits:

  • Copyright Flags:

    • Gemini may falsely flag archival material.
    • Handling involves rerouting failed items to alternative models.

Deployment Considerations: Qwen via Ollama

  • Qwen via Ollama is an option for local runs

Negatives

  • Requires sufficient hardware.
  • Accuracy may drop slightly due to quantisation.

Positives

  • Free
  • No reliance on internet during runs
  • Data Privacy

Qwen2.5VL, 7b performed:
WER: 0.22
CER: 0.15

This performance is notably worse than Qwen3 through browser interface. To increase results use a larger model, but you will need better specs than I have. Even then, this may not rival online usage. So it will be a decision for the user to make surrounding there priorities. One can also use Hugging Face for this task. This will be a user preference. Although it is worth noting Ollama is discussing their multimodal optimisation strategy which should turn heads

Issues with LLMs

LLMs are often criticised for stochasticity. This was tested using Deepseek.

Deepseek Consistency Results

Run WER CER
Run 1 0.10 0.06
Run 2 0.10 0.06
Run 3 0.10 0.06
Run 4 0.10 0.06

Deepseek showed consistent results across multiple runs.

Comparison WER Difference
Run 2 vs Run 1 0.0128
Run 3 vs Run 1 0.0000
Run 4 vs Run 1 0.0118

Still, live monitoring of results (e.g., mean character count) is recommended for production pipelines.

Conclusion

This post compared LLMs and traditional OCR tools for image-to-text pipelines:

  • Traditional OCR tools like Tesseract are stable and lightweight, but less accurate.
  • LLMs like Gemini and Deepseek outperform on accuracy, but introduce complexity, cost, and deployment challenges.

The best choice depends on your goals:

  • For cost-effective large-scale work: Tesseract with preprocessing
  • For highest accuracy and smaller datasets: Gemini or Deepseek

For full code and data:

👉 GitHub Repository

References

  • Amrhein, C. & Clematide, S. (2018). Supervised OCR Error Detection and Correction. JLCL, 33(1).
  • Compton, T. (2025). OCR Evaluation. GitHub
  • Hemmer, A. et al. (2024). Confidence-Aware OCR Error Detection. Document Analysis Systems, Springer.
  • Kim, S. et al. (2025). LLMs and OCR for Historical Records. arXiv. doi:10.48550/arXiv.2501.11623
  • Warwick Modern Record Centre, archival material

Top comments (0)