Why Most OCR Tools Stop at English
A "searchable PDF" is a PDF where the text content is invisible-but-extractable. Visually it looks like the original scanned page; under the hood there's a text layer that lets you Ctrl+F to find words.
For English-only documents, modern OCR is essentially a solved problem. Tools like Adobe Acrobat, our PDF OCR tool, and command-line Tesseract produce 95%+ accuracy on clean scans.
The trouble starts with non-Latin scripts. Arabic, Chinese, Japanese, Korean, Thai, Hebrew, Hindi, and many other languages need different OCR engines or specific configurations. Mixed-script documents (English citations in a Chinese paper, French quotes in an Arabic article) confuse most engines.
This post covers the multilingual OCR pipeline: which engines handle which scripts, how to handle mixed-script pages, and how to tune accuracy for your specific document quality.
OCR Engine Landscape in 2026
| Engine | Open source? | Strong scripts | Weak scripts |
|---|---|---|---|
| Tesseract 5 | Yes (Apache) | Latin, Cyrillic, Greek | Arabic, complex Asian |
| Adobe Acrobat OCR | No | All major | Few weak spots |
| ABBYY FineReader | No | Most major scripts | Rarely-used scripts |
| Google Cloud Vision | No (paid API) | All including Arabic, CJK | Very specialized |
| Azure Document Intelligence | No (paid API) | All major scripts | Rarely-used scripts |
| AWS Textract | No (paid API) | English-focused | Most non-Latin |
| EasyOCR | Yes | 80+ languages | Variable accuracy |
| PaddleOCR | Yes | Chinese-strong, others | Some Latin scripts |
For multilingual production: cloud APIs (Google Cloud Vision, Azure) produce the highest accuracy across scripts. Open-source options (Tesseract, EasyOCR, PaddleOCR) are good enough for many use cases at zero per-page cost.
Tesseract Configuration
Tesseract 5 ships language data files for 100+ languages. Install:
# Ubuntu / Debian
sudo apt install tesseract-ocr tesseract-ocr-eng tesseract-ocr-fra tesseract-ocr-deu tesseract-ocr-ara tesseract-ocr-chi-sim
# Mac
brew install tesseract tesseract-lang
# Windows: install via UB Mannheim builds
Basic OCR:
tesseract input.png output.pdf -l eng pdf
For multilingual pages, specify multiple languages with +:
tesseract input.png output.pdf -l eng+fra+deu pdf
For Asian scripts, use specific files:
chi_simfor Chinese Simplifiedchi_trafor Chinese Traditionaljpnfor Japanesejpn_vertfor vertical Japanesekorfor Koreanthafor Thai
For Arabic and Hebrew (RTL):
tesseract input.png output.pdf -l ara pdf
Tesseract handles Arabic OCR but accuracy is significantly lower than Google Cloud Vision or ABBYY for the same document.
Cloud API Workflow
For high-accuracy multilingual OCR, Google Cloud Vision is the standard:
from google.cloud import vision
import io
def ocr_image(image_path):
client = vision.ImageAnnotatorClient()
with io.open(image_path, "rb") as f:
content = f.read()
image = vision.Image(content=content)
response = client.document_text_detection(image=image)
return response.full_text_annotation.text
text = ocr_image("input.png")
print(text)
Pricing: $1.50 per 1000 pages. For a typical project of 100-500 pages, total cost is under $1.
The output text needs to be packaged back into a PDF text layer. Use pdfplumber or reportlab to overlay text on the original PDF.
For batch processing of many documents, see our PDF OCR tool or the Batch Processing Files Guide.
Mixed-Script Pages
A page with English captions, Chinese body text, and Arabic notes is challenging. Strategies:
Strategy 1: Multi-language Tesseract
tesseract input.png output.pdf -l eng+chi_sim+ara pdf
Tesseract tries each language model; usually correct for clear sections, sometimes wrong at boundaries.
Strategy 2: Pre-segmentation
Use layout analysis tools (LayoutLM, PaddleStructure) to identify text regions. OCR each region with the appropriate language. Reassemble.
import layoutparser as lp
model = lp.Detectron2LayoutModel("lp://PubLayNet")
layout = model.detect(image)
for block in layout:
region_image = image.crop(block.coordinates)
# OCR each region with appropriate language
Strategy 3: Cloud API with language hints
Google Cloud Vision auto-detects language with high accuracy on mixed-script pages. Often the simplest answer.
Accuracy Tuning
OCR accuracy depends heavily on input quality. Improve with preprocessing:
| Preprocessing | Effect |
|---|---|
| Deskew (correct slight rotation) | Improves accuracy 5-10% on tilted scans |
| Binarization (black/white) | Better than grayscale for Tesseract |
| Noise removal | Critical for noisy scans |
| Resolution to 300 DPI | Below 300 DPI, accuracy drops sharply |
| Contrast enhancement | Critical for faded scans |
Pre-processing pipeline with ImageMagick:
convert input.png \
-deskew 40% \
-density 300 \
-threshold 50% \
-despeckle \
cleaned.png
tesseract cleaned.png output.pdf -l eng pdf
For severely degraded scans (faded carbon copies, low-contrast handwriting on lined paper), even cloud APIs struggle. Manual transcription is sometimes the only reliable option.
Pro Tip: OCR accuracy improves with each higher resolution up to 300 DPI. Above that, accuracy plateaus. Below 300 DPI, characters get confused (lowercase l vs digit 1, lowercase o vs digit 0). Always scan at 300 DPI or higher.
Searchable PDF Output Formats
After OCR, the output PDF can have text layered in different ways:
Type 1: Text-only
OCR text replaces the original image. Useful for clean text recovery but loses the original visual layout.
Type 2: Image + invisible text overlay (preferred)
Original scan image is preserved. Invisible text layer added on top. Visually identical to original; searchable via Ctrl+F.
Type 3: Image + image-text overlay
Hybrid approach where extracted text is placed visibly alongside or over the image. Useful for accessibility tools that read the text aloud.
For most use cases: Type 2. Tesseract's pdf output mode produces this by default.
Searchable PDF Specifications
For accessibility compliance:
- PDF/UA: PDF for Universal Accessibility, requires structured tags
- PDF/A: archival format, preserves text and visual layout
- PDF/X: print-oriented, color managed
For accessibility (PDF/UA), the OCR text layer must be properly tagged. Adobe Acrobat Pro produces tagged PDFs by default. Tesseract's output is text-layered but not tagged; additional tools like pdf-uaccess add tags.
For US/EU regulatory compliance, PDF/UA is increasingly required for public-facing documents.
File Size Considerations
Searchable PDFs are larger than image-only PDFs:
| Source | Image-only size | Searchable PDF size |
|---|---|---|
| 300 DPI 8.5×11 page | 1.5 MB | ~1.6 MB (text adds 5-10%) |
| 300 DPI 8.5×11 with image compression | 250 KB | 280 KB |
| 600 DPI 8.5×11 page | 6 MB | 6.2 MB |
The text layer adds modest size. Compression of the image layer matters more for total size.
For aggressive compression, see How to Reduce PDF File Size.
Handwriting OCR
Tesseract has limited handwriting OCR support. For handwritten text:
- Google Cloud Vision: better than Tesseract, mediocre on cursive
- Microsoft Azure Form Recognizer: handles printed handwriting reasonably
- AWS Textract: focused on form-like documents
- Tesseract: ineffective on cursive, OK on neat printing
For most handwriting OCR work in 2026, cloud APIs are the only viable option. Even those struggle with cursive, abbreviations, and specialized notation (mathematical formulas, music notation, chemical structures).
For specialized handwriting (vintage manuscripts, doctor's notes), human transcription remains the only reliable answer.
Common Issues
Text shifted left from original: text layer alignment off. Tesseract sometimes mis-positions text relative to the image. Check with Adobe Acrobat's "Recognize Text" verification tool.
Diacritics missing or wrong: language model didn't include diacritics. Use the specific language file (e.g., fra not eng+fra for accented French).
Mixed orientation pages: scan has portrait and landscape pages. Tesseract auto-detects per-page orientation; some PDF tools don't. Re-orient before OCR.
OCR ran but file says "no text": your PDF reader is using the image not the text layer. Most readers can extract OCR text; verify in Adobe Acrobat.
Text accuracy drops on second pass: re-running OCR doesn't improve accuracy. Each pass starts fresh. Don't expect cumulative improvement.
For broader PDF tools, see our PDF converter.
Frequently Asked Questions
Why is OCR slow on long documents?
Tesseract is single-threaded by default. For multi-page work, use parallel processing:
parallel tesseract {} {.}.pdf -l eng pdf ::: *.png
This processes files in parallel, scaling with CPU cores.
Can I OCR a PDF without converting to images first?
For text-PDFs (already searchable): no need, text is already there. For scanned PDFs (image-based): yes via Tesseract's --oem modes. Most production workflows split to images first because of cleaner control.
How do I OCR a multi-page PDF?
# Split PDF to images
pdftoppm input.pdf page -png -r 300
# OCR each page
for img in page-*.png; do
tesseract "$img" "${img%.png}" -l eng pdf
done
# Combine PDFs
pdftk page-*.pdf cat output combined.pdf
Or use our PDF OCR tool which handles the pipeline automatically.
What's the accuracy of OCR on receipts?
Variable. Clean printed receipts: 95%+. Crumpled or low-contrast receipts: 60-80%. Specialized OCR products (Receipt Bank, Expensify) tune for receipts and produce higher accuracy than general OCR.
Can OCR extract tables?
Yes, but with caveats. Tesseract's TSV output mode produces table-structured data. PaddleStructure and LayoutLMv3 are more specialized for table OCR with reasonable accuracy.
Does OCR work on screenshots?
Yes, often better than scans. Screenshots have clean rendering, no skew, no noise. Most OCR tools achieve 99%+ accuracy on application UI screenshots.
Related Reading
Bottom Line
For multilingual OCR: Tesseract 5 with the right language files for most cases, Google Cloud Vision or Azure for mixed-script and high-accuracy work. Pre-process scans to 300 DPI, deskew, and binarize before OCR. Use Type 2 (image + invisible text overlay) for searchable output. Our PDF OCR tool and PDF converter handle the pipeline if you want to skip the manual setup.



