Searchable PDF With OCR: Multilingual, Mixed-Script, and Accuracy Tuning

Why Most OCR Tools Stop at English

A "searchable PDF" is a PDF where the text content is invisible-but-extractable. Visually it looks like the original scanned page; under the hood there's a text layer that lets you Ctrl+F to find words.

For English-only documents, modern OCR is essentially a solved problem. Tools like Adobe Acrobat, our PDF OCR tool, and command-line Tesseract produce 95%+ accuracy on clean scans.

The trouble starts with non-Latin scripts. Arabic, Chinese, Japanese, Korean, Thai, Hebrew, Hindi, and many other languages need different OCR engines or specific configurations. Mixed-script documents (English citations in a Chinese paper, French quotes in an Arabic article) confuse most engines.

This post covers the multilingual OCR pipeline: which engines handle which scripts, how to handle mixed-script pages, and how to tune accuracy for your specific document quality.

OCR Engine Landscape in 2026

Engine	Open source?	Strong scripts	Weak scripts
Tesseract 5	Yes (Apache)	Latin, Cyrillic, Greek	Arabic, complex Asian
Adobe Acrobat OCR	No	All major	Few weak spots
ABBYY FineReader	No	Most major scripts	Rarely-used scripts
Google Cloud Vision	No (paid API)	All including Arabic, CJK	Very specialized
Azure Document Intelligence	No (paid API)	All major scripts	Rarely-used scripts
AWS Textract	No (paid API)	English-focused	Most non-Latin
EasyOCR	Yes	80+ languages	Variable accuracy
PaddleOCR	Yes	Chinese-strong, others	Some Latin scripts

For multilingual production: cloud APIs (Google Cloud Vision, Azure) produce the highest accuracy across scripts. Open-source options (Tesseract, EasyOCR, PaddleOCR) are good enough for many use cases at zero per-page cost.

Tesseract Configuration

Tesseract 5 ships language data files for 100+ languages. Install:

# Ubuntu / Debian
sudo apt install tesseract-ocr tesseract-ocr-eng tesseract-ocr-fra tesseract-ocr-deu tesseract-ocr-ara tesseract-ocr-chi-sim

# Mac
brew install tesseract tesseract-lang

# Windows: install via UB Mannheim builds

Basic OCR:

tesseract input.png output.pdf -l eng pdf

For multilingual pages, specify multiple languages with +:

tesseract input.png output.pdf -l eng+fra+deu pdf

For Asian scripts, use specific files:

chi_sim for Chinese Simplified
chi_tra for Chinese Traditional
jpn for Japanese
jpn_vert for vertical Japanese
kor for Korean
tha for Thai

For Arabic and Hebrew (RTL):

tesseract input.png output.pdf -l ara pdf

Tesseract handles Arabic OCR but accuracy is significantly lower than Google Cloud Vision or ABBYY for the same document.

Cloud API Workflow

For high-accuracy multilingual OCR, Google Cloud Vision is the standard:

from google.cloud import vision
import io

def ocr_image(image_path):
    client = vision.ImageAnnotatorClient()

    with io.open(image_path, "rb") as f:
        content = f.read()

    image = vision.Image(content=content)
    response = client.document_text_detection(image=image)

    return response.full_text_annotation.text

text = ocr_image("input.png")
print(text)

Pricing: $1.50 per 1000 pages. For a typical project of 100-500 pages, total cost is under $1.

The output text needs to be packaged back into a PDF text layer. Use pdfplumber or reportlab to overlay text on the original PDF.

For batch processing of many documents, see our PDF OCR tool or the Batch Processing Files Guide.

Mixed-Script Pages

A page with English captions, Chinese body text, and Arabic notes is challenging. Strategies:

Strategy 1: Multi-language Tesseract

tesseract input.png output.pdf -l eng+chi_sim+ara pdf

Tesseract tries each language model; usually correct for clear sections, sometimes wrong at boundaries.

Strategy 2: Pre-segmentation

Use layout analysis tools (LayoutLM, PaddleStructure) to identify text regions. OCR each region with the appropriate language. Reassemble.

import layoutparser as lp

model = lp.Detectron2LayoutModel("lp://PubLayNet")
layout = model.detect(image)
for block in layout:
    region_image = image.crop(block.coordinates)
    # OCR each region with appropriate language

Strategy 3: Cloud API with language hints

Google Cloud Vision auto-detects language with high accuracy on mixed-script pages. Often the simplest answer.

Accuracy Tuning

OCR accuracy depends heavily on input quality. Improve with preprocessing:

Preprocessing	Effect
Deskew (correct slight rotation)	Improves accuracy 5-10% on tilted scans
Binarization (black/white)	Better than grayscale for Tesseract
Noise removal	Critical for noisy scans
Resolution to 300 DPI	Below 300 DPI, accuracy drops sharply
Contrast enhancement	Critical for faded scans

Pre-processing pipeline with ImageMagick:

convert input.png \
  -deskew 40% \
  -density 300 \
  -threshold 50% \
  -despeckle \
  cleaned.png

tesseract cleaned.png output.pdf -l eng pdf

For severely degraded scans (faded carbon copies, low-contrast handwriting on lined paper), even cloud APIs struggle. Manual transcription is sometimes the only reliable option.

Pro Tip: OCR accuracy improves with each higher resolution up to 300 DPI. Above that, accuracy plateaus. Below 300 DPI, characters get confused (lowercase l vs digit 1, lowercase o vs digit 0). Always scan at 300 DPI or higher.

Searchable PDF Output Formats

After OCR, the output PDF can have text layered in different ways:

Type 1: Text-only

OCR text replaces the original image. Useful for clean text recovery but loses the original visual layout.

Type 2: Image + invisible text overlay (preferred)

Original scan image is preserved. Invisible text layer added on top. Visually identical to original; searchable via Ctrl+F.

Type 3: Image + image-text overlay

Hybrid approach where extracted text is placed visibly alongside or over the image. Useful for accessibility tools that read the text aloud.

For most use cases: Type 2. Tesseract's pdf output mode produces this by default.

Searchable PDF Specifications

For accessibility compliance:

PDF/UA: PDF for Universal Accessibility, requires structured tags
PDF/A: archival format, preserves text and visual layout
PDF/X: print-oriented, color managed

For accessibility (PDF/UA), the OCR text layer must be properly tagged. Adobe Acrobat Pro produces tagged PDFs by default. Tesseract's output is text-layered but not tagged; additional tools like pdf-uaccess add tags.

For US/EU regulatory compliance, PDF/UA is increasingly required for public-facing documents.

File Size Considerations

Searchable PDFs are larger than image-only PDFs:

Source	Image-only size	Searchable PDF size
300 DPI 8.5×11 page	1.5 MB	~1.6 MB (text adds 5-10%)
300 DPI 8.5×11 with image compression	250 KB	280 KB
600 DPI 8.5×11 page	6 MB	6.2 MB

The text layer adds modest size. Compression of the image layer matters more for total size.

For aggressive compression, see How to Reduce PDF File Size.

Handwriting OCR

Tesseract has limited handwriting OCR support. For handwritten text:

Google Cloud Vision: better than Tesseract, mediocre on cursive
Microsoft Azure Form Recognizer: handles printed handwriting reasonably
AWS Textract: focused on form-like documents
Tesseract: ineffective on cursive, OK on neat printing

For most handwriting OCR work in 2026, cloud APIs are the only viable option. Even those struggle with cursive, abbreviations, and specialized notation (mathematical formulas, music notation, chemical structures).

For specialized handwriting (vintage manuscripts, doctor's notes), human transcription remains the only reliable answer.

Common Issues

Text shifted left from original: text layer alignment off. Tesseract sometimes mis-positions text relative to the image. Check with Adobe Acrobat's "Recognize Text" verification tool.

Diacritics missing or wrong: language model didn't include diacritics. Use the specific language file (e.g., fra not eng+fra for accented French).

Mixed orientation pages: scan has portrait and landscape pages. Tesseract auto-detects per-page orientation; some PDF tools don't. Re-orient before OCR.

OCR ran but file says "no text": your PDF reader is using the image not the text layer. Most readers can extract OCR text; verify in Adobe Acrobat.

Text accuracy drops on second pass: re-running OCR doesn't improve accuracy. Each pass starts fresh. Don't expect cumulative improvement.

For broader PDF tools, see our PDF converter.

Frequently Asked Questions

Why is OCR slow on long documents?

Tesseract is single-threaded by default. For multi-page work, use parallel processing:

parallel tesseract {} {.}.pdf -l eng pdf ::: *.png

This processes files in parallel, scaling with CPU cores.

Can I OCR a PDF without converting to images first?

For text-PDFs (already searchable): no need, text is already there. For scanned PDFs (image-based): yes via Tesseract's --oem modes. Most production workflows split to images first because of cleaner control.

How do I OCR a multi-page PDF?

# Split PDF to images
pdftoppm input.pdf page -png -r 300

# OCR each page
for img in page-*.png; do
  tesseract "$img" "${img%.png}" -l eng pdf
done

# Combine PDFs
pdftk page-*.pdf cat output combined.pdf

Or use our PDF OCR tool which handles the pipeline automatically.

What's the accuracy of OCR on receipts?

Variable. Clean printed receipts: 95%+. Crumpled or low-contrast receipts: 60-80%. Specialized OCR products (Receipt Bank, Expensify) tune for receipts and produce higher accuracy than general OCR.

Can OCR extract tables?

Yes, but with caveats. Tesseract's TSV output mode produces table-structured data. PaddleStructure and LayoutLMv3 are more specialized for table OCR with reasonable accuracy.

Does OCR work on screenshots?

Yes, often better than scans. Screenshots have clean rendering, no skew, no noise. Most OCR tools achieve 99%+ accuracy on application UI screenshots.

Bottom Line

For multilingual OCR: Tesseract 5 with the right language files for most cases, Google Cloud Vision or Azure for mixed-script and high-accuracy work. Pre-process scans to 300 DPI, deskew, and binarize before OCR. Use Type 2 (image + invisible text overlay) for searchable output. Our PDF OCR tool and PDF converter handle the pipeline if you want to skip the manual setup.