When a PDF Isn't Really a PDF
Not all PDFs contain text — many are just images of text inside a PDF wrapper. Scanned documents, photographed contracts, faxes, and export-to-PDF functions from older software all produce "image PDFs" where the text exists only as pixels. You can read them, but you can't search, copy, or edit the text.
OCR (Optical Character Recognition) analyzes these images and extracts the text, letting you convert a scanned invoice into a Word document, make a 20-year archive of contracts searchable, or extract data from a photographed form for processing.
This guide covers how OCR works, what affects accuracy, and the practical workflows for converting scanned documents — from single pages to thousands of files.
How OCR Actually Works
Modern OCR uses neural networks trained on millions of document images. The process:
- Image preprocessing — Correct skew (straighten rotated pages), denoise, enhance contrast, deskew. This step dramatically affects accuracy on poor scans.
- Page segmentation — Identify regions: text blocks, tables, images, headers, footers. Multi-column layouts require correct segmentation to preserve reading order.
- Character recognition — Analyze each character against trained models. Neural networks score confidence for each character.
- Language models — Apply language context to improve accuracy. "The quick brown f0x" — the model knows "fox" is more likely than "f0x" in English text.
- Output formatting — Reconstruct layout for Word/PDF output, or extract plain text.
Accuracy numbers you'll see cited (99% accuracy) are measured on clean, straight, high-contrast printed text. Real-world scans of handwritten notes, old typewriters, or damaged documents perform significantly worse.
Factors That Affect OCR Accuracy
| Factor | Impact | What to Do |
|---|---|---|
| Image resolution | Very high | Use 300 DPI minimum; 400 DPI for small text |
| Skew/rotation | High | Correct before OCR or use auto-deskew |
| Contrast | High | Black text on white background is best |
| Font type | Medium | Serif/sans-serif printed text works best |
| Noise/stains | Medium | Clean up with image processing first |
| Handwriting | Very high | Most OCR is poor at cursive/unusual handwriting |
| Tables | High | Specialized table extraction needed |
| Language | High | Match OCR language to document language |
| Compression artifacts | Medium | Use lossless or high-quality source scans |
For the best OCR results, scan at 300 DPI in black and white (not grayscale) for text-only documents, or 300 DPI in color for documents with images or forms.
Online OCR: Browser-Based Tools
ConvertIntoMP4 PDF OCR
ConvertIntoMP4's PDF OCR tool converts scanned PDFs directly in your browser. Upload a scanned PDF, and it returns a searchable PDF with the recognized text embedded as a text layer — the original appearance is preserved, but now you can search and copy text.
This is the fastest option for occasional OCR work without software installation.
Image to Text
For images (PNG, JPG, TIFF) containing text, the image to text tool extracts the text content directly — useful for photographed documents, screenshots, or scanned pages saved as images rather than PDFs.
Tesseract: The Open-Source OCR Engine
Tesseract is the industry-standard open-source OCR engine, used by many commercial products as their underlying technology. It's free, highly accurate for printed text, and supports 100+ languages.
Installation
# macOS
brew install tesseract tesseract-lang
# Ubuntu/Debian
sudo apt install tesseract-ocr tesseract-ocr-eng tesseract-ocr-fra # add language packs as needed
# Windows — download from: https://github.com/UB-Mannheim/tesseract/wiki
Basic Usage
# Convert image to text file
tesseract document.png output.txt
# Convert with specific language
tesseract document.png output --lang eng
# Convert to searchable PDF (text layer over image)
tesseract document.png output --lang eng pdf
# Convert to hOCR (HTML with position data)
tesseract document.png output hocr
# Multiple languages (e.g., English + French)
tesseract document.png output --lang eng+fra
Preprocessing for Better Accuracy
# Improve contrast before OCR with ImageMagick
convert input.jpg \
-level 0%,100%,0.5 \ # gamma correction
-sharpen 0x1 \ # mild sharpening
-deskew 40% \ # straighten skewed text
preprocessed.png
# Then OCR the preprocessed image
tesseract preprocessed.png output pdf
For heavily skewed or rotated scans, deskewing before OCR can improve accuracy by 20–40%.
Python OCR Workflows
Basic OCR with pytesseract
import pytesseract
from PIL import Image
from pathlib import Path
def ocr_image(image_path: str, language: str = "eng") -> str:
"""Extract text from an image file."""
image = Image.open(image_path)
text = pytesseract.image_to_string(image, lang=language)
return text
def ocr_with_preprocessing(image_path: str) -> str:
"""OCR with contrast enhancement."""
from PIL import ImageFilter, ImageEnhance
image = Image.open(image_path).convert("L") # Convert to grayscale
# Enhance contrast
enhancer = ImageEnhance.Contrast(image)
image = enhancer.enhance(2.0)
# Sharpen
image = image.filter(ImageFilter.SHARPEN)
return pytesseract.image_to_string(image, lang="eng")
# Get confidence data
def ocr_with_confidence(image_path: str):
image = Image.open(image_path)
data = pytesseract.image_to_data(image, output_type=pytesseract.Output.DICT)
words = []
for i, word in enumerate(data["text"]):
if word.strip() and int(data["conf"][i]) > 60: # Only high-confidence words
words.append(word)
return " ".join(words)
Batch OCR Scanned PDFs
import fitz # PyMuPDF
import pytesseract
from PIL import Image
from pathlib import Path
import io
def pdf_page_to_image(pdf_path: str, page_num: int, dpi: int = 300) -> Image.Image:
"""Render a PDF page as an image."""
doc = fitz.open(pdf_path)
page = doc[page_num]
# Render at specified DPI
zoom = dpi / 72 # PDF default is 72 DPI
mat = fitz.Matrix(zoom, zoom)
pix = page.get_pixmap(matrix=mat, alpha=False)
img_data = pix.tobytes("png")
return Image.open(io.BytesIO(img_data))
def ocr_pdf(input_pdf: str, output_pdf: str, language: str = "eng"):
"""Convert scanned PDF to searchable PDF."""
doc = fitz.open(input_pdf)
for page_num in range(len(doc)):
print(f"Processing page {page_num + 1}/{len(doc)}...")
page = doc[page_num]
# Skip pages that already have text
if len(page.get_text()) > 50:
print(f" Page {page_num + 1} already has text, skipping")
continue
# Render page to image
image = pdf_page_to_image(input_pdf, page_num)
# OCR the image
text = pytesseract.image_to_string(image, lang=language)
# Add invisible text layer to PDF
page.insert_text(
(0, 0), text,
fontsize=1,
color=(1, 1, 1), # White (invisible)
overlay=False
)
doc.save(output_pdf)
print(f"Saved searchable PDF: {output_pdf}")
# Process a folder of scanned PDFs
def batch_ocr(input_folder: str, output_folder: str):
input_path = Path(input_folder)
output_path = Path(output_folder)
output_path.mkdir(exist_ok=True)
pdfs = list(input_path.glob("*.pdf"))
print(f"Found {len(pdfs)} PDFs to process")
for pdf in pdfs:
out_file = output_path / pdf.name
print(f"\nProcessing: {pdf.name}")
ocr_pdf(str(pdf), str(out_file))
batch_ocr("./scanned_docs", "./searchable_docs")
Table Extraction from Scanned Documents
Regular text OCR doesn't handle tables well — it reads cells in the wrong order or merges columns. For tables, specialized tools work better:
# Using camelot for table extraction from PDFs with text
import camelot
tables = camelot.read_pdf("document_with_tables.pdf", pages="all")
tables.export("output.csv", f="csv", compress=True) # saves each table to CSV
# Check accuracy
for i, table in enumerate(tables):
print(f"Table {i+1} accuracy: {table.parsing_report['accuracy']:.1f}%")
print(table.df.head())
For images with tables (not PDFs), consider processing with OpenCV to detect grid lines before OCR, or use Google Document AI / Azure Form Recognizer for complex table extraction.
Document Conversion After OCR
Once text is extracted, converting to editable formats:
PDF to Word (After OCR Makes It Searchable)
A truly scanned PDF (image-only) can't be converted to Word until OCR creates a text layer. The workflow:
- OCR the scanned PDF → searchable PDF
- Convert searchable PDF → Word document
The PDF to Word converter handles PDFs that already have text layers. For scanned-only PDFs, use PDF OCR first to create the text layer.
Saving OCR Results
# Save as plain text
with open("output.txt", "w", encoding="utf-8") as f:
f.write(extracted_text)
# Save as Word document
from docx import Document
doc = Document()
doc.add_paragraph(extracted_text)
doc.save("output.docx")
# Save structured data as JSON
import json
with open("output.json", "w") as f:
json.dump({"text": extracted_text, "pages": page_count}, f)
Accuracy Benchmarks by Document Type
| Document Type | Expected Accuracy | Notes |
|---|---|---|
| Clean printed text (300 DPI) | 99%+ | Best case |
| Old typewriter text | 95–98% | Minor character confusion (l/1, 0/O) |
| Newspaper/magazine columns | 92–96% | Multi-column layout challenges |
| Handwritten print text | 70–85% | Highly variable by handwriting |
| Cursive handwriting | 40–70% | Significantly worse |
| Mathematical formulas | 60–80% | Specialized OCR better |
| Tables and forms | 80–90% | Structural extraction needed |
| Low-quality scan (under 150 DPI) | 70–85% | Rescan if possible |
| Skewed document (>5°) | Lower by 10–20% | Deskew first |
For documents where accuracy is critical (legal, medical, financial), always review OCR output. Build a review workflow into your automation.
Language Support
Tesseract supports 100+ languages. Common language codes:
| Language | Code | Language | Code |
|---|---|---|---|
| English | eng | French | fra |
| Spanish | spa | German | deu |
| Japanese | jpn | Chinese (Simplified) | chi_sim |
| Arabic | ara | Russian | rus |
| Korean | kor | Portuguese | por |
| Italian | ita | Dutch | nld |
For mixed-language documents:
tesseract document.png output --lang eng+fra+deu
Right-to-left languages (Arabic, Hebrew) require additional configuration:
tesseract arabic_doc.png output --lang ara --dpi 300 -c preserve_interword_spaces=1
Commercial OCR APIs
For production workflows requiring higher accuracy or specific features:
| Service | Strength | Pricing |
|---|---|---|
| Google Document AI | Forms, tables, receipts | Per page |
| Azure AI Document Intelligence | Enterprise documents | Per page |
| Amazon Textract | Forms, tables, signatures | Per page |
| Adobe PDF Services | PDF-native integration | Per document |
| ABBYY Cloud OCR | Accuracy-focused | Per page |
Tesseract is adequate for general printed text. Commercial services shine for forms, tables, receipts, and handwriting.
Frequently Asked Questions
Can OCR handle handwritten notes?
Modern OCR handles printed handwriting reasonably well, but cursive is still challenging. For personal handwriting digitization, dedicated apps like Microsoft OneNote, Google Lens, or specialized handwriting recognition tools perform better than general-purpose OCR.
Why does my OCR output have random characters mixed in?
Random characters typically indicate either: 1) the image resolution is too low (below 200 DPI), 2) there's significant noise or staining in the scan, or 3) the OCR language doesn't match the document language. Try scanning at a higher resolution and setting the correct language.
Can OCR recover text from a damaged or wet document?
OCR can only work with what the camera or scanner captures. If text is blurred, faded, or obscured, preprocessing (contrast enhancement, sharpening) helps somewhat. For severely damaged documents, forensic imaging techniques exist but are outside standard OCR tools.
Does OCR work on PDF forms with fillable fields?
PDFs with interactive form fields already contain machine-readable text — no OCR needed. OCR is only required for image-based content. The PDF form filling guide covers processing interactive PDF forms.
How do I improve OCR on a two-column document?
Set Tesseract's page segmentation mode: --psm 3 (automatic) handles most multi-column layouts. If columns are being merged, try --psm 1 (automatic with orientation and script detection) or manually crop each column and OCR separately.
Building a Document Digitization Pipeline
For organizations with large archives to digitize, the workflow:
- Scan at 300 DPI, black and white for text, color for forms with color coding
- Preprocess — auto-deskew, auto-crop to document boundaries, remove blank pages
- OCR — Tesseract for general text, commercial API for forms/tables
- Quality check — automated confidence scoring, flag pages below threshold for review
- Output — searchable PDFs, extracted text files, structured data (JSON/CSV for forms)
- Index — feed text into search engine (Elasticsearch, Solr, or even SQLite FTS)
The scanned PDF to searchable guide covers the practical steps for making archive PDFs searchable, and the PDF tools overview covers the full range of PDF processing options available for document workflows.



