OCR Document Conversion: Turn Scanned Files Into Editable Text

When a PDF Isn't Really a PDF

Not all PDFs contain text — many are just images of text inside a PDF wrapper. Scanned documents, photographed contracts, faxes, and export-to-PDF functions from older software all produce "image PDFs" where the text exists only as pixels. You can read them, but you can't search, copy, or edit the text.

OCR (Optical Character Recognition) analyzes these images and extracts the text, letting you convert a scanned invoice into a Word document, make a 20-year archive of contracts searchable, or extract data from a photographed form for processing.

This guide covers how OCR works, what affects accuracy, and the practical workflows for converting scanned documents — from single pages to thousands of files.

How OCR Actually Works

Modern OCR uses neural networks trained on millions of document images. The process:

Image preprocessing — Correct skew (straighten rotated pages), denoise, enhance contrast, deskew. This step dramatically affects accuracy on poor scans.
Page segmentation — Identify regions: text blocks, tables, images, headers, footers. Multi-column layouts require correct segmentation to preserve reading order.
Character recognition — Analyze each character against trained models. Neural networks score confidence for each character.
Language models — Apply language context to improve accuracy. "The quick brown f0x" — the model knows "fox" is more likely than "f0x" in English text.
Output formatting — Reconstruct layout for Word/PDF output, or extract plain text.

Accuracy numbers you'll see cited (99% accuracy) are measured on clean, straight, high-contrast printed text. Real-world scans of handwritten notes, old typewriters, or damaged documents perform significantly worse.

Factors That Affect OCR Accuracy

Factor	Impact	What to Do
Image resolution	Very high	Use 300 DPI minimum; 400 DPI for small text
Skew/rotation	High	Correct before OCR or use auto-deskew
Contrast	High	Black text on white background is best
Font type	Medium	Serif/sans-serif printed text works best
Noise/stains	Medium	Clean up with image processing first
Handwriting	Very high	Most OCR is poor at cursive/unusual handwriting
Tables	High	Specialized table extraction needed
Language	High	Match OCR language to document language
Compression artifacts	Medium	Use lossless or high-quality source scans

For the best OCR results, scan at 300 DPI in black and white (not grayscale) for text-only documents, or 300 DPI in color for documents with images or forms.

Online OCR: Browser-Based Tools

ConvertIntoMP4 PDF OCR

ConvertIntoMP4's PDF OCR tool converts scanned PDFs directly in your browser. Upload a scanned PDF, and it returns a searchable PDF with the recognized text embedded as a text layer — the original appearance is preserved, but now you can search and copy text.

This is the fastest option for occasional OCR work without software installation.

Image to Text

For images (PNG, JPG, TIFF) containing text, the image to text tool extracts the text content directly — useful for photographed documents, screenshots, or scanned pages saved as images rather than PDFs.

Tesseract: The Open-Source OCR Engine

Tesseract is the industry-standard open-source OCR engine, used by many commercial products as their underlying technology. It's free, highly accurate for printed text, and supports 100+ languages.

Installation

# macOS
brew install tesseract tesseract-lang

# Ubuntu/Debian
sudo apt install tesseract-ocr tesseract-ocr-eng tesseract-ocr-fra  # add language packs as needed

# Windows — download from: https://github.com/UB-Mannheim/tesseract/wiki

Basic Usage

# Convert image to text file
tesseract document.png output.txt

# Convert with specific language
tesseract document.png output --lang eng

# Convert to searchable PDF (text layer over image)
tesseract document.png output --lang eng pdf

# Convert to hOCR (HTML with position data)
tesseract document.png output hocr

# Multiple languages (e.g., English + French)
tesseract document.png output --lang eng+fra

Preprocessing for Better Accuracy

# Improve contrast before OCR with ImageMagick
convert input.jpg \
  -level 0%,100%,0.5 \    # gamma correction
  -sharpen 0x1 \           # mild sharpening
  -deskew 40% \            # straighten skewed text
  preprocessed.png

# Then OCR the preprocessed image
tesseract preprocessed.png output pdf

For heavily skewed or rotated scans, deskewing before OCR can improve accuracy by 20–40%.

Python OCR Workflows

Basic OCR with pytesseract

import pytesseract
from PIL import Image
from pathlib import Path

def ocr_image(image_path: str, language: str = "eng") -> str:
    """Extract text from an image file."""
    image = Image.open(image_path)
    text = pytesseract.image_to_string(image, lang=language)
    return text

def ocr_with_preprocessing(image_path: str) -> str:
    """OCR with contrast enhancement."""
    from PIL import ImageFilter, ImageEnhance

    image = Image.open(image_path).convert("L")  # Convert to grayscale

    # Enhance contrast
    enhancer = ImageEnhance.Contrast(image)
    image = enhancer.enhance(2.0)

    # Sharpen
    image = image.filter(ImageFilter.SHARPEN)

    return pytesseract.image_to_string(image, lang="eng")

# Get confidence data
def ocr_with_confidence(image_path: str):
    image = Image.open(image_path)
    data = pytesseract.image_to_data(image, output_type=pytesseract.Output.DICT)

    words = []
    for i, word in enumerate(data["text"]):
        if word.strip() and int(data["conf"][i]) > 60:  # Only high-confidence words
            words.append(word)

    return " ".join(words)

Batch OCR Scanned PDFs

import fitz  # PyMuPDF
import pytesseract
from PIL import Image
from pathlib import Path
import io

def pdf_page_to_image(pdf_path: str, page_num: int, dpi: int = 300) -> Image.Image:
    """Render a PDF page as an image."""
    doc = fitz.open(pdf_path)
    page = doc[page_num]

    # Render at specified DPI
    zoom = dpi / 72  # PDF default is 72 DPI
    mat = fitz.Matrix(zoom, zoom)
    pix = page.get_pixmap(matrix=mat, alpha=False)

    img_data = pix.tobytes("png")
    return Image.open(io.BytesIO(img_data))

def ocr_pdf(input_pdf: str, output_pdf: str, language: str = "eng"):
    """Convert scanned PDF to searchable PDF."""
    doc = fitz.open(input_pdf)

    for page_num in range(len(doc)):
        print(f"Processing page {page_num + 1}/{len(doc)}...")

        page = doc[page_num]

        # Skip pages that already have text
        if len(page.get_text()) > 50:
            print(f"  Page {page_num + 1} already has text, skipping")
            continue

        # Render page to image
        image = pdf_page_to_image(input_pdf, page_num)

        # OCR the image
        text = pytesseract.image_to_string(image, lang=language)

        # Add invisible text layer to PDF
        page.insert_text(
            (0, 0), text,
            fontsize=1,
            color=(1, 1, 1),  # White (invisible)
            overlay=False
        )

    doc.save(output_pdf)
    print(f"Saved searchable PDF: {output_pdf}")

# Process a folder of scanned PDFs
def batch_ocr(input_folder: str, output_folder: str):
    input_path = Path(input_folder)
    output_path = Path(output_folder)
    output_path.mkdir(exist_ok=True)

    pdfs = list(input_path.glob("*.pdf"))
    print(f"Found {len(pdfs)} PDFs to process")

    for pdf in pdfs:
        out_file = output_path / pdf.name
        print(f"\nProcessing: {pdf.name}")
        ocr_pdf(str(pdf), str(out_file))

batch_ocr("./scanned_docs", "./searchable_docs")

Table Extraction from Scanned Documents

Regular text OCR doesn't handle tables well — it reads cells in the wrong order or merges columns. For tables, specialized tools work better:

# Using camelot for table extraction from PDFs with text
import camelot

tables = camelot.read_pdf("document_with_tables.pdf", pages="all")
tables.export("output.csv", f="csv", compress=True)  # saves each table to CSV

# Check accuracy
for i, table in enumerate(tables):
    print(f"Table {i+1} accuracy: {table.parsing_report['accuracy']:.1f}%")
    print(table.df.head())

For images with tables (not PDFs), consider processing with OpenCV to detect grid lines before OCR, or use Google Document AI / Azure Form Recognizer for complex table extraction.

Document Conversion After OCR

Once text is extracted, converting to editable formats:

PDF to Word (After OCR Makes It Searchable)

A truly scanned PDF (image-only) can't be converted to Word until OCR creates a text layer. The workflow:

OCR the scanned PDF → searchable PDF
Convert searchable PDF → Word document

The PDF to Word converter handles PDFs that already have text layers. For scanned-only PDFs, use PDF OCR first to create the text layer.

Saving OCR Results

# Save as plain text
with open("output.txt", "w", encoding="utf-8") as f:
    f.write(extracted_text)

# Save as Word document
from docx import Document
doc = Document()
doc.add_paragraph(extracted_text)
doc.save("output.docx")

# Save structured data as JSON
import json
with open("output.json", "w") as f:
    json.dump({"text": extracted_text, "pages": page_count}, f)

Accuracy Benchmarks by Document Type

Document Type	Expected Accuracy	Notes
Clean printed text (300 DPI)	99%+	Best case
Old typewriter text	95–98%	Minor character confusion (l/1, 0/O)
Newspaper/magazine columns	92–96%	Multi-column layout challenges
Handwritten print text	70–85%	Highly variable by handwriting
Cursive handwriting	40–70%	Significantly worse
Mathematical formulas	60–80%	Specialized OCR better
Tables and forms	80–90%	Structural extraction needed
Low-quality scan (under 150 DPI)	70–85%	Rescan if possible
Skewed document (>5°)	Lower by 10–20%	Deskew first

For documents where accuracy is critical (legal, medical, financial), always review OCR output. Build a review workflow into your automation.

Language Support

Tesseract supports 100+ languages. Common language codes:

Language	Code	Language	Code
English	eng	French	fra
Spanish	spa	German	deu
Japanese	jpn	Chinese (Simplified)	chi_sim
Arabic	ara	Russian	rus
Korean	kor	Portuguese	por
Italian	ita	Dutch	nld

For mixed-language documents:

tesseract document.png output --lang eng+fra+deu

Right-to-left languages (Arabic, Hebrew) require additional configuration:

tesseract arabic_doc.png output --lang ara --dpi 300 -c preserve_interword_spaces=1

Commercial OCR APIs

For production workflows requiring higher accuracy or specific features:

Service	Strength	Pricing
Google Document AI	Forms, tables, receipts	Per page
Azure AI Document Intelligence	Enterprise documents	Per page
Amazon Textract	Forms, tables, signatures	Per page
Adobe PDF Services	PDF-native integration	Per document
ABBYY Cloud OCR	Accuracy-focused	Per page

Tesseract is adequate for general printed text. Commercial services shine for forms, tables, receipts, and handwriting.

Frequently Asked Questions

Can OCR handle handwritten notes?

Modern OCR handles printed handwriting reasonably well, but cursive is still challenging. For personal handwriting digitization, dedicated apps like Microsoft OneNote, Google Lens, or specialized handwriting recognition tools perform better than general-purpose OCR.

Why does my OCR output have random characters mixed in?

Random characters typically indicate either: 1) the image resolution is too low (below 200 DPI), 2) there's significant noise or staining in the scan, or 3) the OCR language doesn't match the document language. Try scanning at a higher resolution and setting the correct language.

Can OCR recover text from a damaged or wet document?

OCR can only work with what the camera or scanner captures. If text is blurred, faded, or obscured, preprocessing (contrast enhancement, sharpening) helps somewhat. For severely damaged documents, forensic imaging techniques exist but are outside standard OCR tools.

Does OCR work on PDF forms with fillable fields?

PDFs with interactive form fields already contain machine-readable text — no OCR needed. OCR is only required for image-based content. The PDF form filling guide covers processing interactive PDF forms.

How do I improve OCR on a two-column document?

Set Tesseract's page segmentation mode: --psm 3 (automatic) handles most multi-column layouts. If columns are being merged, try --psm 1 (automatic with orientation and script detection) or manually crop each column and OCR separately.

Building a Document Digitization Pipeline

For organizations with large archives to digitize, the workflow:

Scan at 300 DPI, black and white for text, color for forms with color coding
Preprocess — auto-deskew, auto-crop to document boundaries, remove blank pages
OCR — Tesseract for general text, commercial API for forms/tables
Quality check — automated confidence scoring, flag pages below threshold for review
Output — searchable PDFs, extracted text files, structured data (JSON/CSV for forms)
Index — feed text into search engine (Elasticsearch, Solr, or even SQLite FTS)

The scanned PDF to searchable guide covers the practical steps for making archive PDFs searchable, and the PDF tools overview covers the full range of PDF processing options available for document workflows.