The Problem with Scanned PDFs
A scanned PDF is essentially a collection of photographs. When you scan a paper document, the scanner captures an image of each page -- not the text on it. The result is a PDF file where every page is a flat image. You cannot search for words, select and copy text, or use screen readers for accessibility. The text is visible to your eyes but invisible to your computer.
This creates real problems. You cannot find a specific clause in a 200-page scanned contract by pressing Ctrl+F. You cannot copy an address or phone number from a scanned letter. You cannot extract data from a scanned invoice into a spreadsheet. And for anyone using assistive technology, scanned PDFs are completely inaccessible.
OCR (Optical Character Recognition) solves this by analyzing the images in a scanned PDF, recognizing the text characters, and adding an invisible text layer on top of each page. The result is a searchable PDF -- visually identical to the original scan but with fully selectable, searchable, copyable text underneath.

How OCR Works
Modern OCR engines use a multi-stage process to convert images to text:
-
Preprocessing -- The image is cleaned up: deskewed (straightened), denoised (speckles removed), binarized (converted to black and white for better contrast), and scaled to an optimal resolution for recognition.
-
Layout analysis -- The engine identifies text regions, separating them from images, tables, headers, footers, and margins. It determines reading order, column structure, and paragraph boundaries.
-
Character recognition -- Each text region is analyzed character by character. Modern engines like Tesseract 5 use LSTM (Long Short-Term Memory) neural networks trained on millions of text samples. The network outputs a probability for each possible character and selects the most likely sequence.
-
Post-processing -- The recognized text is checked against language dictionaries and grammar rules to correct likely errors. For example, "rn" misrecognized as "m" (a common OCR error) can be corrected by checking if the resulting word exists in the dictionary.
-
Text layer creation -- The final recognized text is placed into an invisible layer that precisely overlays the original image. Each word is positioned to match its visual location on the page.
Method 1: Convert Scanned PDFs Online
Our online OCR tool processes scanned PDFs entirely in your browser with no software to install.
Step-by-Step Instructions
- Open the ConvertIntoMP4 PDF OCR tool
- Upload your scanned PDF (supports files up to 50 MB)
- Select the document language (or multiple languages for multilingual documents)
- Choose the output type:
- Searchable PDF -- Original images with invisible text layer (most common)
- Text-only PDF -- Recognized text reformatted as a text document
- Text file (.txt) -- Plain text extraction only
- Click Process and wait for OCR to complete
- Download your searchable PDF
Our OCR engine supports 17 languages including English, Spanish, French, German, Italian, Portuguese, Dutch, Russian, Chinese, Japanese, Korean, Arabic, Hindi, Turkish, Polish, Swedish, and Norwegian.
What You Get
The output is a PDF that looks exactly like your original scan but with a hidden text layer that enables:
- Ctrl+F search -- Find any word or phrase instantly
- Text selection -- Click and drag to select text, then copy to clipboard
- Text-to-speech -- Screen readers can read the document aloud
- Data extraction -- Copy text into other documents, spreadsheets, or databases
Method 2: Adobe Acrobat OCR
Adobe Acrobat Pro includes powerful OCR functionality:
- Open the scanned PDF in Acrobat Pro
- Click Scan & OCR in the right panel
- Click Recognize Text > In This File
- Set language and output settings
- Click Recognize Text
Acrobat's OCR is highly accurate and handles complex layouts well, but requires a paid subscription ($22.99/month).
Method 3: Free Desktop Tools
Tesseract OCR (Open Source)
Tesseract is the most widely used open-source OCR engine, originally developed by HP and now maintained by Google. It powers many commercial OCR products.
# Install Tesseract
# macOS: brew install tesseract
# Ubuntu: sudo apt install tesseract-ocr
# Convert scanned PDF to searchable PDF
# First extract images, then OCR each page
pdftoppm -r 300 scanned.pdf page
tesseract page-1.ppm output-1 pdf
For multi-page PDFs, tools like ocrmypdf wrap Tesseract with convenient batch processing:
# Install ocrmypdf
pip install ocrmypdf
# One command to OCR an entire PDF
ocrmypdf input.pdf output.pdf --language eng --deskew --clean
NAPS2 (Windows, Free)
NAPS2 (Not Another PDF Scanner 2) is a free Windows application that combines scanning and OCR:
- Open NAPS2
- Import your scanned PDF (File > Import)
- Click OCR and select the language
- Export as searchable PDF
Optimizing Scan Quality for Better OCR
The accuracy of OCR depends heavily on the quality of the input scan. A clean, high-resolution scan can achieve 99%+ accuracy, while a poor scan might drop to 70-80%.
| Factor | Optimal Setting | Impact on Accuracy | Common Mistakes |
|---|---|---|---|
| Resolution (DPI) | 300 DPI (600 for small text) | Critical -- below 200 DPI accuracy drops sharply | Scanning at 72-150 DPI for smaller files |
| Color mode | Grayscale (not color, not B&W) | High -- grayscale preserves edge detail | Using pure B&W threshold which destroys antialiasing |
| Contrast | High contrast between text and background | High -- low contrast causes character confusion | Scanning faded documents without contrast adjustment |
| Skew angle | 0 degrees (perfectly straight) | Medium -- most OCR engines auto-deskew | Placing pages crooked on the scanner bed |
| Page cleanliness | No stains, marks, or folds across text | Medium -- noise can be mistaken for characters | Scanning coffee-stained or creased pages |
| Font size | 10 pt or larger | Medium -- very small text needs higher DPI | Footnotes and fine print at low resolution |
| Font type | Standard printed fonts (serif or sans-serif) | Low-Medium -- unusual fonts reduce accuracy | Decorative, script, or damaged typeface |
Pro Tip: If you have the option to re-scan, always choose 300 DPI in grayscale. Color scans are 3x larger than grayscale without improving OCR accuracy (the engine converts to grayscale internally anyway). The extra file size slows processing and storage without any recognition benefit. Reserve 600 DPI for documents with very small text (6-8 pt) like footnotes, legal fine print, or technical specifications.
Handling Different Document Types
Standard Business Documents
Letters, memos, reports, and contracts with standard fonts and clean layouts convert with 95-99% accuracy. These are the easiest documents to OCR.
Workflow: Scan at 300 DPI grayscale, run OCR with English language, verify a few key sections.
Financial Documents
Bank statements, invoices, and receipts contain numbers, currency symbols, and tabular data. OCR accuracy for numbers is critical because a single wrong digit can have real consequences.
Workflow: Scan at 300 DPI, run OCR, then verify all numeric values against the original. For extracting table data into spreadsheets after OCR, see our guide on how to convert PDF to Excel.
Historical and Degraded Documents
Old documents with yellowed paper, faded ink, and aged typefaces present the greatest OCR challenge. The text may be partially illegible even to human readers.
Workflow: Scan at 600 DPI, preprocess to enhance contrast, use OCR with post-correction. Accept that accuracy may be 80-90% and plan for manual review.
Multilingual Documents
Documents containing text in multiple languages (common in international contracts, academic papers, and government documents) require multi-language OCR.
Workflow: Select all relevant languages in the OCR settings. Most engines support specifying 2-3 languages simultaneously. Accuracy may be slightly lower than single-language processing because the dictionary lookup has more candidates.
Handwritten Documents
Handwriting recognition is significantly harder than printed text recognition. Modern OCR engines can handle neat, consistent handwriting with moderate accuracy (70-85%) but struggle with cursive, messy, or inconsistent handwriting.
Workflow: Use an OCR engine with handwriting support (not all engines include this). Expect extensive manual correction. For critical documents, manual transcription may be more efficient.

Post-OCR Quality Assurance
OCR is not perfect. Even the best engines make mistakes, and uncorrected OCR errors can cause problems -- especially in legal, financial, and medical documents. Here is a systematic quality assurance process.
Common OCR Errors
| Error Type | Example | Cause | Detection Method |
|---|---|---|---|
| Character confusion | "rn" read as "m", "l" read as "1" | Visual similarity between characters | Spell-check, manual review |
| Merged words | "the report" read as "thereport" | Tight character spacing | Spell-check highlights unknown words |
| Split words | "information" read as "infor mation" | Worn or uneven ink, damaged print | Search for unusual spaces |
| Wrong numbers | "5" read as "6", "0" read as "O" | Similar shapes, low resolution | Manual verification of critical numbers |
| Missing characters | "document" read as "docment" | Faded ink, low contrast | Spell-check, word count comparison |
| Extra characters | Specks recognized as periods or commas | Paper noise, scanner artifacts | Manual review of punctuation |
Verification Steps
- Search for common problem words -- Search the OCR'd text for known error-prone words in your document type
- Compare page counts -- Ensure the text layer has content on every page
- Spot-check paragraphs -- Read 2-3 random paragraphs against the original image
- Verify critical data -- Manually confirm names, dates, amounts, and identifiers
- Run spell-check -- Open the searchable PDF in Word or a text editor and run spell-check to catch recognition errors
Batch Processing Scanned PDFs
Organizations with large archives of scanned documents need batch OCR capabilities.
Online Batch OCR
Upload multiple scanned PDFs to our PDF OCR tool simultaneously. Each file is processed independently, and you can download all results as a ZIP archive.
Automated Pipeline
For ongoing document digitization:
# Watch a folder for new scanned PDFs and OCR them automatically
inotifywait -m /incoming -e create -e moved_to |
while read dir action file; do
ocrmypdf "/incoming/$file" "/processed/$file" \
--language eng --deskew --clean --optimize 1
done
Pro Tip: When batch processing hundreds of scanned PDFs, prioritize by business need rather than processing everything at once. OCR the documents people search for most frequently first (contracts, policies, financial records), then work through the archive chronologically. This delivers immediate value while the full archive is still being processed.
OCR and PDF Accessibility
Making scanned PDFs searchable is a critical step toward document accessibility, but it is not the complete solution. A truly accessible PDF also needs:
- Document structure tags -- Headings, paragraphs, lists, and tables tagged for screen reader navigation
- Reading order -- Content ordered logically, not just visually
- Alt text for images -- Descriptions for non-text visual elements
- Language specification -- Document language declared for correct pronunciation
OCR creates the text layer that makes other accessibility improvements possible. Without OCR, a scanned PDF is a complete barrier to anyone using a screen reader. With OCR, you have a foundation to build on.
For a comprehensive guide to PDF accessibility requirements and implementation, see our PDF accessibility guide.
Combining OCR with Other PDF Operations
OCR is often one step in a larger document workflow:
- Scan paper documents to create image-based PDFs
- Rotate any misoriented pages using our rotate PDF tool
- OCR to add searchable text layers
- Merge related documents with our merge PDF tool
- Add page numbers for navigation (see our guide on how to add page numbers to PDF)
- Compress the final document with our PDF compressor -- OCR'd PDFs are often larger due to the added text layer
- Password-protect sensitive documents using our password protection tool
For extracting specific pages from a large scanned document before OCR processing, our extract pages tool lets you pull out only the pages you need, reducing OCR processing time.

OCR Output Formats
After OCR, you may want the content in a format other than searchable PDF:
- Searchable PDF -- The standard output; original images with invisible text layer
- Word document -- Convert the OCR'd PDF to Word for editing. See our guide on how to convert PDF to Word
- Excel spreadsheet -- Extract tables from OCR'd documents. See how to convert PDF to Excel
- Plain text -- Extract all recognized text as a .txt file for data processing
- PowerPoint -- Convert OCR'd presentation scans back to editable slides. See how to convert PDF to PowerPoint
Cost and Speed Considerations
OCR processing time depends on the number of pages, scan resolution, document complexity, and the OCR engine used:
- Simple documents (1-10 pages): 5-30 seconds
- Medium documents (10-50 pages): 30 seconds to 3 minutes
- Large documents (50-200 pages): 3-10 minutes
- Archive batches (200+ pages): 10-30 minutes
Higher resolution scans take longer to process but produce better results. The processing time investment is almost always worth the accuracy improvement.
Our online tool processes files for free up to the size limit, with no page count restriction. For high-volume processing, our Pro plan offers faster processing, larger file sizes, and API access for automated workflows.
Best Practices Summary
- Scan at 300 DPI minimum -- Resolution is the single biggest factor in OCR accuracy
- Use grayscale, not color -- Same accuracy, 3x smaller files, faster processing
- Deskew before OCR -- Straighten crooked scans for better recognition
- Select the correct language -- The dictionary and character set must match the document
- Verify critical data manually -- Never trust OCR blindly for numbers, names, or legal text
- Keep the original scans -- Always retain the source images in case you need to re-OCR with better settings
- Process at the right time -- OCR before merging, reordering, or other PDF operations
- Compress after OCR -- The text layer adds file size; compression rebalances it
Converting scanned PDFs to searchable documents transforms dead images into living, usable text. Whether you are digitizing a paper archive, processing incoming scanned mail, or making documents accessible, OCR is the essential technology that bridges the gap between physical paper and digital searchability.
For deeper OCR guidance including language-specific tips and troubleshooting, see our comprehensive tutorial on how to OCR scanned documents.



