Why Convert PDF to Markdown?
PDF is the universal format for finished documents. It looks the same on every device, preserves fonts and layout, and is the default output of nearly every word processor and typesetting system. But PDF was designed for reading, not for editing, versioning, or reusing content.
Markdown, on the other hand, is a lightweight plain-text format that is native to the modern developer and documentation workflow. It renders beautifully in GitHub, GitLab, Notion, Obsidian, Hugo, Jekyll, Docusaurus, and dozens of other platforms. It diffs cleanly in version control. It is easy to parse, easy to transform, and easy to maintain.
Converting PDF to Markdown bridges the gap between polished output and editable source. Common scenarios include:
- Migrating documentation to a static site generator like Hugo, Astro, or Docusaurus
- Importing research papers or specifications into a knowledge base (Obsidian, Logseq, Notion)
- Making legacy documents Git-friendly so teams can track changes, review diffs, and collaborate
- Extracting content from PDF reports for reuse in blog posts, wikis, or internal dashboards
- Accessibility improvements — Markdown renders semantically in HTML, which screen readers handle better than most PDF viewers
The challenge is that PDF is a page-description language. It stores glyphs at absolute coordinates on a page, not semantic paragraphs, headings, or table cells. A good PDF-to-Markdown converter must reconstruct the document's logical structure from visual layout cues, which is a non-trivial problem.
Text-Layer Extraction vs. OCR
Before choosing a tool, you need to understand what kind of PDF you are working with.
Text-Layer PDFs (Digitally Created)
PDFs created by word processors (Word, Google Docs, LaTeX) or export functions (Chrome "Print to PDF") contain an embedded text layer. The characters are stored as Unicode codepoints with position data. Extracting text from these PDFs is fast and accurate because the tool reads the text directly rather than interpreting pixels.
Image-Only PDFs (Scanned Documents)
PDFs created by scanning physical documents contain only raster images — one image per page. There is no text layer. To extract text, you need Optical Character Recognition (OCR), which analyzes the image and attempts to identify characters. OCR accuracy depends on scan quality, font type, and language.
Hybrid PDFs
Some PDFs have both text and images. For example, a report might have text-layer body content but include scanned appendices or embedded screenshots with text. The best tools handle both in a single pass.
How to Tell Which Type You Have
Open the PDF in any viewer and try selecting text with your cursor. If you can highlight individual words, it has a text layer. If the entire page selects as one image block, it is image-only and requires OCR.
| PDF Type | Text Selectable | Requires OCR | Accuracy | Speed |
|---|---|---|---|---|
| Digital | Yes | No | Near 100% | Fast |
| Scanned | No | Yes | 85-99% | Slow |
| Hybrid | Partial | Partial | Varies | Moderate |
Key Challenges in PDF-to-Markdown Conversion
Heading Detection
PDF does not store heading levels (H1, H2, H3). A heading is just text rendered in a larger or bolder font at a certain position. Tools must infer heading hierarchy from font size, weight, and position — and they often get it wrong when the document uses inconsistent styling.
Table Reconstruction
Tables in PDF are the single hardest element to convert. PDF draws tables as lines and positioned text — there is no <table> equivalent. Extracting structured table data requires detecting row and column boundaries from line positions and cell content alignment.
Simple tables with visible gridlines convert reasonably well. Complex tables with merged cells, nested headers, or no visible borders are where most tools struggle.
Image Handling
Embedded images in a PDF need to be extracted as separate files and referenced in the Markdown output. The tool must identify image boundaries, extract the raster data, save it to a file (PNG or JPEG), and insert the appropriate Markdown image syntax .
Multi-Column Layouts
Magazine-style two-column or three-column layouts confuse most extractors. The tool may interleave text from different columns, producing garbled output. Documents with sidebars, pull quotes, or floating text boxes have the same problem.
Mathematical Notation
Academic papers and technical documents often contain mathematical equations rendered with LaTeX or MathML. Converting these to Markdown requires outputting either LaTeX math syntax ($E = mc^2$) or plain-text approximations. Few tools handle this well out of the box.
Tool Comparison
Pandoc
Pandoc is the Swiss Army knife of document conversion. It reads dozens of formats and writes dozens more, including PDF-to-Markdown. However, Pandoc's PDF reader relies on pdftotext (from Poppler) under the hood, which extracts raw text without much structural analysis.
# Basic PDF to Markdown with Pandoc
pandoc input.pdf -o output.md
# With explicit format flags
pandoc -f pdf -t markdown --wrap=none input.pdf -o output.md
# Extract media (images) to a folder
pandoc input.pdf -o output.md --extract-media=./media
Strengths: Widely available, handles text-layer PDFs, good image extraction, excellent for piping into other Pandoc output formats.
Weaknesses: No OCR support. Limited table detection. Headings are often missed because Pandoc's PDF reader does not analyze font sizes. Multi-column layouts produce garbled text.
Marker
Marker is a newer open-source tool specifically designed for high-quality PDF-to-Markdown conversion. It uses a combination of deep learning models for layout detection, OCR, and table recognition.
# Install Marker
pip install marker-pdf
# Convert a single PDF
marker_single input.pdf output_dir
# Batch convert a directory
marker output_dir --workers 4 input_dir
Strengths: Excellent table detection, heading recognition, and multi-column handling. Works on both text-layer and scanned PDFs. Handles math notation. Open source and actively maintained.
Weaknesses: Requires Python and significant dependencies (PyTorch). First run downloads several GB of models. Slower than text-extraction-only tools. GPU recommended for reasonable performance.
Mathpix
Mathpix is a commercial API service that excels at converting documents with mathematical notation, scientific figures, and complex tables. It uses proprietary OCR models trained on academic content.
# Using the Mathpix CLI
mpx convert input.pdf -o output.md
# Via API (curl)
curl -X POST https://api.mathpix.com/v3/pdf \
-H "app_id: YOUR_APP_ID" \
-H "app_key: YOUR_APP_KEY" \
-F "[email protected]" \
-F "options_json={\"conversion_formats\": {\"md\": true}}"
Strengths: Best-in-class math and equation handling. Excellent table recognition. Fast cloud processing. Handles complex academic layouts.
Weaknesses: Paid service (free tier limited to 100 pages/month). Requires sending documents to a third-party cloud. Not open source.
pdf2md / pdftomd
Several lightweight tools under variations of this name exist in the npm and Python ecosystems. They typically use pdfjs-dist (Mozilla's PDF.js) or pdfminer to extract text and apply heuristics for structure detection.
# Node.js variant
npx pdf2md input.pdf > output.md
# Python variant (pdfminer-based)
pip install pdftomd
pdftomd input.pdf output.md
Strengths: Lightweight, no GPU required, fast for text-layer PDFs. Good enough for simple, single-column documents.
Weaknesses: Poor table handling. Minimal heading detection. No OCR. Breaks down on complex layouts.
Docling (IBM)
Docling is IBM's open-source document understanding library. It uses layout analysis models to segment pages into regions (text, tables, figures, headers) and then converts each region appropriately.
pip install docling
# Python usage
from docling.document_converter import DocumentConverter
converter = DocumentConverter()
result = converter.convert("input.pdf")
print(result.document.export_to_markdown())
Strengths: Strong layout analysis. Good table extraction. Handles multi-column layouts. Active development from IBM Research.
Weaknesses: Python-only. Requires model downloads. Newer project with a smaller community than Pandoc or Marker.
Feature Comparison Table
| Feature | Pandoc | Marker | Mathpix | pdf2md | Docling |
|---|---|---|---|---|---|
| Text-layer PDFs | Good | Great | Great | Good | Great |
| Scanned PDFs (OCR) | No | Yes | Yes | No | Yes |
| Heading detection | Poor | Great | Great | Fair | Good |
| Table extraction | Poor | Great | Great | Poor | Great |
| Image extraction | Good | Good | Good | No | Good |
| Math notation | Poor | Good | Best | No | Fair |
| Multi-column | Poor | Good | Good | Poor | Good |
| Speed | Fast | Moderate | Fast | Fast | Moderate |
| Offline | Yes | Yes | No | Yes | Yes |
| Free / Open Source | Yes | Yes | No | Yes | Yes |
| GPU recommended | No | Yes | N/A | No | Yes |
Practical Workflow: PDF to Markdown for a Static Site
Here is a step-by-step workflow for converting a collection of PDF documents into Markdown files suitable for a static site generator like Hugo, Astro, or Docusaurus.
Step 1: Audit Your PDFs
Before batch-converting, check what you are dealing with:
# Check if PDFs have a text layer (using pdftotext from Poppler)
for f in *.pdf; do
chars=$(pdftotext "$f" - | wc -c)
echo "$f: $chars characters"
done
Files with zero or very few characters are scanned/image-only and need OCR.
Step 2: Convert with Marker (Recommended)
For a mixed collection of digital and scanned PDFs, Marker handles both:
# Convert all PDFs in a directory
marker output_markdown --workers 4 pdf_collection/
# Output structure:
# output_markdown/
# document1/
# document1.md
# images/
# figure1.png
# figure2.png
Step 3: Add Frontmatter
Static site generators need YAML frontmatter. Add it with a script:
#!/bin/bash
for f in output_markdown/*/*.md; do
filename=$(basename "$f" .md)
title=$(head -1 "$f" | sed 's/^#\s*//')
# Prepend frontmatter
tmpfile=$(mktemp)
cat > "$tmpfile" << EOF
---
title: "$title"
date: $(date +%Y-%m-%d)
source: "$filename.pdf"
---
EOF
cat "$f" >> "$tmpfile"
mv "$tmpfile" "$f"
done
Step 4: Fix Image Paths
Adjust image references to match your static site's asset directory:
# Move images to static directory
mkdir -p static/images/docs
cp output_markdown/*/images/* static/images/docs/
# Update references in Markdown files
sed -i 's|images/|/images/docs/|g' output_markdown/*/*.md
Step 5: Review and Clean Up
Automated conversion is never perfect. Review the output for:
- Incorrect heading levels (promote or demote as needed)
- Broken tables (may need manual reformatting)
- Missing image alt text
- Garbled text from multi-column sections
- Leftover page numbers, headers, or footers
Handling Tables in PDF-to-Markdown
Tables deserve special attention because they are the most common failure point.
When Tables Convert Well
- Single header row
- Visible gridlines or borders
- No merged cells
- Consistent column alignment
- No nested tables
When Tables Break
- Merged header cells spanning multiple columns
- Cells containing line breaks or paragraphs
- Tables split across pages
- Borderless tables relying on whitespace alignment
- Tables wider than the page (rotated or scaled)
Manual Table Fix Strategy
When automated extraction fails, the fastest approach is often:
- Open the PDF and take a screenshot of the table
- Use an AI-assisted tool to convert the screenshot to Markdown
- Paste the result and verify
For large documents with many tables, consider extracting the PDF pages as images first and then using a table-specific tool:
# Extract page 5 as a high-res PNG
pdftoppm -png -r 300 -f 5 -l 5 input.pdf page
# Then feed page-5.png to a table extraction tool
If your documents are already in PDF format and you need to convert them for editing in other applications, our PDF Converter handles conversions to Word, Excel, PowerPoint, and other editable formats while preserving layout.
Tips for Better Conversion Results
Pre-Processing
-
Linearize the PDF before conversion. Some tools handle linearized PDFs faster:
qpdf --linearize input.pdf linearized.pdf -
Remove password protection if the PDF is encrypted. You cannot extract text from an encrypted PDF without first decrypting it. Use our Unlock PDF tool for quick decryption.
-
Split large PDFs into chapters or sections before converting. Most tools produce better results on shorter documents because layout context is easier to maintain:
# Split a PDF into individual pages qpdf --split-pages input.pdf page-%d.pdf
Post-Processing
-
Normalize heading levels. Many converters produce inconsistent headings. Use a linter like
markdownlintto catch issues:npx markdownlint-cli output.md -
Validate Markdown tables render correctly by previewing in VS Code or a Markdown viewer before committing.
-
Strip page artifacts. Headers, footers, and page numbers from the PDF often appear inline in the Markdown. Search for repeating patterns and remove them.
-
Check link integrity. PDF hyperlinks may or may not survive conversion. Verify that any URLs in the output are correct and not truncated.
Converting the Other Direction
Sometimes you need to go the other way — from Markdown to PDF. Pandoc excels at this:
# Markdown to PDF via LaTeX
pandoc input.md -o output.pdf
# Markdown to PDF via HTML (no LaTeX needed)
pandoc input.md -o output.pdf --pdf-engine=weasyprint
For document workflows that involve multiple format conversions, our Document Converter supports a wide range of input and output formats, including PDF, DOCX, and more.
Choosing the Right Tool for Your Use Case
| Use Case | Recommended Tool | Why |
|---|---|---|
| Simple text-layer PDFs | Pandoc | Fast, no dependencies, good enough for clean docs |
| Academic papers with math | Mathpix | Best equation handling, worth the cost for accuracy |
| Scanned documents | Marker | Free, handles OCR, good table detection |
| Large batch conversion | Marker or Docling | Both support batch processing with parallelism |
| Quick one-off conversion | pdf2md | Lightweight, instant, no setup |
| Enterprise document pipeline | Docling | IBM-backed, strong layout analysis, API-friendly |
Frequently Asked Questions
Can I convert a scanned PDF to Markdown without OCR?
No. Scanned PDFs contain only images. Without OCR, there is no text to extract. Tools like Marker and Mathpix include built-in OCR. For Pandoc, you would need to run Tesseract OCR separately first and then convert.
How accurate is PDF-to-Markdown conversion?
For clean, single-column, text-layer PDFs, accuracy is typically 95-99%. For scanned documents, accuracy drops to 85-98% depending on scan quality. Tables and multi-column layouts are the main sources of errors regardless of PDF type.
Does the conversion preserve hyperlinks?
It depends on the tool. Pandoc and Marker both attempt to preserve hyperlinks as Markdown link syntax. Image-only (scanned) PDFs never have hyperlinks — the links exist only as visible text, not as actual link objects.
What about PDF forms?
PDF forms (fillable fields) are not preserved in Markdown. The filled-in values may be extracted as text, but the form structure (checkboxes, dropdowns, text fields) does not have a Markdown equivalent. For form data extraction, you may want to convert to a structured format first. Our PDF to Excel conversion can help with tabular form data.
Can I automate this in a CI/CD pipeline?
Yes. All the command-line tools listed here (Pandoc, Marker, Docling) can be integrated into CI/CD pipelines. A common pattern is to store PDFs in a repository, convert them to Markdown on push, and deploy the Markdown to a static site.
# Example GitHub Actions step
- name: Convert PDFs to Markdown
run: |
pip install marker-pdf
marker output_markdown --workers 2 docs/pdfs/
Conclusion
Converting PDF to Markdown is not a one-click operation for complex documents, but the tooling has improved dramatically. For most use cases, Marker offers the best balance of quality and accessibility — it handles OCR, tables, headings, and images in a single open-source package. For documents heavy on mathematical notation, Mathpix remains the gold standard. And for simple text extraction from clean PDFs, Pandoc is fast and reliable.
The key is matching your tool to your document type. Audit your PDFs first, pick the right converter, and always plan for a post-processing review pass. The initial conversion gets you 80-95% of the way there; a focused cleanup pass handles the rest.
For other document conversion needs, explore the full Document Converter on ConvertIntoMP4, which supports PDF, Word, Excel, PowerPoint, and dozens of other formats through an intuitive browser-based interface.



