How to Convert PDF to Markdown: Tools and Techniques

Why Convert PDF to Markdown?

PDF is the universal format for finished documents. It looks the same on every device, preserves fonts and layout, and is the default output of nearly every word processor and typesetting system. But PDF was designed for reading, not for editing, versioning, or reusing content.

Markdown, on the other hand, is a lightweight plain-text format that is native to the modern developer and documentation workflow. It renders beautifully in GitHub, GitLab, Notion, Obsidian, Hugo, Jekyll, Docusaurus, and dozens of other platforms. It diffs cleanly in version control. It is easy to parse, easy to transform, and easy to maintain.

Converting PDF to Markdown bridges the gap between polished output and editable source. Common scenarios include:

Migrating documentation to a static site generator like Hugo, Astro, or Docusaurus
Importing research papers or specifications into a knowledge base (Obsidian, Logseq, Notion)
Making legacy documents Git-friendly so teams can track changes, review diffs, and collaborate
Extracting content from PDF reports for reuse in blog posts, wikis, or internal dashboards
Accessibility improvements — Markdown renders semantically in HTML, which screen readers handle better than most PDF viewers

The challenge is that PDF is a page-description language. It stores glyphs at absolute coordinates on a page, not semantic paragraphs, headings, or table cells. A good PDF-to-Markdown converter must reconstruct the document's logical structure from visual layout cues, which is a non-trivial problem.

Text-Layer Extraction vs. OCR

Before choosing a tool, you need to understand what kind of PDF you are working with.

Text-Layer PDFs (Digitally Created)

PDFs created by word processors (Word, Google Docs, LaTeX) or export functions (Chrome "Print to PDF") contain an embedded text layer. The characters are stored as Unicode codepoints with position data. Extracting text from these PDFs is fast and accurate because the tool reads the text directly rather than interpreting pixels.

Image-Only PDFs (Scanned Documents)

PDFs created by scanning physical documents contain only raster images — one image per page. There is no text layer. To extract text, you need Optical Character Recognition (OCR), which analyzes the image and attempts to identify characters. OCR accuracy depends on scan quality, font type, and language.

Hybrid PDFs

Some PDFs have both text and images. For example, a report might have text-layer body content but include scanned appendices or embedded screenshots with text. The best tools handle both in a single pass.

How to Tell Which Type You Have

Open the PDF in any viewer and try selecting text with your cursor. If you can highlight individual words, it has a text layer. If the entire page selects as one image block, it is image-only and requires OCR.

PDF Type	Text Selectable	Requires OCR	Accuracy	Speed
Digital	Yes	No	Near 100%	Fast
Scanned	No	Yes	85-99%	Slow
Hybrid	Partial	Partial	Varies	Moderate

Key Challenges in PDF-to-Markdown Conversion

Heading Detection

PDF does not store heading levels (H1, H2, H3). A heading is just text rendered in a larger or bolder font at a certain position. Tools must infer heading hierarchy from font size, weight, and position — and they often get it wrong when the document uses inconsistent styling.

Table Reconstruction

Tables in PDF are the single hardest element to convert. PDF draws tables as lines and positioned text — there is no <table> equivalent. Extracting structured table data requires detecting row and column boundaries from line positions and cell content alignment.

Simple tables with visible gridlines convert reasonably well. Complex tables with merged cells, nested headers, or no visible borders are where most tools struggle.

Image Handling

Embedded images in a PDF need to be extracted as separate files and referenced in the Markdown output. The tool must identify image boundaries, extract the raster data, save it to a file (PNG or JPEG), and insert the appropriate Markdown image syntax ![alt](path).

Multi-Column Layouts

Magazine-style two-column or three-column layouts confuse most extractors. The tool may interleave text from different columns, producing garbled output. Documents with sidebars, pull quotes, or floating text boxes have the same problem.

Mathematical Notation

Academic papers and technical documents often contain mathematical equations rendered with LaTeX or MathML. Converting these to Markdown requires outputting either LaTeX math syntax ( $E = mc^2$ ) or plain-text approximations. Few tools handle this well out of the box.

Tool Comparison

Pandoc

Pandoc is the Swiss Army knife of document conversion. It reads dozens of formats and writes dozens more, including PDF-to-Markdown. However, Pandoc's PDF reader relies on pdftotext (from Poppler) under the hood, which extracts raw text without much structural analysis.

# Basic PDF to Markdown with Pandoc
pandoc input.pdf -o output.md

# With explicit format flags
pandoc -f pdf -t markdown --wrap=none input.pdf -o output.md

# Extract media (images) to a folder
pandoc input.pdf -o output.md --extract-media=./media

Strengths: Widely available, handles text-layer PDFs, good image extraction, excellent for piping into other Pandoc output formats.

Weaknesses: No OCR support. Limited table detection. Headings are often missed because Pandoc's PDF reader does not analyze font sizes. Multi-column layouts produce garbled text.

Marker

Marker is a newer open-source tool specifically designed for high-quality PDF-to-Markdown conversion. It uses a combination of deep learning models for layout detection, OCR, and table recognition.

# Install Marker
pip install marker-pdf

# Convert a single PDF
marker_single input.pdf output_dir

# Batch convert a directory
marker output_dir --workers 4 input_dir

Strengths: Excellent table detection, heading recognition, and multi-column handling. Works on both text-layer and scanned PDFs. Handles math notation. Open source and actively maintained.

Weaknesses: Requires Python and significant dependencies (PyTorch). First run downloads several GB of models. Slower than text-extraction-only tools. GPU recommended for reasonable performance.

Mathpix

Mathpix is a commercial API service that excels at converting documents with mathematical notation, scientific figures, and complex tables. It uses proprietary OCR models trained on academic content.

# Using the Mathpix CLI
mpx convert input.pdf -o output.md

# Via API (curl)
curl -X POST https://api.mathpix.com/v3/pdf \
  -H "app_id: YOUR_APP_ID" \
  -H "app_key: YOUR_APP_KEY" \
  -F "file=@input.pdf" \
  -F "options_json={\"conversion_formats\": {\"md\": true}}"

Strengths: Best-in-class math and equation handling. Excellent table recognition. Fast cloud processing. Handles complex academic layouts.

Weaknesses: Paid service (free tier limited to 100 pages/month). Requires sending documents to a third-party cloud. Not open source.

pdf2md / pdftomd

Several lightweight tools under variations of this name exist in the npm and Python ecosystems. They typically use pdfjs-dist (Mozilla's PDF.js) or pdfminer to extract text and apply heuristics for structure detection.

# Node.js variant
npx pdf2md input.pdf > output.md

# Python variant (pdfminer-based)
pip install pdftomd
pdftomd input.pdf output.md

Strengths: Lightweight, no GPU required, fast for text-layer PDFs. Good enough for simple, single-column documents.

Weaknesses: Poor table handling. Minimal heading detection. No OCR. Breaks down on complex layouts.

Docling (IBM)

Docling is IBM's open-source document understanding library. It uses layout analysis models to segment pages into regions (text, tables, figures, headers) and then converts each region appropriately.

pip install docling

# Python usage
from docling.document_converter import DocumentConverter

converter = DocumentConverter()
result = converter.convert("input.pdf")
print(result.document.export_to_markdown())

Strengths: Strong layout analysis. Good table extraction. Handles multi-column layouts. Active development from IBM Research.

Weaknesses: Python-only. Requires model downloads. Newer project with a smaller community than Pandoc or Marker.

Feature Comparison Table

Feature	Pandoc	Marker	Mathpix	pdf2md	Docling
Text-layer PDFs	Good	Great	Great	Good	Great
Scanned PDFs (OCR)	No	Yes	Yes	No	Yes
Heading detection	Poor	Great	Great	Fair	Good
Table extraction	Poor	Great	Great	Poor	Great
Image extraction	Good	Good	Good	No	Good
Math notation	Poor	Good	Best	No	Fair
Multi-column	Poor	Good	Good	Poor	Good
Speed	Fast	Moderate	Fast	Fast	Moderate
Offline	Yes	Yes	No	Yes	Yes
Free / Open Source	Yes	Yes	No	Yes	Yes
GPU recommended	No	Yes	N/A	No	Yes

Practical Workflow: PDF to Markdown for a Static Site

Here is a step-by-step workflow for converting a collection of PDF documents into Markdown files suitable for a static site generator like Hugo, Astro, or Docusaurus.

Step 1: Audit Your PDFs

Before batch-converting, check what you are dealing with:

# Check if PDFs have a text layer (using pdftotext from Poppler)
for f in *.pdf; do
  chars=$(pdftotext "$f" - | wc -c)
  echo "$f: $chars characters"
done

Files with zero or very few characters are scanned/image-only and need OCR.

Step 2: Convert with Marker (Recommended)

For a mixed collection of digital and scanned PDFs, Marker handles both:

# Convert all PDFs in a directory
marker output_markdown --workers 4 pdf_collection/

# Output structure:
# output_markdown/
#   document1/
#     document1.md
#     images/
#       figure1.png
#       figure2.png

Step 3: Add Frontmatter

Static site generators need YAML frontmatter. Add it with a script:

#!/bin/bash
for f in output_markdown/*/*.md; do
  filename=$(basename "$f" .md)
  title=$(head -1 "$f" | sed 's/^#\s*//')

  # Prepend frontmatter
  tmpfile=$(mktemp)
  cat > "$tmpfile" << EOF
---
title: "$title"
date: $(date +%Y-%m-%d)
source: "$filename.pdf"
---

EOF
  cat "$f" >> "$tmpfile"
  mv "$tmpfile" "$f"
done

Step 4: Fix Image Paths

Adjust image references to match your static site's asset directory:

# Move images to static directory
mkdir -p static/images/docs
cp output_markdown/*/images/* static/images/docs/

# Update references in Markdown files
sed -i 's|images/|/images/docs/|g' output_markdown/*/*.md

Step 5: Review and Clean Up

Automated conversion is never perfect. Review the output for:

Incorrect heading levels (promote or demote as needed)
Broken tables (may need manual reformatting)
Missing image alt text
Garbled text from multi-column sections
Leftover page numbers, headers, or footers

Handling Tables in PDF-to-Markdown

Tables deserve special attention because they are the most common failure point.

When Tables Convert Well

Single header row
Visible gridlines or borders
No merged cells
Consistent column alignment
No nested tables

When Tables Break

Merged header cells spanning multiple columns
Cells containing line breaks or paragraphs
Tables split across pages
Borderless tables relying on whitespace alignment
Tables wider than the page (rotated or scaled)

Manual Table Fix Strategy

When automated extraction fails, the fastest approach is often:

Open the PDF and take a screenshot of the table
Use an AI-assisted tool to convert the screenshot to Markdown
Paste the result and verify

For large documents with many tables, consider extracting the PDF pages as images first and then using a table-specific tool:

# Extract page 5 as a high-res PNG
pdftoppm -png -r 300 -f 5 -l 5 input.pdf page

# Then feed page-5.png to a table extraction tool

If your documents are already in PDF format and you need to convert them for editing in other applications, our PDF Converter handles conversions to Word, Excel, PowerPoint, and other editable formats while preserving layout.

Tips for Better Conversion Results

Pre-Processing

Linearize the PDF before conversion. Some tools handle linearized PDFs faster:
```
qpdf --linearize input.pdf linearized.pdf
```
Remove password protection if the PDF is encrypted. You cannot extract text from an encrypted PDF without first decrypting it. Use our Unlock PDF tool for quick decryption.
Split large PDFs into chapters or sections before converting. Most tools produce better results on shorter documents because layout context is easier to maintain:
```
# Split a PDF into individual pages
qpdf --split-pages input.pdf page-%d.pdf
```

Post-Processing

Normalize heading levels. Many converters produce inconsistent headings. Use a linter like markdownlint to catch issues:
```
npx markdownlint-cli output.md
```
Validate Markdown tables render correctly by previewing in VS Code or a Markdown viewer before committing.
Strip page artifacts. Headers, footers, and page numbers from the PDF often appear inline in the Markdown. Search for repeating patterns and remove them.
Check link integrity. PDF hyperlinks may or may not survive conversion. Verify that any URLs in the output are correct and not truncated.

Converting the Other Direction

Sometimes you need to go the other way — from Markdown to PDF. Pandoc excels at this:

# Markdown to PDF via LaTeX
pandoc input.md -o output.pdf

# Markdown to PDF via HTML (no LaTeX needed)
pandoc input.md -o output.pdf --pdf-engine=weasyprint

For document workflows that involve multiple format conversions, our Document Converter supports a wide range of input and output formats, including PDF, DOCX, and more.

Choosing the Right Tool for Your Use Case

Use Case	Recommended Tool	Why
Simple text-layer PDFs	Pandoc	Fast, no dependencies, good enough for clean docs
Academic papers with math	Mathpix	Best equation handling, worth the cost for accuracy
Scanned documents	Marker	Free, handles OCR, good table detection
Large batch conversion	Marker or Docling	Both support batch processing with parallelism
Quick one-off conversion	pdf2md	Lightweight, instant, no setup
Enterprise document pipeline	Docling	IBM-backed, strong layout analysis, API-friendly

Frequently Asked Questions

Can I convert a scanned PDF to Markdown without OCR?

No. Scanned PDFs contain only images. Without OCR, there is no text to extract. Tools like Marker and Mathpix include built-in OCR. For Pandoc, you would need to run Tesseract OCR separately first and then convert.

How accurate is PDF-to-Markdown conversion?

For clean, single-column, text-layer PDFs, accuracy is typically 95-99%. For scanned documents, accuracy drops to 85-98% depending on scan quality. Tables and multi-column layouts are the main sources of errors regardless of PDF type.

Does the conversion preserve hyperlinks?

It depends on the tool. Pandoc and Marker both attempt to preserve hyperlinks as Markdown link syntax. Image-only (scanned) PDFs never have hyperlinks — the links exist only as visible text, not as actual link objects.

What about PDF forms?

PDF forms (fillable fields) are not preserved in Markdown. The filled-in values may be extracted as text, but the form structure (checkboxes, dropdowns, text fields) does not have a Markdown equivalent. For form data extraction, you may want to convert to a structured format first. Our PDF to Excel conversion can help with tabular form data.

Can I automate this in a CI/CD pipeline?

Yes. All the command-line tools listed here (Pandoc, Marker, Docling) can be integrated into CI/CD pipelines. A common pattern is to store PDFs in a repository, convert them to Markdown on push, and deploy the Markdown to a static site.

# Example GitHub Actions step
- name: Convert PDFs to Markdown
  run: |
    pip install marker-pdf
    marker output_markdown --workers 2 docs/pdfs/

Conclusion

Converting PDF to Markdown is not a one-click operation for complex documents, but the tooling has improved dramatically. For most use cases, Marker offers the best balance of quality and accessibility — it handles OCR, tables, headings, and images in a single open-source package. For documents heavy on mathematical notation, Mathpix remains the gold standard. And for simple text extraction from clean PDFs, Pandoc is fast and reliable.

The key is matching your tool to your document type. Audit your PDFs first, pick the right converter, and always plan for a post-processing review pass. The initial conversion gets you 80-95% of the way there; a focused cleanup pass handles the rest.

For other document conversion needs, explore the full Document Converter on ConvertIntoMP4, which supports PDF, Word, Excel, PowerPoint, and dozens of other formats through an intuitive browser-based interface.