MHT/MHTML Archived Web Pages: Conversion to PDF, Single HTML, and Modern Archives

What MHT Is For

MHT (also called MHTML) is the "MIME HTML" format. Internet Explorer used it as the default "Save Page As" format for many years. The file packages an entire web page with images, CSS, and other resources into a single .mht file.

In 2026, MHT is mostly historical:

Internet Explorer is retired (Microsoft stopped support 2022)
Chrome and Firefox don't support MHT in default builds
Modern web archiving uses WARC (Web Archive)

But organizations still have MHT archives from 2005-2015. This post covers conversion to modern formats.

For broader document conversion, see our document converter.

What's in an MHT File

An MHT file is a MIME multipart message:

MIME-Version: 1.0
Content-Type: multipart/related; boundary="----=_NextPart_000_0001"

------=_NextPart_000_0001
Content-Type: text/html; charset=utf-8
Content-Location: index.html

<html>...</html>

------=_NextPart_000_0001
Content-Type: image/jpeg
Content-Location: image.jpg
Content-Transfer-Encoding: base64

[base64 image data]

The HTML and resources are embedded in one file. Useful for preservation but inflates file size.

Modern Browser Support

Browser	MHT support
Internet Explorer	Yes (native)
Microsoft Edge	Yes (legacy mode)
Chrome	Behind flag (`--save-page-as-mhtml`)
Firefox	No (was removed)
Safari	No

For viewing legacy MHT: use IE compatibility (Edge), Chrome with flag, or convert to modern format.

Conversion to Single HTML

For viewing MHT in modern browsers, extract to single HTML:

# Python script
import email

with open("input.mht", "rb") as f:
    msg = email.message_from_bytes(f.read())

for part in msg.walk():
    if part.get_content_type() == "text/html":
        html = part.get_payload(decode=True).decode("utf-8")
        with open("output.html", "w") as f:
            f.write(html)

The HTML retains its structure but references to embedded resources (images) need extracting separately.

For batch processing, see Batch Processing Files Guide.

Conversion to PDF

For archival PDF:

Option 1: Open in IE (or Edge legacy mode), Print to PDF

Open MHT in Internet Explorer or Edge legacy mode
File > Print
Select "Microsoft Print to PDF"
Save

This preserves visual layout. Limited to platforms with IE/Edge legacy.

Option 2: Convert via wkhtmltopdf

# Convert HTML extracted from MHT
wkhtmltopdf extracted.html output.pdf

wkhtmltopdf is a standalone tool that renders HTML to PDF. Good output quality.

Option 3: Use online converter

Various online services convert MHT to PDF. Privacy concerns for sensitive content.

For PDF context, see our PDF converter.

Conversion to Single-File HTML

For "self-contained HTML" (similar to MHT but modern):

# Use SingleFile (browser extension or CLI)
single-file https://example.com saved-page.html

# For an existing MHT, extract first then re-save

SingleFile produces an HTML file with all resources embedded as data URLs. Modern equivalent of MHT.

Conversion to WARC

WARC (Web Archive) is the international standard for web preservation:

Format: ISO 28500
Tool: wget, Heritrix, Browsertrix
Use: Internet Archive, national libraries
Advantage: industry standard, broad tool support

For batch MHT to WARC: extract HTML and resources, re-package as WARC. Manual or scripted.

Conversion Pipeline

A typical legacy MHT conversion workflow:

Extract MHT contents: Python email module or specialized tools
Parse HTML: BeautifulSoup or similar
Resolve resource references: extract images, CSS, scripts
Reassemble: as single HTML, PDF, or WARC

Code example:

import email
from email import policy

# Parse MHT
with open("input.mht", "rb") as f:
    msg = email.message_from_bytes(f.read(), policy=policy.default)

# Extract resources
resources = {}
html_content = None

for part in msg.walk():
    content_id = part.get("Content-Location") or part.get("Content-ID")
    if part.get_content_type() == "text/html":
        html_content = part.get_payload(decode=True).decode("utf-8")
    elif content_id:
        resources[content_id] = part.get_payload(decode=True)

# Save resources to disk
import os
os.makedirs("resources", exist_ok=True)
for cid, data in resources.items():
    filename = cid.split("/")[-1]
    with open(f"resources/{filename}", "wb") as f:
        f.write(data)

# Save HTML (with re-mapped resource paths)
# ... process html_content to reference local files ...

Common Issues

Images not displaying: resource references not preserved. Re-map paths in HTML to local files.

Encoding issues: MHT uses MIME encoding. Decode base64 sections explicitly.

Layout broken: CSS not extracted or modified. Verify CSS files are present.

Forms don't work: JavaScript may have been altered. MHT preserves at point-of-save; live functionality (forms, dynamic content) is frozen.

Large file size: MHT inflates due to base64 encoding. Single HTML with data URLs is similar size.

When to Just Re-Capture

For some workflows, re-capturing the page is easier than converting old MHT:

# Save page as PDF
wkhtmltopdf https://example.com page.pdf

# Save page as single HTML (with all resources embedded)
single-file https://example.com page.html

# Capture as WARC
wget --warc-file=archive --recursive --level=1 https://example.com

For active sites: re-capture is fresh. For deleted sites: MHT may be the only record.

Web Archive Standards

Format	Year	Status
MHT/MHTML	1999	Legacy, IE-era
WARC	2009	Current standard
Wayback HTML	n/a	Internet Archive's format
HAR (HTTP Archive)	2012	Network-level capture
SingleFile HTML	2018	Modern alternative

For new web archiving: WARC (institutional) or SingleFile (personal). For converting old MHT: extract to modern format.

For archival considerations, see FFV1 Archival Codec (video equivalent).

Privacy and Copyright

MHT files preserve a snapshot of a website at point-of-save:

Personal data may be included (logged-in views, user content)
Site terms of service may restrict redistribution
Copyright applies to the captured content

For organizational archives: review what's in MHT files before sharing or processing.

Common Issues

Large size doesn't compress further: base64-encoded images are already encoded. The MHT itself is text but the embedded data isn't very compressible.

Cannot open MHT with passwords: rare but possible (some MHT writers added password protection). No standard way to recover.

Different MHT readers show differently: rendering inconsistencies. Check with multiple tools.

For broader file format conversions, see Searchable PDF With OCR.

Frequently Asked Questions

Should I convert my MHT archive?

For active research: yes, modern formats (WARC, single HTML) are easier to use. For passive archive: keep MHT until needed.

Can I read MHT on a Mac?

Some viewer apps support MHT. Or convert to PDF/HTML for native viewing.

Is MHT a security risk?

Old MHT files can contain stale or malicious JavaScript. For untrusted sources: convert to PDF or static HTML to neutralize scripts.

What about Edge's legacy mode?

Microsoft Edge can open MHT in IE Mode. For Windows users: easiest viewing path.

How big are MHT files?

A typical news article: 200-500 KB. A page with many images: 2-10 MB. Larger pages with video: 50+ MB.

Can I extract just the text?

Use Python's email module to extract HTML, then BeautifulSoup or html2text to get plain text:

from bs4 import BeautifulSoup
soup = BeautifulSoup(html_content, "html.parser")
text = soup.get_text()

Bottom Line

For MHT/MHTML in 2026: convert to PDF (for viewing), WARC (for institutional archive), or SingleFile HTML (for personal modern equivalent). Open MHT in Edge legacy mode if needed for occasional viewing. For new web archiving: WARC or SingleFile, not MHT. Our document converter handles related document conversions.

MHT/MHTML Archived Web Pages: Conversion to PDF, Single HTML, and Modern Archives

What MHT Is For

What's in an MHT File

Modern Browser Support

Conversion to Single HTML

Conversion to PDF

Conversion to Single-File HTML

Conversion to WARC

Conversion Pipeline

Common Issues

When to Just Re-Capture

Web Archive Standards

Privacy and Copyright

Common Issues

Frequently Asked Questions

Should I convert my MHT archive?

Can I read MHT on a Mac?

Is MHT a security risk?

What about Edge's legacy mode?

How big are MHT files?

Can I extract just the text?

Related Reading

Bottom Line

Try these conversions

Related Articles

FFV1 for Archival: The Lossless Codec That Libraries and Archives Picked

How to Convert SWF to MP4 After Flash's End

Markdown to DOCX with Pandoc on Linux — Complete Guide

About the Author