Task 396: .MHT File Format

Task 396: .MHT File Format

Find the file format specifications for the .MHT file format.

The .MHT (MIME HTML) file format, also known as MHTML, is defined in RFC 2557 as a way to encapsulate aggregate HTML documents and their resources (such as images, stylesheets, and scripts) into a single MIME-structured file. It uses MIME multipart/related encoding to bundle the root HTML and subsidiary content, allowing web pages to be archived or transmitted as a self-contained unit.

  1. Make a list of all the properties of this file format intrinsic to its file system.

Based on the RFC 2557 specification and common implementations in .MHT files, the key intrinsic properties (headers and structural elements) are:

  • From: The sender or saver identifier (e.g., "" or a user/email).
  • Date: The timestamp of when the file was created or saved.
  • Subject: The title or subject of the archived page.
  • MIME-Version: The MIME protocol version (typically "1.0").
  • Content-Type (top-level): The overall MIME type, usually "multipart/related" with parameters like type (e.g., "text/html"), boundary (the delimiter string), and optionally start (a Content-ID reference).
  • Boundary: The boundary string extracted from the Content-Type header, used to separate multipart sections.
  • Content-Base: An optional base URI for resolving relative links.
  • Snapshot-Content-Location: An optional browser-specific header indicating the original URL of the snapshot.
  • Number of parts: The count of multipart sections in the file (root + resources).
  • For each part (multipart section):
  • Content-Type: The MIME type of the part (e.g., "text/html", "image/gif").
  • Charset: The character encoding if specified (e.g., "UTF-8", part of Content-Type).
  • Content-Transfer-Encoding: The encoding method (e.g., "quoted-printable", "base64").
  • Content-Location: The URI location of the resource (absolute or relative).
  • Content-ID: A unique identifier for the part (e.g., "unique@domain").

These properties are derived from the MIME structure and are consistently present or optional across .MHT files.

  1. Find two direct download links for files of format .MHT.
  1. Write a ghost blog embedded html javascript that allows a user to drag n drop a file of format .MHT and it will dump to screen all these properties.

Here's an embeddable HTML snippet with JavaScript (suitable for a Ghost blog post or any HTML context) that creates a drag-and-drop area. It reads the .MHT file as text, manually parses the MIME structure (splitting on boundaries and extracting headers), and dumps the properties to the screen.

MHT Properties Dumper
Drag and drop .MHT file here

    

Note: The parser is basic and assumes standard MIME formatting. It may need adjustments for complex files.

  1. Write a python class that can open any file of format .MHT and decode read and write and print to console all the properties from the above list.
import email
import email.policy
from email.message import EmailMessage
import sys

class MHTHandler:
    def __init__(self, filepath=None):
        self.message = None
        if filepath:
            self.read(filepath)

    def read(self, filepath):
        with open(filepath, 'r', encoding='utf-8', errors='ignore') as f:
            text = f.read()
        self.message = email.parser.Parser(policy=email.policy.default).parsestr(text)

    def print_properties(self):
        if not self.message:
            print("No file loaded.")
            return

        print("From:", self.message.get('From', 'N/A'))
        print("Date:", self.message.get('Date', 'N/A'))
        print("Subject:", self.message.get('Subject', 'N/A'))
        print("MIME-Version:", self.message.get('MIME-Version', 'N/A'))
        print("Content-Type (top-level):", self.message.get('Content-Type', 'N/A'))

        boundary = self.message.get_param('boundary', None, 'Content-Type')
        print("Boundary:", boundary or 'N/A')

        print("Content-Base:", self.message.get('Content-Base', 'N/A'))
        print("Snapshot-Content-Location:", self.message.get('Snapshot-Content-Location', 'N/A'))

        parts = [p for p in self.message.iter_parts()] if self.message.is_multipart() else []
        print("Number of parts:", len(parts))

        for idx, part in enumerate(parts, 1):
            print(f"\nPart {idx}:")
            print("  Content-Type:", part.get('Content-Type', 'N/A'))
            print("  Charset:", part.get_param('charset', 'N/A'))
            print("  Content-Transfer-Encoding:", part.get('Content-Transfer-Encoding', 'N/A'))
            print("  Content-Location:", part.get('Content-Location', 'N/A'))
            print("  Content-ID:", part.get('Content-ID', 'N/A'))

    def write(self, filepath, root_html='<html><body>Test</body></html>'):
        msg = EmailMessage()
        msg['From'] = '<Saved by Python>'
        msg['Date'] = 'Mon, 30 Sep 2025 00:00:00 -0000'
        msg['Subject'] = 'Test MHT'
        msg['MIME-Version'] = '1.0'
        msg['Content-Type'] = 'multipart/related; type="text/html"; boundary="----boundary_test"'
        msg['Content-Base'] = 'http://example.com'
        msg['Snapshot-Content-Location'] = 'http://example.com/test'

        # Add root part
        root_part = EmailMessage()
        root_part['Content-Type'] = 'text/html; charset="UTF-8"'
        root_part['Content-Transfer-Encoding'] = 'quoted-printable'
        root_part['Content-Location'] = '/'
        root_part.set_payload(root_html.encode('utf-8'))
        msg.attach(root_part)

        with open(filepath, 'w', encoding='utf-8') as f:
            f.write(msg.as_string())

# Example usage
if __name__ == '__main__':
    if len(sys.argv) > 1:
        handler = MHTHandler(sys.argv[1])
        handler.print_properties()
    else:
        handler = MHTHandler()
        handler.write('test.mht')
        print("Wrote test.mht")
  1. Write a java class that can open any file of format .MHT and decode read and write and print to console all the properties from the above list.
import javax.mail.*;
import javax.mail.internet.*;
import java.io.*;
import java.util.Properties;

public class MHTHandler {
    private MimeMessage message;

    public MHTHandler(String filepath) throws Exception {
        read(filepath);
    }

    public MHTHandler() {}

    public void read(String filepath) throws Exception {
        Properties props = new Properties();
        Session session = Session.getDefaultInstance(props, null);
        InputStream is = new FileInputStream(filepath);
        message = new MimeMessage(session, is);
        is.close();
    }

    public void printProperties() throws Exception {
        if (message == null) {
            System.out.println("No file loaded.");
            return;
        }

        System.out.println("From: " + message.getFrom()[0]);
        System.out.println("Date: " + message.getSentDate());
        System.out.println("Subject: " + message.getSubject());
        System.out.println("MIME-Version: " + message.getHeader("MIME-Version", null));
        System.out.println("Content-Type (top-level): " + message.getContentType());

        String boundary = null;
        String ct = message.getContentType();
        if (ct.contains("boundary=")) {
            boundary = ct.split("boundary=\"")[1].split("\"")[0];
        }
        System.out.println("Boundary: " + (boundary != null ? boundary : "N/A"));

        System.out.println("Content-Base: " + message.getHeader("Content-Base", null));
        System.out.println("Snapshot-Content-Location: " + message.getHeader("Snapshot-Content-Location", null));

        if (message.getContent() instanceof MimeMultipart) {
            MimeMultipart multipart = (MimeMultipart) message.getContent();
            System.out.println("Number of parts: " + multipart.getCount());

            for (int i = 0; i < multipart.getCount(); i++) {
                BodyPart part = multipart.getBodyPart(i);
                System.out.println("\nPart " + (i + 1) + ":");
                System.out.println("  Content-Type: " + part.getContentType());
                String charset = null;
                if (part.getContentType().contains("charset=")) {
                    charset = part.getContentType().split("charset=")[1].split("[;\"]")[0];
                }
                System.out.println("  Charset: " + (charset != null ? charset : "N/A"));
                System.out.println("  Content-Transfer-Encoding: " + part.getHeader("Content-Transfer-Encoding", null));
                System.out.println("  Content-Location: " + part.getHeader("Content-Location", null));
                System.out.println("  Content-ID: " + part.getHeader("Content-ID", null));
            }
        } else {
            System.out.println("Number of parts: 0");
        }
    }

    public void write(String filepath, String rootHtml) throws Exception {
        Properties props = new Properties();
        Session session = Session.getDefaultInstance(props, null);

        MimeMessage msg = new MimeMessage(session);
        msg.setFrom(new InternetAddress("<Saved by Java>"));
        msg.setSentDate(new java.util.Date());
        msg.setSubject("Test MHT");
        msg.setHeader("MIME-Version", "1.0");
        msg.setHeader("Content-Base", "http://example.com");
        msg.setHeader("Snapshot-Content-Location", "http://example.com/test");

        MimeMultipart multipart = new MimeMultipart("related");
        multipart.setSubType("related; type=\"text/html\"");

        MimeBodyPart rootPart = new MimeBodyPart();
        rootPart.setContent(rootHtml, "text/html; charset=UTF-8");
        rootPart.setHeader("Content-Transfer-Encoding", "quoted-printable");
        rootPart.setHeader("Content-Location", "/");
        multipart.addBodyPart(rootPart);

        msg.setContent(multipart);

        OutputStream os = new FileOutputStream(filepath);
        msg.writeTo(os);
        os.close();
    }

    public static void main(String[] args) throws Exception {
        if (args.length > 0) {
            MHTHandler handler = new MHTHandler(args[0]);
            handler.printProperties();
        } else {
            MHTHandler handler = new MHTHandler();
            handler.write("test.mht", "<html><body>Test</body></html>");
            System.out.println("Wrote test.mht");
        }
    }
}

Note: This requires the JavaMail API (javax.mail) library.

  1. Write a javascript class that can open any file of format .MHT and decode read and write and print to console all the properties from the above list.

This is for Node.js (uses fs for file I/O).

const fs = require('fs');

class MHTHandler {
    constructor(filepath = null) {
        this.text = null;
        this.properties = {};
        if (filepath) this.read(filepath);
    }

    read(filepath) {
        this.text = fs.readFileSync(filepath, 'utf8');
        this.parse();
    }

    parse() {
        if (!this.text) return;

        const lines = this.text.split(/\r?\n/);
        let headers = {};
        let currentHeader = '';
        let inHeaders = true;
        let parts = [];
        let currentPart = { headers: {} };
        let boundary = null;
        let inPart = false;

        for (let line of lines) {
            if (inHeaders) {
                if (line.trim() === '') {
                    inHeaders = false;
                    continue;
                }
                if (/^\s/.test(line)) {
                    headers[currentHeader] += ' ' + line.trim();
                } else {
                    const [key, value] = line.split(':', 2);
                    currentHeader = key.trim();
                    headers[currentHeader] = (value || '').trim();
                }
            } else {
                if (boundary && line.includes(`--${boundary}`)) {
                    if (Object.keys(currentPart.headers).length > 0) {
                        parts.push(currentPart);
                    }
                    currentPart = { headers: {} };
                    inPart = true;
                    inHeaders = true;
                } else if (inPart && line.trim() === '') {
                    inHeaders = false;
                    inPart = false;
                } else if (inPart && inHeaders) {
                    if (/^\s/.test(line)) {
                        currentPart.headers[currentHeader] += ' ' + line.trim();
                    } else {
                        const [key, value] = line.split(':', 2);
                        currentHeader = key.trim();
                        currentPart.headers[currentHeader] = (value || '').trim();
                    }
                }
            }
        }
        if (Object.keys(currentPart.headers).length > 0) parts.push(currentPart);

        if (headers['Content-Type']) {
            const boundaryMatch = headers['Content-Type'].match(/boundary="([^"]+)"/);
            if (boundaryMatch) boundary = boundaryMatch[1];
        }

        this.properties = {
            From: headers['From'] || 'N/A',
            Date: headers['Date'] || 'N/A',
            Subject: headers['Subject'] || 'N/A',
            'MIME-Version': headers['MIME-Version'] || 'N/A',
            'Content-Type (top-level)': headers['Content-Type'] || 'N/A',
            Boundary: boundary || 'N/A',
            'Content-Base': headers['Content-Base'] || 'N/A',
            'Snapshot-Content-Location': headers['Snapshot-Content-Location'] || 'N/A',
            'Number of parts': parts.length,
            parts: parts.map(part => ({
                'Content-Type': part.headers['Content-Type'] || 'N/A',
                Charset: part.headers['Content-Type'] ? (part.headers['Content-Type'].match(/charset=([^;]+)/) || [])[1] || 'N/A' : 'N/A',
                'Content-Transfer-Encoding': part.headers['Content-Transfer-Encoding'] || 'N/A',
                'Content-Location': part.headers['Content-Location'] || 'N/A',
                'Content-ID': part.headers['Content-ID'] || 'N/A'
            }))
        };
    }

    printProperties() {
        if (!this.properties.From) {
            console.log('No file loaded.');
            return;
        }
        Object.keys(this.properties).forEach(key => {
            if (key === 'parts') {
                this.properties.parts.forEach((part, idx) => {
                    console.log(`\nPart ${idx + 1}:`);
                    Object.keys(part).forEach(pKey => {
                        console.log(`  ${pKey}: ${part[pKey]}`);
                    });
                });
            } else {
                console.log(`${key}: ${this.properties[key]}`);
            }
        });
    }

    write(filepath, rootHtml = '<html><body>Test</body></html>') {
        const boundary = '----boundary_test';
        let content = `From: <Saved by JavaScript>\n`;
        content += `Date: Mon, 30 Sep 2025 00:00:00 -0000\n`;
        content += `Subject: Test MHT\n`;
        content += `MIME-Version: 1.0\n`;
        content += `Content-Type: multipart/related; type="text/html"; boundary="${boundary}"\n`;
        content += `Content-Base: http://example.com\n`;
        content += `Snapshot-Content-Location: http://example.com/test\n\n`;
        content += `This is a multi-part message in MIME format.\n\n`;
        content += `--${boundary}\n`;
        content += `Content-Type: text/html; charset="UTF-8"\n`;
        content += `Content-Transfer-Encoding: quoted-printable\n`;
        content += `Content-Location: /\n\n`;
        content += rootHtml + '\n';
        content += `--${boundary}--\n`;

        fs.writeFileSync(filepath, content, 'utf8');
    }
}

// Example usage
if (process.argv.length > 2) {
    const handler = new MHTHandler(process.argv[2]);
    handler.printProperties();
} else {
    const handler = new MHTHandler();
    handler.write('test.mht');
    console.log('Wrote test.mht');
}
  1. Write a c class that can open any file of format .MHT and decode read and write and print to console all the properties from the above list.

This is in C++ (as "c class" likely implies C++ for class support; manual parsing due to no built-in MIME lib).

#include <iostream>
#include <fstream>
#include <sstream>
#include <vector>
#include <map>
#include <string>
#include <regex>

class MHTHandler {
private:
    std::map<std::string, std::string> headers;
    std::string boundary;
    std::vector<std::map<std::string, std::string>> parts;

    void parse(const std::string& text) {
        std::istringstream iss(text);
        std::string line;
        std::string currentHeader;
        bool inHeaders = true;

        while (std::getline(iss, line)) {
            if (inHeaders) {
                if (line.empty()) {
                    inHeaders = false;
                    continue;
                }
                if (line[0] == ' ' || line[0] == '\t') {
                    headers[currentHeader] += " " + line.substr(1);
                } else {
                    size_t colon = line.find(':');
                    if (colon != std::string::npos) {
                        currentHeader = line.substr(0, colon);
                        headers[currentHeader] = line.substr(colon + 1);
                    }
                }
            } else {
                if (boundary.empty() && headers.find("Content-Type") != headers.end()) {
                    std::regex boundaryRegex(R"(boundary="([^"]+)")");
                    std::smatch match;
                    if (std::regex_search(headers["Content-Type"], match, boundaryRegex)) {
                        boundary = match[1];
                    }
                }
                if (!boundary.empty() && line.find("--" + boundary) != std::string::npos) {
                    if (!parts.empty() && !parts.back().empty()) {
                        // Already pushed
                    }
                    parts.push_back({});
                    inHeaders = true;
                    currentHeader = "";
                } else if (inHeaders && !parts.empty()) {
                    if (line.empty()) {
                        inHeaders = false;
                    } else if (line[0] == ' ' || line[0] == '\t') {
                        parts.back()[currentHeader] += " " + line.substr(1);
                    } else {
                        size_t colon = line.find(':');
                        if (colon != std::string::npos) {
                            currentHeader = line.substr(0, colon);
                            parts.back()[currentHeader] = line.substr(colon + 1);
                        }
                    }
                }
            }
        }
    }

public:
    MHTHandler(const std::string& filepath = "") {
        if (!filepath.empty()) read(filepath);
    }

    void read(const std::string& filepath) {
        std::ifstream file(filepath);
        if (!file) {
            std::cerr << "Failed to open file." << std::endl;
            return;
        }
        std::stringstream buffer;
        buffer << file.rdbuf();
        parse(buffer.str());
    }

    void printProperties() {
        if (headers.empty()) {
            std::cout << "No file loaded." << std::endl;
            return;
        }

        std::cout << "From: " << headers["From"] << std::endl;
        std::cout << "Date: " << headers["Date"] << std::endl;
        std::cout << "Subject: " << headers["Subject"] << std::endl;
        std::cout << "MIME-Version: " << headers["MIME-Version"] << std::endl;
        std::cout << "Content-Type (top-level): " << headers["Content-Type"] << std::endl;
        std::cout << "Boundary: " << (boundary.empty() ? "N/A" : boundary) << std::endl;
        std::cout << "Content-Base: " << headers["Content-Base"] << std::endl;
        std::cout << "Snapshot-Content-Location: " << headers["Snapshot-Content-Location"] << std::endl;
        std::cout << "Number of parts: " << parts.size() << std::endl;

        for (size_t i = 0; i < parts.size(); ++i) {
            auto& part = parts[i];
            std::cout << "\nPart " << (i + 1) << ":" << std::endl;
            std::cout << "  Content-Type: " << part["Content-Type"] << std::endl;
            std::string charset = "N/A";
            std::regex charsetRegex(R"(charset=([^;]+))");
            std::smatch match;
            if (std::regex_search(part["Content-Type"], match, charsetRegex)) {
                charset = match[1];
            }
            std::cout << "  Charset: " << charset << std::endl;
            std::cout << "  Content-Transfer-Encoding: " << part["Content-Transfer-Encoding"] << std::endl;
            std::cout << "  Content-Location: " << part["Content-Location"] << std::endl;
            std::cout << "  Content-ID: " << part["Content-ID"] << std::endl;
        }
    }

    void write(const std::string& filepath, const std::string& rootHtml = "<html><body>Test</body></html>") {
        std::string boundaryStr = "----boundary_test";
        std::ofstream file(filepath);
        if (!file) {
            std::cerr << "Failed to write file." << std::endl;
            return;
        }

        file << "From: <Saved by C++>\n";
        file << "Date: Mon, 30 Sep 2025 00:00:00 -0000\n";
        file << "Subject: Test MHT\n";
        file << "MIME-Version: 1.0\n";
        file << "Content-Type: multipart/related; type=\"text/html\"; boundary=\"" << boundaryStr << "\"\n";
        file << "Content-Base: http://example.com\n";
        file << "Snapshot-Content-Location: http://example.com/test\n\n";
        file << "This is a multi-part message in MIME format.\n\n";
        file << "--" << boundaryStr << "\n";
        file << "Content-Type: text/html; charset=\"UTF-8\"\n";
        file << "Content-Transfer-Encoding: quoted-printable\n";
        file << "Content-Location: /\n\n";
        file << rootHtml << "\n";
        file << "--" << boundaryStr << "--\n";
    }
};

int main(int argc, char* argv[]) {
    if (argc > 1) {
        MHTHandler handler(argv[1]);
        handler.printProperties();
    } else {
        MHTHandler handler;
        handler.write("test.mht");
        std::cout << "Wrote test.mht" << std::endl;
    }
    return 0;
}