.PDF

Task 526: .PDF File Format

thinkmelt@protonmail.com

Oct 29, 2025 • 9 min read

PDF File Format Specifications

The Portable Document Format (PDF) is standardized as ISO 32000, with the core specification for PDF 1.7 available in the document ISO 32000-1:2008. This is based on Adobe's PDF Reference and defines the format for representing electronic documents in a device-independent manner. The specification covers syntax, objects, filters, encryption, interactive features, and more. The full official specification can be found in resources like the Adobe-hosted PDF 32000 document.

1. List of Properties Intrinsic to the PDF File Format

Based on the PDF specification (ISO 32000-1:2008), here is a comprehensive list of intrinsic properties and structural elements of the PDF file format. These are core to the format's structure and not dependent on external file system attributes (e.g., size or timestamps, which vary per instance). They define how the file is organized, parsed, and rendered. I've grouped them logically for clarity.

Header Properties

Magic Signature: The file starts with %PDF- (ASCII bytes: 37 50 68 70 45), identifying it as a PDF.
Version Number: Follows the signature (e.g., 1.7 for PDF 1.7), specifying the conformance level (major.minor, from 1.0 to 1.7; can be overridden in the catalog).
Binary Indicator: Optional comment line starting with % followed by at least four bytes with values ≥128 (e.g., %âãÏÓ) to indicate binary content.
Line Termination: Uses EOL markers (CR, LF, or CR LF) after the header.

Body Properties

Indirect Objects: Core building blocks; format: obj_num gen_num obj ... endobj. Obj_num starts from 1; gen_num starts at 0 and increments on updates (max 65535).
Direct Objects: Embedded within indirect objects; types include boolean, integer, real, string (literal or hex), name (prefixed with /), array ([...] ), dictionary (<< ... >>), stream (dictionary followed by stream ... endstream), null.
Object Streams (PDF 1.5+): Compressed collections of indirect objects; dictionary with /Type /ObjStm, /N (count), /First (offset), /Extends (chain reference).
Incremental Updates: Appended sections for changes, preserving original content; multiple bodies possible.
Character Set Restrictions: Regular characters, delimiters (e.g., ( ) < > [ ] { } / %), white-space (SPACE, TAB, CR, LF, FF, NUL).
Comments: Lines starting with % (ignored except in header).

Cross-Reference (Xref) Properties

Xref Table: Starts with xref; subsections with start_obj count followed by 20-byte entries: 10-digit offset, 5-digit gen_num, n (in use) or f (free), ended by space and EOL.
Xref Stream (PDF 1.5+): Alternative to table; indirect object with /Type /XRef, stream containing compressed entries; fields defined by /W array (widths for type/offset/gen).
Hybrid Xref (PDF 1.5+): Combines table and stream for compatibility.
Free List: Object 0 as head of free objects chain.

Trailer Properties

Trailer Dictionary: Starts with trailer << ... >>; key entries:
/Size: Total objects (including 0).
/Root: Reference to document catalog.
/Info: Reference to document info dictionary (e.g., /Title, /Author, /Subject, /Keywords, /Creator, /Producer, /CreationDate, /ModDate, /Trapped).
/ID: Array of two hex strings (file identifiers; first unchanged, second changes on update).
/Prev: Offset to previous xref (for incremental files).
/Encrypt: Reference to encryption dictionary (if encrypted).
Startxref: Line with startxref followed by byte offset to xref.
EOF Marker: %%EOF at the end.

Other Structural/Metadata Properties

Document Catalog (/Type /Catalog): Root object; properties like /Pages (page tree), /Version (override), /PageLayout (display mode), /Outlines (bookmarks), /Metadata (XMP stream), /StructTreeRoot (accessibility structure), /OCProperties (optional content).
Page Tree and Pages: Hierarchical (/Type /Pages or /Page); properties like /MediaBox (rectangle), /Resources (fonts, etc.), /Contents (stream), /Annots (annotations).
Filters and Compression: Stream filters (e.g., /FlateDecode, /ASCIIHexDecode, /LZWDecode, /JBIG2Decode); chained.
Encryption: Standard (PDF 1.1+) or public-key (PDF 1.3+); dictionary with /Filter /Standard, /V (version 1–5), /R (revision), /O/U (owner/user passwords), /P (permissions), /EncryptMetadata.
Metadata (XMP) (PDF 1.4+): XML stream in /Metadata; namespaces like pdf, xmp, dc.
Linearization (Optimized for web): Hint streams, primary/overflow hints, specific object ordering.
Signatures and Security: Digital signatures (PDF 1.3+); /Sig fields with /ByteRange, /Contents (PKCS#7).
Limits: Max nesting 28 levels, array size 8191, string 65,535 bytes, file size ~10GB (implementation-dependent).

These properties ensure random access, portability, and extensibility.

2. Two Direct Download Links for PDF Files

Here are two direct download links to sample PDF files:

https://www.rd.usda.gov/sites/default/files/pdf-sample_0.pdf (A simple dummy PDF)
https://icseindia.org/document/sample.pdf (A sample document PDF)

3. Ghost Blog Embedded HTML/JavaScript for Drag-and-Drop PDF Property Dump

This is a self-contained HTML page with embedded JavaScript that allows drag-and-drop of a PDF file. It parses the file in the browser (using ArrayBuffer) and dumps the properties from the list above to the screen. Note: This is a basic parser; it extracts header, trailer, xref offset, and key trailer values but doesn't fully decode complex objects or handle all edge cases (e.g., compressed streams require additional logic).

PDF Property Dumper

Drag and Drop PDF File

Drop PDF here

4. Python Class for PDF Handling

This Python class opens a PDF file, decodes/reads the structure, prints the properties, and supports writing (saving a modified version, e.g., updating version). It's a basic pure-Python parser without external libraries.

import re
import struct

class PDFHandler:
    def __init__(self, filepath):
        self.filepath = filepath
        self.data = None
        self.properties = {}
        self.load()

    def load(self):
        with open(self.filepath, 'rb') as f:
            self.data = f.read()
        self.parse_properties()

    def parse_properties(self):
        text = self.data.decode('latin1', errors='ignore')
        # Header
        header_match = re.match(b'%PDF-(\d\.\d)', self.data)
        self.properties['Magic Signature'] = '%PDF-'
        self.properties['Version'] = header_match.group(1).decode() if header_match else 'Unknown'
        # Startxref
        startxref_pos = text.rfind('startxref')
        if startxref_pos != -1:
            startxref_val = int(re.search(r'\d+', text[startxref_pos + 9:]).group())
            self.properties['Startxref'] = startxref_val
        # Trailer
        trailer_start = text.rfind('trailer')
        eof_pos = text.rfind('%%EOF')
        if trailer_start != -1 and eof_pos != -1:
            trailer_text = text[trailer_start + 7:eof_pos]
            dict_match = re.search(r'<<\s*(.*?)>>\s*', trailer_text, re.DOTALL)
            if dict_match:
                entries = re.findall(r'/(\w+)\s+([^\s/]+)', dict_match.group(1))
                self.properties['Trailer Properties'] = {k: v for k, v in entries}
        # Xref type
        self.properties['Xref Type'] = 'Table' if b'xref' in self.data else ('Stream' if b'/XRef' in self.data else 'Unknown')

    def print_properties(self):
        for key, value in self.properties.items():
            print(f"{key}: {value}")

    def write(self, new_filepath, updates=None):
        data = self.data
        if updates:
            # Example: update version
            if 'version' in updates:
                data = re.sub(b'%PDF-\d\.\d', b'%PDF-' + updates['version'].encode(), data, count=1)
        with open(new_filepath, 'wb') as f:
            f.write(data)

# Example usage:
# handler = PDFHandler('sample.pdf')
# handler.print_properties()
# handler.write('modified.pdf', {'version': '1.5'})

5. Java Class for PDF Handling

This Java class opens a PDF, decodes/reads, prints properties, and writes (e.g., modifies and saves). Basic parser using byte arrays.

import java.io.*;
import java.nio.file.*;
import java.util.*;
import java.util.regex.*;

public class PDFHandler {
    private String filepath;
    private byte[] data;
    private Map<String, Object> properties = new HashMap<>();

    public PDFHandler(String filepath) {
        this.filepath = filepath;
        load();
    }

    private void load() {
        try {
            data = Files.readAllBytes(Paths.get(filepath));
            parseProperties();
        } catch (IOException e) {
            e.printStackTrace();
        }
    }

    private void parseProperties() {
        String text = new String(data, java.nio.charset.StandardCharsets.ISO_8859_1);
        // Header
        Matcher headerMatcher = Pattern.compile("^%PDF-(\\d\\.\\d)").matcher(text);
        properties.put("Magic Signature", "%PDF-");
        properties.put("Version", headerMatcher.find() ? headerMatcher.group(1) : "Unknown");
        // Startxref
        int startxrefPos = text.lastIndexOf("startxref");
        if (startxrefPos != -1) {
            Matcher numMatcher = Pattern.compile("\\d+").matcher(text.substring(startxrefPos + 9));
            if (numMatcher.find()) {
                properties.put("Startxref", Integer.parseInt(numMatcher.group()));
            }
        }
        // Trailer
        int trailerStart = text.lastIndexOf("trailer");
        int eofPos = text.lastIndexOf("%%EOF");
        if (trailerStart != -1 && eofPos != -1) {
            String trailerText = text.substring(trailerStart + 7, eofPos).trim();
            Matcher dictMatcher = Pattern.compile("<<\\s*(.*?)>>", Pattern.DOTALL).matcher(trailerText);
            if (dictMatcher.find()) {
                Map<String, String> trailerProps = new HashMap<>();
                Matcher entryMatcher = Pattern.compile("/(\\w+)\\s+([^\\s/]+)").matcher(dictMatcher.group(1));
                while (entryMatcher.find()) {
                    trailerProps.put(entryMatcher.group(1), entryMatcher.group(2));
                }
                properties.put("Trailer Properties", trailerProps);
            }
        }
        // Xref type
        properties.put("Xref Type", text.contains("xref") ? "Table" : (text.contains("/XRef") ? "Stream" : "Unknown"));
    }

    public void printProperties() {
        properties.forEach((key, value) -> System.out.println(key + ": " + value));
    }

    public void write(String newFilepath, Map<String, String> updates) throws IOException {
        byte[] modifiedData = data.clone();
        if (updates != null && updates.containsKey("version")) {
            String newVersion = "%PDF-" + updates.get("version");
            modifiedData = new String(modifiedData, java.nio.charset.StandardCharsets.ISO_8859_1)
                    .replaceFirst("%PDF-\\d\\.\\d", newVersion)
                    .getBytes(java.nio.charset.StandardCharsets.ISO_8859_1);
        }
        Files.write(Paths.get(newFilepath), modifiedData);
    }

    // Example usage:
    // public static void main(String[] args) {
    //     PDFHandler handler = new PDFHandler("sample.pdf");
    //     handler.printProperties();
    //     handler.write("modified.pdf", Map.of("version", "1.5"));
    // }
}

6. JavaScript Class for PDF Handling

This Node.js-compatible class opens a PDF (using fs), decodes/reads, prints properties to console, and writes modifications. Requires Node.js.

const fs = require('fs');

class PDFHandler {
    constructor(filepath) {
        this.filepath = filepath;
        this.data = null;
        this.properties = {};
        this.load();
    }

    load() {
        this.data = fs.readFileSync(this.filepath);
        this.parseProperties();
    }

    parseProperties() {
        const text = this.data.toString('latin1');
        // Header
        const headerMatch = text.match(/^%PDF-(\d\.\d)/);
        this.properties['Magic Signature'] = '%PDF-';
        this.properties['Version'] = headerMatch ? headerMatch[1] : 'Unknown';
        // Startxref
        const startxrefPos = text.lastIndexOf('startxref');
        if (startxrefPos !== -1) {
            const startxrefMatch = text.slice(startxrefPos + 9).match(/\d+/);
            this.properties['Startxref'] = startxrefMatch ? parseInt(startxrefMatch[0]) : 'Unknown';
        }
        // Trailer
        const trailerStart = text.lastIndexOf('trailer');
        const eofPos = text.lastIndexOf('%%EOF');
        if (trailerStart !== -1 && eofPos !== -1) {
            const trailerText = text.slice(trailerStart + 7, eofPos).trim();
            const dictMatch = trailerText.match(/<<\s*(.*?)>>/s);
            if (dictMatch) {
                const trailerProps = {};
                const entries = dictMatch[1].match(/\/(\w+)\s+([^\s/]+)/g) || [];
                entries.forEach(entry => {
                    const [, key, value] = entry.match(/\/(\w+)\s+([^\s/]+)/);
                    trailerProps[key] = value;
                });
                this.properties['Trailer Properties'] = trailerProps;
            }
        }
        // Xref type
        this.properties['Xref Type'] = text.includes('xref') ? 'Table' : (text.includes('/XRef') ? 'Stream' : 'Unknown');
    }

    printProperties() {
        for (const [key, value] of Object.entries(this.properties)) {
            console.log(`${key}: ${JSON.stringify(value, null, 2)}`);
        }
    }

    write(newFilepath, updates = {}) {
        let modifiedData = this.data;
        if (updates.version) {
            const newVersion = `%PDF-${updates.version}`;
            modifiedData = Buffer.from(this.data.toString('latin1').replace(/^%PDF-\d\.\d/, newVersion), 'latin1');
        }
        fs.writeFileSync(newFilepath, modifiedData);
    }
}

// Example usage:
// const handler = new PDFHandler('sample.pdf');
// handler.printProperties();
// handler.write('modified.pdf', { version: '1.5' });

7. C++ Class for PDF Handling

This C++ class opens a PDF, decodes/reads, prints properties to console, and writes modifications. Uses std::regex for parsing.

#include <iostream>
#include <fstream>
#include <sstream>
#include <regex>
#include <map>
#include <string>

class PDFHandler {
private:
    std::string filepath;
    std::string data;
    std::map<std::string, std::string> properties; // Simplified to string values for demo

public:
    PDFHandler(const std::string& fp) : filepath(fp) {
        load();
    }

    void load() {
        std::ifstream file(filepath, std::ios::binary);
        if (file) {
            std::ostringstream oss;
            oss << file.rdbuf();
            data = oss.str();
            parseProperties();
        }
    }

    void parseProperties() {
        // Header
        std::regex headerRegex(R"(%PDF-(\d\.\d))");
        std::smatch match;
        if (std::regex_search(data, match, headerRegex)) {
            properties["Magic Signature"] = "%PDF-";
            properties["Version"] = match[1].str();
        } else {
            properties["Version"] = "Unknown";
        }
        // Startxref
        size_t startxrefPos = data.rfind("startxref");
        if (startxrefPos != std::string::npos) {
            std::regex numRegex(R"(\d+)");
            std::sregex_iterator iter(data.begin() + startxrefPos + 9, data.end(), numRegex);
            if (iter != std::sregex_iterator()) {
                properties["Startxref"] = (*iter)[0].str();
            }
        }
        // Trailer
        size_t trailerStart = data.rfind("trailer");
        size_t eofPos = data.rfind("%%EOF");
        if (trailerStart != std::string::npos && eofPos != std::string::npos) {
            std::string trailerText = data.substr(trailerStart + 7, eofPos - trailerStart - 7);
            std::regex dictRegex(R"(<<\s*(.*?)>>\s*)", std::regex::dotall);
            if (std::regex_search(trailerText, match, dictRegex)) {
                std::string dictContent = match[1].str();
                std::regex entryRegex(R"(/(\w+)\s+([^\s/]+))");
                std::sregex_iterator entryIter(dictContent.begin(), dictContent.end(), entryRegex);
                std::string trailerProps;
                for (; entryIter != std::sregex_iterator(); ++entryIter) {
                    trailerProps += (*entryIter)[1].str() + ": " + (*entryIter)[2].str() + ", ";
                }
                properties["Trailer Properties"] = trailerProps;
            }
        }
        // Xref type
        properties["Xref Type"] = (data.find("xref") != std::string::npos) ? "Table" : ((data.find("/XRef") != std::string::npos) ? "Stream" : "Unknown");
    }

    void printProperties() {
        for (const auto& prop : properties) {
            std::cout << prop.first << ": " << prop.second << std::endl;
        }
    }

    void write(const std::string& newFilepath, const std::map<std::string, std::string>& updates) {
        std::string modifiedData = data;
        if (updates.find("version") != updates.end()) {
            std::regex versionRegex(R"(%PDF-\d\.\d)");
            modifiedData = std::regex_replace(modifiedData, versionRegex, "%PDF-" + updates.at("version"), std::regex_constants::format_first_only);
        }
        std::ofstream outFile(newFilepath, std::ios::binary);
        outFile << modifiedData;
    }
};

// Example usage:
// int main() {
//     PDFHandler handler("sample.pdf");
//     handler.printProperties();
//     std::map<std::string, std::string> updates = {{"version", "1.5"}};
//     handler.write("modified.pdf", updates);
//     return 0;
// }