Task 526: .PDF File Format
Task 526: .PDF File Format
PDF File Format Specifications
The Portable Document Format (PDF) is standardized as ISO 32000, with the core specification for PDF 1.7 available in the document ISO 32000-1:2008. This is based on Adobe's PDF Reference and defines the format for representing electronic documents in a device-independent manner. The specification covers syntax, objects, filters, encryption, interactive features, and more. The full official specification can be found in resources like the Adobe-hosted PDF 32000 document.
1. List of Properties Intrinsic to the PDF File Format
Based on the PDF specification (ISO 32000-1:2008), here is a comprehensive list of intrinsic properties and structural elements of the PDF file format. These are core to the format's structure and not dependent on external file system attributes (e.g., size or timestamps, which vary per instance). They define how the file is organized, parsed, and rendered. I've grouped them logically for clarity.
Header Properties
- Magic Signature: The file starts with
%PDF-(ASCII bytes: 37 50 68 70 45), identifying it as a PDF. - Version Number: Follows the signature (e.g.,
1.7for PDF 1.7), specifying the conformance level (major.minor, from 1.0 to 1.7; can be overridden in the catalog). - Binary Indicator: Optional comment line starting with
%followed by at least four bytes with values ≥128 (e.g.,%âãÏÓ) to indicate binary content. - Line Termination: Uses EOL markers (CR, LF, or CR LF) after the header.
Body Properties
- Indirect Objects: Core building blocks; format:
obj_num gen_num obj ... endobj. Obj_num starts from 1; gen_num starts at 0 and increments on updates (max 65535). - Direct Objects: Embedded within indirect objects; types include boolean, integer, real, string (literal or hex), name (prefixed with
/), array ([...]), dictionary (<< ... >>), stream (dictionary followed bystream...endstream), null. - Object Streams (PDF 1.5+): Compressed collections of indirect objects; dictionary with
/Type /ObjStm,/N(count),/First(offset),/Extends(chain reference). - Incremental Updates: Appended sections for changes, preserving original content; multiple bodies possible.
- Character Set Restrictions: Regular characters, delimiters (e.g.,
( ) < > [ ] { } / %), white-space (SPACE, TAB, CR, LF, FF, NUL). - Comments: Lines starting with
%(ignored except in header).
Cross-Reference (Xref) Properties
- Xref Table: Starts with
xref; subsections withstart_obj countfollowed by 20-byte entries: 10-digit offset, 5-digit gen_num,n(in use) orf(free), ended by space and EOL. - Xref Stream (PDF 1.5+): Alternative to table; indirect object with
/Type /XRef, stream containing compressed entries; fields defined by/Warray (widths for type/offset/gen). - Hybrid Xref (PDF 1.5+): Combines table and stream for compatibility.
- Free List: Object 0 as head of free objects chain.
Trailer Properties
- Trailer Dictionary: Starts with
trailer << ... >>; key entries: /Size: Total objects (including 0)./Root: Reference to document catalog./Info: Reference to document info dictionary (e.g.,/Title,/Author,/Subject,/Keywords,/Creator,/Producer,/CreationDate,/ModDate,/Trapped)./ID: Array of two hex strings (file identifiers; first unchanged, second changes on update)./Prev: Offset to previous xref (for incremental files)./Encrypt: Reference to encryption dictionary (if encrypted).- Startxref: Line with
startxreffollowed by byte offset to xref. - EOF Marker:
%%EOFat the end.
Other Structural/Metadata Properties
- Document Catalog (
/Type /Catalog): Root object; properties like/Pages(page tree),/Version(override),/PageLayout(display mode),/Outlines(bookmarks),/Metadata(XMP stream),/StructTreeRoot(accessibility structure),/OCProperties(optional content). - Page Tree and Pages: Hierarchical (
/Type /Pagesor/Page); properties like/MediaBox(rectangle),/Resources(fonts, etc.),/Contents(stream),/Annots(annotations). - Filters and Compression: Stream filters (e.g.,
/FlateDecode,/ASCIIHexDecode,/LZWDecode,/JBIG2Decode); chained. - Encryption: Standard (PDF 1.1+) or public-key (PDF 1.3+); dictionary with
/Filter /Standard,/V(version 1–5),/R(revision),/O/U(owner/user passwords),/P(permissions),/EncryptMetadata. - Metadata (XMP) (PDF 1.4+): XML stream in
/Metadata; namespaces likepdf,xmp,dc. - Linearization (Optimized for web): Hint streams, primary/overflow hints, specific object ordering.
- Signatures and Security: Digital signatures (PDF 1.3+);
/Sigfields with/ByteRange,/Contents(PKCS#7). - Limits: Max nesting 28 levels, array size 8191, string 65,535 bytes, file size ~10GB (implementation-dependent).
These properties ensure random access, portability, and extensibility.
2. Two Direct Download Links for PDF Files
Here are two direct download links to sample PDF files:
- https://www.rd.usda.gov/sites/default/files/pdf-sample_0.pdf (A simple dummy PDF)
- https://icseindia.org/document/sample.pdf (A sample document PDF)
3. Ghost Blog Embedded HTML/JavaScript for Drag-and-Drop PDF Property Dump
This is a self-contained HTML page with embedded JavaScript that allows drag-and-drop of a PDF file. It parses the file in the browser (using ArrayBuffer) and dumps the properties from the list above to the screen. Note: This is a basic parser; it extracts header, trailer, xref offset, and key trailer values but doesn't fully decode complex objects or handle all edge cases (e.g., compressed streams require additional logic).
Drag and Drop PDF File
4. Python Class for PDF Handling
This Python class opens a PDF file, decodes/reads the structure, prints the properties, and supports writing (saving a modified version, e.g., updating version). It's a basic pure-Python parser without external libraries.
import re
import struct
class PDFHandler:
def __init__(self, filepath):
self.filepath = filepath
self.data = None
self.properties = {}
self.load()
def load(self):
with open(self.filepath, 'rb') as f:
self.data = f.read()
self.parse_properties()
def parse_properties(self):
text = self.data.decode('latin1', errors='ignore')
# Header
header_match = re.match(b'%PDF-(\d\.\d)', self.data)
self.properties['Magic Signature'] = '%PDF-'
self.properties['Version'] = header_match.group(1).decode() if header_match else 'Unknown'
# Startxref
startxref_pos = text.rfind('startxref')
if startxref_pos != -1:
startxref_val = int(re.search(r'\d+', text[startxref_pos + 9:]).group())
self.properties['Startxref'] = startxref_val
# Trailer
trailer_start = text.rfind('trailer')
eof_pos = text.rfind('%%EOF')
if trailer_start != -1 and eof_pos != -1:
trailer_text = text[trailer_start + 7:eof_pos]
dict_match = re.search(r'<<\s*(.*?)>>\s*', trailer_text, re.DOTALL)
if dict_match:
entries = re.findall(r'/(\w+)\s+([^\s/]+)', dict_match.group(1))
self.properties['Trailer Properties'] = {k: v for k, v in entries}
# Xref type
self.properties['Xref Type'] = 'Table' if b'xref' in self.data else ('Stream' if b'/XRef' in self.data else 'Unknown')
def print_properties(self):
for key, value in self.properties.items():
print(f"{key}: {value}")
def write(self, new_filepath, updates=None):
data = self.data
if updates:
# Example: update version
if 'version' in updates:
data = re.sub(b'%PDF-\d\.\d', b'%PDF-' + updates['version'].encode(), data, count=1)
with open(new_filepath, 'wb') as f:
f.write(data)
# Example usage:
# handler = PDFHandler('sample.pdf')
# handler.print_properties()
# handler.write('modified.pdf', {'version': '1.5'})
5. Java Class for PDF Handling
This Java class opens a PDF, decodes/reads, prints properties, and writes (e.g., modifies and saves). Basic parser using byte arrays.
import java.io.*;
import java.nio.file.*;
import java.util.*;
import java.util.regex.*;
public class PDFHandler {
private String filepath;
private byte[] data;
private Map<String, Object> properties = new HashMap<>();
public PDFHandler(String filepath) {
this.filepath = filepath;
load();
}
private void load() {
try {
data = Files.readAllBytes(Paths.get(filepath));
parseProperties();
} catch (IOException e) {
e.printStackTrace();
}
}
private void parseProperties() {
String text = new String(data, java.nio.charset.StandardCharsets.ISO_8859_1);
// Header
Matcher headerMatcher = Pattern.compile("^%PDF-(\\d\\.\\d)").matcher(text);
properties.put("Magic Signature", "%PDF-");
properties.put("Version", headerMatcher.find() ? headerMatcher.group(1) : "Unknown");
// Startxref
int startxrefPos = text.lastIndexOf("startxref");
if (startxrefPos != -1) {
Matcher numMatcher = Pattern.compile("\\d+").matcher(text.substring(startxrefPos + 9));
if (numMatcher.find()) {
properties.put("Startxref", Integer.parseInt(numMatcher.group()));
}
}
// Trailer
int trailerStart = text.lastIndexOf("trailer");
int eofPos = text.lastIndexOf("%%EOF");
if (trailerStart != -1 && eofPos != -1) {
String trailerText = text.substring(trailerStart + 7, eofPos).trim();
Matcher dictMatcher = Pattern.compile("<<\\s*(.*?)>>", Pattern.DOTALL).matcher(trailerText);
if (dictMatcher.find()) {
Map<String, String> trailerProps = new HashMap<>();
Matcher entryMatcher = Pattern.compile("/(\\w+)\\s+([^\\s/]+)").matcher(dictMatcher.group(1));
while (entryMatcher.find()) {
trailerProps.put(entryMatcher.group(1), entryMatcher.group(2));
}
properties.put("Trailer Properties", trailerProps);
}
}
// Xref type
properties.put("Xref Type", text.contains("xref") ? "Table" : (text.contains("/XRef") ? "Stream" : "Unknown"));
}
public void printProperties() {
properties.forEach((key, value) -> System.out.println(key + ": " + value));
}
public void write(String newFilepath, Map<String, String> updates) throws IOException {
byte[] modifiedData = data.clone();
if (updates != null && updates.containsKey("version")) {
String newVersion = "%PDF-" + updates.get("version");
modifiedData = new String(modifiedData, java.nio.charset.StandardCharsets.ISO_8859_1)
.replaceFirst("%PDF-\\d\\.\\d", newVersion)
.getBytes(java.nio.charset.StandardCharsets.ISO_8859_1);
}
Files.write(Paths.get(newFilepath), modifiedData);
}
// Example usage:
// public static void main(String[] args) {
// PDFHandler handler = new PDFHandler("sample.pdf");
// handler.printProperties();
// handler.write("modified.pdf", Map.of("version", "1.5"));
// }
}
6. JavaScript Class for PDF Handling
This Node.js-compatible class opens a PDF (using fs), decodes/reads, prints properties to console, and writes modifications. Requires Node.js.
const fs = require('fs');
class PDFHandler {
constructor(filepath) {
this.filepath = filepath;
this.data = null;
this.properties = {};
this.load();
}
load() {
this.data = fs.readFileSync(this.filepath);
this.parseProperties();
}
parseProperties() {
const text = this.data.toString('latin1');
// Header
const headerMatch = text.match(/^%PDF-(\d\.\d)/);
this.properties['Magic Signature'] = '%PDF-';
this.properties['Version'] = headerMatch ? headerMatch[1] : 'Unknown';
// Startxref
const startxrefPos = text.lastIndexOf('startxref');
if (startxrefPos !== -1) {
const startxrefMatch = text.slice(startxrefPos + 9).match(/\d+/);
this.properties['Startxref'] = startxrefMatch ? parseInt(startxrefMatch[0]) : 'Unknown';
}
// Trailer
const trailerStart = text.lastIndexOf('trailer');
const eofPos = text.lastIndexOf('%%EOF');
if (trailerStart !== -1 && eofPos !== -1) {
const trailerText = text.slice(trailerStart + 7, eofPos).trim();
const dictMatch = trailerText.match(/<<\s*(.*?)>>/s);
if (dictMatch) {
const trailerProps = {};
const entries = dictMatch[1].match(/\/(\w+)\s+([^\s/]+)/g) || [];
entries.forEach(entry => {
const [, key, value] = entry.match(/\/(\w+)\s+([^\s/]+)/);
trailerProps[key] = value;
});
this.properties['Trailer Properties'] = trailerProps;
}
}
// Xref type
this.properties['Xref Type'] = text.includes('xref') ? 'Table' : (text.includes('/XRef') ? 'Stream' : 'Unknown');
}
printProperties() {
for (const [key, value] of Object.entries(this.properties)) {
console.log(`${key}: ${JSON.stringify(value, null, 2)}`);
}
}
write(newFilepath, updates = {}) {
let modifiedData = this.data;
if (updates.version) {
const newVersion = `%PDF-${updates.version}`;
modifiedData = Buffer.from(this.data.toString('latin1').replace(/^%PDF-\d\.\d/, newVersion), 'latin1');
}
fs.writeFileSync(newFilepath, modifiedData);
}
}
// Example usage:
// const handler = new PDFHandler('sample.pdf');
// handler.printProperties();
// handler.write('modified.pdf', { version: '1.5' });
7. C++ Class for PDF Handling
This C++ class opens a PDF, decodes/reads, prints properties to console, and writes modifications. Uses std::regex for parsing.
#include <iostream>
#include <fstream>
#include <sstream>
#include <regex>
#include <map>
#include <string>
class PDFHandler {
private:
std::string filepath;
std::string data;
std::map<std::string, std::string> properties; // Simplified to string values for demo
public:
PDFHandler(const std::string& fp) : filepath(fp) {
load();
}
void load() {
std::ifstream file(filepath, std::ios::binary);
if (file) {
std::ostringstream oss;
oss << file.rdbuf();
data = oss.str();
parseProperties();
}
}
void parseProperties() {
// Header
std::regex headerRegex(R"(%PDF-(\d\.\d))");
std::smatch match;
if (std::regex_search(data, match, headerRegex)) {
properties["Magic Signature"] = "%PDF-";
properties["Version"] = match[1].str();
} else {
properties["Version"] = "Unknown";
}
// Startxref
size_t startxrefPos = data.rfind("startxref");
if (startxrefPos != std::string::npos) {
std::regex numRegex(R"(\d+)");
std::sregex_iterator iter(data.begin() + startxrefPos + 9, data.end(), numRegex);
if (iter != std::sregex_iterator()) {
properties["Startxref"] = (*iter)[0].str();
}
}
// Trailer
size_t trailerStart = data.rfind("trailer");
size_t eofPos = data.rfind("%%EOF");
if (trailerStart != std::string::npos && eofPos != std::string::npos) {
std::string trailerText = data.substr(trailerStart + 7, eofPos - trailerStart - 7);
std::regex dictRegex(R"(<<\s*(.*?)>>\s*)", std::regex::dotall);
if (std::regex_search(trailerText, match, dictRegex)) {
std::string dictContent = match[1].str();
std::regex entryRegex(R"(/(\w+)\s+([^\s/]+))");
std::sregex_iterator entryIter(dictContent.begin(), dictContent.end(), entryRegex);
std::string trailerProps;
for (; entryIter != std::sregex_iterator(); ++entryIter) {
trailerProps += (*entryIter)[1].str() + ": " + (*entryIter)[2].str() + ", ";
}
properties["Trailer Properties"] = trailerProps;
}
}
// Xref type
properties["Xref Type"] = (data.find("xref") != std::string::npos) ? "Table" : ((data.find("/XRef") != std::string::npos) ? "Stream" : "Unknown");
}
void printProperties() {
for (const auto& prop : properties) {
std::cout << prop.first << ": " << prop.second << std::endl;
}
}
void write(const std::string& newFilepath, const std::map<std::string, std::string>& updates) {
std::string modifiedData = data;
if (updates.find("version") != updates.end()) {
std::regex versionRegex(R"(%PDF-\d\.\d)");
modifiedData = std::regex_replace(modifiedData, versionRegex, "%PDF-" + updates.at("version"), std::regex_constants::format_first_only);
}
std::ofstream outFile(newFilepath, std::ios::binary);
outFile << modifiedData;
}
};
// Example usage:
// int main() {
// PDFHandler handler("sample.pdf");
// handler.printProperties();
// std::map<std::string, std::string> updates = {{"version", "1.5"}};
// handler.write("modified.pdf", updates);
// return 0;
// }