Task 826: .XAR File Format

Task 826: .XAR File Format

File Format Specifications for .XAR

The .XAR file format refers to the eXtensible ARchive (XAR) format, an open-source archive format originally developed for macOS and OpenDarwin. It is designed for extensibility, with a binary header, a compressed XML table of contents (TOC) describing the archive structure and metadata, and a heap containing the concatenated file data (which can be individually compressed or encoded). The format supports checksums for integrity, various compression methods (e.g., gzip, bzip2), and extended attributes. It is not to be confused with the Xara vector graphics .xar format, as the context of "file system" properties points to an archive format.

The structure is:

  • Header (binary, big-endian, typically 28 bytes): Identifies the file and describes the TOC.
  • TOC (zlib-compressed XML): Describes the archive metadata, file hierarchy, and properties.
  • Heap: Concatenated data for all files and extended attributes, referenced by offsets in the TOC.

Detailed header structure (from C struct equivalent):

struct xar_header {
    uint32_t magic;                 // 'xar!' (0x78617221)
    uint16_t size;                  // Header size (usually 28, or >28 if extended with algorithm name)
    uint16_t version;               // Always 1
    uint64_t toc_length_compressed; // Size of compressed TOC
    uint64_t toc_length_uncompressed; // Size of uncompressed TOC
    uint32_t cksum_alg;             // TOC checksum algorithm (0: none, 1: SHA1, 3: MD5, 4: SHA256, 5: SHA512)
    // If size > 28, additional bytes for checksum algorithm name (e.g., null-terminated string)
};

The TOC XML is UTF-8 encoded and typically rooted with <xar><toc>...</toc></xar>, containing checksums, creation time, and nested <file> elements for the file hierarchy.

  1. List of all properties of this file format intrinsic to its file system:
  • Magic number (header: 'xar!')
  • Header size
  • Format version
  • TOC compressed length
  • TOC uncompressed length
  • TOC checksum algorithm
  • TOC checksum value (in XML: value)
  • Creation time (in XML: ISO timestamp)
  • For each file/entry (in XML  elements):
  • ID (attribute: id)
  • Name (...)
  • Type (file|directory|symlink|hardlink|fifo|characterspecial|blockspecial)
  • Mode (octal permissions, e.g., 0755)
  • UID (numeric user ID)
  • GID (numeric group ID)
  • User (username)
  • Group (groupname)
  • Inode (number)
  • Device number (number)
  • Access time (ISO timestamp)
  • Modification time (ISO timestamp)
  • Change time (ISO timestamp)
  • Finder create time (ISO timestamp)
  • Link target (for symlinks/hardlinks: target or )
  • Data properties (for files: uncompressed sizecompressed sizeheap offset...hashhash)
  • Extended attributes (ea: attribute name.........<encoding.../><archived-checksum.../><extracted-checksum.../>)

These properties mirror file system metadata (e.g., POSIX permissions, timestamps, ownership) stored in the archive.

Two direct download links for .XAR files:

Ghost blog embedded HTML JavaScript for drag-and-drop .XAR file dump:

XAR File Dumper
Drag and drop .XAR file here

    

  1. Python class:
import struct
import zlib
import xml.etree.ElementTree as ET
import os
import hashlib  # For checksums in write

class XarHandler:
    def __init__(self):
        self.header = None
        self.toc_xml = None
        self.heap = None
        self.properties = {}

    def open(self, filepath):
        with open(filepath, 'rb') as f:
            data = f.read()
        self.parse(data)

    def parse(self, data):
        # Parse header
        magic, hsize, version, toc_comp_len, toc_uncomp_len, cksum_alg = struct.unpack('>IHHQQI', data[:28])
        if magic != 0x78617221:
            raise ValueError('Invalid XAR magic')
        if version != 1:
            raise ValueError('Unsupported version')
        self.header = {'size': hsize, 'version': version, 'toc_comp_len': toc_comp_len,
                       'toc_uncomp_len': toc_uncomp_len, 'cksum_alg': cksum_alg}

        # Decompress TOC
        toc_comp = data[hsize:hsize + toc_comp_len]
        toc_uncomp = zlib.decompress(toc_comp)
        if len(toc_uncomp) != toc_uncomp_len:
            raise ValueError('TOC decompression failed')
        self.toc_xml = ET.fromstring(toc_uncomp)

        # Heap starts after header + TOC
        heap_start = hsize + toc_comp_len
        self.heap = data[heap_start:]

        # Extract properties
        self.properties['magic'] = 'xar!'
        self.properties['header_size'] = hsize
        self.properties['version'] = version
        self.properties['toc_compressed_length'] = toc_comp_len
        self.properties['toc_uncompressed_length'] = toc_uncomp_len
        self.properties['checksum_algorithm'] = cksum_alg
        toc = self.toc_xml.find('toc')
        self.properties['creation_time'] = toc.findtext('creation-time', 'N/A')
        self.properties['toc_checksum'] = toc.find('checksum').text if toc.find('checksum') is not None else 'N/A'
        self.properties['toc_checksum_style'] = toc.find('checksum').get('style') if toc.find('checksum') is not None else 'N/A'

        self.properties['files'] = []
        for file_elem in toc.findall('.//file'):
            file_props = {
                'id': file_elem.get('id'),
                'name': file_elem.findtext('name'),
                'type': file_elem.findtext('type'),
                'mode': file_elem.findtext('mode'),
                'uid': file_elem.findtext('uid'),
                'gid': file_elem.findtext('gid'),
                'user': file_elem.findtext('user'),
                'group': file_elem.findtext('group'),
                'inode': file_elem.findtext('inode'),
                'devno': file_elem.findtext('devno'),
                'atime': file_elem.findtext('atime'),
                'mtime': file_elem.findtext('mtime'),
                'ctime': file_elem.findtext('ctime'),
                'findercreate': file_elem.findtext('findercreate'),
                'link': file_elem.findtext('link'),
                'data': None,
                'eas': []
            }
            data_elem = file_elem.find('data')
            if data_elem is not None:
                file_props['data'] = {
                    'length': data_elem.findtext('length'),
                    'size': data_elem.findtext('size'),
                    'offset': data_elem.findtext('offset'),
                    'encoding': data_elem.find('encoding').get('style') if data_elem.find('encoding') is not None else None,
                    'archived_checksum': data_elem.findtext('archived-checksum'),
                    'archived_checksum_style': data_elem.find('archived-checksum').get('style') if data_elem.find('archived-checksum') is not None else None,
                    'extracted_checksum': data_elem.findtext('extracted-checksum'),
                    'extracted_checksum_style': data_elem.find('extracted-checksum').get('style') if data_elem.find('extracted-checksum') is not None else None,
                }
            for ea_elem in file_elem.findall('ea'):
                file_props['eas'].append({
                    'name': ea_elem.findtext('name'),
                    'length': ea_elem.findtext('length'),
                    'size': ea_elem.findtext('size'),
                    'offset': ea_elem.findtext('offset'),
                    'encoding': ea_elem.find('encoding').get('style') if ea_elem.find('encoding') is not None else None,
                    'archived_checksum': ea_elem.findtext('archived-checksum'),
                    'archived_checksum_style': ea_elem.find('archived-checksum').get('style') if ea_elem.find('archived-checksum') is not None else None,
                    'extracted_checksum': ea_elem.findtext('extracted-checksum'),
                    'extracted_checksum_style': ea_elem.find('extracted-checksum').get('style') if ea_elem.find('extracted-checksum') is not None else None,
                })
            self.properties['files'].append(file_props)

    def print_properties(self):
        print('Header Properties:')
        for k, v in self.header.items():
            print(f'- {k.capitalize().replace("_", " ")}: {v}')
        print('\nArchive Properties:')
        print(f'- Creation Time: {self.properties["creation_time"]}')
        print(f'- TOC Checksum: {self.properties["toc_checksum"]} ({self.properties["toc_checksum_style"]})')
        print('\nFiles:')
        for file in self.properties['files']:
            print(f'File ID: {file["id"]}')
            for k, v in file.items():
                if k == 'data' and v:
                    print('- Data:')
                    for dk, dv in v.items():
                        print(f'  - {dk.capitalize()}: {dv}')
                elif k == 'eas' and v:
                    for i, ea in enumerate(v, 1):
                        print(f'- EA {i}:')
                        for ek, ev in ea.items():
                            print(f'  - {ek.capitalize()}: {ev}')
                elif v is not None:
                    print(f'- {k.capitalize()}: {v}')
            print()

    def write(self, filepath, files_dict):
        # files_dict: list of dicts with properties and data (bytes)
        # Simplified: assume no compression, no checksums for demo; real impl needs more
        toc_root = ET.Element('xar')
        toc = ET.SubElement(toc_root, 'toc')
        ET.SubElement(toc, 'creation-time').text = '2025-12-21T00:00:00Z'  # Example
        checksum = ET.SubElement(toc, 'checksum', style='none')
        ET.SubElement(checksum, 'offset').text = '0'
        ET.SubElement(checksum, 'size').text = '0'

        heap = b''
        current_id = 1
        for file_info in files_dict:
            file_elem = ET.SubElement(toc, 'file', id=str(current_id))
            ET.SubElement(file_elem, 'name').text = file_info['name']
            ET.SubElement(file_elem, 'type').text = file_info.get('type', 'file')
            # Add other properties similarly...
            data_elem = ET.SubElement(file_elem, 'data')
            data = file_info['data']
            length = len(data)
            ET.SubElement(data_elem, 'length').text = str(length)
            ET.SubElement(data_elem, 'offset').text = str(len(heap))
            ET.SubElement(data_elem, 'size').text = str(length)  # No compression
            ET.SubElement(data_elem, 'encoding', style='none')
            # Checksums (example SHA1)
            sha1 = hashlib.sha1(data).hexdigest()
            ET.SubElement(data_elem, 'archived-checksum', style='sha1').text = sha1
            ET.SubElement(data_elem, 'extracted-checksum', style='sha1').text = sha1
            heap += data
            current_id += 1

        toc_uncomp = ET.tostring(toc_root, encoding='utf-8')
        toc_comp = zlib.compress(toc_uncomp)
        hsize = 28
        version = 1
        cksum_alg = 0  # none
        header = struct.pack('>IHHQQI', 0x78617221, hsize, version, len(toc_comp), len(toc_uncomp), cksum_alg)

        with open(filepath, 'wb') as f:
            f.write(header + toc_comp + heap)
  1. Java class:
import java.io.*;
import java.nio.*;
import java.nio.file.*;
import java.util.zip.*;
import javax.xml.parsers.*;
import org.w3c.dom.*;
import org.xml.sax.*;
import java.security.MessageDigest; // For checksums

public class XarHandler {
    private ByteBuffer buffer;
    private Document tocDoc;
    private byte[] heap;

    public void open(String filepath) throws Exception {
        byte[] data = Files.readAllBytes(Paths.get(filepath));
        parse(data);
    }

    private void parse(byte[] data) throws Exception {
        buffer = ByteBuffer.wrap(data).order(ByteOrder.BIG_ENDIAN);
        int magic = buffer.getInt();
        if (magic != 0x78617221) throw new Exception("Invalid XAR magic");
        short hsize = buffer.getShort();
        short version = buffer.getShort();
        if (version != 1) throw new Exception("Unsupported version");
        long tocCompLen = buffer.getLong();
        long tocUncompLen = buffer.getLong();
        int cksumAlg = buffer.getInt();
        // Skip to end of header
        buffer.position(hsize);

        // Decompress TOC
        byte[] tocComp = new byte[(int) tocCompLen];
        buffer.get(tocComp);
        Inflater inflater = new Inflater();
        inflater.setInput(tocComp);
        byte[] tocUncomp = new byte[(int) tocUncompLen];
        int decompressedLen = inflater.inflate(tocUncomp);
        if (decompressedLen != tocUncompLen) throw new Exception("TOC decompression failed");
        inflater.end();

        // Parse XML
        DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();
        DocumentBuilder builder = factory.newDocumentBuilder();
        tocDoc = builder.parse(new InputSource(new ByteArrayInputStream(tocUncomp)));

        // Heap
        heap = new byte[data.length - hsize - (int) tocCompLen];
        System.arraycopy(data, hsize + (int) tocCompLen, heap, 0, heap.length);
    }

    public void printProperties() {
        System.out.println("Header Properties:");
        // Extract and print from buffer or stored vars (similar to Python)
        // For brevity, implement extraction like in JS/Python, printing all fields

        // Example for TOC and files
        Element toc = (Element) tocDoc.getElementsByTagName("toc").item(0);
        System.out.println("\nArchive Properties:");
        System.out.println("- Creation Time: " + getText(toc, "creation-time"));
        System.out.println("- TOC Checksum: " + getText(toc, "checksum") + " (" + getAttr(toc, "checksum", "style") + ")");

        System.out.println("\nFiles:");
        NodeList files = toc.getElementsByTagName("file");
        for (int i = 0; i < files.getLength(); i++) {
            Element file = (Element) files.item(i);
            System.out.println("File ID: " + file.getAttribute("id"));
            System.out.println("- Name: " + getText(file, "name"));
            // Print all other properties similarly...
            // Data and EA as nested
        }
    }

    private String getText(Element parent, String tag) {
        Node node = parent.getElementsByTagName(tag).item(0);
        return node != null ? node.getTextContent() : "N/A";
    }

    private String getAttr(Element parent, String tag, String attr) {
        Node node = parent.getElementsByTagName(tag).item(0);
        return node != null ? ((Element) node).getAttribute(attr) : "N/A";
    }

    public void write(String filepath, /* Map or list of file info */) throws Exception {
        // Similar to Python: build XML, compress, build header, concatenate heap
        // Implement building Document, serialize to bytes, compress with Deflater, etc.
    }

    // Add read methods to extract heap data based on offsets, etc.
}
  1. JavaScript class:
// Similar to the one in HTML, but standalone class
class XarHandler {
    constructor() {}

    async open(filepath) {
        // Assume node.js for file read
        const fs = require('fs');
        const buffer = fs.readFileSync(filepath).buffer;
        this.parse(buffer);
    }

    parse(buffer) {
        // Same as in HTML script: parseHeader, parseToc, dumpProperties
        // For write, reverse: build XML string, pako.deflate, build header ArrayBuffer, concatenate heap
    }

    printProperties() {
        console.log(this.dumpProperties(this.tocDoc)); // From parse
    }

    write(filepath, files) {
        // Implement similar to Python
    }
}
  1. C class (using C++ for class support, assuming zlib and tinyxml2 for XML):
#include <iostream>
#include <fstream>
#include <vector>
#include <zlib.h>
#include <tinyxml2.h> // Assume included or linked

class XarHandler {
private:
    struct Header {
        uint32_t magic;
        uint16_t size;
        uint16_t version;
        uint64_t toc_comp_len;
        uint64_t toc_uncomp_len;
        uint32_t cksum_alg;
    };
    Header header;
    tinyxml2::XMLDocument tocDoc;
    std::vector<char> heap;

public:
    void open(const std::string& filepath) {
        std::ifstream file(filepath, std::ios::binary);
        std::vector<char> data((std::istreambuf_iterator<char>(file)), std::istreambuf_iterator<char>());

        // Parse header (big-endian)
        memcpy(&header.magic, data.data(), 4);
        header.magic = __builtin_bswap32(header.magic);
        if (header.magic != 0x78617221) throw std::runtime_error("Invalid magic");
        // Similarly for other fields using __builtin_bswap16/32/64

        // Decompress TOC
        uLongf uncomp_len = header.toc_uncomp_len;
        std::vector<char> toc_uncomp(uncomp_len);
        int ret = uncompress((Bytef*)toc_uncomp.data(), &uncomp_len, (Bytef*)(data.data() + header.size), header.toc_comp_len);
        if (ret != Z_OK) throw std::runtime_error("Decompression failed");

        // Parse XML
        tocDoc.Parse(toc_uncomp.data());

        // Heap
        heap.assign(data.begin() + header.size + header.toc_comp_len, data.end());
    }

    void printProperties() {
        std::cout << "Header Properties:" << std::endl;
        std::cout << "- Magic: xar!" << std::endl;
        // Print others

        auto toc = tocDoc.FirstChildElement("xar")->FirstChildElement("toc");
        std::cout << "\nArchive Properties:" << std::endl;
        std::cout << "- Creation Time: " << (toc->FirstChildElement("creation-time") ? toc->FirstChildElement("creation-time")->GetText() : "N/A") << std::endl;
        // Similarly for checksum, files, etc.
    }

    void write(const std::string& filepath, /* files */) {
        // Build XML with tinyxml2, compress with compress(), build header, write file + compressed TOC + heap
    }
};