Task 826: .XAR File Format
Task 826: .XAR File Format
File Format Specifications for .XAR
The .XAR file format refers to the eXtensible ARchive (XAR) format, an open-source archive format originally developed for macOS and OpenDarwin. It is designed for extensibility, with a binary header, a compressed XML table of contents (TOC) describing the archive structure and metadata, and a heap containing the concatenated file data (which can be individually compressed or encoded). The format supports checksums for integrity, various compression methods (e.g., gzip, bzip2), and extended attributes. It is not to be confused with the Xara vector graphics .xar format, as the context of "file system" properties points to an archive format.
The structure is:
- Header (binary, big-endian, typically 28 bytes): Identifies the file and describes the TOC.
- TOC (zlib-compressed XML): Describes the archive metadata, file hierarchy, and properties.
- Heap: Concatenated data for all files and extended attributes, referenced by offsets in the TOC.
Detailed header structure (from C struct equivalent):
struct xar_header {
uint32_t magic; // 'xar!' (0x78617221)
uint16_t size; // Header size (usually 28, or >28 if extended with algorithm name)
uint16_t version; // Always 1
uint64_t toc_length_compressed; // Size of compressed TOC
uint64_t toc_length_uncompressed; // Size of uncompressed TOC
uint32_t cksum_alg; // TOC checksum algorithm (0: none, 1: SHA1, 3: MD5, 4: SHA256, 5: SHA512)
// If size > 28, additional bytes for checksum algorithm name (e.g., null-terminated string)
};
The TOC XML is UTF-8 encoded and typically rooted with <xar><toc>...</toc></xar>, containing checksums, creation time, and nested <file> elements for the file hierarchy.
- List of all properties of this file format intrinsic to its file system:
- Magic number (header: 'xar!')
- Header size
- Format version
- TOC compressed length
- TOC uncompressed length
- TOC checksum algorithm
- TOC checksum value (in XML: value)
- Creation time (in XML: ISO timestamp)
- For each file/entry (in XML elements):
- ID (attribute: id)
- Name (...)
- Type (file|directory|symlink|hardlink|fifo|characterspecial|blockspecial)
- Mode (octal permissions, e.g., 0755)
- UID (numeric user ID)
- GID (numeric group ID)
- User (username)
- Group (groupname)
- Inode (number)
- Device number (number)
- Access time (ISO timestamp)
- Modification time (ISO timestamp)
- Change time (ISO timestamp)
- Finder create time (ISO timestamp)
- Link target (for symlinks/hardlinks: target or )
- Data properties (for files: uncompressed sizecompressed sizeheap offset...hashhash)
- Extended attributes (ea: attribute name.........<encoding.../><archived-checksum.../><extracted-checksum.../>)
These properties mirror file system metadata (e.g., POSIX permissions, timestamps, ownership) stored in the archive.
Two direct download links for .XAR files:
- https://storage.googleapis.com/google-code-archive-downloads/v2/code.google.com/xar/xar-1.5.1.xar
- https://storage.googleapis.com/google-code-archive-downloads/v2/code.google.com/xar/xar-1.5.2.xar
Ghost blog embedded HTML JavaScript for drag-and-drop .XAR file dump:
- Python class:
import struct
import zlib
import xml.etree.ElementTree as ET
import os
import hashlib # For checksums in write
class XarHandler:
def __init__(self):
self.header = None
self.toc_xml = None
self.heap = None
self.properties = {}
def open(self, filepath):
with open(filepath, 'rb') as f:
data = f.read()
self.parse(data)
def parse(self, data):
# Parse header
magic, hsize, version, toc_comp_len, toc_uncomp_len, cksum_alg = struct.unpack('>IHHQQI', data[:28])
if magic != 0x78617221:
raise ValueError('Invalid XAR magic')
if version != 1:
raise ValueError('Unsupported version')
self.header = {'size': hsize, 'version': version, 'toc_comp_len': toc_comp_len,
'toc_uncomp_len': toc_uncomp_len, 'cksum_alg': cksum_alg}
# Decompress TOC
toc_comp = data[hsize:hsize + toc_comp_len]
toc_uncomp = zlib.decompress(toc_comp)
if len(toc_uncomp) != toc_uncomp_len:
raise ValueError('TOC decompression failed')
self.toc_xml = ET.fromstring(toc_uncomp)
# Heap starts after header + TOC
heap_start = hsize + toc_comp_len
self.heap = data[heap_start:]
# Extract properties
self.properties['magic'] = 'xar!'
self.properties['header_size'] = hsize
self.properties['version'] = version
self.properties['toc_compressed_length'] = toc_comp_len
self.properties['toc_uncompressed_length'] = toc_uncomp_len
self.properties['checksum_algorithm'] = cksum_alg
toc = self.toc_xml.find('toc')
self.properties['creation_time'] = toc.findtext('creation-time', 'N/A')
self.properties['toc_checksum'] = toc.find('checksum').text if toc.find('checksum') is not None else 'N/A'
self.properties['toc_checksum_style'] = toc.find('checksum').get('style') if toc.find('checksum') is not None else 'N/A'
self.properties['files'] = []
for file_elem in toc.findall('.//file'):
file_props = {
'id': file_elem.get('id'),
'name': file_elem.findtext('name'),
'type': file_elem.findtext('type'),
'mode': file_elem.findtext('mode'),
'uid': file_elem.findtext('uid'),
'gid': file_elem.findtext('gid'),
'user': file_elem.findtext('user'),
'group': file_elem.findtext('group'),
'inode': file_elem.findtext('inode'),
'devno': file_elem.findtext('devno'),
'atime': file_elem.findtext('atime'),
'mtime': file_elem.findtext('mtime'),
'ctime': file_elem.findtext('ctime'),
'findercreate': file_elem.findtext('findercreate'),
'link': file_elem.findtext('link'),
'data': None,
'eas': []
}
data_elem = file_elem.find('data')
if data_elem is not None:
file_props['data'] = {
'length': data_elem.findtext('length'),
'size': data_elem.findtext('size'),
'offset': data_elem.findtext('offset'),
'encoding': data_elem.find('encoding').get('style') if data_elem.find('encoding') is not None else None,
'archived_checksum': data_elem.findtext('archived-checksum'),
'archived_checksum_style': data_elem.find('archived-checksum').get('style') if data_elem.find('archived-checksum') is not None else None,
'extracted_checksum': data_elem.findtext('extracted-checksum'),
'extracted_checksum_style': data_elem.find('extracted-checksum').get('style') if data_elem.find('extracted-checksum') is not None else None,
}
for ea_elem in file_elem.findall('ea'):
file_props['eas'].append({
'name': ea_elem.findtext('name'),
'length': ea_elem.findtext('length'),
'size': ea_elem.findtext('size'),
'offset': ea_elem.findtext('offset'),
'encoding': ea_elem.find('encoding').get('style') if ea_elem.find('encoding') is not None else None,
'archived_checksum': ea_elem.findtext('archived-checksum'),
'archived_checksum_style': ea_elem.find('archived-checksum').get('style') if ea_elem.find('archived-checksum') is not None else None,
'extracted_checksum': ea_elem.findtext('extracted-checksum'),
'extracted_checksum_style': ea_elem.find('extracted-checksum').get('style') if ea_elem.find('extracted-checksum') is not None else None,
})
self.properties['files'].append(file_props)
def print_properties(self):
print('Header Properties:')
for k, v in self.header.items():
print(f'- {k.capitalize().replace("_", " ")}: {v}')
print('\nArchive Properties:')
print(f'- Creation Time: {self.properties["creation_time"]}')
print(f'- TOC Checksum: {self.properties["toc_checksum"]} ({self.properties["toc_checksum_style"]})')
print('\nFiles:')
for file in self.properties['files']:
print(f'File ID: {file["id"]}')
for k, v in file.items():
if k == 'data' and v:
print('- Data:')
for dk, dv in v.items():
print(f' - {dk.capitalize()}: {dv}')
elif k == 'eas' and v:
for i, ea in enumerate(v, 1):
print(f'- EA {i}:')
for ek, ev in ea.items():
print(f' - {ek.capitalize()}: {ev}')
elif v is not None:
print(f'- {k.capitalize()}: {v}')
print()
def write(self, filepath, files_dict):
# files_dict: list of dicts with properties and data (bytes)
# Simplified: assume no compression, no checksums for demo; real impl needs more
toc_root = ET.Element('xar')
toc = ET.SubElement(toc_root, 'toc')
ET.SubElement(toc, 'creation-time').text = '2025-12-21T00:00:00Z' # Example
checksum = ET.SubElement(toc, 'checksum', style='none')
ET.SubElement(checksum, 'offset').text = '0'
ET.SubElement(checksum, 'size').text = '0'
heap = b''
current_id = 1
for file_info in files_dict:
file_elem = ET.SubElement(toc, 'file', id=str(current_id))
ET.SubElement(file_elem, 'name').text = file_info['name']
ET.SubElement(file_elem, 'type').text = file_info.get('type', 'file')
# Add other properties similarly...
data_elem = ET.SubElement(file_elem, 'data')
data = file_info['data']
length = len(data)
ET.SubElement(data_elem, 'length').text = str(length)
ET.SubElement(data_elem, 'offset').text = str(len(heap))
ET.SubElement(data_elem, 'size').text = str(length) # No compression
ET.SubElement(data_elem, 'encoding', style='none')
# Checksums (example SHA1)
sha1 = hashlib.sha1(data).hexdigest()
ET.SubElement(data_elem, 'archived-checksum', style='sha1').text = sha1
ET.SubElement(data_elem, 'extracted-checksum', style='sha1').text = sha1
heap += data
current_id += 1
toc_uncomp = ET.tostring(toc_root, encoding='utf-8')
toc_comp = zlib.compress(toc_uncomp)
hsize = 28
version = 1
cksum_alg = 0 # none
header = struct.pack('>IHHQQI', 0x78617221, hsize, version, len(toc_comp), len(toc_uncomp), cksum_alg)
with open(filepath, 'wb') as f:
f.write(header + toc_comp + heap)
- Java class:
import java.io.*;
import java.nio.*;
import java.nio.file.*;
import java.util.zip.*;
import javax.xml.parsers.*;
import org.w3c.dom.*;
import org.xml.sax.*;
import java.security.MessageDigest; // For checksums
public class XarHandler {
private ByteBuffer buffer;
private Document tocDoc;
private byte[] heap;
public void open(String filepath) throws Exception {
byte[] data = Files.readAllBytes(Paths.get(filepath));
parse(data);
}
private void parse(byte[] data) throws Exception {
buffer = ByteBuffer.wrap(data).order(ByteOrder.BIG_ENDIAN);
int magic = buffer.getInt();
if (magic != 0x78617221) throw new Exception("Invalid XAR magic");
short hsize = buffer.getShort();
short version = buffer.getShort();
if (version != 1) throw new Exception("Unsupported version");
long tocCompLen = buffer.getLong();
long tocUncompLen = buffer.getLong();
int cksumAlg = buffer.getInt();
// Skip to end of header
buffer.position(hsize);
// Decompress TOC
byte[] tocComp = new byte[(int) tocCompLen];
buffer.get(tocComp);
Inflater inflater = new Inflater();
inflater.setInput(tocComp);
byte[] tocUncomp = new byte[(int) tocUncompLen];
int decompressedLen = inflater.inflate(tocUncomp);
if (decompressedLen != tocUncompLen) throw new Exception("TOC decompression failed");
inflater.end();
// Parse XML
DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();
DocumentBuilder builder = factory.newDocumentBuilder();
tocDoc = builder.parse(new InputSource(new ByteArrayInputStream(tocUncomp)));
// Heap
heap = new byte[data.length - hsize - (int) tocCompLen];
System.arraycopy(data, hsize + (int) tocCompLen, heap, 0, heap.length);
}
public void printProperties() {
System.out.println("Header Properties:");
// Extract and print from buffer or stored vars (similar to Python)
// For brevity, implement extraction like in JS/Python, printing all fields
// Example for TOC and files
Element toc = (Element) tocDoc.getElementsByTagName("toc").item(0);
System.out.println("\nArchive Properties:");
System.out.println("- Creation Time: " + getText(toc, "creation-time"));
System.out.println("- TOC Checksum: " + getText(toc, "checksum") + " (" + getAttr(toc, "checksum", "style") + ")");
System.out.println("\nFiles:");
NodeList files = toc.getElementsByTagName("file");
for (int i = 0; i < files.getLength(); i++) {
Element file = (Element) files.item(i);
System.out.println("File ID: " + file.getAttribute("id"));
System.out.println("- Name: " + getText(file, "name"));
// Print all other properties similarly...
// Data and EA as nested
}
}
private String getText(Element parent, String tag) {
Node node = parent.getElementsByTagName(tag).item(0);
return node != null ? node.getTextContent() : "N/A";
}
private String getAttr(Element parent, String tag, String attr) {
Node node = parent.getElementsByTagName(tag).item(0);
return node != null ? ((Element) node).getAttribute(attr) : "N/A";
}
public void write(String filepath, /* Map or list of file info */) throws Exception {
// Similar to Python: build XML, compress, build header, concatenate heap
// Implement building Document, serialize to bytes, compress with Deflater, etc.
}
// Add read methods to extract heap data based on offsets, etc.
}
- JavaScript class:
// Similar to the one in HTML, but standalone class
class XarHandler {
constructor() {}
async open(filepath) {
// Assume node.js for file read
const fs = require('fs');
const buffer = fs.readFileSync(filepath).buffer;
this.parse(buffer);
}
parse(buffer) {
// Same as in HTML script: parseHeader, parseToc, dumpProperties
// For write, reverse: build XML string, pako.deflate, build header ArrayBuffer, concatenate heap
}
printProperties() {
console.log(this.dumpProperties(this.tocDoc)); // From parse
}
write(filepath, files) {
// Implement similar to Python
}
}
- C class (using C++ for class support, assuming zlib and tinyxml2 for XML):
#include <iostream>
#include <fstream>
#include <vector>
#include <zlib.h>
#include <tinyxml2.h> // Assume included or linked
class XarHandler {
private:
struct Header {
uint32_t magic;
uint16_t size;
uint16_t version;
uint64_t toc_comp_len;
uint64_t toc_uncomp_len;
uint32_t cksum_alg;
};
Header header;
tinyxml2::XMLDocument tocDoc;
std::vector<char> heap;
public:
void open(const std::string& filepath) {
std::ifstream file(filepath, std::ios::binary);
std::vector<char> data((std::istreambuf_iterator<char>(file)), std::istreambuf_iterator<char>());
// Parse header (big-endian)
memcpy(&header.magic, data.data(), 4);
header.magic = __builtin_bswap32(header.magic);
if (header.magic != 0x78617221) throw std::runtime_error("Invalid magic");
// Similarly for other fields using __builtin_bswap16/32/64
// Decompress TOC
uLongf uncomp_len = header.toc_uncomp_len;
std::vector<char> toc_uncomp(uncomp_len);
int ret = uncompress((Bytef*)toc_uncomp.data(), &uncomp_len, (Bytef*)(data.data() + header.size), header.toc_comp_len);
if (ret != Z_OK) throw std::runtime_error("Decompression failed");
// Parse XML
tocDoc.Parse(toc_uncomp.data());
// Heap
heap.assign(data.begin() + header.size + header.toc_comp_len, data.end());
}
void printProperties() {
std::cout << "Header Properties:" << std::endl;
std::cout << "- Magic: xar!" << std::endl;
// Print others
auto toc = tocDoc.FirstChildElement("xar")->FirstChildElement("toc");
std::cout << "\nArchive Properties:" << std::endl;
std::cout << "- Creation Time: " << (toc->FirstChildElement("creation-time") ? toc->FirstChildElement("creation-time")->GetText() : "N/A") << std::endl;
// Similarly for checksum, files, etc.
}
void write(const std::string& filepath, /* files */) {
// Build XML with tinyxml2, compress with compress(), build header, write file + compressed TOC + heap
}
};