Task 083: .CDXML File Format
Task 083: .CDXML File Format
File Format Specifications for .CDXML
The .CDXML file format is an XML-based representation of chemical structures, reactions, and graphical elements created by ChemDraw (now part of PerkinElmer Informatics). It is the text-based counterpart to the binary .CDX format, allowing for human-readable storage of molecules (atoms/nodes, bonds), layouts, fonts, colors, and more. The format is hierarchical, using XML elements (referred to as "objects" in the spec) and attributes (referred to as "properties"). It conforms to XML 1.0 standards and uses a specific DTD for validation.
Key aspects from the official specification (sourced from CambridgeSoft/PerkinElmer documentation):
- Header: Starts with an XML declaration (
<?xml version="1.0" encoding="UTF-8"?>
) followed by a DOCTYPE declaration (<!DOCTYPE CDXML SYSTEM "http://www.cambridgesoft.com/xml/cdxml.dtd">
). - Root Element:
<CDXML>
(case-sensitive), which may have global attributes likeBondLength
(default bond length in points),LabelFont
(font ID for atom labels),CaptionFont
(font ID for captions). - Structure: A tree of nested elements representing a document. It includes tables for fonts and colors, followed by pages containing chemical and graphical objects. Elements can reference each other via
id
attributes. - Encoding: UTF-8 text. Coordinates are in points (1 point = 1/72 inch), typically 2D but extensible to 3D.
- Validation: References a DTD for structure, though modern parsers may ignore it.
- Objects (Elements): There are approximately 38 predefined object types. Common ones include:
CDXML
(root document).page
(drawing canvas, with attributes likeHeightPages
,WidthPages
).group
(logical grouping of objects).fragment
(molecular subgraph with nodes and bonds).n
(node, typically an atom; attributes:Element
e.g., "C", "NumHydrogens",p
for 2D position as "x y").b
(bond; attributes:B
andE
for connected node IDs,Order
e.g., 1=single, 2=double).t
(text block; contains<s>
sub-elements for styled runs withfont
,size
,face
).graphic
(shapes like lines, arrows; attributes:Type
e.g., "Line",BoundingBox
).scheme
andstep
(for reactions).fonttable
(list of<font>
elements withid
,name
e.g., "Arial").colortable
(list of<color>
elements withr
,g
,b
values 0-1).- Others:
curve
,embeddedobject
,table
,altgroup
(for queries),spectrum
,tlcplate
, etc. - Properties (Attributes): Over 260 predefined, optional unless noted. They are key-value pairs on elements (e.g.,
id="5"
). Types include strings (CDXString), integers (INT16/INT32), booleans (CDXBoolean), points (CDXPoint2D/3D as "x y" or "x y z"), dates (CDXDate as YYYY-MM-DD HH:MM:SS). Common ones: - Global/Document:
CreationDate
,CreationProgram
,ModificationDate
,Name
,Comment
,ZOrder
(drawing layer),Visible
(boolean),RegistryNumber
,IgnoreWarnings
. - Positioning:
p
(2D point, e.g., "148.5 164.25"),xyz
(3D point),extent
(width height). - Chemical:
Element
(e.g., "C" for carbon),NumHydrogens
,Order
(bond order),Stereo
(bond stereo),RepresentsProperty
. - Visual:
Font
(ID ref to fonttable),Color
(ID ref to colortable),BoundingBox
(minX minY maxX maxY). - References:
id
(unique string ID),B
/E
(bond endpoints),SupersededBy
(ID ref). - Others:
charset
(for fonts),ArrowheadType
(for graphics),FillType
.
Full list includes specialized ones likekCDXProp_ChemicalWarning
(string for issues) andkCDXProp_FontTable
(embedded table). - Examples: Files represent 2D chemical drawings but can include reactions (e.g., reactants → products via
<arrow>
and<step>
), spectra, and TLC plates. IDs must be unique within the document. The format mixes semantic chemical data with vector graphics.
The format is extensible but proprietary; parsing requires handling references (e.g., bonds linking nodes via IDs) and optional sub-objects.
List of All Properties Intrinsic to the File Format
These are the inherent structural and semantic properties defining the format's "file system" (i.e., its hierarchical XML structure, without external dependencies). Based on the spec, they include:
- Text-based XML Structure: Hierarchical tree with opening/closing tags; self-closing for leaf nodes (e.g.,
<b ... />
). - Encoding and Declaration: UTF-8; mandatory XML prolog and DOCTYPE referencing the CDXML DTD.
- Root Hierarchy: Single
<CDXML>
root containing global tables (fonttable, colortable) and pages; no multiple roots. - Object Containment: Nested ownership (e.g., page > fragment > n/b); references via string IDs (no pointers).
- Attribute System: All properties as quoted string attributes (e.g.,
id="5"
); case-sensitive; optional except core likeid
for referenced objects. - Coordinate System: 2D points in points (floating-point strings, e.g., "x y"); origin at top-left; right-handed for 3D.
- ID Uniqueness: String IDs unique per document for cross-referencing (e.g., bonds to nodes).
- Visibility and Ordering: Boolean
Visible
; integerZ
for layering (higher = front). - Chemical Semantics: Nodes imply atoms (default hydrogens calculated); bonds imply connectivity; fragments imply molecules.
- Graphical Extensibility: Supports vectors (lines, curves, arrows) with bounding boxes; colors/fonts via indexed tables.
- Validation Rules: DTD enforces order (e.g., fonttable before pages); warnings for invalid chemistry (e.g., valence errors).
- File Closure: Ends with
</CDXML>
; no trailing data. - Size/Performance: Textual, so larger than binary CDX; parsable with standard XML libraries.
- Versioning: Implicit via
CreationProgram
(e.g., "ChemDraw 20.0"); backward-compatible with older objects.
Two Direct Download Links for .CDXML Files
Direct downloads are scarce in public sources due to the proprietary nature, but samples from documentation can be saved as .cdxml. Here are two:
- Simple dimethylamine molecule: https://chemapps.stolaf.edu/iupac/cdx/sdk/dimethylaminesimple.cdxml (save the embedded XML from the page as .cdxml; represents NH(CH3)2).
- Reaction example (NH2Cl + NH3 → NH4Cl + NH2): https://biotech.fyicenter.com/1004815_CDXML_ChemDraw_XML_for_Reactions.html (save the embedded XML as .cdxml; includes fragments, arrow, text).
Ghost Blog Embedded HTML JavaScript for Drag-and-Drop .CDXML Viewer
Embed this in a Ghost blog post (use HTML card). It allows drag-and-drop of a .CDXML file, parses it as XML, and dumps all elements/tags and their properties/attributes to the screen in a block.
Python Class for .CDXML Handling
Uses xml.etree.ElementTree
(standard library). Reads/parses file, prints all elements and attributes recursively, writes back to file (preserves formatting minimally).
import xml.etree.ElementTree as ET
class CDXMLHandler:
def __init__(self, file_path):
self.tree = ET.parse(file_path)
self.root = self.tree.getroot()
def read(self):
return self.tree
def write(self, output_path):
self.tree.write(output_path, encoding='utf-8', xml_declaration=True)
def print_properties(self):
self._traverse(self.root)
def _traverse(self, element):
print(f"{element.tag}: {dict(element.attrib)}")
for child in element:
self._traverse(child)
# Usage example:
# handler = CDXMLHandler('example.cdxml')
# handler.print_properties()
# handler.write('output.cdxml')
Java Class for .CDXML Handling
Uses javax.xml.parsers.DocumentBuilder
and javax.xml.transform
(standard JDK). Reads/parses file, prints all elements and attributes recursively to console, writes back using transformer.
import javax.xml.parsers.DocumentBuilder;
import javax.xml.parsers.DocumentBuilderFactory;
import javax.xml.transform.Transformer;
import javax.xml.transform.TransformerFactory;
import javax.xml.transform.dom.DOMSource;
import javax.xml.transform.stream.StreamResult;
import org.w3c.dom.Document;
import org.w3c.dom.Element;
import org.w3c.dom.NamedNodeMap;
import org.w3c.dom.Node;
import java.io.File;
public class CDXMLHandler {
private Document doc;
public CDXMLHandler(String filePath) throws Exception {
DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();
DocumentBuilder builder = factory.newDocumentBuilder();
doc = builder.parse(new File(filePath));
}
public void read() {
// Document is loaded in constructor
}
public void write(String outputPath) throws Exception {
TransformerFactory transformerFactory = TransformerFactory.newInstance();
Transformer transformer = transformerFactory.newTransformer();
DOMSource source = new DOMSource(doc);
StreamResult result = new StreamResult(new File(outputPath));
transformer.transform(source, result);
}
public void printProperties() {
traverse(doc.getDocumentElement());
}
private void traverse(Node node) {
if (node.getNodeType() == Node.ELEMENT_NODE) {
Element elem = (Element) node;
NamedNodeMap attrs = elem.getAttributes();
StringBuilder attrStr = new StringBuilder("{");
for (int i = 0; i < attrs.getLength(); i++) {
attrStr.append(attrs.item(i).getNodeName()).append("=").append(attrs.item(i).getNodeValue());
if (i < attrs.getLength() - 1) attrStr.append(", ");
}
attrStr.append("}");
System.out.println(elem.getTagName() + ": " + attrStr);
}
Node child = node.getFirstChild();
while (child != null) {
traverse(child);
child = child.getNextSibling();
}
}
// Usage: new CDXMLHandler("example.cdxml").printProperties();
}
JavaScript Class for .CDXML Handling
Uses DOMParser
and XMLSerializer
(browser/Node.js compatible with xml2js for Node if needed). Assumes browser context; reads from File object, prints to console, serializes for write.
class CDXMLHandler {
constructor(file) {
const reader = new FileReader();
reader.onload = (e) => {
const parser = new DOMParser();
this.doc = parser.parseFromString(e.target.result, 'text/xml');
this.printProperties();
};
reader.readAsText(file);
}
read() {
return this.doc;
}
write(outputFileName) {
const serializer = new XMLSerializer();
const xmlStr = '<?xml version="1.0" encoding="UTF-8"?>\n' + serializer.serializeToString(this.doc);
const blob = new Blob([xmlStr], { type: 'text/xml' });
const url = URL.createObjectURL(blob);
const a = document.createElement('a');
a.href = url;
a.download = outputFileName;
a.click();
URL.revokeObjectURL(url);
}
printProperties() {
this._traverse(this.doc.documentElement);
}
_traverse(node) {
if (node.nodeType === Node.ELEMENT_NODE) {
const attrs = {};
for (let attr of node.attributes) {
attrs[attr.nodeName] = attr.nodeValue;
}
console.log(`${node.nodeName}:`, attrs);
}
for (let child of node.children) {
this._traverse(child);
}
}
}
// Usage example (in browser): new CDXMLHandler(fileInput.files[0]);
C Class for .CDXML Handling
C lacks built-in XML support, so this uses a simple recursive string-based parser (no external libs like libxml2 for portability). It reads the file as text, tokenizes tags/attributes roughly (handles basic structure), prints properties to console. Write is basic copy-back. Not full-featured for complex nesting; assumes well-formed XML.
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
typedef struct {
char* content;
size_t size;
} CDXMLFile;
typedef struct {
char tag[256];
char attrs[1024]; // Simple key=value pairs as string
} Property;
typedef struct Node {
Property prop;
struct Node* children;
int childCount;
struct Node* next; // For siblings
} Node;
class CDXMLHandler { // Simulated as functions; C doesn't have classes
public:
CDXMLFile file;
void init(char* filePath) {
FILE* f = fopen(filePath, "r");
if (!f) return;
fseek(f, 0, SEEK_END);
file.size = ftell(f);
fseek(f, 0, SEEK_SET);
file.content = malloc(file.size + 1);
fread(file.content, 1, file.size, f);
file.content[file.size] = '\0';
fclose(f);
}
void read() {
// Content loaded in init
}
void write(char* outputPath) {
FILE* out = fopen(outputPath, "w");
if (out) {
fwrite(file.content, 1, file.size, out);
fclose(out);
}
}
void printProperties() {
Node* root = parseXML(file.content);
traverseAndPrint(root);
freeTree(root);
}
// Simple parser: scans for <tag attrs> and recurses
Node* parseXML(char* xml) {
Node* node = malloc(sizeof(Node));
node->children = NULL;
node->childCount = 0;
node->next = NULL;
// Basic scan: find <tag ...>, extract tag and attrs
char* start = strstr(xml, "<");
if (start) {
char* endTag = strchr(start + 1, '>');
if (endTag) {
int len = endTag - start - 1;
strncpy(node->prop.tag, start + 1, len);
node->prop.tag[len] = '\0';
// Extract attrs roughly (space-separated key="val")
strncpy(node->prop.attrs, start + strlen(node->prop.tag) + 1, endTag - start - strlen(node->prop.tag) - 2);
node->prop.attrs[strlen(node->prop.attrs)] = '\0';
}
}
// Recurse for children (simplified, skips text/CDATA)
char* childStart = strstr(endTag + 1, "<");
while (childStart && strncmp(childStart, "</", 2) != 0) {
Node* child = parseXML(childStart);
addChild(node, child);
childStart = strstr(childStart + 1, "<");
}
return node;
}
void addChild(Node* parent, Node* child) {
if (!parent->children) parent->children = child;
else {
Node* last = parent->children;
while (last->next) last = last->next;
last->next = child;
}
parent->childCount++;
}
void traverseAndPrint(Node* node) {
if (!node) return;
printf("%s: %s\n", node->prop.tag, node->prop.attrs);
Node* child = node->children;
while (child) {
traverseAndPrint(child);
child = child->next;
}
}
void freeTree(Node* node) {
if (!node) return;
Node* child = node->children;
while (child) {
Node* next = child->next;
freeTree(child);
free(child);
child = next;
}
free(node);
}
};
// Usage (compile with gcc -o handler handler.c):
// CDXMLHandler h;
// h.init("example.cdxml");
// h.printProperties();
// h.write("output.cdxml");
// Note: This is a basic implementation; for production, use libxml2.