Task 466: .NXS File Format

Task 466: .NXS File Format

1. List of Properties of the .NXS File Format

The .NXS file extension refers to the NEXUS file format, a text-based standard primarily utilized in phylogenetics and systematic biology for storing and exchanging data such as taxonomic information, character states, distance matrices, and phylogenetic trees. This format facilitates interoperability among programs including PAUP*, Mesquite, MrBayes, ModelTest, MacClade, and GDA. It adheres to a character-state data model, where operational taxonomic units (OTUs) exhibit states for homologous characters. The format is extensible, allowing public (standard) and private (application-specific) elements, with restrictions on block ordering and command sequencing within blocks. Intrinsic properties, derived from its structural specifications, are outlined below:

  • Header: The file must commence with the fixed, case-insensitive string "#NEXUS" as the magic identifier, signaling the format's initiation.
  • Overall Structure: Comprises modular blocks delimited by "BEGIN [blockname];" and "END;". Blocks encapsulate related data and commands. Multiple blocks may appear sequentially, but each public block type is permitted only once. The TAXA block is mandatory if taxa are referenced. Private blocks may be ignored by non-recognizing applications.
  • Blocks: Core organizational units, categorized as public (standardized) or private (custom). Public blocks include:
  • TAXA: Defines taxa (OTUs).
  • CHARACTERS or DATA: Houses character data matrices (DATA combines TAXA and CHARACTERS functionality).
  • UNALIGNED: Specifies unaligned data.
  • DISTANCES: Stores distance matrices.
  • SETS: Defines subsets of taxa, characters, or other elements.
  • ASSUMPTIONS: Specifies analytical assumptions (e.g., exclusions, weights).
  • TREES: Contains phylogenetic tree descriptions.
  • CODONS: Defines codons and genetic codes.
    Private blocks (e.g., PAUP for PAUP*-specific commands) are application-restricted.
  • Commands: Semicolon-terminated statements within blocks, specifying details. Common commands include:
  • DIMENSIONS: Defines counts (e.g., NTAX for number of taxa, NCHAR for number of characters); must precede other commands in relevant blocks.
  • TAXLABELS: Lists taxon names in TAXA or DATA blocks.
  • FORMAT: Specifies data characteristics in DATA/CHARACTERS blocks, including DATATYPE (DNA, RNA, NUCLEOTIDE, PROTEIN, STANDARD, CONTINUOUS), MISSING symbol (default ?), GAP symbol (default -), SYMBOLS (custom alphabet), EQUATE (substitution mappings, case-sensitive), RESPECTCASE (preserves case), INTERLEAVE (yes/no for matrix layout), TRANSPOSE (swaps rows/columns), ITEMS, STATESFORMAT, TOKENS/LABELS, MATCHCHAR.
  • MATRIX: Presents the data matrix in DATA/CHARACTERS blocks, associating taxa with character states; supports interleaved or non-interleaved layouts.
  • ELIMINATE: Excludes characters in DATA/CHARACTERS blocks.
  • CHARSET, TAXSET, STATESET, CHANGESET, TREESET, CODONPOSSET: Define sets in SETS blocks (e.g., ranges like 1-5, 47-.).
  • CHARPARTITION, TAXPARTITION, TREEPARTITION: Define partitions in SETS blocks.
  • EXSET, WTSET, TYPESET, ANCSTATES, USERTYPE: Specify assumptions in ASSUMPTIONS blocks (e.g., EXSET* for default exclusions).
  • GENETICCODE, CODESET: Define codes in CODONS blocks.
  • TRANSLATE: Maps tokens to taxon names in TREES blocks.
  • TREE: Defines trees in TREES blocks, with names and Newick-format topologies (e.g., ((1,2),3); optional branch lengths :value, rooting indicators [&U] for unrooted).
    Application-specific commands (e.g., in PAUP block: LOG, OUTGROUP, SET, LSET, HSEARCH, DESCRIBE, SAVETREES, LEAVE, QUIT).
  • Syntax Rules: Keywords are case-insensitive; whitespace is flexible except within tokens or names; commands and blocks terminate with semicolons; taxon names may use underscores for spaces or single quotes for literal spaces; ranges use hyphens (e.g., 1-5, 47-. for 47 to end); duplicate names are prohibited; ordering constraints apply (e.g., DIMENSIONS first in DATA); extensible for custom elements.
  • Comments: Enclosed in square brackets ([ ]), insertable anywhere for annotations; printed comments prefixed with ! (e.g., [!Note]); can disable code sections or embed metadata (e.g., alignment position indicators).
  • Data Types and Symbols: DATATYPE defines type; supports ambiguity codes (e.g., R for purine in DNA); case-sensitive equates for input substitutions; symbols lists for custom alphabets.
  • Matrix Properties: Taxon-character associations; supports gaps, missing data, interleaving (data in segments), transposition; items/statesformat for complex data representation.
  • Tree Properties: Newick parenthetical notation; optional translation tables; branch lengths; support values in brackets; character state trees (CSTREE); multiple trees per block.
  • Objects: Definable entities include taxa, characters, states, trees, sets, partitions, weights, types, exclusions, ancestral states, codon positions.
  • Extensibility: Allows undefined blocks/commands for program-specific or future expansions; private blocks ignored by incompatible software.
  • File Usage Constraints: No internet dependencies; supports reproducibility via embedded analysis commands; tree files may contain only TREES block.

The following provide direct access to phylogenetic NEXUS files from public repositories:

These URLs deliver raw NEXUS content, savable as .nxs files.

3. Ghost Blog Embedded HTML JavaScript for Drag-and-Drop .NXS File Dump

This self-contained HTML/JavaScript snippet, embeddable in a Ghost blog post via the HTML card, creates a drag-and-drop area. It reads dropped .NXS files as text, parses properties (header, blocks, dimensions, format, taxa, matrix, sets, assumptions, trees, etc.), and outputs them in a pre-formatted display. Parsing handles standard elements, including interleaved matrices and comments.

Drag and drop a .NXS file here to dump its properties.


4. Python Class for .NXS File Handling

This Python class opens, parses (decodes), reads, writes, and prints NEXUS file properties, supporting key blocks (TAXA, DATA/CHARACTERS, TREES, SETS, ASSUMPTIONS) and elements like interleaved matrices.

import re

class NXS:
    def __init__(self):
        self.header = '#NEXUS'
        self.ntax = 0
        self.nchar = 0
        self.datatype = 'standard'
        self.missing = '?'
        self.gap = '-'
        self.interleave = False
        self.taxlabels = []
        self.matrix = {}
        self.sets = {}  # e.g., {'charset': ..., 'taxset': ...}
        self.assumptions = {}  # e.g., {'exset': ...}
        self.trees = []  # List of (name, desc, translate)
        self.comments = []

    def read(self, filename):
        with open(filename, 'r') as f:
            text = f.read()
        self.parse(text)

    def parse(self, text):
        if not text.strip().startswith('#NEXUS'):
            raise ValueError("Not a valid NEXUS file")
        
        # Extract comments
        self.comments = re.findall(r'\[\s*[^!].*?\]', text, re.DOTALL)
        
        blocks = re.findall(r'begin\s+(\w+);([\s\S]*?)end;', text, re.I)
        for block_name, content in blocks:
            block_name = block_name.upper()
            commands = [cmd.strip() for cmd in content.split(';') if cmd.strip()]
            
            if block_name in ['DATA', 'CHARACTERS']:
                for cmd in commands:
                    if cmd.lower().startswith('dimensions'):
                        dim_match = re.search(r'ntax=(\d+)\s+nchar=(\d+)', cmd, re.I)
                        if dim_match:
                            self.ntax = int(dim_match.group(1))
                            self.nchar = int(dim_match.group(2))
                    elif cmd.lower().startswith('format'):
                        fmt_str = cmd[6:].strip()
                        dt_match = re.search(r'datatype=(\w+)', fmt_str, re.I)
                        if dt_match: self.datatype = dt_match.group(1).lower()
                        ms_match = re.search(r'missing=(\S)', fmt_str, re.I)
                        if ms_match: self.missing = ms_match.group(1)
                        gp_match = re.search(r'gap=(\S)', fmt_str, re.I)
                        if gp_match: self.gap = gp_match.group(1)
                        inter_match = re.search(r'interleave=(\w+)', fmt_str, re.I)
                        if inter_match: self.interleave = inter_match.group(1).lower() == 'yes'
                    elif cmd.lower().startswith('matrix'):
                        mat_str = cmd[6:].strip()
                        lines = re.split(r'\n', mat_str)
                        for line in lines:
                            line = line.strip()
                            if line:
                                parts = re.split(r'\s+', line, maxsplit=1)
                                tax = parts[0].strip("'\"")
                                if len(parts) > 1:
                                    data = parts[1].strip()
                                    if tax in self.matrix:
                                        self.matrix[tax] += data  # Handle interleave
                                    else:
                                        self.matrix[tax] = data
                                    if tax not in self.taxlabels:
                                        self.taxlabels.append(tax)
            elif block_name == 'TAXA':
                for cmd in commands:
                    if cmd.lower().startswith('taxlabels'):
                        labels = re.split(r'\s+', cmd[9:].strip())
                        self.taxlabels.extend(labels)
            elif block_name == 'SETS':
                for cmd in commands:
                    if cmd.lower().startswith('charset') or cmd.lower().startswith('taxset'):
                        set_type = cmd.split()[0].lower()
                        set_name = cmd.split('=')[0].split()[1]
                        set_val = cmd.split('=')[1].strip()
                        self.sets.setdefault(set_type, {})[set_name] = set_val
            elif block_name == 'ASSUMPTIONS':
                for cmd in commands:
                    if cmd.lower().startswith('exset'):
                        ex_name = cmd.split('=')[0].split()[1] if len(cmd.split()) > 1 else 'default'
                        ex_val = cmd.split('=')[1].strip()
                        self.assumptions.setdefault('exset', {})[ex_name] = ex_val
            elif block_name == 'TREES':
                translate = {}
                for cmd in commands:
                    if cmd.lower().startswith('translate'):
                        trans_str = cmd[9:].strip().replace(',', ' ').split()
                        for i in range(0, len(trans_str), 2):
                            if i+1 < len(trans_str):
                                translate[trans_str[i]] = trans_str[i+1]
                    elif cmd.lower().startswith('tree'):
                        tree_match = re.search(r'tree\s+(\w+)\s*=\s*([\s\S]*)', cmd, re.I)
                        if tree_match:
                            self.trees.append((tree_match.group(1), tree_match.group(2).strip(), translate.copy()))

    def print_properties(self):
        print(f"Header: {self.header}")
        print(f"ntax: {self.ntax}")
        print(f"nchar: {self.nchar}")
        print(f"datatype: {self.datatype}")
        print(f"missing: {self.missing}")
        print(f"gap: {self.gap}")
        print(f"interleave: {self.interleave}")
        print(f"taxlabels: {self.taxlabels}")
        print("matrix:")
        for tax, data in self.matrix.items():
            print(f"  {tax}: {data}")
        print("sets:")
        for set_type, sets in self.sets.items():
            for name, val in sets.items():
                print(f"  {set_type} {name}: {val}")
        print("assumptions:")
        for ass_type, ass in self.assumptions.items():
            for name, val in ass.items():
                print(f"  {ass_type} {name}: {val}")
        print("trees:")
        for name, desc, trans in self.trees:
            print(f"  {name}: {desc}")
            if trans:
                print(f"    translate: {trans}")
        print("comments:")
        for comment in self.comments:
            print(f"  {comment}")

    def write(self, filename):
        with open(filename, 'w') as f:
            f.write(f"{self.header}\n")
            if self.taxlabels:
                f.write('begin taxa;\n')
                f.write(f"  dimensions ntax={self.ntax};\n")
                f.write('  taxlabels ' + ' '.join(f"'{tax}'" for tax in self.taxlabels) + ';\n')
                f.write('end;\n')
            f.write('begin data;\n')
            f.write(f"  dimensions ntax={self.ntax} nchar={self.nchar};\n")
            f.write(f"  format datatype={self.datatype} missing={self.missing} gap={self.gap} interleave={'yes' if self.interleave else 'no'};\n")
            f.write('  matrix\n')
            for tax in self.taxlabels:
                if tax in self.matrix:
                    f.write(f"    '{tax}' {self.matrix[tax]}\n")
            f.write('  ;\n')
            f.write('end;\n')
            if self.sets:
                f.write('begin sets;\n')
                for set_type, sets in self.sets.items():
                    for name, val in sets.items():
                        f.write(f"  {set_type} {name} = {val};\n")
                f.write('end;\n')
            if self.assumptions:
                f.write('begin assumptions;\n')
                for ass_type, ass in self.assumptions.items():
                    for name, val in ass.items():
                        f.write(f"  {ass_type} {name} = {val};\n")
                f.write('end;\n')
            if self.trees:
                f.write('begin trees;\n')
                for name, desc, trans in self.trees:
                    if trans:
                        f.write('  translate\n')
                        for token, tname in trans.items():
                            f.write(f"    {token} {tname},\n")
                        f.write('  ;\n')
                    f.write(f"  tree {name} = {desc};\n")
                f.write('end;\n')
            # Comments not written as they are non-structural

5. Java Class for .NXS File Handling

This Java class provides methods to open, parse, read, write, and print NEXUS properties, with support for similar elements as the Python implementation.

import java.io.*;
import java.util.*;
import java.util.regex.*;

public class NXS {
    private String header = "#NEXUS";
    private int ntax = 0;
    private int nchar = 0;
    private String datatype = "standard";
    private char missing = '?';
    private char gap = '-';
    private boolean interleave = false;
    private List<String> taxlabels = new ArrayList<>();
    private Map<String, String> matrix = new LinkedHashMap<>();
    private Map<String, Map<String, String>> sets = new HashMap<>();
    private Map<String, Map<String, String>> assumptions = new HashMap<>();
    private List<TreeEntry> trees = new ArrayList<>();
    private List<String> comments = new ArrayList<>();

    private static class TreeEntry {
        String name;
        String desc;
        Map<String, String> translate;

        TreeEntry(String name, String desc, Map<String, String> translate) {
            this.name = name;
            this.desc = desc;
            this.translate = translate;
        }
    }

    public void read(String filename) throws IOException {
        StringBuilder text = new StringBuilder();
        try (BufferedReader reader = new BufferedReader(new FileReader(filename))) {
            String line;
            while ((line = reader.readLine()) != null) {
                text.append(line).append("\n");
            }
        }
        parse(text.toString());
    }

    private void parse(String text) {
        if (!text.trim().startsWith("#NEXUS")) {
            throw new IllegalArgumentException("Not a valid NEXUS file");
        }

        // Comments
        Pattern commentPattern = Pattern.compile("\\[\\s*[^!].*?\\]", Pattern.DOTALL);
        Matcher commentMatcher = commentPattern.matcher(text);
        while (commentMatcher.find()) {
            comments.add(commentMatcher.group());
        }

        Pattern blockPattern = Pattern.compile("begin\\s+(\\w+);([\\s\\S]*?)end;", Pattern.CASE_INSENSITIVE | Pattern.DOTALL);
        Matcher blockMatcher = blockPattern.matcher(text);
        while (blockMatcher.find()) {
            String blockName = blockMatcher.group(1).toUpperCase();
            String content = blockMatcher.group(2).trim();
            String[] commands = content.split(";");
            List<String> cmdList = new ArrayList<>();
            for (String cmd : commands) {
                cmd = cmd.trim();
                if (!cmd.isEmpty()) cmdList.add(cmd);
            }

            if (blockName.equals("DATA") || blockName.equals("CHARACTERS")) {
                for (String cmd : cmdList) {
                    if (cmd.toLowerCase().startsWith("dimensions")) {
                        Pattern dimPattern = Pattern.compile("ntax=(\\d+)\\s+nchar=(\\d+)", Pattern.CASE_INSENSITIVE);
                        Matcher dimMatcher = dimPattern.matcher(cmd);
                        if (dimMatcher.find()) {
                            ntax = Integer.parseInt(dimMatcher.group(1));
                            nchar = Integer.parseInt(dimMatcher.group(2));
                        }
                    } else if (cmd.toLowerCase().startsWith("format")) {
                        String fmtStr = cmd.substring(6).trim();
                        Pattern dtPattern = Pattern.compile("datatype=(\\w+)", Pattern.CASE_INSENSITIVE);
                        Matcher dtMatcher = dtPattern.matcher(fmtStr);
                        if (dtMatcher.find()) datatype = dtMatcher.group(1).toLowerCase();
                        Pattern msPattern = Pattern.compile("missing=(\\S)", Pattern.CASE_INSENSITIVE);
                        Matcher msMatcher = msPattern.matcher(fmtStr);
                        if (msMatcher.find()) missing = msMatcher.group(1).charAt(0);
                        Pattern gpPattern = Pattern.compile("gap=(\\S)", Pattern.CASE_INSENSITIVE);
                        Matcher gpMatcher = gpPattern.matcher(fmtStr);
                        if (gpMatcher.find()) gap = gpMatcher.group(1).charAt(0);
                        Pattern interPattern = Pattern.compile("interleave=(\\w+)", Pattern.CASE_INSENSITIVE);
                        Matcher interMatcher = interPattern.matcher(fmtStr);
                        if (interMatcher.find()) interleave = interMatcher.group(1).toLowerCase().equals("yes");
                    } else if (cmd.toLowerCase().startsWith("matrix")) {
                        String matStr = cmd.substring(6).trim();
                        String[] lines = matStr.split("\n");
                        for (String line : lines) {
                            line = line.trim();
                            if (!line.isEmpty()) {
                                String[] parts = line.split("\\s+", 2);
                                String tax = parts[0].replaceAll("['\"]", "");
                                if (parts.length > 1) {
                                    String data = parts[1].trim();
                                    matrix.merge(tax, data, String::concat); // Handle interleave
                                    if (!taxlabels.contains(tax)) taxlabels.add(tax);
                                }
                            }
                        }
                    }
                }
            } else if (blockName.equals("TAXA")) {
                for (String cmd : cmdList) {
                    if (cmd.toLowerCase().startsWith("taxlabels")) {
                        String[] labels = cmd.substring(9).trim().split("\\s+");
                        taxlabels.addAll(Arrays.asList(labels));
                    }
                }
            } else if (blockName.equals("SETS")) {
                for (String cmd : cmdList) {
                    if (cmd.toLowerCase().startsWith("charset") || cmd.toLowerCase().startsWith("taxset")) {
                        String[] parts = cmd.split("=");
                        if (parts.length > 1) {
                            String left = parts[0].trim();
                            String setType = left.split("\\s+")[0].toLowerCase();
                            String setName = left.split("\\s+")[1];
                            String setVal = parts[1].trim();
                            sets.computeIfAbsent(setType, k -> new HashMap<>()).put(setName, setVal);
                        }
                    }
                }
            } else if (blockName.equals("ASSUMPTIONS")) {
                for (String cmd : cmdList) {
                    if (cmd.toLowerCase().startsWith("exset")) {
                        String[] parts = cmd.split("=");
                        if (parts.length > 1) {
                            String left = parts[0].trim();
                            String assName = left.split("\\s+").length > 1 ? left.split("\\s+")[1] : "default";
                            String assVal = parts[1].trim();
                            assumptions.computeIfAbsent("exset", k -> new HashMap<>()).put(assName, assVal);
                        }
                    }
                }
            } else if (blockName.equals("TREES")) {
                Map<String, String> translate = new HashMap<>();
                for (String cmd : cmdList) {
                    if (cmd.toLowerCase().startsWith("translate")) {
                        String transStr = cmd.substring(9).trim().replace(",", " ");
                        String[] transParts = transStr.split("\\s+");
                        for (int i = 0; i < transParts.length; i += 2) {
                            if (i + 1 < transParts.length) {
                                translate.put(transParts[i], transParts[i + 1]);
                            }
                        }
                    } else if (cmd.toLowerCase().startsWith("tree")) {
                        Pattern treePattern = Pattern.compile("tree\\s+(\\w+)\\s*=\\s*([\\s\\S]*)", Pattern.CASE_INSENSITIVE);
                        Matcher treeMatcher = treePattern.matcher(cmd);
                        if (treeMatcher.find()) {
                            trees.add(new TreeEntry(treeMatcher.group(1), treeMatcher.group(2).trim(), new HashMap<>(translate)));
                        }
                    }
                }
            }
        }
    }

    public void printProperties() {
        System.out.println("Header: " + header);
        System.out.println("ntax: " + ntax);
        System.out.println("nchar: " + nchar);
        System.out.println("datatype: " + datatype);
        System.out.println("missing: " + missing);
        System.out.println("gap: " + gap);
        System.out.println("interleave: " + interleave);
        System.out.println("taxlabels: " + taxlabels);
        System.out.println("matrix:");
        for (Map.Entry<String, String> entry : matrix.entrySet()) {
            System.out.println("  " + entry.getKey() + ": " + entry.getValue());
        }
        System.out.println("sets:");
        for (Map.Entry<String, Map<String, String>> setEntry : sets.entrySet()) {
            for (Map.Entry<String, String> inner : setEntry.getValue().entrySet()) {
                System.out.println("  " + setEntry.getKey() + " " + inner.getKey() + ": " + inner.getValue());
            }
        }
        System.out.println("assumptions:");
        for (Map.Entry<String, Map<String, String>> assEntry : assumptions.entrySet()) {
            for (Map.Entry<String, String> inner : assEntry.getValue().entrySet()) {
                System.out.println("  " + assEntry.getKey() + " " + inner.getKey() + ": " + inner.getValue());
            }
        }
        System.out.println("trees:");
        for (TreeEntry tree : trees) {
            System.out.println("  " + tree.name + ": " + tree.desc);
            if (!tree.translate.isEmpty()) {
                System.out.println("    translate: " + tree.translate);
            }
        }
        System.out.println("comments:");
        for (String comment : comments) {
            System.out.println("  " + comment);
        }
    }

    public void write(String filename) throws IOException {
        try (BufferedWriter writer = new BufferedWriter(new FileWriter(filename))) {
            writer.write(header + "\n");
            if (!taxlabels.isEmpty()) {
                writer.write("begin taxa;\n");
                writer.write("  dimensions ntax=" + ntax + ";\n");
                writer.write("  taxlabels");
                for (String tax : taxlabels) {
                    writer.write(" '" + tax + "'");
                }
                writer.write(";\n");
                writer.write("end;\n");
            }
            writer.write("begin data;\n");
            writer.write("  dimensions ntax=" + ntax + " nchar=" + nchar + ";\n");
            writer.write("  format datatype=" + datatype + " missing=" + missing + " gap=" + gap + " interleave=" + (interleave ? "yes" : "no") + ";\n");
            writer.write("  matrix\n");
            for (String tax : taxlabels) {
                if (matrix.containsKey(tax)) {
                    writer.write("    '" + tax + "' " + matrix.get(tax) + "\n");
                }
            }
            writer.write("  ;\n");
            writer.write("end;\n");
            if (!sets.isEmpty()) {
                writer.write("begin sets;\n");
                for (Map.Entry<String, Map<String, String>> setEntry : sets.entrySet()) {
                    for (Map.Entry<String, String> inner : setEntry.getValue().entrySet()) {
                        writer.write("  " + setEntry.getKey() + " " + inner.getKey() + " = " + inner.getValue() + ";\n");
                    }
                }
                writer.write("end;\n");
            }
            if (!assumptions.isEmpty()) {
                writer.write("begin assumptions;\n");
                for (Map.Entry<String, Map<String, String>> assEntry : assumptions.entrySet()) {
                    for (Map.Entry<String, String> inner : assEntry.getValue().entrySet()) {
                        writer.write("  " + assEntry.getKey() + " " + inner.getKey() + " = " + inner.getValue() + ";\n");
                    }
                }
                writer.write("end;\n");
            }
            if (!trees.isEmpty()) {
                writer.write("begin trees;\n");
                for (TreeEntry tree : trees) {
                    if (!tree.translate.isEmpty()) {
                        writer.write("  translate\n");
                        for (Map.Entry<String, String> trans : tree.translate.entrySet()) {
                            writer.write("    " + trans.getKey() + " " + trans.getValue() + ",\n");
                        }
                        writer.write("  ;\n");
                    }
                    writer.write("  tree " + tree.name + " = " + tree.desc + ";\n");
                }
                writer.write("end;\n");
            }
            // Comments omitted in write as non-structural
        }
    }
}

6. JavaScript Class for .NXS File Handling

This JavaScript class, suitable for Node.js (with fs) or browsers (omit fs for read/write), parses, reads, writes, and prints NEXUS properties.

const fs = require('fs'); // Node.js only

class NXS {
  constructor() {
    this.header = '#NEXUS';
    this.ntax = 0;
    this.nchar = 0;
    this.datatype = 'standard';
    this.missing = '?';
    this.gap = '-';
    this.interleave = false;
    this.taxlabels = [];
    this.matrix = {};
    this.sets = {};
    this.assumptions = {};
    this.trees = []; // Array of {name, desc, translate: {}}
    this.comments = [];
  }

  read(filename) {
    const text = fs.readFileSync(filename, 'utf8');
    this.parse(text);
  }

  parse(text) {
    if (!text.trim().startsWith('#NEXUS')) {
      throw new Error('Not a valid NEXUS file');
    }

    // Comments
    const commentRegex = /\[\s*[^!].*?\]/gs;
    this.comments = text.match(commentRegex) || [];

    const blockRegex = /begin\s+(\w+);([\s\S]*?)end;/gi;
    let blockMatch;
    while ((blockMatch = blockRegex.exec(text)) !== null) {
      const blockName = blockMatch[1].toUpperCase();
      const content = blockMatch[2].trim();
      const commands = content.split(';').map(cmd => cmd.trim()).filter(cmd => cmd);

      if (blockName === 'DATA' || blockName === 'CHARACTERS') {
        commands.forEach(cmd => {
          if (cmd.toLowerCase().startsWith('dimensions')) {
            const dimMatch = cmd.match(/ntax=(\d+)\s+nchar=(\d+)/i);
            if (dimMatch) {
              this.ntax = parseInt(dimMatch[1]);
              this.nchar = parseInt(dimMatch[2]);
            }
          } else if (cmd.toLowerCase().startsWith('format')) {
            const fmtStr = cmd.slice(6).trim();
            const dtMatch = fmtStr.match(/datatype=(\w+)/i);
            if (dtMatch) this.datatype = dtMatch[1].toLowerCase();
            const msMatch = fmtStr.match(/missing=(\S)/i);
            if (msMatch) this.missing = msMatch[1];
            const gpMatch = fmtStr.match(/gap=(\S)/i);
            if (gpMatch) this.gap = gpMatch[1];
            const interMatch = fmtStr.match(/interleave=(\w+)/i);
            if (interMatch) this.interleave = interMatch[1].toLowerCase() === 'yes';
          } else if (cmd.toLowerCase().startsWith('matrix')) {
            const matStr = cmd.slice(6).trim();
            const lines = matStr.split('\n').map(l => l.trim()).filter(l => l);
            lines.forEach(line => {
              const parts = line.split(/\s+/);
              const tax = parts[0].replace(/['"]/g, '');
              const data = parts.slice(1).join('');
              if (this.matrix[tax]) {
                this.matrix[tax] += data; // Interleave
              } else {
                this.matrix[tax] = data;
              }
              if (!this.taxlabels.includes(tax)) this.taxlabels.push(tax);
            });
          }
        });
      } else if (blockName === 'TAXA') {
        commands.forEach(cmd => {
          if (cmd.toLowerCase().startsWith('taxlabels')) {
            const labels = cmd.slice(9).trim().split(/\s+/);
            this.taxlabels.push(...labels);
          }
        });
      } else if (blockName === 'SETS') {
        commands.forEach(cmd => {
          if (cmd.toLowerCase().startsWith('charset') || cmd.toLowerCase().startsWith('taxset')) {
            const parts = cmd.split('=');
            if (parts.length > 1) {
              const left = parts[0].trim();
              const setType = left.split(/\s+/)[0].toLowerCase();
              const setName = left.split(/\s+/)[1];
              const setVal = parts[1].trim();
              if (!this.sets[setType]) this.sets[setType] = {};
              this.sets[setType][setName] = setVal;
            }
          }
        });
      } else if (blockName === 'ASSUMPTIONS') {
        commands.forEach(cmd => {
          if (cmd.toLowerCase().startsWith('exset')) {
            const parts = cmd.split('=');
            if (parts.length > 1) {
              const left = parts[0].trim();
              const assName = left.split(/\s+/).length > 1 ? left.split(/\s+/)[1] : 'default';
              const assVal = parts[1].trim();
              if (!this.assumptions['exset']) this.assumptions['exset'] = {};
              this.assumptions['exset'][assName] = assVal;
            }
          }
        });
      } else if (blockName === 'TREES') {
        let translate = {};
        commands.forEach(cmd => {
          if (cmd.toLowerCase().startsWith('translate')) {
            const transStr = cmd.slice(9).trim().replace(/,/g, ' ').split(/\s+/);
            for (let i = 0; i < transStr.length; i += 2) {
              if (i + 1 < transStr.length) {
                translate[transStr[i]] = transStr[i + 1];
              }
            }
          } else if (cmd.toLowerCase().startsWith('tree')) {
            const treeMatch = cmd.match(/tree\s+(\w+)\s*=\s*([\s\S]*)/i);
            if (treeMatch) {
              this.trees.push({name: treeMatch[1], desc: treeMatch[2].trim(), translate: {...translate}});
            }
          }
        });
      }
    }
  }

  printProperties() {
    console.log(`Header: ${this.header}`);
    console.log(`ntax: ${this.ntax}`);
    console.log(`nchar: ${this.nchar}`);
    console.log(`datatype: ${this.datatype}`);
    console.log(`missing: ${this.missing}`);
    console.log(`gap: ${this.gap}`);
    console.log(`interleave: ${this.interleave}`);
    console.log(`taxlabels: ${this.taxlabels}`);
    console.log('matrix:');
    for (const [tax, data] of Object.entries(this.matrix)) {
      console.log(`  ${tax}: ${data}`);
    }
    console.log('sets:');
    for (const [setType, innerSets] of Object.entries(this.sets)) {
      for (const [name, val] of Object.entries(innerSets)) {
        console.log(`  ${setType} ${name}: ${val}`);
      }
    }
    console.log('assumptions:');
    for (const [assType, innerAss] of Object.entries(this.assumptions)) {
      for (const [name, val] of Object.entries(innerAss)) {
        console.log(`  ${assType} ${name}: ${val}`);
      }
    }
    console.log('trees:');
    this.trees.forEach(tree => {
      console.log(`  ${tree.name}: ${tree.desc}`);
      if (Object.keys(tree.translate).length > 0) {
        console.log(`    translate: ${JSON.stringify(tree.translate)}`);
      }
    });
    console.log('comments:');
    this.comments.forEach(comment => {
      console.log(`  ${comment}`);
    });
  }

  write(filename) {
    let output = `${this.header}\n`;
    if (this.taxlabels.length > 0) {
      output += 'begin taxa;\n';
      output += `  dimensions ntax=${this.ntax};\n`;
      output += '  taxlabels ' + this.taxlabels.map(tax => `'${tax}'`).join(' ') + ';\n';
      output += 'end;\n';
    }
    output += 'begin data;\n';
    output += `  dimensions ntax=${this.ntax} nchar=${this.nchar};\n`;
    output += `  format datatype=${this.datatype} missing=${this.missing} gap=${this.gap} interleave=${this.interleave ? 'yes' : 'no'};\n`;
    output += '  matrix\n';
    this.taxlabels.forEach(tax => {
      if (this.matrix[tax]) {
        output += `    '${tax}' ${this.matrix[tax]}\n`;
      }
    });
    output += '  ;\n';
    output += 'end;\n';
    if (Object.keys(this.sets).length > 0) {
      output += 'begin sets;\n';
      for (const [setType, innerSets] of Object.entries(this.sets)) {
        for (const [name, val] of Object.entries(innerSets)) {
          output += `  ${setType} ${name} = ${val};\n`;
        }
      }
      output += 'end;\n';
    }
    if (Object.keys(this.assumptions).length > 0) {
      output += 'begin assumptions;\n';
      for (const [assType, innerAss] of Object.entries(this.assumptions)) {
        for (const [name, val] of Object.entries(innerAss)) {
          output += `  ${assType} ${name} = ${val};\n`;
        }
      }
      output += 'end;\n';
    }
    if (this.trees.length > 0) {
      output += 'begin trees;\n';
      this.trees.forEach(tree => {
        if (Object.keys(tree.translate).length > 0) {
          output += '  translate\n';
          for (const [token, tname] of Object.entries(tree.translate)) {
            output += `    ${token} ${tname},\n`;
          }
          output += '  ;\n';
        }
        output += `  tree ${tree.name} = ${tree.desc};\n`;
      });
      output += 'end;\n';
    }
    fs.writeFileSync(filename, output);
  }
}

7. C Implementation for .NXS File Handling

Using a struct and functions (as C lacks native classes), this implementation handles opening, parsing, reading, writing, and printing NEXUS properties. It uses string manipulation and regex (requires -lregex compilation flag).

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <ctype.h>
#include <regex.h>

#define MAX_TAXA 100
#define MAX_SETS 50
#define MAX_TREES 50
#define MAX_LINE 1024
#define MAX_COMMENT 1024

typedef struct {
    char *name;
    char *val;
} KeyVal;

typedef struct {
    char *name;
    char *desc;
    KeyVal *translate;
    int trans_count;
} Tree;

typedef struct {
    char *header;
    int ntax;
    int nchar;
    char *datatype;
    char missing;
    char gap;
    int interleave;
    char **taxlabels;
    char **matrix_data;
    int num_taxa;
    KeyVal *sets;
    int sets_count;
    KeyVal *assumptions;
    int ass_count;
    Tree *trees;
    int num_trees;
    char **comments;
    int comments_count;
} NXS;

void init_NXS(NXS *self) {
    self->header = strdup("#NEXUS");
    self->ntax = 0;
    self->nchar = 0;
    self->datatype = strdup("standard");
    self->missing = '?';
    self->gap = '-';
    self->interleave = 0;
    self->taxlabels = calloc(MAX_TAXA, sizeof(char*));
    self->matrix_data = calloc(MAX_TAXA, sizeof(char*));
    self->num_taxa = 0;
    self->sets = calloc(MAX_SETS, sizeof(KeyVal));
    self->sets_count = 0;
    self->assumptions = calloc(MAX_SETS, sizeof(KeyVal));
    self->ass_count = 0;
    self->trees = calloc(MAX_TREES, sizeof(Tree));
    self->num_trees = 0;
    self->comments = calloc(MAX_COMMENT, sizeof(char*));
    self->comments_count = 0;
    for (int i = 0; i < MAX_TREES; i++) {
        self->trees[i].translate = calloc(MAX_TAXA, sizeof(KeyVal));
        self->trees[i].trans_count = 0;
    }
}

void free_NXS(NXS *self) {
    free(self->header);
    free(self->datatype);
    for (int i = 0; i < self->num_taxa; i++) {
        free(self->taxlabels[i]);
        free(self->matrix_data[i]);
    }
    free(self->taxlabels);
    free(self->matrix_data);
    for (int i = 0; i < self->sets_count; i++) {
        free(self->sets[i].name);
        free(self->sets[i].val);
    }
    free(self->sets);
    for (int i = 0; i < self->ass_count; i++) {
        free(self->assumptions[i].name);
        free(self->assumptions[i].val);
    }
    free(self->assumptions);
    for (int i = 0; i < self->num_trees; i++) {
        free(self->trees[i].name);
        free(self->trees[i].desc);
        for (int j = 0; j < self->trees[i].trans_count; j++) {
            free(self->trees[i].translate[j].name);
            free(self->trees[i].translate[j].val);
        }
        free(self->trees[i].translate);
    }
    free(self->trees);
    for (int i = 0; i < self->comments_count; i++) {
        free(self->comments[i]);
    }
    free(self->comments);
}

char* read_file(const char* filename) {
    FILE* fp = fopen(filename, "r");
    if (!fp) return NULL;
    fseek(fp, 0, SEEK_END);
    long size = ftell(fp);
    fseek(fp, 0, SEEK_SET);
    char* text = malloc(size + 1);
    fread(text, 1, size, fp);
    text[size] = '\0';
    fclose(fp);
    return text;
}

void parse_comments(NXS *self, const char *text) {
    regex_t regex;
    regcomp(&regex, "\\[\\s*[^!].*?\\]", REG_EXTENDED);
    regmatch_t match;
    int offset = 0;
    while (regexec(&regex, text + offset, 1, &match, 0) == 0 && self->comments_count < MAX_COMMENT) {
        int len = match.rm_eo - match.rm_so;
        self->comments[self->comments_count] = strndup(text + offset + match.rm_so, len);
        self->comments_count++;
        offset += match.rm_eo;
    }
    regfree(&regex);
}

void parse(NXS *self, char *text) {
    if (strncmp(text, "#NEXUS", 6) != 0) {
        fprintf(stderr, "Not a valid NEXUS file\n");
        return;
    }

    parse_comments(self, text);

    char *block_start = strstr(text, "begin");
    while (block_start) {
        char block_name[50];
        sscanf(block_start, "begin %49s;", block_name);
        char *content_start = block_start + strlen("begin ") + strlen(block_name) + 1;
        char *end = strstr(content_start, "end;");
        if (!end) break;
        char content[MAX_LINE * 10];
        strncpy(content, content_start, end - content_start);
        content[end - content_start] = '\0';

        if (strcasecmp(block_name, "data") == 0 || strcasecmp(block_name, "characters") == 0) {
            char *cmd = strtok(content, ";");
            while (cmd) {
                cmd = strstr(cmd, "dimensions") ? cmd : strtok(NULL, ";");
                if (!cmd) break;
                if (strstr(cmd, "dimensions")) {
                    sscanf(cmd, "dimensions ntax=%d nchar=%d", &self->ntax, &self->nchar);
                } else if (strstr(cmd, "format")) {
                    char *fmt = cmd + strlen("format");
                    char *dt = strstr(fmt, "datatype=");
                    if (dt) sscanf(dt, "datatype=%s", self->datatype);
                    char *ms = strstr(fmt, "missing=");
                    if (ms) sscanf(ms, "missing=%c", &self->missing);
                    char *gp = strstr(fmt, "gap=");
                    if (gp) sscanf(gp, "gap=%c", &self->gap);
                    char *inter = strstr(fmt, "interleave=");
                    if (inter) {
                        char inter_val[4];
                        sscanf(inter, "interleave=%3s", inter_val);
                        self->interleave = strcasecmp(inter_val, "yes") == 0;
                    }
                } else if (strstr(cmd, "matrix")) {
                    char *mat = cmd + strlen("matrix");
                    char *line = strtok(mat, "\n");
                    while (line && self->num_taxa < MAX_TAXA) {
                        while (*line == ' ') line++;
                        if (*line) {
                            char tax[100], data[MAX_LINE];
                            if (sscanf(line, "'%99[^']' %[^\n]", tax, data) == 2 || sscanf(line, "%99s %[^\n]", tax, data) == 2) {
                                int idx = -1;
                                for (int i = 0; i < self->num_taxa; i++) {
                                    if (strcmp(self->taxlabels[i], tax) == 0) {
                                        idx = i;
                                        break;
                                    }
                                }
                                if (idx == -1) {
                                    idx = self->num_taxa++;
                                    self->taxlabels[idx] = strdup(tax);
                                    self->matrix_data[idx] = strdup(data);
                                } else {
                                    char *new_data = malloc(strlen(self->matrix_data[idx]) + strlen(data) + 1);
                                    strcpy(new_data, self->matrix_data[idx]);
                                    strcat(new_data, data);
                                    free(self->matrix_data[idx]);
                                    self->matrix_data[idx] = new_data;
                                }
                            }
                        }
                        line = strtok(NULL, "\n");
                    }
                }
            }
        } else if (strcasecmp(block_name, "taxa") == 0) {
            char *cmd = strtok(content, ";");
            while (cmd) {
                if (strstr(cmd, "taxlabels")) {
                    char *labels = cmd + strlen("taxlabels");
                    char *tax = strtok(labels, " \n");
                    while (tax && self->num_taxa < MAX_TAXA) {
                        self->taxlabels[self->num_taxa++] = strdup(tax);
                        tax = strtok(NULL, " \n");
                    }
                }
                cmd = strtok(NULL, ";");
            }
        } else if (strcasecmp(block_name, "sets") == 0) {
            char *cmd = strtok(content, ";");
            while (cmd) {
                if (strstr(cmd, "charset") || strstr(cmd, "taxset")) {
                    char set_type[10], set_name[50], set_val[MAX_LINE];
                    sscanf(cmd, "%9s %49s = %[^\n]", set_type, set_name, set_val);
                    if (self->sets_count < MAX_SETS) {
                        char *full_name = malloc(strlen(set_type) + strlen(set_name) + 2);
                        sprintf(full_name, "%s_%s", set_type, set_name);
                        self->sets[self->sets_count].name = full_name;
                        self->sets[self->sets_count].val = strdup(set_val);
                        self->sets_count++;
                    }
                }
                cmd = strtok(NULL, ";");
            }
        } else if (strcasecmp(block_name, "assumptions") == 0) {
            char *cmd = strtok(content, ";");
            while (cmd) {
                if (strstr(cmd, "exset")) {
                    char ass_name[50], ass_val[MAX_LINE];
                    if (sscanf(cmd, "exset %49s = %[^\n]", ass_name, ass_val) != 2) {
                        strcpy(ass_name, "default");
                        sscanf(cmd, "exset = %[^\n]", ass_val);
                    }
                    if (self->ass_count < MAX_SETS) {
                        self->assumptions[self->ass_count].name = strdup(ass_name);
                        self->assumptions[self->ass_count].val = strdup(ass_val);
                        self->ass_count++;
                    }
                }
                cmd = strtok(NULL, ";");
            }
        } else if (strcasecmp(block_name, "trees") == 0) {
            char *cmd = strtok(content, ";");
            int trans_idx = 0;
            int tree_idx = self->num_trees;
            while (cmd && tree_idx < MAX_TREES) {
                if (strstr(cmd, "translate")) {
                    char *trans = cmd + strlen("translate");
                    char *pair = strtok(trans, ",");
                    while (pair && trans_idx < MAX_TAXA) {
                        char token[50], tname[100];
                        sscanf(pair, "%49s %99s", token, tname);
                        self->trees[tree_idx].translate[trans_idx].name = strdup(token);
                        self->trees[tree_idx].translate[trans_idx].val = strdup(tname);
                        trans_idx++;
                        pair = strtok(NULL, ",");
                    }
                    self->trees[tree_idx].trans_count = trans_idx;
                } else if (strstr(cmd, "tree")) {
                    char name[50], desc[MAX_LINE];
                    sscanf(cmd, "tree %49s = %[^\n]", name, desc);
                    self->trees[tree_idx].name = strdup(name);
                    self->trees[tree_idx].desc = strdup(desc);
                    tree_idx++;
                    self->num_trees = tree_idx;
                    trans_idx = 0; // Reset for next tree
                }
                cmd = strtok(NULL, ";");
            }
        }

        block_start = strstr(end + 4, "begin");
    }
}

void read_NXS(NXS *self, const char *filename) {
    char *text = read_file(filename);
    if (text) {
        parse(self, text);
        free(text);
    }
}

void print_NXS(const NXS *self) {
    printf("Header: %s\n", self->header);
    printf("ntax: %d\n", self->ntax);
    printf("nchar: %d\n", self->nchar);
    printf("datatype: %s\n", self->datatype);
    printf("missing: %c\n", self->missing);
    printf("gap: %c\n", self->gap);
    printf("interleave: %d\n", self->interleave);
    printf("taxlabels:\n");
    for (int i = 0; i < self->num_taxa; i++) {
        printf("  %s\n", self->taxlabels[i]);
    }
    printf("matrix:\n");
    for (int i = 0; i < self->num_taxa; i++) {
        printf("  %s: %s\n", self->taxlabels[i], self->matrix_data[i]);
    }
    printf("sets:\n");
    for (int i = 0; i < self->sets_count; i++) {
        printf("  %s: %s\n", self->sets[i].name, self->sets[i].val);
    }
    printf("assumptions:\n");
    for (int i = 0; i < self->ass_count; i++) {
        printf("  %s: %s\n", self->assumptions[i].name, self->assumptions[i].val);
    }
    printf("trees:\n");
    for (int i = 0; i < self->num_trees; i++) {
        printf("  %s: %s\n", self->trees[i].name, self->trees[i].desc);
        if (self->trees[i].trans_count > 0) {
            printf("    translate:\n");
            for (int j = 0; j < self->trees[i].trans_count; j++) {
                printf("      %s: %s\n", self->trees[i].translate[j].name, self->trees[i].translate[j].val);
            }
        }
    }
    printf("comments:\n");
    for (int i = 0; i < self->comments_count; i++) {
        printf("  %s\n", self->comments[i]);
    }
}

void write_NXS(const NXS *self, const char *filename) {
    FILE *fp = fopen(filename, "w");
    if (!fp) return;

    fprintf(fp, "%s\n", self->header);
    if (self->num_taxa > 0) {
        fprintf(fp, "begin taxa;\n");
        fprintf(fp, "  dimensions ntax=%d;\n", self->ntax);
        fprintf(fp, "  taxlabels");
        for (int i = 0; i < self->num_taxa; i++) {
            fprintf(fp, " '%s'", self->taxlabels[i]);
        }
        fprintf(fp, ";\n");
        fprintf(fp, "end;\n");
    }
    fprintf(fp, "begin data;\n");
    fprintf(fp, "  dimensions ntax=%d nchar=%d;\n", self->ntax, self->nchar);
    fprintf(fp, "  format datatype=%s missing=%c gap=%c interleave=%s;\n", 
            self->datatype, self->missing, self->gap, self->interleave ? "yes" : "no");
    fprintf(fp, "  matrix\n");
    for (int i = 0; i < self->num_taxa; i++) {
        fprintf(fp, "    '%s' %s\n", self->taxlabels[i], self->matrix_data[i]);
    }
    fprintf(fp, "  ;\n");
    fprintf(fp, "end;\n");
    if (self->sets_count > 0) {
        fprintf(fp, "begin sets;\n");
        for (int i = 0; i < self->sets_count; i++) {
            fprintf(fp, "  %s = %s;\n", self->sets[i].name, self->sets[i].val);
        }
        fprintf(fp, "end;\n");
    }
    if (self->ass_count > 0) {
        fprintf(fp, "begin assumptions;\n");
        for (int i = 0; i < self->ass_count; i++) {
            fprintf(fp, "  exset %s = %s;\n", self->assumptions[i].name, self->assumptions[i].val);
        }
        fprintf(fp, "end;\n");
    }
    if (self->num_trees > 0) {
        fprintf(fp, "begin trees;\n");
        for (int i = 0; i < self->num_trees; i++) {
            if (self->trees[i].trans_count > 0) {
                fprintf(fp, "  translate\n");
                for (int j = 0; j < self->trees[i].trans_count; j++) {
                    fprintf(fp, "    %s %s,\n", self->trees[i].translate[j].name, self->trees[i].translate[j].val);
                }
                fprintf(fp, "  ;\n");
            }
            fprintf(fp, "  tree %s = %s;\n", self->trees[i].name, self->trees[i].desc);
        }
        fprintf(fp, "end;\n");
    }
    // Comments not written

    fclose(fp);
}