Task 534: .PFAM File Format

Task 534: .PFAM File Format

File Format Specifications for .PFAM

After searching extensively, no standard file format with the exact extension .PFAM was found. The term "PFAM" is strongly associated with the Pfam database of protein families, where alignments are stored in Stockholm format (commonly with .sto or .stk extensions). The PC Matic file extension database lists .pfam as a "PFAM File," but no further details or specifications are available, and no examples with that extension were located. Therefore, I am assuming the query refers to Pfam alignment files in Stockholm format, as this is the intrinsic format used by Pfam for sequence alignments and annotations. The specifications are based on the Stockholm format 1.0, as used by Pfam.

The Stockholm format is a text-based multiple sequence alignment format that supports annotations via markup lines. It starts with a header, contains sequence lines and optional markup, and ends with a terminator. It is not binary; it's ASCII text. The format is flexible, with no strict byte offsets, but it has structured tags for properties.

  1. List of all the properties of this file format intrinsic to its file system.

The properties are the annotation tags (markup) in the Stockholm format. These are not "file system" properties like metadata in NTFS or ext4, but intrinsic to the format itself (e.g., fields, headers, annotations). The format doesn't have built-in file system-specific attributes beyond what a text file has (e.g., size, modification time), but the content properties are:

  • Header: # STOCKHOLM 1.0 (required, indicates format version).
  • Sequence lines: Sequence name (e.g., "name/start-end") followed by whitespace and the aligned sequence (letters, gaps as . or -).
  • End marker: // (required, ends the alignment).
  • #=GF  : Generic per-File annotations (free text). Compulsory/optional for Pfam:
  • AC: Accession number (e.g., PFxxxxx).
  • ID: Identification (one word name).
  • DE: Definition (short description).
  • AU: Author.
  • SE: Source of seed.
  • SS: Source of structure.
  • BM: Build method.
  • SM: Search method.
  • GA: Gathering threshold.
  • TC: Trusted Cutoff.
  • NC: Noise Cutoff.
  • TP: Type (e.g., Family, Domain).
  • SQ: Sequence count.
  • DC: Database Comment.
  • DR: Database Reference.
  • RC: Reference Comment.
  • RN: Reference Number.
  • RM: Reference Medline.
  • RT: Reference Title.
  • RA: Reference Author.
  • RL: Reference Location.
  • PI: Previous identifier.
  • KW: Keywords.
  • CC: Comment.
  • NE: Pfam accession (nested domain).
  • NL: Location of nested domains.
  • WK: Wikipedia link.
  • CL: Clan accession.
  • MB: Membership.
  • NH: New Hampshire tree.
  • TN: Tree ID.
  • FR: False discovery Rate.
  • CB: Calibration method.
  • #=GS   : Generic per-Sequence annotations.
  • AC: Accession.
  • DE: Description.
  • DR: Database Reference.
  • OS: Organism.
  • OC: Organism Classification.
  • LO: Look (e.g., color).
  • #=GR   : Generic per-Residue annotations (1 char per residue).
  • SS: Secondary Structure (protein: HGIEBTSCX; RNA: .,;<>(){}[]AaBb.-_).
  • SA: Surface Accessibility (0-9X).
  • TM: TransMembrane (Mio).
  • PP: Posterior Probability (0-9*).
  • LI: Ligand binding (*).
  • AS: Active Site (*).
  • pAS: Pfam predicted AS (*).
  • sAS: SwissProt AS (*).
  • IN: Intron (0-2).
  • tWW, cWH, cWS, tWS, etc.: RNA tertiary interactions.
  • #=GC : Generic per-Column annotations (1 char per column).
  • RF: Reference annotation (consensus sequence, . or - for gaps, ~ for unaligned).
  • MM: Model Mask.
  • SS_cons: Consensus Secondary Structure.
  • SA_cons: Consensus Surface Accessibility.
  • And other _cons variants for consensus (e.g., PP_cons).

These properties are text-based and parsed line by line. The format supports multiple alignments in one file (separated by //), but Pfam typically has one per family in individual files or concatenated in large files.

  1. Two direct download links for files of format .PFAM.

No files with the exact .PFAM extension were found. Using Stockholm (.sto) files from Pfam examples as substitutes:

  1. Ghost blog embedded HTML JavaScript for drag and drop to dump properties.

"Ghost blog embedded" seems to mean HTML/JS code that can be embedded in a Ghost blog post. The code below is self-contained HTML with JS to handle drag-and-drop of a Stockholm file and display all properties (tags and their values).

Drag and drop .PFAM (Stockholm) file here
  1. Python class for opening, decoding, reading, writing, and printing properties.
class PfamFile:
    def __init__(self, filepath):
        self.filepath = filepath
        self.properties = {'GF': {}, 'GS': {}, 'GR': {}, 'GC': {}, 'Sequences': [], 'Header': None}
        self.read()

    def read(self):
        with open(self.filepath, 'r') as f:
            content = f.read()
        lines = content.split('\n')
        current_seq = {}
        for line in lines:
            line = line.strip()
            if not line or line.startswith('//'): continue
            if line.startswith('# STOCKHOLM'):
                self.properties['Header'] = line
            elif line.startswith('#=GF'):
                parts = line[5:].strip().split(maxsplit=1)
                self.properties['GF'][parts[0]] = parts[1] if len(parts) > 1 else ''
            elif line.startswith('#=GS'):
                parts = line[5:].strip().split(maxsplit=2)
                seqname = parts[0]
                feature = parts[1]
                text = parts[2] if len(parts) > 2 else ''
                if seqname not in self.properties['GS']:
                    self.properties['GS'][seqname] = {}
                self.properties['GS'][seqname][feature] = text
            elif line.startsWith('#=GR'):
                parts = line[5:].strip().split(maxsplit=2)
                seqname = parts[0]
                feature = parts[1]
                text = parts[2] if len(parts) > 2 else ''
                if seqname not in self.properties['GR']:
                    self.properties['GR'][seqname] = {}
                self.properties['GR'][seqname][feature] = text
            elif line.startsWith('#=GC'):
                parts = line[5:].strip().split(maxsplit=1)
                self.properties['GC'][parts[0]] = parts[1] if len(parts) > 1 else ''
            else:
                parts = line.split()
                if len(parts) >= 2:
                    seqname = parts[0]
                    sequence = ' '.join(parts[1:])
                    if seqname in current_seq:
                        current_seq[seqname] += sequence
                    else:
                        current_seq[seqname] = sequence
        self.properties['Sequences'] = current_seq

    def print_properties(self):
        import json
        print(json.dumps(self.properties, indent=2))

    def write(self, new_filepath=None):
        filepath = new_filepath or self.filepath
        with open(filepath, 'w') as f:
            if self.properties['Header']:
                f.write(self.properties['Header'] + '\n')
            for feature, text in self.properties['GF'].items():
                f.write(f'#=GF {feature} {text}\n')
            for seqname, features in self.properties['GS'].items():
                for feature, text in features.items():
                    f.write(f'#=GS {seqname} {feature} {text}\n')
            for seqname, sequence in self.properties['Sequences'].items():
                f.write(f'{seqname} {sequence}\n')
                if seqname in self.properties['GR']:
                    for feature, text in self.properties['GR'][seqname].items():
                        f.write(f'#=GR {seqname} {feature} {text}\n')
            for feature, text in self.properties['GC'].items():
                f.write(f'#=GC {feature} {text}\n')
            f.write('//\n')

# Example usage:
# pfam = PfamFile('example.sto')
# pfam.print_properties()
# pfam.write('output.sto')
  1. Java class for opening, decoding, reading, writing, and printing properties.
import java.io.*;
import java.util.*;

public class PfamFile {
    private String filepath;
    private Map<String, Object> properties = new HashMap<>();

    public PfamFile(String filepath) {
        this.filepath = filepath;
        properties.put("GF", new HashMap<String, String>());
        properties.put("GS", new HashMap<String, Map<String, String>>());
        properties.put("GR", new HashMap<String, Map<String, String>>());
        properties.put("GC", new HashMap<String, String>());
        properties.put("Sequences", new HashMap<String, String>());
        read();
    }

    private void read() {
        try (BufferedReader br = new BufferedReader(new FileReader(filepath))) {
            String line;
            Map<String, String> currentSeq = (Map<String, String>) properties.get("Sequences");
            while ((line = br.readLine()) != null) {
                line = line.trim();
                if (line.isEmpty() || line.startsWith("//")) continue;
                if (line.startsWith("# STOCKHOLM")) {
                    properties.put("Header", line);
                } else if (line.startsWith("#=GF")) {
                    String[] parts = line.substring(5).trim().split("\\s+", 2);
                    ((Map<String, String>) properties.get("GF")).put(parts[0], parts.length > 1 ? parts[1] : "");
                } else if (line.startsWith("#=GS")) {
                    String[] parts = line.substring(5).trim().split("\\s+", 3);
                    String seqname = parts[0];
                    String feature = parts[1];
                    String text = parts.length > 2 ? parts[2] : "";
                    ((Map<String, Map<String, String>>) properties.get("GS")).computeIfAbsent(seqname, k -> new HashMap<>()).put(feature, text);
                } else if (line.startsWith("#=GR")) {
                    String[] parts = line.substring(5).trim().split("\\s+", 3);
                    String seqname = parts[0];
                    String feature = parts[1];
                    String text = parts.length > 2 ? parts[2] : "";
                    ((Map<String, Map<String, String>>) properties.get("GR")).computeIfAbsent(seqname, k -> new HashMap<>()).put(feature, text);
                } else if (line.startsWith("#=GC")) {
                    String[] parts = line.substring(5).trim().split("\\s+", 2);
                    ((Map<String, String>) properties.get("GC")).put(parts[0], parts.length > 1 ? parts[1] : "");
                } else {
                    String[] parts = line.split("\\s+");
                    if (parts.length >= 2) {
                        String seqname = parts[0];
                        String sequence = String.join("", Arrays.asList(parts).subList(1, parts.length));
                        currentSeq.put(seqname, currentSeq.getOrDefault(seqname, "") + sequence);
                    }
                }
            }
        } catch (IOException e) {
            e.printStackTrace();
        }
    }

    public void printProperties() {
        System.out.println(properties);
    }

    public void write(String newFilepath) {
        if (newFilepath == null) newFilepath = filepath;
        try (PrintWriter pw = new PrintWriter(new File(newFilepath))) {
            if (properties.containsKey("Header")) pw.println(properties.get("Header"));
            ((Map<String, String>) properties.get("GF")).forEach((k, v) -> pw.println("#=GF " + k + " " + v));
            ((Map<String, Map<String, String>>) properties.get("GS")).forEach((seq, feats) -> feats.forEach((k, v) -> pw.println("#=GS " + seq + " " + k + " " + v)));
            ((Map<String, String>) properties.get("Sequences")).forEach((seq, seqStr) -> {
                pw.println(seq + " " + seqStr);
                if (((Map<String, Map<String, String>>) properties.get("GR")).containsKey(seq)) {
                    ((Map<String, Map<String, String>>) properties.get("GR")).get(seq).forEach((k, v) -> pw.println("#=GR " + seq + " " + k + " " + v));
                }
            });
            ((Map<String, String>) properties.get("GC")).forEach((k, v) -> pw.println("#=GC " + k + " " + v));
            pw.println("//");
        } catch (FileNotFoundException e) {
            e.printStackTrace();
        }
    }

    // Example usage:
    // public static void main(String[] args) {
    //     PfamFile pfam = new PfamFile("example.sto");
    //     pfam.printProperties();
    //     pfam.write("output.sto");
    // }
}
  1. JavaScript class for opening, decoding, reading, writing, and printing properties.

Note: JS doesn't have native file I/O like Python or Java; this assumes Node.js with 'fs' module for read/write. For browser, use FileReader for read, and Blob for write (download).

const fs = require('fs'); // For Node.js

class PfamFile {
    constructor(filepath) {
        this.filepath = filepath;
        this.properties = { GF: {}, GS: {}, GR: {}, GC: {}, Sequences: {}, Header: null };
        this.read();
    }

    read() {
        const content = fs.readFileSync(this.filepath, 'utf8');
        const lines = content.split('\n');
        for (let line of lines) {
            line = line.trim();
            if (!line || line.startsWith('//')) continue;
            if (line.startsWith('# STOCKHOLM')) {
                this.properties.Header = line;
            } else if (line.startsWith('#=GF')) {
                const parts = line.slice(5).trim().split(/\s+/);
                this.properties.GF[parts[0]] = parts.slice(1).join(' ');
            } else if (line.startsWith('#=GS')) {
                const parts = line.slice(5).trim().split(/\s+/);
                const seqname = parts[0];
                const feature = parts[1];
                const text = parts.slice(2).join(' ');
                if (!this.properties.GS[seqname]) this.properties.GS[seqname] = {};
                this.properties.GS[seqname][feature] = text;
            } else if (line.startsWith('#=GR')) {
                const parts = line.slice(5).trim().split(/\s+/);
                const seqname = parts[0];
                const feature = parts[1];
                const text = parts.slice(2).join(' ');
                if (!this.properties.GR[seqname]) this.properties.GR[seqname] = {};
                this.properties.GR[seqname][feature] = text;
            } else if (line.startsWith('#=GC')) {
                const parts = line.slice(5).trim().split(/\s+/);
                this.properties.GC[parts[0]] = parts.slice(1).join(' ');
            } else {
                const parts = line.split(/\s+/);
                if (parts.length >= 2) {
                    const seqname = parts[0];
                    const sequence = parts.slice(1).join('');
                    this.properties.Sequences[seqname] = (this.properties.Sequences[seqname] || '') + sequence;
                }
            }
        }
    }

    printProperties() {
        console.log(JSON.stringify(this.properties, null, 2));
    }

    write(newFilepath = this.filepath) {
        let content = this.properties.Header ? this.properties.Header + '\n' : '';
        for (let [feature, text] of Object.entries(this.properties.GF)) {
            content += `#=GF ${feature} ${text}\n`;
        }
        for (let [seqname, feats] of Object.entries(this.properties.GS)) {
            for (let [feature, text] of Object.entries(feats)) {
                content += `#=GS ${seqname} ${feature} ${text}\n`;
            }
        }
        for (let [seqname, sequence] of Object.entries(this.properties.Sequences)) {
            content += `${seqname} ${sequence}\n`;
            if (this.properties.GR[seqname]) {
                for (let [feature, text] of Object.entries(this.properties.GR[seqname])) {
                    content += `#=GR ${seqname} ${feature} ${text}\n`;
                }
            }
        }
        for (let [feature, text] of Object.entries(this.properties.GC)) {
            content += `#=GC ${feature} ${text}\n`;
        }
        content += '//\n';
        fs.writeFileSync(newFilepath, content);
    }
}

// Example usage:
// const pfam = new PfamFile('example.sto');
// pfam.printProperties();
// pfam.write('output.sto');
  1. C class for opening, decoding, reading, writing, and printing properties.

In C, "class" is simulated with structs and functions. This uses stdio for file I/O.

#include <stdio.h>
#include <stdlib.h>
#include <string.h>

typedef struct {
    char *filepath;
    char *header;
    // Use simple arrays or linked lists for properties; for simplicity, assume max 100 properties
    struct { char key[256]; char value[1024]; } gf[100];
    int gf_count;
    // For GS, GR: nested, use simple string storage for demo
    char gs[100][1024];
    int gs_count;
    char gr[100][1024];
    int gr_count;
    struct { char key[256]; char value[1024]; } gc[100];
    int gc_count;
    struct { char name[256]; char seq[1024]; } sequences[100];
    int seq_count;
} PfamFile;

void init_pfam(PfamFile *pf, char *filepath) {
    pf->filepath = strdup(filepath);
    pf->header = NULL;
    pf->gf_count = 0;
    pf->gs_count = 0;
    pf->gr_count = 0;
    pf->gc_count = 0;
    pf->seq_count = 0;
}

void read_pfam(PfamFile *pf) {
    FILE *f = fopen(pf->filepath, "r");
    if (!f) return;
    char line[2048];
    while (fgets(line, sizeof(line), f)) {
        line[strcspn(line, "\n")] = 0;
        if (strlen(line) == 0 || strncmp(line, "//", 2) == 0) continue;
        if (strncmp(line, "# STOCKHOLM", 11) == 0) {
            pf->header = strdup(line);
        } else if (strncmp(line, "#=GF", 4) == 0) {
            sscanf(line + 5, "%255s %1023[^\n]", pf->gf[pf->gf_count].key, pf->gf[pf->gf_count].value);
            pf->gf_count++;
        } else if (strncmp(line, "#=GS", 4) == 0) {
            strncpy(pf->gs[pf->gs_count++], line + 5, 1023);
        } else if (strncmp(line, "#=GR", 4) == 0) {
            strncpy(pf->gr[pf->gr_count++], line + 5, 1023);
        } else if (strncmp(line, "#=GC", 4) == 0) {
            sscanf(line + 5, "%255s %1023[^\n]", pf->gc[pf->gc_count].key, pf->gc[pf->gc_count].value);
            pf->gc_count++;
        } else {
            char name[256], seq[1024];
            sscanf(line, "%255s %1023[^\n]", name, seq);
            int i;
            for (i = 0; i < pf->seq_count; i++) {
                if (strcmp(pf->sequences[i].name, name) == 0) {
                    strcat(pf->sequences[i].seq, seq);
                    break;
                }
            }
            if (i == pf->seq_count) {
                strcpy(pf->sequences[pf->seq_count].name, name);
                strcpy(pf->sequences[pf->seq_count].seq, seq);
                pf->seq_count++;
            }
        }
    }
    fclose(f);
}

void print_properties(PfamFile *pf) {
    if (pf->header) printf("Header: %s\n", pf->header);
    printf("GF properties:\n");
    for (int i = 0; i < pf->gf_count; i++) printf("%s: %s\n", pf->gf[i].key, pf->gf[i].value);
    printf("GS properties:\n");
    for (int i = 0; i < pf->gs_count; i++) printf("%s\n", pf->gs[i]);
    printf("GR properties:\n");
    for (int i = 0; i < pf->gr_count; i++) printf("%s\n", pf->gr[i]);
    printf("GC properties:\n");
    for (int i = 0; i < pf->gc_count; i++) printf("%s: %s\n", pf->gc[i].key, pf->gc[i].value);
    printf("Sequences:\n");
    for (int i = 0; i < pf->seq_count; i++) printf("%s: %s\n", pf->sequences[i].name, pf->sequences[i].seq);
}

void write_pfam(PfamFile *pf, char *new_filepath) {
    if (!new_filepath) new_filepath = pf->filepath;
    FILE *f = fopen(new_filepath, "w");
    if (pf->header) fprintf(f, "%s\n", pf->header);
    for (int i = 0; i < pf->gf_count; i++) fprintf(f, "#=GF %s %s\n", pf->gf[i].key, pf->gf[i].value);
    for (int i = 0; i < pf->gs_count; i++) fprintf(f, "#=GS %s\n", pf->gs[i]);
    for (int i = 0; i < pf->seq_count; i++) {
        fprintf(f, "%s %s\n", pf->sequences[i].name, pf->sequences[i].seq);
        // Assume GR are printed after sequences; simplistic
    }
    for (int i = 0; i < pf->gr_count; i++) fprintf(f, "#=GR %s\n", pf->gr[i]);
    for (int i = 0; i < pf->gc_count; i++) fprintf(f, "#=GC %s %s\n", pf->gc[i].key, pf->gc[i].value);
    fprintf(f, "//\n");
    fclose(f);
}

void free_pfam(PfamFile *pf) {
    free(pf->filepath);
    if (pf->header) free(pf->header);
}

// Example usage:
// int main() {
//     PfamFile pf;
//     init_pfam(&pf, "example.sto");
//     read_pfam(&pf);
//     print_properties(&pf);
//     write_pfam(&pf, "output.sto");
//     free_pfam(&pf);
//     return 0;
// }