Task 550: .PIR File Format

Task 550: .PIR File Format

File Format Specifications for .PIR

The .PIR file format (also known as NBRF or PIR format) is a text-based format used primarily for storing biological sequences, such as proteins or nucleotides. It is maintained by the National Biomedical Research Foundation (NBRF) and is commonly used in bioinformatics tools like Modeller, Biopython, and EMBOSS. The format supports multiple sequences in a single file and is line-based with specific markers.

The structure is as follows:

  • Header Line: Starts with ">" followed by a two-letter type code (e.g., "P1" for proteins, "F1" for protein fragments, "DL" for DNA libraries, "RL" for RNA libraries, "DC" for DNA circular, "RC" for RNA circular, or "XX" for other). Then a semicolon ";" and a unique sequence ID (typically 4-12 characters).
  • Description Line: A single line with a textual description of the sequence (e.g., protein name or function).
  • Sequence Data: One or more lines containing the sequence characters (e.g., amino acids like A, C, G, T for nucleotides or standard one-letter codes for proteins). Lines may have leading spaces for alignment. The sequence ends with a "*" asterisk on the last line.
  • Optional Lines: After the "*", additional lines may describe features or annotations, but these are ignored by most sequence parsers.
  • Files are ASCII text, with no strict line length limits, but sequences are often wrapped for readability.
  • Multiple sequences can be concatenated in one file, each starting with its own header.

Variations exist, such as aligned PIR for tools like Modeller, where sequences are padded with gaps ("-") for alignment, and additional fields like chain IDs or residue numbers may be included in the header (e.g., >P1;1bdmA with structure codes).

List of all the properties of this file format intrinsic to its file system:

  • File extension: .pir (case-insensitive, often lowercase).
  • MIME type: text/plain (as it's ASCII text).
  • Encoding: ASCII/UTF-8.
  • Multi-sequence support: Yes, multiple entries per file.
  • Header marker: ">" character at the start of each entry.
  • Type code: Two-letter code (e.g., P1, DL) indicating sequence type.
  • Separator: Semicolon ";" between type code and ID.
  • Sequence ID: Unique alphanumeric identifier (4-12 characters typically).
  • Description: Single-line text string describing the sequence.
  • Sequence data: String of one-letter codes, possibly multi-line, with optional spaces or gaps ("-").
  • Terminator: "*" asterisk at the end of the sequence data.
  • Optional annotation lines: Text lines after "*", not part of the core sequence.
  • Line endings: Platform-independent (CRLF or LF).
  • No binary data; fully text-based.
  • No fixed size or checksum; variable length based on sequence.

Two direct download links for files of format .PIR:

Ghost blog embedded html javascript that allows a user to drag n drop a file of format .PIR and it will dump to screen all these properties:

Drag and drop a .PIR file here
  1. Python class that can open any file of format .PIR and decode read and write and print to console all the properties from the above list:
class PirFile:
    def __init__(self, filepath=None):
        self.entries = []
        if filepath:
            self.read(filepath)

    def read(self, filepath):
        with open(filepath, 'r') as f:
            content = f.read()
        lines = content.splitlines()
        current = None
        in_sequence = False
        for line in lines:
            if line.startswith('>'):
                if current:
                    self.entries.append(current)
                parts = line[1:].split(';')
                current = {
                    'type_code': parts[0],
                    'id': parts[1] if len(parts) > 1 else '',
                    'description': '',
                    'sequence': '',
                    'optional_lines': []
                }
                in_sequence = False
            elif current and not current['description']:
                current['description'] = line.strip()
                in_sequence = True
            elif current and in_sequence:
                stripped = line.strip()
                if '*' in stripped:
                    current['sequence'] += stripped.replace('*', '').replace(' ', '')
                    in_sequence = False
                else:
                    current['sequence'] += stripped.replace(' ', '')
            elif current and not in_sequence:
                current['optional_lines'].append(line.strip())
        if current:
            self.entries.append(current)

    def print_properties(self):
        for idx, entry in enumerate(self.entries, 1):
            print(f"Entry {idx}:")
            print(f"  Type Code: {entry['type_code']}")
            print(f"  Sequence ID: {entry['id']}")
            print(f"  Description: {entry['description']}")
            print(f"  Sequence: {entry['sequence']}")
            print(f"  Optional Lines: {', '.join(entry['optional_lines'])}")
            print()

    def write(self, filepath):
        with open(filepath, 'w') as f:
            for entry in self.entries:
                f.write(f">{entry['type_code']};{entry['id']}\n")
                f.write(f"{entry['description']}\n")
                # Wrap sequence for readability (60 chars per line)
                seq = entry['sequence']
                for i in range(0, len(seq), 60):
                    f.write(f"{seq[i:i+60]}\n")
                f.write("*\n")
                for opt in entry['optional_lines']:
                    f.write(f"{opt}\n")

# Example usage:
# pir = PirFile('example.pir')
# pir.print_properties()
# pir.write('output.pir')
  1. Java class that can open any file of format .PIR and decode read and write and print to console all the properties from the above list:
import java.io.*;
import java.util.ArrayList;
import java.util.List;

public class PirFile {
    private List<Entry> entries = new ArrayList<>();

    static class Entry {
        String typeCode;
        String id;
        String description;
        String sequence;
        List<String> optionalLines = new ArrayList<>();
    }

    public void read(String filepath) throws IOException {
        entries.clear();
        try (BufferedReader reader = new BufferedReader(new FileReader(filepath))) {
            String line;
            Entry current = null;
            boolean inSequence = false;
            while ((line = reader.readLine()) != null) {
                if (line.startsWith(">")) {
                    if (current != null) {
                        entries.add(current);
                    }
                    String[] parts = line.substring(1).split(";");
                    current = new Entry();
                    current.typeCode = parts[0];
                    current.id = (parts.length > 1) ? parts[1] : "";
                    current.sequence = "";
                    inSequence = false;
                } else if (current != null && current.description == null) {
                    current.description = line.trim();
                    inSequence = true;
                } else if (current != null && inSequence) {
                    String stripped = line.trim().replace(" ", "");
                    if (stripped.contains("*")) {
                        current.sequence += stripped.replace("*", "");
                        inSequence = false;
                    } else {
                        current.sequence += stripped;
                    }
                } else if (current != null && !inSequence) {
                    current.optionalLines.add(line.trim());
                }
            }
            if (current != null) {
                entries.add(current);
            }
        }
    }

    public void printProperties() {
        for (int i = 0; i < entries.size(); i++) {
            Entry entry = entries.get(i);
            System.out.println("Entry " + (i + 1) + ":");
            System.out.println("  Type Code: " + entry.typeCode);
            System.out.println("  Sequence ID: " + entry.id);
            System.out.println("  Description: " + entry.description);
            System.out.println("  Sequence: " + entry.sequence);
            System.out.println("  Optional Lines: " + String.join(", ", entry.optionalLines));
            System.out.println();
        }
    }

    public void write(String filepath) throws IOException {
        try (BufferedWriter writer = new BufferedWriter(new FileWriter(filepath))) {
            for (Entry entry : entries) {
                writer.write(">" + entry.typeCode + ";" + entry.id + "\n");
                writer.write(entry.description + "\n");
                String seq = entry.sequence;
                for (int j = 0; j < seq.length(); j += 60) {
                    writer.write(seq.substring(j, Math.min(j + 60, seq.length())) + "\n");
                }
                writer.write("*\n");
                for (String opt : entry.optionalLines) {
                    writer.write(opt + "\n");
                }
            }
        }
    }

    // Example usage:
    // public static void main(String[] args) throws IOException {
    //     PirFile pir = new PirFile();
    //     pir.read("example.pir");
    //     pir.printProperties();
    //     pir.write("output.pir");
    // }
}
  1. Javascript class that can open any file of format .PIR and decode read and write and print to console all the properties from the above list (Node.js version):
const fs = require('fs');

class PirFile {
  constructor(filepath = null) {
    this.entries = [];
    if (filepath) {
      this.read(filepath);
    }
  }

  read(filepath) {
    const content = fs.readFileSync(filepath, 'utf8');
    const lines = content.split(/\r?\n/);
    let current = null;
    let inSequence = false;
    this.entries = [];

    for (let line of lines) {
      if (line.startsWith('>')) {
        if (current) this.entries.push(current);
        const parts = line.slice(1).split(';');
        current = {
          typeCode: parts[0],
          id: parts[1] || '',
          description: '',
          sequence: '',
          optionalLines: []
        };
        inSequence = false;
      } else if (current && !current.description) {
        current.description = line.trim();
        inSequence = true;
      } else if (current && inSequence) {
        const stripped = line.trim().replace(/\s+/g, '');
        if (stripped.includes('*')) {
          current.sequence += stripped.replace('*', '');
          inSequence = false;
        } else {
          current.sequence += stripped;
        }
      } else if (current && !inSequence) {
        current.optionalLines.push(line.trim());
      }
    }
    if (current) this.entries.push(current);
  }

  printProperties() {
    this.entries.forEach((entry, index) => {
      console.log(`Entry ${index + 1}:`);
      console.log(`  Type Code: ${entry.typeCode}`);
      console.log(`  Sequence ID: ${entry.id}`);
      console.log(`  Description: ${entry.description}`);
      console.log(`  Sequence: ${entry.sequence}`);
      console.log(`  Optional Lines: ${entry.optionalLines.join(', ')}`);
      console.log();
    });
  }

  write(filepath) {
    let output = '';
    this.entries.forEach(entry => {
      output += `>${entry.typeCode};${entry.id}\n`;
      output += `${entry.description}\n`;
      const seq = entry.sequence;
      for (let i = 0; i < seq.length; i += 60) {
        output += `${seq.slice(i, i + 60)}\n`;
      }
      output += '*\n';
      entry.optionalLines.forEach(opt => {
        output += `${opt}\n`;
      });
    });
    fs.writeFileSync(filepath, output);
  }
}

// Example usage:
// const pir = new PirFile('example.pir');
// pir.printProperties();
// pir.write('output.pir');
  1. C class that can open any file of format .PIR and decode read and write and print to console all the properties from the above list (using C++ for class support):
#include <iostream>
#include <fstream>
#include <vector>
#include <string>
#include <sstream>

struct Entry {
    std::string typeCode;
    std::string id;
    std::string description;
    std::string sequence;
    std::vector<std::string> optionalLines;
};

class PirFile {
private:
    std::vector<Entry> entries;

public:
    void read(const std::string& filepath) {
        std::ifstream file(filepath);
        if (!file.is_open()) {
            std::cerr << "Error opening file" << std::endl;
            return;
        }
        entries.clear();
        std::string line;
        Entry current;
        bool inSequence = false;
        while (std::getline(file, line)) {
            if (line[0] == '>') {
                if (!current.typeCode.empty()) {
                    entries.push_back(current);
                }
                current = Entry();
                std::string header = line.substr(1);
                size_t semiPos = header.find(';');
                if (semiPos != std::string::npos) {
                    current.typeCode = header.substr(0, semiPos);
                    current.id = header.substr(semiPos + 1);
                } else {
                    current.typeCode = header;
                }
                inSequence = false;
            } else if (!current.description.empty() == false) {
                current.description = line;
                // Trim whitespace
                current.description.erase(0, current.description.find_first_not_of(" \t"));
                current.description.erase(current.description.find_last_not_of(" \t") + 1);
                inSequence = true;
            } else if (inSequence) {
                std::string stripped = line;
                // Remove spaces
                stripped.erase(std::remove_if(stripped.begin(), stripped.end(), isspace), stripped.end());
                size_t starPos = stripped.find('*');
                if (starPos != std::string::npos) {
                    current.sequence += stripped.substr(0, starPos);
                    inSequence = false;
                } else {
                    current.sequence += stripped;
                }
            } else {
                std::string opt = line;
                opt.erase(0, opt.find_first_not_of(" \t"));
                opt.erase(opt.find_last_not_of(" \t") + 1);
                if (!opt.empty()) {
                    current.optionalLines.push_back(opt);
                }
            }
        }
        if (!current.typeCode.empty()) {
            entries.push_back(current);
        }
        file.close();
    }

    void printProperties() const {
        for (size_t i = 0; i < entries.size(); ++i) {
            const Entry& entry = entries[i];
            std::cout << "Entry " << (i + 1) << ":" << std::endl;
            std::cout << "  Type Code: " << entry.typeCode << std::endl;
            std::cout << "  Sequence ID: " << entry.id << std::endl;
            std::cout << "  Description: " << entry.description << std::endl;
            std::cout << "  Sequence: " << entry.sequence << std::endl;
            std::cout << "  Optional Lines: ";
            for (size_t j = 0; j < entry.optionalLines.size(); ++j) {
                std::cout << entry.optionalLines[j];
                if (j < entry.optionalLines.size() - 1) std::cout << ", ";
            }
            std::cout << std::endl << std::endl;
        }
    }

    void write(const std::string& filepath) const {
        std::ofstream file(filepath);
        if (!file.is_open()) {
            std::cerr << "Error opening file for write" << std::endl;
            return;
        }
        for (const auto& entry : entries) {
            file << ">" << entry.typeCode << ";" << entry.id << "\n";
            file << entry.description << "\n";
            std::string seq = entry.sequence;
            for (size_t i = 0; i < seq.length(); i += 60) {
                file << seq.substr(i, 60) << "\n";
            }
            file << "*\n";
            for (const auto& opt : entry.optionalLines) {
                file << opt << "\n";
            }
        }
        file.close();
    }
};

// Example usage:
// int main() {
//     PirFile pir;
//     pir.read("example.pir");
//     pir.printProperties();
//     pir.write("output.pir");
//     return 0;
// }