Task 214: .FASTA File Format

Task 214: .FASTA File Format

1. Properties of the .FASTA File Format Intrinsic to Its File System

Based on established specifications from reliable sources, the .FASTA file format is a text-based representation primarily used for biological sequences, such as nucleotides or amino acids. It lacks binary elements like magic numbers or fixed headers typical of structured file formats, relying instead on textual conventions. The following list outlines its key intrinsic properties, derived from format definitions:

  • Text Encoding: ASCII or UTF-8, ensuring compatibility with plain text editors.
  • File Extensions: Commonly .fasta, .fa, .fas, .fna (for nucleic acid sequences), .faa (for amino acid sequences), .ffn, .frn, or .fsa.
  • MIME Type: text/plain or application/x-fasta, indicating its plain text nature.
  • Structure: Composed of one or more sequence records, with no fixed overall header or footer.
  • Record Delimiter: Each sequence record begins with a header line starting with the character '>', and ends at the next '>' or the end of the file.
  • Header Format: The header line follows the pattern >[identifier][space][optional description], where the identifier is a unique string without spaces (e.g., an accession number), and the description provides contextual information.
  • Sequence Data: Follows the header on subsequent lines, consisting of single-letter codes for bases or amino acids, typically wrapped at 60-80 characters per line for readability, though not strictly enforced.
  • Allowed Characters (Nucleotides): A, C, G, T, U, N (unknown), R, Y, S, W, K, M, B, D, H, V (IUPAC ambiguity codes), - (gap), with case insensitivity common.
  • Allowed Characters (Amino Acids): A, C, D, E, F, G, H, I, K, L, M, N, P, Q, R, S, T, V, W, Y, B (D or N), J (I or L), O (pyrrolysine), U (selenocysteine), X (unknown), Z (E or Q), * (stop), - (gap).
  • Line Constraints: No empty lines within sequence data; whitespace is ignored except in headers.
  • Multi-Sequence Support: Permits multiple records in a single file, making it suitable for databases or alignments.
  • Versioning: No formal version number; evolves through community conventions without a governing standard.

These properties define the format's simplicity and flexibility, with no dependencies on specific file system attributes beyond standard text file handling.

The following are two direct download links to sample .FASTA files, sourced from public repositories for illustrative purposes:

These links point to raw file content, enabling direct download via web browsers or tools like wget.

3. Ghost Blog Embedded HTML JavaScript for Drag-and-Drop .FASTA File Processing

The following is a self-contained HTML snippet with embedded JavaScript suitable for embedding in a Ghost blog post (or any HTML-compatible platform). It creates a drop zone where users can drag and drop a .FASTA file. Upon dropping, the script reads the file, parses it according to the format specifications, extracts the properties (e.g., identifiers, descriptions, sequence lengths, and full sequences), and displays them on the screen in a structured format. Error handling is included for invalid files.

Drag and drop a .FASTA file here.

This script handles parsing by splitting lines, identifying headers, and accumulating sequences. It displays truncated sequences to avoid overwhelming the output.

4. Python Class for .FASTA File Handling

The following Python class can open a .FASTA file, decode (parse) its content, read and print the properties to the console, and write a new or modified .FASTA file. It uses standard libraries for file I/O.

import os

class FastaHandler:
    def __init__(self, filepath):
        self.filepath = filepath
        self.sequences = []
        self.properties = {}

    def read_and_decode(self):
        """Opens and decodes the .FASTA file, parsing into sequences."""
        if not os.path.exists(self.filepath):
            raise FileNotFoundError(f"File not found: {self.filepath}")
        with open(self.filepath, 'r', encoding='utf-8') as f:
            content = f.read()
        lines = content.split('\n')
        current_seq = None
        for line in lines:
            line = line.strip()
            if line.startswith('>'):
                if current_seq:
                    self.sequences.append(current_seq)
                header = line[1:].strip()
                identifier, *desc = header.split(' ', 1)
                current_seq = {
                    'identifier': identifier,
                    'description': desc[0] if desc else '',
                    'sequence': ''
                }
            elif current_seq:
                current_seq['sequence'] += line
        if current_seq:
            self.sequences.append(current_seq)
        self.properties = {
            'file_extension': os.path.splitext(self.filepath)[1],
            'encoding': 'ASCII/UTF-8',
            'structure': 'Multi-sequence records',
            'num_sequences': len(self.sequences),
            'sequences': [{k: v for k, v in seq.items() if k != 'sequence'} | {'sequence_length': len(seq['sequence'])} for seq in self.sequences]
        }

    def print_properties(self):
        """Prints all properties to the console."""
        if not self.properties:
            print("No properties decoded. Call read_and_decode first.")
            return
        print(f"File Extension: {self.properties['file_extension']}")
        print(f"Encoding: {self.properties['encoding']}")
        print(f"Structure: {self.properties['structure']}")
        print(f"Number of Sequences: {self.properties['num_sequences']}")
        for i, seq in enumerate(self.properties['sequences'], 1):
            print(f"\nSequence {i}:")
            print(f"  Identifier: {seq['identifier']}")
            print(f"  Description: {seq['description']}")
            print(f"  Sequence Length: {seq['sequence_length']}")

    def write(self, output_path, wrap_length=80):
        """Writes the parsed sequences to a new .FASTA file."""
        with open(output_path, 'w', encoding='utf-8') as f:
            for seq in self.sequences:
                header = f">{seq['identifier']} {seq['description']}\n"
                f.write(header)
                sequence = seq['sequence']
                for i in range(0, len(sequence), wrap_length):
                    f.write(sequence[i:i + wrap_length] + '\n')

# Example usage:
# handler = FastaHandler('example.fasta')
# handler.read_and_decode()
# handler.print_properties()
# handler.write('output.fasta')

5. Java Class for .FASTA File Handling

The following Java class performs similar operations: opening, decoding, reading, printing properties, and writing .FASTA files. It uses java.io for file handling.

import java.io.*;
import java.util.ArrayList;
import java.util.HashMap;
import java.util.List;
import java.util.Map;

public class FastaHandler {
    private String filepath;
    private List<Map<String, Object>> sequences;
    private Map<String, Object> properties;

    public FastaHandler(String filepath) {
        this.filepath = filepath;
        this.sequences = new ArrayList<>();
        this.properties = new HashMap<>();
    }

    public void readAndDecode() throws IOException {
        File file = new File(filepath);
        if (!file.exists()) {
            throw new FileNotFoundException("File not found: " + filepath);
        }
        BufferedReader reader = new BufferedReader(new FileReader(file));
        String line;
        Map<String, Object> currentSeq = null;
        while ((line = reader.readLine()) != null) {
            line = line.trim();
            if (line.startsWith(">")) {
                if (currentSeq != null) {
                    sequences.add(currentSeq);
                }
                String header = line.substring(1).trim();
                String[] parts = header.split(" ", 2);
                String identifier = parts[0];
                String description = parts.length > 1 ? parts[1] : "";
                currentSeq = new HashMap<>();
                currentSeq.put("identifier", identifier);
                currentSeq.put("description", description);
                currentSeq.put("sequence", new StringBuilder());
            } else if (currentSeq != null) {
                ((StringBuilder) currentSeq.get("sequence")).append(line);
            }
        }
        if (currentSeq != null) {
            sequences.add(currentSeq);
        }
        reader.close();

        properties.put("file_extension", filepath.substring(filepath.lastIndexOf('.')));
        properties.put("encoding", "ASCII/UTF-8");
        properties.put("structure", "Multi-sequence records");
        properties.put("num_sequences", sequences.size());
        List<Map<String, Object>> seqProps = new ArrayList<>();
        for (Map<String, Object> seq : sequences) {
            Map<String, Object> prop = new HashMap<>();
            prop.put("identifier", seq.get("identifier"));
            prop.put("description", seq.get("description"));
            prop.put("sequence_length", ((StringBuilder) seq.get("sequence")).length());
            seqProps.add(prop);
        }
        properties.put("sequences", seqProps);
    }

    public void printProperties() {
        if (properties.isEmpty()) {
            System.out.println("No properties decoded. Call readAndDecode first.");
            return;
        }
        System.out.println("File Extension: " + properties.get("file_extension"));
        System.out.println("Encoding: " + properties.get("encoding"));
        System.out.println("Structure: " + properties.get("structure"));
        System.out.println("Number of Sequences: " + properties.get("num_sequences"));
        @SuppressWarnings("unchecked")
        List<Map<String, Object>> seqs = (List<Map<String, Object>>) properties.get("sequences");
        for (int i = 0; i < seqs.size(); i++) {
            Map<String, Object> seq = seqs.get(i);
            System.out.println("\nSequence " + (i + 1) + ":");
            System.out.println("  Identifier: " + seq.get("identifier"));
            System.out.println("  Description: " + seq.get("description"));
            System.out.println("  Sequence Length: " + seq.get("sequence_length"));
        }
    }

    public void write(String outputPath, int wrapLength) throws IOException {
        BufferedWriter writer = new BufferedWriter(new FileWriter(outputPath));
        for (Map<String, Object> seq : sequences) {
            writer.write(">" + seq.get("identifier") + " " + seq.get("description") + "\n");
            String sequence = ((StringBuilder) seq.get("sequence")).toString();
            for (int i = 0; i < sequence.length(); i += wrapLength) {
                writer.write(sequence.substring(i, Math.min(i + wrapLength, sequence.length())) + "\n");
            }
        }
        writer.close();
    }

    // Example usage:
    // public static void main(String[] args) throws IOException {
    //     FastaHandler handler = new FastaHandler("example.fasta");
    //     handler.readAndDecode();
    //     handler.printProperties();
    //     handler.write("output.fasta", 80);
    // }
}

6. JavaScript Class for .FASTA File Handling

The following JavaScript class (ES6 syntax) can handle .FASTA files in a Node.js environment (using fs module). It opens, decodes, reads, prints properties to the console, and writes files.

const fs = require('fs');

class FastaHandler {
  constructor(filepath) {
    this.filepath = filepath;
    this.sequences = [];
    this.properties = {};
  }

  readAndDecode() {
    if (!fs.existsSync(this.filepath)) {
      throw new Error(`File not found: ${this.filepath}`);
    }
    const content = fs.readFileSync(this.filepath, 'utf-8');
    const lines = content.split('\n');
    let currentSeq = null;
    lines.forEach(line => {
      line = line.trim();
      if (line.startsWith('>')) {
        if (currentSeq) this.sequences.push(currentSeq);
        const header = line.slice(1).trim();
        const [identifier, ...descParts] = header.split(' ');
        currentSeq = {
          identifier,
          description: descParts.join(' '),
          sequence: ''
        };
      } else if (currentSeq) {
        currentSeq.sequence += line;
      }
    });
    if (currentSeq) this.sequences.push(currentSeq);
    this.properties = {
      file_extension: this.filepath.split('.').pop(),
      encoding: 'ASCII/UTF-8',
      structure: 'Multi-sequence records',
      num_sequences: this.sequences.length,
      sequences: this.sequences.map(seq => ({
        identifier: seq.identifier,
        description: seq.description,
        sequence_length: seq.sequence.length
      }))
    };
  }

  printProperties() {
    if (!Object.keys(this.properties).length) {
      console.log('No properties decoded. Call readAndDecode first.');
      return;
    }
    console.log(`File Extension: .${this.properties.file_extension}`);
    console.log(`Encoding: ${this.properties.encoding}`);
    console.log(`Structure: ${this.properties.structure}`);
    console.log(`Number of Sequences: ${this.properties.num_sequences}`);
    this.properties.sequences.forEach((seq, index) => {
      console.log(`\nSequence ${index + 1}:`);
      console.log(`  Identifier: ${seq.identifier}`);
      console.log(`  Description: ${seq.description}`);
      console.log(`  Sequence Length: ${seq.sequence_length}`);
    });
  }

  write(outputPath, wrapLength = 80) {
    let output = '';
    this.sequences.forEach(seq => {
      output += `>${seq.identifier} ${seq.description}\n`;
      for (let i = 0; i < seq.sequence.length; i += wrapLength) {
        output += seq.sequence.slice(i, i + wrapLength) + '\n';
      }
    });
    fs.writeFileSync(outputPath, output, 'utf-8');
  }
}

// Example usage:
// const handler = new FastaHandler('example.fasta');
// handler.readAndDecode();
// handler.printProperties();
// handler.write('output.fasta');

7. C Implementation for .FASTA File Handling

Since C does not support classes natively, the following implementation uses a struct and associated functions to achieve equivalent functionality. It opens, decodes, reads, prints properties to the console, and writes .FASTA files. Compile with a standard C compiler (e.g., gcc).

#include <stdio.h>
#include <stdlib.h>
#include <string.h>

#define MAX_LINE_LEN 1024
#define MAX_SEQ_LEN 1000000  // Arbitrary max for sequence

typedef struct {
    char *identifier;
    char *description;
    char *sequence;
} Sequence;

typedef struct {
    char *filepath;
    Sequence *sequences;
    int num_sequences;
    char *file_extension;
    char *encoding;
    char *structure;
} FastaHandler;

FastaHandler* create_fasta_handler(const char *filepath) {
    FastaHandler *handler = malloc(sizeof(FastaHandler));
    handler->filepath = strdup(filepath);
    handler->sequences = NULL;
    handler->num_sequences = 0;
    handler->file_extension = NULL;
    handler->encoding = strdup("ASCII/UTF-8");
    handler->structure = strdup("Multi-sequence records");
    return handler;
}

void free_fasta_handler(FastaHandler *handler) {
    if (handler) {
        free(handler->filepath);
        free(handler->encoding);
        free(handler->structure);
        free(handler->file_extension);
        for (int i = 0; i < handler->num_sequences; i++) {
            free(handler->sequences[i].identifier);
            free(handler->sequences[i].description);
            free(handler->sequences[i].sequence);
        }
        free(handler->sequences);
        free(handler);
    }
}

int read_and_decode(FastaHandler *handler) {
    FILE *file = fopen(handler->filepath, "r");
    if (!file) {
        perror("File not found");
        return 1;
    }
    char line[MAX_LINE_LEN];
    Sequence *temp_sequences = NULL;
    int capacity = 0;
    int count = 0;
    Sequence current = {NULL, NULL, NULL};
    char *seq_buffer = malloc(MAX_SEQ_LEN);
    int seq_len = 0;
    while (fgets(line, MAX_LINE_LEN, file)) {
        line[strcspn(line, "\n")] = 0;  // Trim newline
        if (line[0] == '>') {
            if (current.identifier) {
                current.sequence = strndup(seq_buffer, seq_len);
                if (count >= capacity) {
                    capacity = capacity ? capacity * 2 : 1;
                    temp_sequences = realloc(temp_sequences, capacity * sizeof(Sequence));
                }
                temp_sequences[count++] = current;
                current = (Sequence){NULL, NULL, NULL};
                seq_len = 0;
            }
            char *header = line + 1;
            char *space_pos = strchr(header, ' ');
            if (space_pos) {
                current.identifier = strndup(header, space_pos - header);
                current.description = strdup(space_pos + 1);
            } else {
                current.identifier = strdup(header);
                current.description = strdup("");
            }
        } else if (current.identifier) {
            strncpy(seq_buffer + seq_len, line, MAX_LINE_LEN);
            seq_len += strlen(line);
        }
    }
    if (current.identifier) {
        current.sequence = strndup(seq_buffer, seq_len);
        if (count >= capacity) {
            temp_sequences = realloc(temp_sequences, (count + 1) * sizeof(Sequence));
        }
        temp_sequences[count++] = current;
    }
    fclose(file);
    free(seq_buffer);
    handler->sequences = temp_sequences;
    handler->num_sequences = count;
    char *dot = strrchr(handler->filepath, '.');
    if (dot) handler->file_extension = strdup(dot);
    return 0;
}

void print_properties(const FastaHandler *handler) {
    if (handler->num_sequences == 0) {
        printf("No properties decoded. Call read_and_decode first.\n");
        return;
    }
    printf("File Extension: %s\n", handler->file_extension ? handler->file_extension : "Unknown");
    printf("Encoding: %s\n", handler->encoding);
    printf("Structure: %s\n", handler->structure);
    printf("Number of Sequences: %d\n", handler->num_sequences);
    for (int i = 0; i < handler->num_sequences; i++) {
        const Sequence *seq = &handler->sequences[i];
        printf("\nSequence %d:\n", i + 1);
        printf("  Identifier: %s\n", seq->identifier);
        printf("  Description: %s\n", seq->description);
        printf("  Sequence Length: %lu\n", strlen(seq->sequence));
    }
}

int write_fasta(const FastaHandler *handler, const char *output_path, int wrap_length) {
    FILE *file = fopen(output_path, "w");
    if (!file) {
        perror("Cannot open output file");
        return 1;
    }
    for (int i = 0; i < handler->num_sequences; i++) {
        const Sequence *seq = &handler->sequences[i];
        fprintf(file, ">%s %s\n", seq->identifier, seq->description);
        const char *sequence = seq->sequence;
        size_t len = strlen(sequence);
        for (size_t j = 0; j < len; j += wrap_length) {
            size_t chunk = (j + wrap_length < len) ? wrap_length : len - j;
            fprintf(file, "%.*s\n", (int)chunk, sequence + j);
        }
    }
    fclose(file);
    return 0;
}

// Example usage:
// int main() {
//     FastaHandler *handler = create_fasta_handler("example.fasta");
//     if (read_and_decode(handler) == 0) {
//         print_properties(handler);
//         write_fasta(handler, "output.fasta", 80);
//     }
//     free_fasta_handler(handler);
//     return 0;
// }