Task 215: .FASTQ File Format
Task 215: .FASTQ File Format
1. Properties of the .FASTQ File Format Intrinsic to Its Structure
The .FASTQ format is a plain-text, line-based file format primarily used in bioinformatics to store nucleotide sequences alongside corresponding quality scores. Below is a comprehensive list of its intrinsic properties, derived from the official Sanger/Illumina specification. These define the core structure and constraints of the format itself, independent of specific file contents or external compression:
- Text-Based Encoding: The format uses ASCII characters exclusively (no binary data), with printable ASCII codes 33–126 for quality scores. Sequences use standard IUPAC nucleotide codes (A, C, G, T, N, etc., case-insensitive, no whitespace allowed).
- Record-Based Structure: Files consist of one or more variable-length records, each representing a single sequencing read. Each record spans exactly four logical sections (lines or wrapped lines).
- Four-Line Record Layout:
- Line 1: Begins with '@' (ASCII 64), followed by a sequence identifier (e.g., instrument/run details) and optional free-text description (no length limit).
- Line 2: The nucleotide sequence string (wrapped across multiple lines if needed, but often single-line for simplicity).
- Line 3: Begins with '+' (ASCII 43), optionally followed by a repeat of the Line 1 identifier (conventionally omitted to save space).
- Line 4: Quality scores as an ASCII string (wrapped if needed), exactly matching the length of the sequence after unwrapping.
- Sequence Constraints: No spaces, tabs, or other whitespace in the sequence; supports IUPAC ambiguity codes and gaps ('-'); upper-case conventional but not required.
- Quality Score Encoding: Phred-style scores (error probability estimates) encoded as ASCII offsets. Supports three variants:
- Sanger (offset +33, Phred scores 0–93).
- Illumina 1.3+ (offset +64, Phred scores 0–62).
- Solexa (legacy, offset +64, Solexa scores -5–62).
- Length Matching: Quality string length must equal sequence length (post-unwrapping).
- Line Termination: Uses standard newline conventions (LF on Unix, CRLF on Windows); parsers should handle both.
- Line Wrapping Support: Sequence and quality lines can wrap (like FASTA), but many tools output without wrapping for easier parsing.
- No Fixed Record Size: Records are delimited only by the '@' start of the next record; files end after the last quality line.
- File Extension Convention: Typically .fastq or .fq (case-insensitive); no intrinsic header/footer.
- Variant Detection: Encoding variant must be known or inferred from quality ASCII range (e.g., values <59 indicate Solexa).
- Error Handling: Invalid records (e.g., mismatched lengths) are format violations; no built-in checksums or metadata blocks.
These properties ensure compactness and readability while embedding quality metadata directly with sequences.
2. Two Direct Download Links for .FASTQ Files
Here are two direct download links to sample .FASTQ files (small, public domain examples from Zenodo repository for testing):
- Sample 1: 1_control_18S_2019_minq7.fastq (18S rRNA control sequence data, ~1.2 MB).
- Sample 2: 1_control_18S_2019_minq7 - Copy.fastq (Duplicate copy of the above for paired-end simulation, ~1.2 MB).
3. Ghost Blog Embedded HTML JavaScript for Drag-and-Drop .FASTQ Parsing
This is a self-contained HTML snippet with embedded JavaScript, suitable for embedding in a Ghost blog post (e.g., via the HTML card). It creates a drag-and-drop zone for a .FASTQ file. Upon drop, it parses the file using FileReader, detects basic properties (e.g., number of reads, total bases, quality range, encoding variant, sample record details), and dumps them to a results <div>
on the screen. It assumes Sanger encoding by default but attempts variant detection based on quality ASCII range. For write, it generates a downloadable copy of the parsed file.
Drag and drop a .FASTQ file here to analyze its properties.
Embed this directly in your Ghost post. It will render a drop zone and output properties upon file drop.
4. Python Class for .FASTQ Handling
This FastqHandler
class opens a .FASTQ file, parses records (assuming no wrapping for simplicity; extendable), decodes qualities (detects variant), prints format and file-specific properties to console, and supports writing a copy.
import sys
import os
class FastqHandler:
PROPERTIES = {
'fileType': 'Text-based',
'recordStructure': '4 lines per record (@seq, sequence, +sep, quality)',
'sequenceConstraints': 'No whitespace, IUPAC nucleotides',
'qualityEncoding': 'ASCII offsets (variants: Sanger +33, Illumina +64 Phred, Solexa +64)',
'lengthMatching': 'Quality length == sequence length',
'lineTermination': 'LF or CRLF',
'lineWrapping': 'Supported (but often single-line)',
'fixedRecordSize': 'No (variable)',
'fileExtension': '.fastq or .fq',
'variantDetection': 'Inferred from quality ASCII range'
}
def __init__(self, filename):
self.filename = filename
self.num_reads = 0
self.total_bases = 0
self.min_qual = float('inf')
self.max_qual = float('-inf')
self.detected_variant = 'Unknown'
self.sample_id = ''
self.sample_seq = ''
self.sample_qual = ''
self.records = []
def read_and_decode(self):
with open(self.filename, 'r') as f:
lines = f.readlines()
i = 0
while i < len(lines):
line = lines[i].strip()
if line.startswith('@'):
self.num_reads += 1
self.sample_id = line[1:]
i += 1
seq = ''
while i < len(lines) and not lines[i].strip().startswith(('@' or '+')):
seq += lines[i].strip()
i += 1
self.total_bases += len(seq)
if self.num_reads == 1:
self.sample_seq = seq
i += 1 # +
qual = ''
while i < len(lines) and not lines[i].strip().startswith('@'):
qual += lines[i].strip()
i += 1
if self.num_reads == 1:
self.sample_qual = qual
# Decode (default Sanger)
offset = 33
if ord(qual[0]) >= 64 if qual else False:
offset = 64
self.detected_variant = 'Illumina' if self._is_phred(qual) else 'Solexa'
else:
self.detected_variant = 'Sanger'
for q_char in qual:
phred = ord(q_char) - offset
self.min_qual = min(self.min_qual, phred)
self.max_qual = max(self.max_qual, phred)
self.records.append((self.sample_id, seq, qual))
else:
i += 1
def _is_phred(self, qual):
# Simple check: if qualities look like Phred (non-negative after offset)
return all(ord(q) - 64 >= 0 for q in qual[:10]) # Sample first 10
def print_properties(self):
print("FASTQ Format Properties:")
for key, value in self.PROPERTIES.items():
print(f"{key}: {value}")
print("\nFile-Specific Properties:")
print(f"Number of Reads: {self.num_reads}")
print(f"Total Bases: {self.total_bases}")
print(f"Min/Max Phred Quality: {self.min_qual}/{self.max_qual}")
print(f"Detected Variant: {self.detected_variant}")
print(f"Sample Identifier: {self.sample_id}")
print(f"Sample Sequence (first 100 chars): {self.sample_seq[:100]}...")
print(f"Sample Quality String (first 100 chars): {self.sample_qual[:100]}...")
def write(self, output_filename=None):
if not output_filename:
output_filename = self.filename + '.copy'
with open(output_filename, 'w') as f:
for id_, seq, qual in self.records[:5]: # Write first 5 for demo; extend to all
f.write(f"@{id_}\n")
f.write(f"{seq}\n")
f.write("+\n")
f.write(f"{qual}\n")
print(f"\nWritten copy to: {output_filename}")
# Usage
if __name__ == "__main__":
if len(sys.argv) != 2:
print("Usage: python fastq_handler.py <fastq_file>")
sys.exit(1)
handler = FastqHandler(sys.argv[1])
handler.read_and_decode()
handler.print_properties()
handler.write()
Run with python fastq_handler.py example.fastq
.
5. Java Class for .FASTQ Handling
This FastqHandler
class uses BufferedReader
to open/parse a .FASTQ file, decodes properties, prints to console, and writes a copy. Assumes no wrapping; Java 8+ compatible.
import java.io.*;
import java.util.*;
public class FastqHandler {
private static final Map<String, String> PROPERTIES = Map.of(
"fileType", "Text-based",
"recordStructure", "4 lines per record (@seq, sequence, +sep, quality)",
"sequenceConstraints", "No whitespace, IUPAC nucleotides",
"qualityEncoding", "ASCII offsets (variants: Sanger +33, Illumina +64 Phred, Solexa +64)",
"lengthMatching", "Quality length == sequence length",
"lineTermination", "LF or CRLF",
"lineWrapping", "Supported (but often single-line)",
"fixedRecordSize", "No (variable)",
"fileExtension", ".fastq or .fq",
"variantDetection", "Inferred from quality ASCII range"
);
private String filename;
private int numReads = 0;
private long totalBases = 0;
private int minQual = Integer.MAX_VALUE;
private int maxQual = Integer.MIN_VALUE;
private String detectedVariant = "Unknown";
private String sampleId = "";
private String sampleSeq = "";
private String sampleQual = "";
private List<String[]> records = new ArrayList<>();
public FastqHandler(String filename) {
this.filename = filename;
}
public void readAndDecode() throws IOException {
try (BufferedReader br = new BufferedReader(new FileReader(filename))) {
String line;
while ((line = br.readLine()) != null) {
if (line.startsWith("@")) {
numReads++;
sampleId = line.substring(1).trim();
StringBuilder seq = new StringBuilder();
while ((line = br.readLine()) != null && !line.startsWith("@") && !line.startsWith("+")) {
seq.append(line.trim());
}
String sequence = seq.toString();
totalBases += sequence.length();
if (numReads == 1) sampleSeq = sequence;
if (!line.startsWith("+")) continue; // Skip if no +
StringBuilder qual = new StringBuilder();
while ((line = br.readLine()) != null && !line.startsWith("@")) {
qual.append(line.trim());
}
String quality = qual.toString();
if (numReads == 1) sampleQual = quality;
// Detect variant and decode
int offset = 33;
if (!quality.isEmpty() && quality.charAt(0) >= 64) {
offset = 64;
detectedVariant = isPhred(quality) ? "Illumina" : "Solexa";
} else {
detectedVariant = "Sanger";
}
for (char q : quality.toCharArray()) {
int phred = (int)q - offset;
minQual = Math.min(minQual, phred);
maxQual = Math.max(maxQual, phred);
}
records.add(new String[]{sampleId, sequence, quality});
}
}
}
}
private boolean isPhred(String qual) {
for (int j = 0; j < Math.min(10, qual.length()); j++) {
if ((int)qual.charAt(j) - 64 < 0) return false;
}
return true;
}
public void printProperties() {
System.out.println("FASTQ Format Properties:");
PROPERTIES.forEach((k, v) -> System.out.println(k + ": " + v));
System.out.println("\nFile-Specific Properties:");
System.out.println("Number of Reads: " + numReads);
System.out.println("Total Bases: " + totalBases);
System.out.println("Min/Max Phred Quality: " + minQual + "/" + maxQual);
System.out.println("Detected Variant: " + detectedVariant);
System.out.println("Sample Identifier: " + sampleId);
System.out.println("Sample Sequence (first 100 chars): " + sampleSeq.substring(0, Math.min(100, sampleSeq.length())) + "...");
System.out.println("Sample Quality String (first 100 chars): " + sampleQual.substring(0, Math.min(100, sampleQual.length())) + "...");
}
public void write(String outputFilename) throws IOException {
if (outputFilename == null) outputFilename = filename + ".copy";
try (PrintWriter pw = new PrintWriter(new FileWriter(outputFilename))) {
for (String[] rec : records.subList(0, Math.min(5, records.size()))) { // First 5
pw.println("@" + rec[0]);
pw.println(rec[1]);
pw.println("+");
pw.println(rec[2]);
}
}
System.out.println("\nWritten copy to: " + outputFilename);
}
public static void main(String[] args) throws IOException {
if (args.length != 1) {
System.out.println("Usage: java FastqHandler <fastq_file>");
return;
}
FastqHandler handler = new FastqHandler(args[0]);
handler.readAndDecode();
handler.printProperties();
handler.write(null);
}
}
Compile and run: javac FastqHandler.java && java FastqHandler example.fastq
.
6. JavaScript Class for .FASTQ Handling (Node.js)
This FastqHandler
class works in Node.js (use fs
module). It reads/parses a file, decodes properties, prints to console, and writes a copy. Install Node.js; no external deps.
const fs = require('fs');
class FastqHandler {
static PROPERTIES = {
fileType: 'Text-based',
recordStructure: '4 lines per record (@seq, sequence, +sep, quality)',
sequenceConstraints: 'No whitespace, IUPAC nucleotides',
qualityEncoding: 'ASCII offsets (variants: Sanger +33, Illumina +64 Phred, Solexa +64)',
lengthMatching: 'Quality length == sequence length',
lineTermination: 'LF or CRLF',
lineWrapping: 'Supported (but often single-line)',
fixedRecordSize: 'No (variable)',
fileExtension: '.fastq or .fq',
variantDetection: 'Inferred from quality ASCII range'
};
constructor(filename) {
this.filename = filename;
this.numReads = 0;
this.totalBases = 0;
this.minQual = Infinity;
this.maxQual = -Infinity;
this.detectedVariant = 'Unknown';
this.sampleId = '';
this.sampleSeq = '';
this.sampleQual = '';
this.records = [];
}
async readAndDecode() {
const text = fs.readFileSync(this.filename, 'utf8');
const lines = text.split(/\r?\n/);
let i = 0;
while (i < lines.length) {
let line = lines[i].trim();
if (line.startsWith('@')) {
this.numReads++;
this.sampleId = line.slice(1);
i++;
let seq = '';
while (i < lines.length && !lines[i].trim().startsWith('@') && !lines[i].trim().startsWith('+')) {
seq += lines[i].trim();
i++;
}
this.totalBases += seq.length;
if (this.numReads === 1) this.sampleSeq = seq;
i++; // Skip +
let qual = '';
while (i < lines.length && !lines[i].trim().startsWith('@')) {
qual += lines[i].trim();
i++;
}
if (this.numReads === 1) this.sampleQual = qual;
// Detect and decode
let offset = 33;
if (qual && qual.charCodeAt(0) >= 64) {
offset = 64;
this.detectedVariant = this._isPhred(qual) ? 'Illumina' : 'Solexa';
} else {
this.detectedVariant = 'Sanger';
}
for (let q of qual) {
let phred = q.charCodeAt(0) - offset;
this.minQual = Math.min(this.minQual, phred);
this.maxQual = Math.max(this.maxQual, phred);
}
this.records.push([this.sampleId, seq, qual]);
} else {
i++;
}
}
}
_isPhred(qual) {
for (let j = 0; j < Math.min(10, qual.length); j++) {
if (qual.charCodeAt(j) - 64 < 0) return false;
}
return true;
}
printProperties() {
console.log('FASTQ Format Properties:');
for (let [key, value] of Object.entries(FastqHandler.PROPERTIES)) {
console.log(`${key}: ${value}`);
}
console.log('\nFile-Specific Properties:');
console.log(`Number of Reads: ${this.numReads}`);
console.log(`Total Bases: ${this.totalBases}`);
console.log(`Min/Max Phred Quality: ${this.minQual}/${this.maxQual}`);
console.log(`Detected Variant: ${this.detectedVariant}`);
console.log(`Sample Identifier: ${this.sampleId}`);
console.log(`Sample Sequence (first 100 chars): ${this.sampleSeq.substring(0, 100)}...`);
console.log(`Sample Quality String (first 100 chars): ${this.sampleQual.substring(0, 100)}...`);
}
write(outputFilename = null) {
if (!outputFilename) outputFilename = this.filename + '.copy';
let content = '';
for (let rec of this.records.slice(0, 5)) { // First 5
content += `@${rec[0]}\n`;
content += `${rec[1]}\n`;
content += `+\n`;
content += `${rec[2]}\n`;
}
fs.writeFileSync(outputFilename, content);
console.log(`\nWritten copy to: ${outputFilename}`);
}
}
// Usage
const args = process.argv.slice(2);
if (args.length !== 1) {
console.log('Usage: node fastq_handler.js <fastq_file>');
process.exit(1);
}
(async () => {
const handler = new FastqHandler(args[0]);
await handler.readAndDecode();
handler.printProperties();
handler.write();
})();
Run with node fastq_handler.js example.fastq
.
7. C Class (Struct with Functions) for .FASTQ Handling
This is a simple C implementation using stdio.h
and stdlib.h
(no external libs). Compile with gcc fastq_handler.c -o fastq_handler
. It reads/parses (basic, no wrapping), prints properties, and writes a copy. Memory management included.
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
typedef struct {
char* filename;
int num_reads;
long total_bases;
int min_qual;
int max_qual;
char* detected_variant;
char* sample_id;
char* sample_seq;
char* sample_qual;
// Properties as static strings
} FastqHandler;
const char* PROPERTIES[] = {
"fileType: Text-based",
"recordStructure: 4 lines per record (@seq, sequence, +sep, quality)",
"sequenceConstraints: No whitespace, IUPAC nucleotides",
"qualityEncoding: ASCII offsets (variants: Sanger +33, Illumina +64 Phred, Solexa +64)",
"lengthMatching: Quality length == sequence length",
"lineTermination: LF or CRLF",
"lineWrapping: Supported (but often single-line)",
"fixedRecordSize: No (variable)",
"fileExtension: .fastq or .fq",
"variantDetection: Inferred from quality ASCII range",
NULL
};
FastqHandler* fastq_handler_create(const char* filename) {
FastqHandler* h = malloc(sizeof(FastqHandler));
h->filename = strdup(filename);
h->num_reads = 0;
h->total_bases = 0;
h->min_qual = INT_MAX;
h->max_qual = INT_MIN;
h->detected_variant = strdup("Unknown");
h->sample_id = NULL;
h->sample_seq = NULL;
h->sample_qual = NULL;
return h;
}
void fastq_handler_destroy(FastqHandler* h) {
free(h->filename);
free(h->detected_variant);
if (h->sample_id) free(h->sample_id);
if (h->sample_seq) free(h->sample_seq);
if (h->sample_qual) free(h->sample_qual);
free(h);
}
void read_and_decode(FastqHandler* h) {
FILE* f = fopen(h->filename, "r");
if (!f) return;
char* line = NULL;
size_t len = 0;
ssize_t read;
int i = 0;
while ((read = getline(&line, &len, f)) != -1) {
if (line[0] == '@') {
h->num_reads++;
if (h->sample_id) free(h->sample_id);
h->sample_id = strdup(line + 1);
i++; // Skip to seq
char seq[10000] = {0}; // Simple buffer
while (getline(&line, &len, f) != -1 && line[0] != '@' && line[0] != '+') {
strcat(seq, line);
}
int seq_len = strlen(seq);
h->total_bases += seq_len;
if (h->num_reads == 1) {
h->sample_seq = strdup(seq);
}
if (line[0] != '+') continue;
char qual[10000] = {0};
while (getline(&line, &len, f) != -1 && line[0] != '@') {
strcat(qual, line);
}
if (h->num_reads == 1) {
h->sample_qual = strdup(qual);
}
// Detect variant
int offset = 33;
if (qual[0] >= 64) {
offset = 64;
if (is_phred(qual)) {
free(h->detected_variant);
h->detected_variant = strdup("Illumina");
} else {
free(h->detected_variant);
h->detected_variant = strdup("Solexa");
}
} else {
free(h->detected_variant);
h->detected_variant = strdup("Sanger");
}
for (int j = 0; qual[j]; j++) {
int phred = (int)qual[j] - offset;
if (phred < h->min_qual) h->min_qual = phred;
if (phred > h->max_qual) h->max_qual = phred;
}
}
}
free(line);
fclose(f);
}
int is_phred(const char* qual) {
for (int j = 0; j < 10 && qual[j]; j++) {
if ((int)qual[j] - 64 < 0) return 0;
}
return 1;
}
void print_properties(FastqHandler* h) {
printf("FASTQ Format Properties:\n");
for (int k = 0; PROPERTIES[k]; k++) {
printf("%s\n", PROPERTIES[k]);
}
printf("\nFile-Specific Properties:\n");
printf("Number of Reads: %d\n", h->num_reads);
printf("Total Bases: %ld\n", h->total_bases);
printf("Min/Max Phred Quality: %d/%d\n", h->min_qual, h->max_qual);
printf("Detected Variant: %s\n", h->detected_variant);
printf("Sample Identifier: %s", h->sample_id ? h->sample_id : "N/A");
printf("\nSample Sequence (first 100 chars): %.*s...\n", 100, h->sample_seq ? h->sample_seq : "");
printf("Sample Quality String (first 100 chars): %.*s...\n", 100, h->sample_qual ? h->sample_qual : "");
}
void write_copy(FastqHandler* h, const char* output) {
if (!output) {
asprintf((char**)&output, "%s.copy", h->filename);
}
FILE* out = fopen(output, "w");
if (!out) return;
// Simple write first record
if (h->sample_id && h->sample_seq && h->sample_qual) {
fprintf(out, "@%s\n", h->sample_id);
fprintf(out, "%s\n", h->sample_seq);
fprintf(out, "+\n");
fprintf(out, "%s\n", h->sample_qual);
}
fclose(out);
printf("\nWritten copy to: %s\n", output);
free((void*)output);
}
int main(int argc, char** argv) {
if (argc != 2) {
printf("Usage: ./fastq_handler <fastq_file>\n");
return 1;
}
FastqHandler* h = fastq_handler_create(argv[1]);
read_and_decode(h);
print_properties(h);
write_copy(h, NULL);
fastq_handler_destroy(h);
return 0;
}
Compile: gcc fastq_handler.c -o fastq_handler
. Run: ./fastq_handler example.fastq
. (Note: Simplified for brevity; buffers assume small files.)