Task 157: .DVC File Format

Task 157: .DVC File Format

1. List of Properties of the .DVC File Format

The .DVC file format (specifically .dvc files from Data Version Control) is a text-based YAML 1.2 format used for metadata in data versioning. It does not have low-level file system intrinsics like binary headers or magic numbers typical of binary formats; instead, its "properties" are the YAML keys and structures that define its schema. These are intrinsic to how the file is structured and interpreted within the DVC system. Below is a comprehensive list of all possible properties (keys/fields), including their types, descriptions, and whether they are required. Properties are organized by root-level fields, output entries (under outs), dependency entries (under deps), and sub-objects.

Root-Level Properties

  • outs: List of objects - List of output entries (files/directories tracked by DVC). Required for most .dvc files.
  • deps: List of objects - List of dependency entries (external data sources or imports). Optional.
  • wdir: String - Working directory relative to the .dvc file's location (defaults to "."). Optional.
  • md5: String - MD5 hash of the .dvc file itself (present for imports). Optional.

Properties in Output Entries (Each Item in outs)

  • path: String - Path to the file/directory (relative to wdir). Required.
  • hash: String - Hash algorithm (currently only "md5" supported). Optional.
  • md5: String - MD5 hash value (for local/SSH). Optional.
  • etag: String - ETag hash value (for HTTP/S3/Azure). Optional.
  • checksum: String - Checksum value (for HDFS/WebHDFS). Optional.
  • version_id: String - Cloud provider version ID (if versioning enabled). Optional.
  • size: Integer - Size in bytes (sum for directories). Optional.
  • nfiles: Integer - Number of files in a directory (recursive). Optional.
  • isexec: Boolean - Whether the file is executable (preserved on checkout/pull; no effect on directories/Windows). Optional.
  • cache: Boolean - Whether to cache the file/directory (defaults to true). Optional.
  • remote: String - Name of the DVC remote for push/fetch. Optional.
  • persist: Boolean - Whether the output remains during reproduction (defaults to false). Optional.
  • push: Boolean - Whether to upload to remote on dvc push (defaults to true). Optional.

Properties in Dependency Entries (Each Item in deps)

  • path: String - Path to the dependency (relative to wdir). Required.
  • hash: String - Hash algorithm (currently only "md5" supported). Optional.
  • md5: String - MD5 hash value (for local/SSH). Optional.
  • etag: String - ETag hash value (for HTTP/S3/GCS/Azure). Optional.
  • checksum: String - Checksum value (for HDFS/WebHDFS). Optional.
  • size: Number - Size in bytes (sum for directories). Optional.
  • nfiles: Number - Number of files in a directory (recursive). Optional.
  • repo: Object - Details for external DVC project dependencies. Optional. Sub-properties:
  • url: String - URL of the Git repository with the source DVC project.
  • rev: String - Git revision (commit hash, branch, or tag).
  • rev_lock: String - Locked Git commit hash at import time.
  • config: String - Path to config file or config options.
  • remote: String - Name of the DVC remote.
  • db: Object - Details for database dependencies. Optional. Sub-properties:
  • connection: String - Database connection name.
  • query: String - SQL query for snapshot.
  • table: String - Database table name.
  • file_format: String - Export format ("csv" or "json").

Comments can be added using # syntax. The file may include additional custom keys (e.g., desc for descriptions), but they are not part of the core schema.

3. Ghost Blog Embedded HTML JavaScript for Drag-and-Drop .DVC File Dump

This is a self-contained HTML snippet with embedded JavaScript that can be embedded in a Ghost blog post (or any HTML page). It creates a drag-and-drop area where a user can drop a .DVC file. The script reads the file as text, parses it as YAML using the js-yaml library (included via CDN), extracts all properties recursively, and dumps them to the screen in a readable key-value format.

.DVC File Properties Dumper
Drag and drop a .DVC file here

4. Python Class for .DVC File Handling

import yaml
import os

class DVCFileHandler:
    def __init__(self, filepath):
        self.filepath = filepath
        self.data = None

    def read(self):
        """Read and decode the .DVC file as YAML."""
        if not os.path.exists(self.filepath):
            raise FileNotFoundError(f"File {self.filepath} not found.")
        with open(self.filepath, 'r') as f:
            self.data = yaml.safe_load(f)
        return self.data

    def write(self, new_data=None):
        """Write the current data or new data to the .DVC file as YAML."""
        data_to_write = new_data if new_data else self.data
        if data_to_write is None:
            raise ValueError("No data to write.")
        with open(self.filepath, 'w') as f:
            yaml.safe_dump(data_to_write, f, sort_keys=False)

    def print_properties(self):
        """Print all properties to console in a readable format."""
        if self.data is None:
            print("No data loaded. Call read() first.")
            return
        def dump(obj, prefix=''):
            for key, value in obj.items():
                if isinstance(value, dict):
                    print(f"{prefix}{key}:")
                    dump(value, prefix + '  ')
                elif isinstance(value, list):
                    print(f"{prefix}{key}:")
                    for i, item in enumerate(value):
                        print(f"{prefix}  [{i}]:")
                        dump(item, prefix + '    ')
                else:
                    print(f"{prefix}{key}: {value}")
        dump(self.data)

# Example usage:
# handler = DVCFileHandler('example.dvc')
# handler.read()
# handler.print_properties()
# handler.write({'outs': [{'path': 'new.xml', 'md5': 'newhash'}]})

5. Java Class for .DVC File Handling

import org.yaml.snakeyaml.Yaml;
import java.io.*;
import java.util.Map;

public class DVCFileHandler {
    private String filepath;
    private Map<String, Object> data;

    public DVCFileHandler(String filepath) {
        this.filepath = filepath;
    }

    public Map<String, Object> read() throws IOException {
        File file = new File(filepath);
        if (!file.exists()) {
            throw new FileNotFoundException("File " + filepath + " not found.");
        }
        try (FileInputStream fis = new FileInputStream(file)) {
            Yaml yaml = new Yaml();
            this.data = yaml.load(fis);
        }
        return this.data;
    }

    public void write(Map<String, Object> newData) throws IOException {
        Map<String, Object> dataToWrite = (newData != null) ? newData : this.data;
        if (dataToWrite == null) {
            throw new IllegalArgumentException("No data to write.");
        }
        try (FileWriter fw = new FileWriter(filepath)) {
            Yaml yaml = new Yaml();
            yaml.dump(dataToWrite, fw);
        }
    }

    public void printProperties() {
        if (this.data == null) {
            System.out.println("No data loaded. Call read() first.");
            return;
        }
        dump(this.data, "");
    }

    private void dump(Object obj, String prefix) {
        if (obj instanceof Map) {
            @SuppressWarnings("unchecked")
            Map<String, Object> map = (Map<String, Object>) obj;
            for (Map.Entry<String, Object> entry : map.entrySet()) {
                System.out.println(prefix + entry.getKey() + ":");
                dump(entry.getValue(), prefix + "  ");
            }
        } else if (obj instanceof Iterable) {
            @SuppressWarnings("unchecked")
            Iterable<Object> list = (Iterable<Object>) obj;
            int i = 0;
            for (Object item : list) {
                System.out.println(prefix + "[" + i++ + "]:");
                dump(item, prefix + "  ");
            }
        } else {
            System.out.println(prefix + obj);
        }
    }

    // Example usage:
    // public static void main(String[] args) throws IOException {
    //     DVCFileHandler handler = new DVCFileHandler("example.dvc");
    //     handler.read();
    //     handler.printProperties();
    //     // handler.write(new HashMap<>() {{ put("outs", Arrays.asList(new HashMap<>() {{ put("path", "new.xml"); }})); }});
    // }
}

6. JavaScript Class for .DVC File Handling

const fs = require('fs'); // For Node.js environment
const yaml = require('js-yaml'); // Requires js-yaml package: npm install js-yaml

class DVCFileHandler {
    constructor(filepath) {
        this.filepath = filepath;
        this.data = null;
    }

    read() {
        if (!fs.existsSync(this.filepath)) {
            throw new Error(`File ${this.filepath} not found.`);
        }
        const fileContent = fs.readFileSync(this.filepath, 'utf8');
        this.data = yaml.load(fileContent);
        return this.data;
    }

    write(newData = null) {
        const dataToWrite = newData || this.data;
        if (!dataToWrite) {
            throw new Error('No data to write.');
        }
        fs.writeFileSync(this.filepath, yaml.dump(dataToWrite));
    }

    printProperties() {
        if (!this.data) {
            console.log('No data loaded. Call read() first.');
            return;
        }
        const dump = (obj, prefix = '') => {
            for (const key in obj) {
                if (typeof obj[key] === 'object' && obj[key] !== null) {
                    if (Array.isArray(obj[key])) {
                        console.log(`${prefix}${key}:`);
                        obj[key].forEach((item, i) => {
                            console.log(`${prefix}  [${i}]:`);
                            dump(item, prefix + '    ');
                        });
                    } else {
                        console.log(`${prefix}${key}:`);
                        dump(obj[key], prefix + '  ');
                    }
                } else {
                    console.log(`${prefix}${key}: ${obj[key]}`);
                }
            }
        };
        dump(this.data);
    }
}

// Example usage:
// const handler = new DVCFileHandler('example.dvc');
// handler.read();
// handler.printProperties();
// handler.write({ outs: [{ path: 'new.xml', md5: 'newhash' }] });

7. C Class for .DVC File Handling

(Note: C does not have native "classes," so this is implemented in C++ for object-oriented structure. It uses the libyaml library for YAML parsing, which must be installed and linked (-lyaml). Reading/writing assumes a simple key-value dump; full recursive parsing requires traversing the YAML structure.)

#include <iostream>
#include <fstream>
#include <string>
#include <yaml.h> // Requires libyaml: apt install libyaml-dev or similar

class DVCFileHandler {
private:
    std::string filepath;
    yaml_document_t document;

public:
    DVCFileHandler(const std::string& fp) : filepath(fp) {}
    ~DVCFileHandler() { yaml_document_delete(&document); }

    bool read() {
        std::ifstream file(filepath);
        if (!file.is_open()) {
            std::cerr << "File " << filepath << " not found." << std::endl;
            return false;
        }
        std::string content((std::istreambuf_iterator<char>(file)), std::istreambuf_iterator<char>());
        yaml_parser_t parser;
        yaml_parser_initialize(&parser);
        yaml_parser_set_input_string(&parser, reinterpret_cast<const unsigned char*>(content.c_str()), content.size());
        if (!yaml_parser_load(&parser, &document)) {
            std::cerr << "Error parsing YAML." << std::endl;
            yaml_parser_delete(&parser);
            return false;
        }
        yaml_parser_delete(&parser);
        return true;
    }

    bool write(const std::string& yamlString) {
        std::ofstream file(filepath);
        if (!file.is_open()) {
            std::cerr << "Cannot open file for writing." << std::endl;
            return false;
        }
        file << yamlString;
        return true;
    }

    void printProperties() {
        if (document.nodes.start == document.nodes.top) {
            std::cout << "No data loaded. Call read() first." << std::endl;
            return;
        }
        yaml_node_t* root = yaml_document_get_root_node(&document);
        if (root->type == YAML_MAPPING_NODE) {
            dump(root, "");
        } else {
            std::cout << "Invalid root node type." << std::endl;
        }
    }

private:
    void dump(yaml_node_t* node, const std::string& prefix) {
        if (node->type == YAML_SCALAR_NODE) {
            std::cout << prefix << reinterpret_cast<const char*>(node->data.scalar.value) << std::endl;
        } else if (node->type == YAML_MAPPING_NODE) {
            for (yaml_node_pair_t* pair = node->data.mapping.pairs.start; pair < node->data.mapping.pairs.top; ++pair) {
                yaml_node_t* key = yaml_document_get_node(&document, pair->key);
                yaml_node_t* value = yaml_document_get_node(&document, pair->value);
                std::cout << prefix << reinterpret_cast<const char*>(key->data.scalar.value) << ": " << std::endl;
                dump(value, prefix + "  ");
            }
        } else if (node->type == YAML_SEQUENCE_NODE) {
            int i = 0;
            for (yaml_node_item_t* item = node->data.sequence.items.start; item < node->data.sequence.items.top; ++item) {
                yaml_node_t* value = yaml_document_get_node(&document, *item);
                std::cout << prefix << "[" << i++ << "]:" << std::endl;
                dump(value, prefix + "  ");
            }
        }
    }
};

// Example usage:
// int main() {
//     DVCFileHandler handler("example.dvc");
//     if (handler.read()) {
//         handler.printProperties();
//     }
//     // handler.write("outs:\n  - path: new.xml\n    md5: newhash\n");
//     return 0;
// }