Task 157: .DVC File Format
Task 157: .DVC File Format
1. List of Properties of the .DVC File Format
The .DVC file format (specifically .dvc files from Data Version Control) is a text-based YAML 1.2 format used for metadata in data versioning. It does not have low-level file system intrinsics like binary headers or magic numbers typical of binary formats; instead, its "properties" are the YAML keys and structures that define its schema. These are intrinsic to how the file is structured and interpreted within the DVC system. Below is a comprehensive list of all possible properties (keys/fields), including their types, descriptions, and whether they are required. Properties are organized by root-level fields, output entries (under outs
), dependency entries (under deps
), and sub-objects.
Root-Level Properties
- outs: List of objects - List of output entries (files/directories tracked by DVC). Required for most .dvc files.
- deps: List of objects - List of dependency entries (external data sources or imports). Optional.
- wdir: String - Working directory relative to the .dvc file's location (defaults to "."). Optional.
- md5: String - MD5 hash of the .dvc file itself (present for imports). Optional.
Properties in Output Entries (Each Item in outs
)
- path: String - Path to the file/directory (relative to
wdir
). Required. - hash: String - Hash algorithm (currently only "md5" supported). Optional.
- md5: String - MD5 hash value (for local/SSH). Optional.
- etag: String - ETag hash value (for HTTP/S3/Azure). Optional.
- checksum: String - Checksum value (for HDFS/WebHDFS). Optional.
- version_id: String - Cloud provider version ID (if versioning enabled). Optional.
- size: Integer - Size in bytes (sum for directories). Optional.
- nfiles: Integer - Number of files in a directory (recursive). Optional.
- isexec: Boolean - Whether the file is executable (preserved on checkout/pull; no effect on directories/Windows). Optional.
- cache: Boolean - Whether to cache the file/directory (defaults to true). Optional.
- remote: String - Name of the DVC remote for push/fetch. Optional.
- persist: Boolean - Whether the output remains during reproduction (defaults to false). Optional.
- push: Boolean - Whether to upload to remote on
dvc push
(defaults to true). Optional.
Properties in Dependency Entries (Each Item in deps
)
- path: String - Path to the dependency (relative to
wdir
). Required. - hash: String - Hash algorithm (currently only "md5" supported). Optional.
- md5: String - MD5 hash value (for local/SSH). Optional.
- etag: String - ETag hash value (for HTTP/S3/GCS/Azure). Optional.
- checksum: String - Checksum value (for HDFS/WebHDFS). Optional.
- size: Number - Size in bytes (sum for directories). Optional.
- nfiles: Number - Number of files in a directory (recursive). Optional.
- repo: Object - Details for external DVC project dependencies. Optional. Sub-properties:
- url: String - URL of the Git repository with the source DVC project.
- rev: String - Git revision (commit hash, branch, or tag).
- rev_lock: String - Locked Git commit hash at import time.
- config: String - Path to config file or config options.
- remote: String - Name of the DVC remote.
- db: Object - Details for database dependencies. Optional. Sub-properties:
- connection: String - Database connection name.
- query: String - SQL query for snapshot.
- table: String - Database table name.
- file_format: String - Export format ("csv" or "json").
Comments can be added using #
syntax. The file may include additional custom keys (e.g., desc
for descriptions), but they are not part of the core schema.
2. Two Direct Download Links for .DVC Files
- https://raw.githubusercontent.com/iterative/example-get-started/main/data/data.xml.dvc
- https://raw.githubusercontent.com/iterative/example-get-started-experiments/main/data/pool_data.dvc
3. Ghost Blog Embedded HTML JavaScript for Drag-and-Drop .DVC File Dump
This is a self-contained HTML snippet with embedded JavaScript that can be embedded in a Ghost blog post (or any HTML page). It creates a drag-and-drop area where a user can drop a .DVC file. The script reads the file as text, parses it as YAML using the js-yaml library (included via CDN), extracts all properties recursively, and dumps them to the screen in a readable key-value format.
4. Python Class for .DVC File Handling
import yaml
import os
class DVCFileHandler:
def __init__(self, filepath):
self.filepath = filepath
self.data = None
def read(self):
"""Read and decode the .DVC file as YAML."""
if not os.path.exists(self.filepath):
raise FileNotFoundError(f"File {self.filepath} not found.")
with open(self.filepath, 'r') as f:
self.data = yaml.safe_load(f)
return self.data
def write(self, new_data=None):
"""Write the current data or new data to the .DVC file as YAML."""
data_to_write = new_data if new_data else self.data
if data_to_write is None:
raise ValueError("No data to write.")
with open(self.filepath, 'w') as f:
yaml.safe_dump(data_to_write, f, sort_keys=False)
def print_properties(self):
"""Print all properties to console in a readable format."""
if self.data is None:
print("No data loaded. Call read() first.")
return
def dump(obj, prefix=''):
for key, value in obj.items():
if isinstance(value, dict):
print(f"{prefix}{key}:")
dump(value, prefix + ' ')
elif isinstance(value, list):
print(f"{prefix}{key}:")
for i, item in enumerate(value):
print(f"{prefix} [{i}]:")
dump(item, prefix + ' ')
else:
print(f"{prefix}{key}: {value}")
dump(self.data)
# Example usage:
# handler = DVCFileHandler('example.dvc')
# handler.read()
# handler.print_properties()
# handler.write({'outs': [{'path': 'new.xml', 'md5': 'newhash'}]})
5. Java Class for .DVC File Handling
import org.yaml.snakeyaml.Yaml;
import java.io.*;
import java.util.Map;
public class DVCFileHandler {
private String filepath;
private Map<String, Object> data;
public DVCFileHandler(String filepath) {
this.filepath = filepath;
}
public Map<String, Object> read() throws IOException {
File file = new File(filepath);
if (!file.exists()) {
throw new FileNotFoundException("File " + filepath + " not found.");
}
try (FileInputStream fis = new FileInputStream(file)) {
Yaml yaml = new Yaml();
this.data = yaml.load(fis);
}
return this.data;
}
public void write(Map<String, Object> newData) throws IOException {
Map<String, Object> dataToWrite = (newData != null) ? newData : this.data;
if (dataToWrite == null) {
throw new IllegalArgumentException("No data to write.");
}
try (FileWriter fw = new FileWriter(filepath)) {
Yaml yaml = new Yaml();
yaml.dump(dataToWrite, fw);
}
}
public void printProperties() {
if (this.data == null) {
System.out.println("No data loaded. Call read() first.");
return;
}
dump(this.data, "");
}
private void dump(Object obj, String prefix) {
if (obj instanceof Map) {
@SuppressWarnings("unchecked")
Map<String, Object> map = (Map<String, Object>) obj;
for (Map.Entry<String, Object> entry : map.entrySet()) {
System.out.println(prefix + entry.getKey() + ":");
dump(entry.getValue(), prefix + " ");
}
} else if (obj instanceof Iterable) {
@SuppressWarnings("unchecked")
Iterable<Object> list = (Iterable<Object>) obj;
int i = 0;
for (Object item : list) {
System.out.println(prefix + "[" + i++ + "]:");
dump(item, prefix + " ");
}
} else {
System.out.println(prefix + obj);
}
}
// Example usage:
// public static void main(String[] args) throws IOException {
// DVCFileHandler handler = new DVCFileHandler("example.dvc");
// handler.read();
// handler.printProperties();
// // handler.write(new HashMap<>() {{ put("outs", Arrays.asList(new HashMap<>() {{ put("path", "new.xml"); }})); }});
// }
}
6. JavaScript Class for .DVC File Handling
const fs = require('fs'); // For Node.js environment
const yaml = require('js-yaml'); // Requires js-yaml package: npm install js-yaml
class DVCFileHandler {
constructor(filepath) {
this.filepath = filepath;
this.data = null;
}
read() {
if (!fs.existsSync(this.filepath)) {
throw new Error(`File ${this.filepath} not found.`);
}
const fileContent = fs.readFileSync(this.filepath, 'utf8');
this.data = yaml.load(fileContent);
return this.data;
}
write(newData = null) {
const dataToWrite = newData || this.data;
if (!dataToWrite) {
throw new Error('No data to write.');
}
fs.writeFileSync(this.filepath, yaml.dump(dataToWrite));
}
printProperties() {
if (!this.data) {
console.log('No data loaded. Call read() first.');
return;
}
const dump = (obj, prefix = '') => {
for (const key in obj) {
if (typeof obj[key] === 'object' && obj[key] !== null) {
if (Array.isArray(obj[key])) {
console.log(`${prefix}${key}:`);
obj[key].forEach((item, i) => {
console.log(`${prefix} [${i}]:`);
dump(item, prefix + ' ');
});
} else {
console.log(`${prefix}${key}:`);
dump(obj[key], prefix + ' ');
}
} else {
console.log(`${prefix}${key}: ${obj[key]}`);
}
}
};
dump(this.data);
}
}
// Example usage:
// const handler = new DVCFileHandler('example.dvc');
// handler.read();
// handler.printProperties();
// handler.write({ outs: [{ path: 'new.xml', md5: 'newhash' }] });
7. C Class for .DVC File Handling
(Note: C does not have native "classes," so this is implemented in C++ for object-oriented structure. It uses the libyaml library for YAML parsing, which must be installed and linked (-lyaml). Reading/writing assumes a simple key-value dump; full recursive parsing requires traversing the YAML structure.)
#include <iostream>
#include <fstream>
#include <string>
#include <yaml.h> // Requires libyaml: apt install libyaml-dev or similar
class DVCFileHandler {
private:
std::string filepath;
yaml_document_t document;
public:
DVCFileHandler(const std::string& fp) : filepath(fp) {}
~DVCFileHandler() { yaml_document_delete(&document); }
bool read() {
std::ifstream file(filepath);
if (!file.is_open()) {
std::cerr << "File " << filepath << " not found." << std::endl;
return false;
}
std::string content((std::istreambuf_iterator<char>(file)), std::istreambuf_iterator<char>());
yaml_parser_t parser;
yaml_parser_initialize(&parser);
yaml_parser_set_input_string(&parser, reinterpret_cast<const unsigned char*>(content.c_str()), content.size());
if (!yaml_parser_load(&parser, &document)) {
std::cerr << "Error parsing YAML." << std::endl;
yaml_parser_delete(&parser);
return false;
}
yaml_parser_delete(&parser);
return true;
}
bool write(const std::string& yamlString) {
std::ofstream file(filepath);
if (!file.is_open()) {
std::cerr << "Cannot open file for writing." << std::endl;
return false;
}
file << yamlString;
return true;
}
void printProperties() {
if (document.nodes.start == document.nodes.top) {
std::cout << "No data loaded. Call read() first." << std::endl;
return;
}
yaml_node_t* root = yaml_document_get_root_node(&document);
if (root->type == YAML_MAPPING_NODE) {
dump(root, "");
} else {
std::cout << "Invalid root node type." << std::endl;
}
}
private:
void dump(yaml_node_t* node, const std::string& prefix) {
if (node->type == YAML_SCALAR_NODE) {
std::cout << prefix << reinterpret_cast<const char*>(node->data.scalar.value) << std::endl;
} else if (node->type == YAML_MAPPING_NODE) {
for (yaml_node_pair_t* pair = node->data.mapping.pairs.start; pair < node->data.mapping.pairs.top; ++pair) {
yaml_node_t* key = yaml_document_get_node(&document, pair->key);
yaml_node_t* value = yaml_document_get_node(&document, pair->value);
std::cout << prefix << reinterpret_cast<const char*>(key->data.scalar.value) << ": " << std::endl;
dump(value, prefix + " ");
}
} else if (node->type == YAML_SEQUENCE_NODE) {
int i = 0;
for (yaml_node_item_t* item = node->data.sequence.items.start; item < node->data.sequence.items.top; ++item) {
yaml_node_t* value = yaml_document_get_node(&document, *item);
std::cout << prefix << "[" << i++ << "]:" << std::endl;
dump(value, prefix + " ");
}
}
}
};
// Example usage:
// int main() {
// DVCFileHandler handler("example.dvc");
// if (handler.read()) {
// handler.printProperties();
// }
// // handler.write("outs:\n - path: new.xml\n md5: newhash\n");
// return 0;
// }