Working on the codebase with AI? Start with AGENTS.md for global setup and package-specific guides.

Zippy Data System

A human-readable, schema-flexible document store built for modern ML and data engineering workflows. Store JSON documents with the simplicity of files and the speed of databases.

Get Started View on GitHub

Why ZDS?

Modern ML and data workflows need flexibility that traditional formats struggle to provide. Parquet and Arrow enforce rigid schemas. SQLite requires SQL. Plain JSON has no indexing. ZDS bridges this gap.

📄

Human-Readable

Debug with cat. Edit with vim. Version control with git. Your data is always accessible with standard tools.

🔀

Schema-Flexible

Each document defines its own shape. No migrations needed. Perfect for iterative development and heterogeneous data.

⚡

High Performance

Rust core with mmap, simd-json, and FxHashMap. O(1) random access. Writes at 4.6M records/second.

🌐

Multi-Language

Native bindings for Python, Node.js, and Rust. Query with DuckDB SQL. One format, every platform.

🔓

Zero Lock-in

ZIP container + JSONL documents. If this library disappears, your data remains fully accessible forever.

🤖

ML-Ready

HuggingFace Dataset API compatible. Streaming iteration. Shuffle buffers. Built for training pipelines.

Quick Start

Python

pip install zippy-data

from zippy import ZDSStore, ZDSRoot, ZDataset

# Legacy helper: single-collection store (still supported)
store = ZDSStore.open("./my_dataset", collection="train")

# Add documents
store.put("doc_001", {"text": "Hello world", "label": 1})
store.put("doc_002", {"text": "Goodbye", "label": 0, "extra": [1, 2, 3]})

# Random access
print(store["doc_001"])  # {"text": "Hello world", "label": 1}

# Preferred: open a root once, then grab multiple collections
root = ZDSRoot.open("./my_dataset", native=True)
train = root.collection("train")
test = root.collection("test")

# Iterate like HuggingFace
dataset = ZDataset(train)
for doc in dataset.shuffle(seed=42):
    print(doc["text"])

print(root.list_collections())  # ['test', 'train']

Node.js

npm install @zippydata/core

const { ZdsStore } = require('@zippydata/core');

const store = ZdsStore.open('./my_dataset', 'train');

store.put('doc_001', { text: 'Hello world', label: 1 });
console.log(store.get('doc_001'));

for (const doc of store.scan()) {
    console.log(doc.text);
}

CLI

# Initialize a store
zippy init ./my_dataset -c train

# Add documents
zippy put ./my_dataset -c train doc_001 --data '{"text": "Hello"}'

# Query
zippy scan ./my_dataset -c train --fields text,label

How It Compares

Feature	ZDS	Parquet	SQLite	Plain JSON
Human-readable	✅	❌	❌	✅
Schema-flexible	✅	❌	⚠️	✅
Fast random access	✅	❌	✅	❌
Indexed lookups	✅	❌	✅	❌
Git-friendly	✅	❌	❌	✅
No special tools	✅	❌	⚠️	✅
ML dataset API	✅	⚠️	❌	❌

The Philosophy

“The best format is one you can understand in 5 minutes and debug with cat.”

ZDS follows proven patterns. A ZIP container wrapping human-readable JSONL documents, enhanced with binary indexes for performance. Like DOCX wraps XML, or EPUB wraps HTML.

my_dataset/
└── collections/
    └── train/
        ├── meta/
        │   └── data.jsonl     # Your data (one JSON per line)
        └── index.bin          # Optional: O(1) lookups

This isn’t meant to be novel and it’s intentionally unoriginal. Novelty in file formats creates lock-in. We chose boring technologies that will outlast any single library.

Read the full paper →

Use Cases

Evaluation Pipelines

Run experiment → Generate 10,000 results
├── Each result has: metrics, predictions, metadata
├── Some results have additional debug info
├── Need to inspect failures manually
└── Want to version control changes

Synthetic Data Generation

Generate training examples with LLM
├── Each example has variable structure  
├── Tool calls, function schemas, nested conversations
├── Need to filter, edit, regenerate subsets
└── Feed directly into training pipeline

Dataset Distribution

# Pack for sharing
zippy pack ./my_dataset dataset.zds

# Recipients can inspect without any library
unzip dataset.zds -d extracted/
cat extracted/collections/train/meta/data.jsonl | head -5 | jq .