Working on the codebase with AI? Start with AGENTS.md for global setup and package-specific guides.

Zippy Data System

A human-readable, schema-flexible document store built for modern ML and data engineering workflows. Store JSON documents with the simplicity of files and the speed of databases.

20x
Faster Writes
3.5x
Faster Access
100%
Human-Readable
Zero
Lock-in

Why ZDS?

Modern ML and data workflows need flexibility that traditional formats struggle to provide. Parquet and Arrow enforce rigid schemas. SQLite requires SQL. Plain JSON has no indexing. ZDS bridges this gap.

πŸ“„

Human-Readable

Debug with cat. Edit with vim. Version control with git. Your data is always accessible with standard tools.

πŸ”€

Schema-Flexible

Each document defines its own shape. No migrations needed. Perfect for iterative development and heterogeneous data.

⚑

High Performance

Rust core with mmap, simd-json, and FxHashMap. O(1) random access. Writes at 4.6M records/second.

🌐

Multi-Language

Native bindings for Python, Node.js, and Rust. Query with DuckDB SQL. One format, every platform.

πŸ”“

Zero Lock-in

ZIP container + JSONL documents. If this library disappears, your data remains fully accessible forever.

πŸ€–

ML-Ready

HuggingFace Dataset API compatible. Streaming iteration. Shuffle buffers. Built for training pipelines.


Quick Start

Python

pip install zippy-data
from zippy import ZDSStore, ZDSRoot, ZDataset

# Legacy helper: single-collection store (still supported)
store = ZDSStore.open("./my_dataset", collection="train")

# Add documents
store.put("doc_001", {"text": "Hello world", "label": 1})
store.put("doc_002", {"text": "Goodbye", "label": 0, "extra": [1, 2, 3]})

# Random access
print(store["doc_001"])  # {"text": "Hello world", "label": 1}

# Preferred: open a root once, then grab multiple collections
root = ZDSRoot.open("./my_dataset", native=True)
train = root.collection("train")
test = root.collection("test")

# Iterate like HuggingFace
dataset = ZDataset(train)
for doc in dataset.shuffle(seed=42):
    print(doc["text"])

print(root.list_collections())  # ['test', 'train']

Node.js

npm install @zippydata/core
const { ZdsStore } = require('@zippydata/core');

const store = ZdsStore.open('./my_dataset', 'train');

store.put('doc_001', { text: 'Hello world', label: 1 });
console.log(store.get('doc_001'));

for (const doc of store.scan()) {
    console.log(doc.text);
}

CLI

# Initialize a store
zippy init ./my_dataset -c train

# Add documents
zippy put ./my_dataset -c train doc_001 --data '{"text": "Hello"}'

# Query
zippy scan ./my_dataset -c train --fields text,label

How It Compares

Feature ZDS Parquet SQLite Plain JSON
Human-readable βœ… ❌ ❌ βœ…
Schema-flexible βœ… ❌ ⚠️ βœ…
Fast random access βœ… ❌ βœ… ❌
Indexed lookups βœ… ❌ βœ… ❌
Git-friendly βœ… ❌ ❌ βœ…
No special tools βœ… ❌ ⚠️ βœ…
ML dataset API βœ… ⚠️ ❌ ❌

The Philosophy

β€œThe best format is one you can understand in 5 minutes and debug with cat.”

ZDS follows proven patterns. A ZIP container wrapping human-readable JSONL documents, enhanced with binary indexes for performance. Like DOCX wraps XML, or EPUB wraps HTML.

my_dataset/
└── collections/
    └── train/
        β”œβ”€β”€ meta/
        β”‚   └── data.jsonl     # Your data (one JSON per line)
        └── index.bin          # Optional: O(1) lookups

This isn’t meant to be novel and it’s intentionally unoriginal. Novelty in file formats creates lock-in. We chose boring technologies that will outlast any single library.

Read the full paper β†’


Use Cases

Evaluation Pipelines

Run experiment β†’ Generate 10,000 results
β”œβ”€β”€ Each result has: metrics, predictions, metadata
β”œβ”€β”€ Some results have additional debug info
β”œβ”€β”€ Need to inspect failures manually
└── Want to version control changes

Synthetic Data Generation

Generate training examples with LLM
β”œβ”€β”€ Each example has variable structure  
β”œβ”€β”€ Tool calls, function schemas, nested conversations
β”œβ”€β”€ Need to filter, edit, regenerate subsets
└── Feed directly into training pipeline

Dataset Distribution

# Pack for sharing
zippy pack ./my_dataset dataset.zds

# Recipients can inspect without any library
unzip dataset.zds -d extracted/
cat extracted/collections/train/meta/data.jsonl | head -5 | jq .

Performance

Benchmarked on Apple M3 Max with 100,000 records:

Operation ZDS SQLite Pandas CSV HF Datasets
Write 4.66M rec/s 237k 205k 633k
Read All (warm) 510k rec/s 263k 8.18M* 40k
Random Access 308k rec/s 88k 227k 30k

*Pandas warm = in-memory DataFrame

See full benchmarks β†’


Get Started

πŸ“– Documentation

Complete API reference and guides for Python, Node.js, Rust, and CLI.

Read Docs

πŸš€ Getting Started

Install ZDS and build your first dataset in under 5 minutes.

Quick Start

πŸ“Š Examples

Real-world examples for ML training, data pipelines, and more.

View Examples