Skip to content

What's New in DataJoint 2.0

DataJoint 2.0 is a major release that establishes DataJoint as a mature framework for scientific data pipelines. The version jump from 0.14 to 2.0 reflects the significance of these changes.

πŸ“˜ Upgrading from legacy DataJoint (pre-2.0)?

This page summarizes new features and concepts. For step-by-step migration instructions, see the Migration Guide.

Citation: Yatsenko D, Nguyen TT. DataJoint 2.0: A Computational Substrate for Agentic Scientific Workflows. arXiv:2602.16585. 2026. doi:10.48550/arXiv.2602.16585

Overview

DataJoint 2.0 introduces fundamental improvements to type handling, job coordination, and object storage while maintaining compatibility with your existing pipelines during migration. Key themes:

  • Explicit over implicit: All type conversions are now explicit through the codec system
  • Better distributed computing: Per-table job coordination with improved error handling
  • Object storage integration: Native support for large arrays and files
  • Multi-backend support: Portable types enabling PostgreSQL backend (added in 2.1)

Breaking Changes at a Glance

If you're upgrading from legacy DataJoint, these changes require code updates:

Area Legacy 2.0
Fetch API table.fetch() table.to_dicts() or .to_arrays()
Update (table & key)._update('attr', val) table.update1({**key, 'attr': val})
Join table1 @ table2 table1 * table2 (with semantic check)
Type syntax longblob, int unsigned <blob>, int64
Jobs ~jobs table Per-table ~~table_name

See the Migration Guide for complete upgrade steps.

Object-Augmented Schema (OAS)

DataJoint 2.0 unifies relational tables with object storage into a single coherent system. The relational database stores metadata and references while large objects (arrays, files, Zarr datasets) are stored in object storageβ€”with full referential integrity maintained across both layers.

β†’ Type System Specification

Three storage sections:

Section Addressing Use Case
Internal Row-based (in database) Small objects (< 1 MB)
Hash-addressed Content hash Arrays, files (deduplication)
Path-addressed Primary key path Zarr, HDF5, streaming access

New syntax:

definition = """
recording_id : uuid
---
metadata : <blob>           # Internal storage
raw_data : <blob@store>     # Hash-addressed object storage
zarr_array : <object@store> # Path-addressed for Zarr/HDF5
"""

Explicit Type System

Breaking change: DataJoint 2.0 makes all type conversions explicit through a three-tier architecture.

β†’ Type System Specification Β· Codec API Specification

What Changed

Legacy DataJoint overloaded MySQL types with implicit conversions: - longblob could be blob serialization OR in-table attachment - attach was implicitly converted to longblob - uuid was used internally for external storage

DataJoint 2.0 makes everything explicit:

Legacy (Implicit) 2.0 (Explicit)
longblob <blob>
attach <attach>
blob@store <blob@store>
int unsigned int64

Three-Tier Architecture

  1. Native types: MySQL types (INT, VARCHAR, LONGBLOB)
  2. Core types: Portable aliases (int32, float64, varchar, uuid, json)
  3. Codecs: Serialization for Python objects (<blob>, <attach>, <object@>)

Custom Codecs

Replace legacy AdaptedTypes with the new codec API:

class GraphCodec(dj.Codec):
    name = "graph"

    def encode(self, value, **kwargs):
        return list(value.edges)

    def decode(self, stored, **kwargs):
        import networkx as nx
        return nx.Graph(stored)

Jobs 2.0

Breaking change: Redesigned job coordination with per-table job management.

β†’ AutoPopulate Specification Β· Job Metadata Specification

What Changed

Legacy (Schema-level) 2.0 (Per-table)
One ~jobs table per schema One ~~table_name per Computed/Imported table
Opaque hashed keys Native primary keys (readable)
Statuses: reserved, error, ignore Added: pending, success
No priority support Priority column (lower = more urgent)

New Features

  • Automatic refresh: Job queue synchronized with pending work automatically
  • Better coordination: Multiple workers coordinate via database without conflicts
  • Error tracking: Built-in error table (Table.jobs.errors) with full stack traces
  • Priority support: Control computation order with priority values
# Distributed mode with coordination
Analysis.populate(reserve_jobs=True, processes=4)

# Monitor progress
Analysis.jobs.progress()  # {'pending': 10, 'reserved': 2, 'error': 0}

# Handle errors
Analysis.jobs.errors.to_dicts()

# Set priorities
Analysis.jobs.update({'session_id': 123}, priority=1)  # High priority

Semantic Matching

Breaking change: Query operations now use lineage-based matching by default.

β†’ Semantic Matching Specification

What Changed

Legacy DataJoint used SQL-style natural joins: attributes matched if they had the same name, regardless of meaning.

DataJoint 2.0 validates semantic lineage: Attributes must share common origin through foreign key chains, not just coincidentally matching names.

# 2.0: Semantic join (default) - validates lineage
result = TableA * TableB  # Only matches attributes with shared origin

# Legacy behavior (if needed)
result = TableA.join(TableB, semantic_check=False)

Why this matters: Prevents accidental matches between attributes like session_id that happen to share a name but refer to different entities in different parts of your schema.

During migration: If semantic matching fails, it often indicates a malformed join that should be reviewed rather than forced.

Configuration System

A cleaner configuration approach with separation of concerns.

β†’ Configuration Reference

  • datajoint.json: Non-sensitive settings (commit to version control)
  • .secrets/: Credentials (never commit)
  • Environment variables: For CI/CD and production
export DJ_HOST=db.example.com
export DJ_USER=myuser
export DJ_PASS=mypassword

ObjectRef API (New)

New feature: Path-addressed storage returns ObjectRef handles that support streaming access without downloading entire datasets.

ref = (Dataset & key).fetch1('zarr_array')

# Direct fsspec access for Zarr/xarray
z = zarr.open(ref.fsmap, mode='r')

# Or download locally
local_path = ref.download('/tmp/data')

# Stream chunks without full download
with ref.open('rb') as f:
    chunk = f.read(1024)

This enables efficient access to large datasets stored in Zarr, HDF5, or custom formats.

Deprecated and Removed

Removed APIs

  • .fetch() method: Replaced with .to_dicts(), .to_arrays(), or .to_pandas()
  • ._update() method: Replaced with .update1()
  • @ operator (natural join): Use * with semantic matching or .join(semantic_check=False)
  • dj.U() * table pattern: Use just table (universal set is implicit)

Deprecated Features

  • AdaptedTypes: Replaced by codec system (still works but migration recommended)
  • Native type syntax: int unsigned β†’ int64 (warnings on new tables)
  • Legacy external storage (blob@store): Replaced by <blob@store> codec syntax

Legacy Support

During migration (Phases 1-3), both legacy and 2.0 APIs can coexist: - Legacy clients can still access data - 2.0 clients understand legacy column types - Dual attributes enable cross-testing

After finalization (Phase 4+), only 2.0 clients are supported.

License Change

DataJoint 2.0 is licensed under the Apache License 2.0 (previously LGPL-2.1). This provides:

  • More permissive for commercial and academic use
  • Clearer patent grant provisions
  • Better compatibility with broader ecosystem

Migration Path

β†’ Complete Migration Guide

Upgrading from DataJoint 0.x is a phased process designed to minimize risk:

Phase 1: Code Updates (Reversible)

  • Update Python code to 2.0 API patterns (.fetch() β†’ .to_dicts(), etc.)
  • Update configuration files (dj_local_conf.json β†’ datajoint.json + .secrets/)
  • No database changes β€” legacy clients still work

Phase 2: Type Migration (Reversible)

  • Update database column comments to use core types (:int64:, :<blob>:)
  • Rebuild ~lineage tables for semantic matching
  • Update Python table definitions
  • Legacy clients still work β€” only metadata changed

Phase 3: External Storage Dual Attributes (Reversible)

  • Create *_v2 attributes alongside legacy external storage columns
  • Both APIs can access data during transition
  • Enables cross-testing between legacy and 2.0
  • Legacy clients still work

Phase 4: Finalize (Point of No Return)

  • Remove legacy external storage columns
  • Drop old ~jobs and ~external_* tables
  • Legacy clients stop working β€” database backup required

Phase 5: Adopt New Features (Optional)

  • Use new codecs (<object@>, <npy@>)
  • Leverage Jobs 2.0 features (priority, better errors)
  • Implement custom codecs for domain-specific types

Migration Support

The migration guide includes: - AI agent prompts for automated migration steps - Validation commands to check migration status - Rollback procedures for each phase - Dry-run modes for all database changes

Most users complete Phases 1-2 in a single session. Phases 3-4 only apply if you use legacy external storage.

See Also