Principles¶
Theoretical Foundations¶
DataJoint Core implements a systematic framework for the joint management of structured scientific data and its associated computations. The framework builds on the theoretical foundations of the Relational Model and the Entity-Relationship Model, introducing a number of critical clarifications for the effective use of databases as scientific data pipelines. Notably, DataJoint introduces the concept of computational dependencies as a native first-class citizen of the data model. This integration of data structure and computation into a single model, defines a new class of computational scientific databases.
This page defines the key principles of this model without attachment to a specific implementation while a more complete description of the model can be found in Yatsenko et al, 2018.
DataJoint developers are developing these principles into an open standard to allow multiple alternative implementations.
Data Representation¶
Tables = Entity Sets¶
DataJoint uses only one data structure in all its operations—the entity set.
- All data are represented in the form of entity sets, i.e. an ordered collection of entities.
- All entities of an entity set belong to the same well-defined entity class and have the same set of named attributes.
- Attributes in an entity set has a data type (or domain), representing the set of its valid values.
- Each entity in an entity set provides the attribute values for all of the attributes of its entity class.
- Each entity set has a primary key, i.e. a subset of attributes that, jointly, uniquely identify any entity in the set.
These formal terms have more common (even if less precise) variants:
formal | common |
---|---|
entity set | table |
attribute | column |
attribute value | field |
A collection of stored tables make up a database. Derived tables are formed through query expressions.
Table Definition¶
DataJoint introduces a streamlined syntax for defining a stored table.
Each line in the definition defines an attribute with its name, data type, an optional default value, and an optional comment in the format:
name [=default] : type [# comment]
Primary attributes come first and are separated from the rest of the attributes with
the divider ---
.
For example, the following code defines the entity set for entities of class Employee
:
employee_id : int
---
ssn = null : int # optional social security number
date_of_birth : date
gender : enum('male', 'female', 'other')
home_address="" : varchar(1000)
primary_phone="" : varchar(12)
Data Tiers¶
Stored tables are designated into one of four tiers indicating how their data originates.
table tier | data origin |
---|---|
lookup | contents are part of the table definition, defined a priori rather than entered externally. Typical stores general facts, parameters, options, etc. |
manual | contents are populated by external mechanisms such as manual entry through web apps or by data ingest scripts |
imported | contents are populated automatically by pipeline computations accessing data from upstream in the pipeline and from external data sources such as raw data stores. |
computed | contents are populated automatically by pipeline computations accessing data from upstream in the pipeline. |
Object Serialization¶
Data Normalization¶
A collection of data is considered normalized when organized into a collection of entity sets, where each entity set represents a well-defined entity class with all its attributes applicable to each entity in the set and the same primary key identifying
The normalization procedure often includes splitting data from one table into several tables, one for each proper entity set.
Databases and Schemas¶
Stored tables are named and grouped into namespaces called schemas.
A collection of schemas make up a database.
A database has a globally unique address or name.
A schema has a unique name within its database.
Within a connection to a particular database, a stored table is identified as
schema.Table
.
A schema typically groups tables that are logically related.
Dependencies¶
Entity sets can form referential dependencies that express and
Diagramming¶
Data integrity¶
Entity integrity¶
Entity integrity is the guarantee made by the data management process of the 1:1 mapping between real-world entities and their digital representations. In practice, entity integrity is ensured when it is made clear