Auto-populate¶
Auto-populated tables are used to define, execute, and coordinate computations in a DataJoint pipeline.
Tables in the initial portions of the pipeline are populated from outside the pipeline. In subsequent steps, computations are performed automatically by the DataJoint pipeline in auto-populated tables.
Computed tables belong to one of the two auto-populated data tiers: dj.Imported
and dj.Computed
.
DataJoint does not enforce the distinction between imported and computed tables: the difference is purely semantic, a convention for developers to follow.
If populating a table requires access to external files such as raw storage that is not part of the database, the table is designated as imported.
Otherwise it is computed.
Auto-populated tables are defined and queried exactly as other tables. (See Manual Tables.) Their data definition follows the same definition syntax.
Make¶
For auto-populated tables, data should never be entered using insert directly.
Instead these tables must define the callback method make(self, key)
.
The insert
method then can only be called on self
inside this callback method.
Imagine that there is a table test.Image
that contains 2D grayscale images in its image
attribute.
Let us define the computed table, test.FilteredImage
that filters the image in some way and saves the result in its filtered_image
attribute.
The class will be defined as follows.
%{
# Filtered image
-> test.Image
---
filtered_image : longblob
%}
classdef FilteredImage < dj.Computed
methods(Access=protected)
function make(self, key)
img = fetch1(test.Image & key, 'image');
key.filtered_image = myfilter(img);
self.insert(key)
end
end
end
Note
Currently matlab uses makeTuples
rather than make
. This will be fixed in an upcoming release: https://github.com/datajoint/datajoint-matlab/issues/141
The make
method receives one argument: the struct key
containing the primary key value of an element of key source to be worked on.
The make
method received one argument: the key
of type struct
in MATLAB and dict
in Python.
The key represents the partially filled entity, usually already containing the primary key attributes of the key source.
The make
callback does three things:
Fetches data from tables upstream in the pipeline using the
key
for restriction.Computes and adds any missing attributes to the fields already in
key
.Inserts the entire entity into
self
.
make
may populate multiple entities in one call when key
does not specify the entire primary key of the populated table.
Populate¶
The inherited populate
method of dj.Imported
and dj.Computed
automatically calls make
for every key for which the auto-populated table is missing data.
The FilteredImage
table can be populated as
populate(test.FilteredImage)
Note that it is not necessary to specify which data needs to be computed.
DataJoint will call make
, one-by-one, for every key in Image
for which FilteredImage
has not yet been computed.
Chains of auto-populated tables form computational pipelines in DataJoint.
Populate options¶
Behavior of the populate
method depends on the number of output arguments requested in the function call.
When no output arguments are requested, errors will halt population.
With two output arguments (failedKeys
and errors
), populate
will catch any encountered errors and return them along with the offending keys.
Progress¶
The function parpopulate
works identically to populate
except that it uses a job reservation mechanism to allow multiple processes to populate the same table in parallel without collision.
When running parpopulate
for the first time, DataJoint will create a job reservation table and its class <package>.Jobs
with the following declaration:
{%
# the job reservation table
table_name : varchar(255) # className of the table
key_hash : char(32) # key hash
---
status : enum('reserved','error','ignore')# if tuple is missing, the job is available
key=null : blob # structure containing the key
error_message="" : varchar(1023) # error message returned if failed
error_stack=null : blob # error stack if failed
host="" : varchar(255) # system hostname
pid=0 : int unsigned # system process id
timestamp=CURRENT_TIMESTAMP : timestamp # automatic timestamp
%}
A job is considered to be available when <package>.Jobs
contains no matching entry.
For each make
call, parpopulate
sets the job status to reserved
.
When the job is completed, the record is removed.
If the job results in error, the job record is left in place with the status set to error
and the error message and error stacks saved.
Consequently, jobs that ended in error during the last execution will not be attempted again until you delete the corresponding entities from <package>.Jobs
.
The primary key of the jobs table comprises the name of the class and a 32-character hash of the job’s primary key. However, the key is saved in a separate field for error debugging purposes.