Filepath Datatype¶
Note: Filepath Datatype is available as a preview feature in DataJoint Python v0.12.
This means that the feature is required to be explicitly enabled. To do so, make sure
to set the environment variable FILEPATH_FEATURE_SWITCH=TRUE
prior to use.
Configuration & Usage¶
Corresponding to issue
#481,
the filepath
attribute type links DataJoint records to files already
managed outside of DataJoint. This can aid in sharing data with
other systems such as allowing an image viewer application to
directly use files from a DataJoint pipeline, or to allow downstream
tables to reference data which reside outside of DataJoint
pipelines.
To define a table using the filepath
datatype, an existing DataJoint
store should be created and then referenced in the
new table definition. For example, given a simple store:
dj.config['stores'] = {
'data': {
'protocol': 'file',
'location': '/data',
'stage': '/data'
}
}
we can define an ScanImages
table as follows:
@schema
class ScanImages(dj.Manual):
definition = """
-> Session
image_id: int
---
image_path: filepath@data
"""
This table can now be used for tracking paths within the /data
local directory.
For example:
>>> ScanImages.insert1((0, 0, '/data/images/image_0.tif'))
>>> (ScanImages() & {'session_id': 0}).fetch1(as_dict=True)
{'session_id': 0, 'image_id': 0, 'image_path': '/data/images/image_0.tif'}
As can be seen from the example, unlike blob records, file paths are managed as path locations to the underlying file.
Integrity Notes¶
Unlike other data in DataJoint, data in filepath
records are
deliberately intended for shared use outside of DataJoint. To help
ensure integrity of filepath
records, DataJoint will record a
checksum of the file data on insert
, and will verify this checksum
on fetch
. However, since the underlying file data may be shared
with other applications, special care should be taken to ensure
records stored in filepath
attributes are not modified outside
of the pipeline, or, if they are, that records in the pipeline are
updated accordingly. A safe method of changing filepath
data is
as follows:
- Delete the
filepath
database record. This will ensure that any downstream records in the pipeline depending on thefilepath
record are purged from the database. - Modify
filepath
data. - Re-insert corresponding the
filepath
record. This will add the record back to DataJoint with an updated file checksum. - Compute any downstream dependencies, if needed.
This will ensure that downstream results dependent on the
filepath
record are updated to reflect the newerfilepath
contents.
Disable Fetch Verification¶
Note: Skipping the checksum is not recommended as it ensures file integrity i.e.
downloaded files are not corrupted. With S3 stores, most of the time to complete a
.fetch()
is from the file download itself as opposed to evaluating the checksum. This
option will primarily benefit filepath
usage connected to a local file
store.
To disable checksums you can set a threshold in bytes for when to stop evaluating checksums like in the example below:
dj.config["filepath_checksum_size_limit"] = 5 * 1024**3 # Skip for all files greater than 5GiB
The default is None
which means it will always verify checksums.