External Data

File Attachment Datatype

Note

External storage is not yet implemented in MATLAB. The feature will be added in an upcoming release: https://github.com/datajoint/datajoint-matlab/issues/143

Configuration & Usage

Corresponding to issue #480, the attach attribute type allows users to attach files into DataJoint schemas as DataJoint-managed files. This is in contrast to traditional blobs which are encodings of programming language data structures such as arrays.

The functionality is modeled after email attachments, where users attach a file along with a message and message recipients have access to a copy of that file upon retrieval of the message.

For DataJoint attach attributes, DataJoint will copy the input file into a DataJoint store, hash the file contents, and track the input file name. Subsequent fetch operations will transfer a copy of the file to the local directory of the Python process and return a pointer to it’s location for subsequent client usage. This allows arbitrary files to be uploaded or attached to a DataJoint schema for later use in processing. File integrity is preserved by checksum comparison against the attachment data and verifying the contents during retrieval.

For example, given a localattach store:

dj.config['stores'] = {
  'localattach': {
    'protocol': 'file',
    'location': '/data/attach'
  }
}

A ScanAttachment table can be created:

@schema
class ScanAttachment(dj.Manual):
    definition = """
    -> Session
    ---
    scan_image:    attach@localattach  # attached image scans
    """

Files can be added using an insert pointing to the source file:

>>> ScanAttachment.insert1((0, '/input/image0.tif'))

And then retrieved to the current directory using fetch:

>>> s0 = (ScanAttachment & {'session_id': 0}).fetch1()
>>> s0
{'session_id': 0, 'scan_image': './image0.tif'}
>>> fh = open(s0['scan_image'], 'rb')
>>> fh
<_io.BufferedReader name='./image0.tif')

Filepath Datatype

Note

External storage is not yet implemented in MATLAB. The feature will be added in an upcoming release: https://github.com/datajoint/datajoint-matlab/issues/143

Note

Filepath Datatype is available as a preview feature in DataJoint Python v0.12. This means that the feature is required to be explicitly enabled. To do so, make sure to set the environment variable FILEPATH_FEATURE_SWITCH=TRUE prior to use.

Configuration & Usage

Corresponding to issue #481, the filepath attribute type links DataJoint records to files already managed outside of DataJoint. This can aid in sharing data with other systems such as allowing an image viewer application to directly use files from a DataJoint pipeline, or to allow downstream tables to reference data which reside outside of DataJoint pipelines.

To define a table using the filepath datatype, an existing DataJoint store should be created and then referenced in the new table definition. For example, given a simple store:

dj.config['stores'] = {
  'data': {
    'protocol': 'file',
    'location': '/data',
    'stage': '/data'
  }
}

we can define an ScanImages table as follows:

@schema
class ScanImages(dj.Manual):
    definition = """
    -> Session
    image_id:    int
    ---
    image_path:  filepath@data
    """

This table can now be used for tracking paths within the /data local directory. For example:

>>> ScanImages.insert1((0, 0, '/data/images/image_0.tif'))
>>> (ScanImages() & {'session_id': 0}).fetch1(as_dict=True)
{'session_id': 0, 'image_id': 0, 'image_path': '/data/images/image_0.tif'}

As can be seen from the example, unlike blob records, file paths are managed as path locations to the underlying file.

Integrity Notes

Unlike other data in DataJoint, data in filepath records are deliberately intended for shared use outside of DataJoint. To help ensure integrity of filepath records, DataJoint will record a checksum of the file data on insert, and will verify this checksum on fetch. However, since the underlying file data may be shared with other applications, special care should be taken to ensure records stored in filepath attributes are not modified outside of the pipeline, or, if they are, that records in the pipeline are updated accordingly. A safe method of changing filepath data is as follows:

  1. Delete the filepath database record.

    This will ensure that any downstream records in the pipeline depending on the filepath record are purged from the database.

  2. Modify filepath data.

  3. Re-insert corresponding the filepath record.

    This will add the record back to DataJoint with an updated file checksum.

  4. Compute any downstream dependencies, if needed.

    This will ensure that downstream results dependent on the filepath record are updated to reflect the newer filepath contents.

Talk to the Community