2. Write a Custom Manifest¶
In part 1, we validated an example fileset against a pre-defined manifest, to check if it was valid. In this section, we will write a custom manifest that anyone can use to check if their fileset is valid according to our rules.
You might want to do this if you are a researcher who wants to check the validitiy of their dataset before uploading it to a repository, or if you are a repository maintainer who wants to check the validity of a dataset before accepting it. Or, you might be a lab manager who wants to implement data sharing best practices in your lab, and wants to create a set of lab-specific data types that all lab members can use to validate their data.
1.1. Prerequisites¶
It is recommended that you complete tutorial part 1 before continuing, so that you understand how to validate a fileset using the File Validator.
For this section, we'll recreate the my_fileset fileset that we created in the previous section.
We can now check the contents of the fileset, and save the path to the fileset for later:
my_fileset/my_subdirectory:
subject1.csv subject2.csv subject3.txt
my_fileset/my_subject:export MY_FILESET_PATH="$(pwd)/my_fileset"
1.2. Create a Manifest¶
The rest of this tutorial will focus on writing a manifest file for a custom fileset type.
We'll call this type my_type, and we'll create a manifest for it called my_type.yaml.
In your favorite text editor, create a new file called my_type.yaml and add the following contents:
1 2 3 4 5 6 7 8 | |
Let's break down this manifest file line-by-line:
- Line 1 defines the
idof the manifest, which should be unique across all manifests (including ones contributed by other community members). - Line 2 defines the
versionof the manifest, which should follow semantic versioning. - Lines 3-5 defines a human-readable
descriptionof the manifest. Since this file is written in YAML, we can use the>character to write a multi-line description. See online documentation about YAML for more details. - Lines 6-8 describes the
authorof the manifest, including their name and email.
Notice that we have not yet defined any rules in our manifest; we'll get to this soon.
Note that a manifest file without any rules is still valid, and can therefore be used to validate a fileset.
You can test this yourself by validating my_fileset against my_type.yaml using the CLI:
Because our manifest has no rules, it will accept any fileset as valid.
Assuming we are in the same directory as the my_type.yaml manifest, this manifest is now discoverable using find_manifest in the Python API or the manifest list CLI command:
┃ ID ┃ Version ┃ Description ┃ Path ┃
┡━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━┩
│ my_type │ 0.1.0 │ An example fileset type for the DataJoint File │ my_type.yaml │
│ │ │ Validator tutorial. │ │
│ │ │ │ │
└─────────┴─────────┴───────────────────────────────────────────────────┴──────────────┘
1.3. Add a Rule to the Manifest¶
Now that we have a manifest, we can add rules to it.
Rules are defined using the rules key in the manifest.
Let's start by defining a simple rule: we want to ensure that there are at least 3 files (including subdirectories themselves) anywhere in the fileset.
Append a rules section so that your manifest looks like this:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 | |
Like the manifest, this rule has an id (which should be unique within the manifest file), and a human-readable description.
Both of these fields are optional, but recommended, especially if you are writing a manifest that will be used by others.
On line 15, we define the count_min field, which contains the logic that checks if our rule is valid.
Formally, we call this field a constraint, and it is one of several types of constraints that we can use.
If we validate our fileset again, we'll see that it is still valid against our manifest because it has at least 3 files:
We can add another constraint to our rule.
This time, we'll use the count_max constraint to ensure that there are no more than 3 files in the fileset:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 | |
Since our fileset has 4 files, it now fails validation because all constraints in a rule must be satisfied:
┏━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━┓
┃ Rule ID ┃ Rule Description ┃ Constraint ID ┃ Constraint Value ┃ Errors ┃
┡━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━┩
│ my_simple_rule │ A simple rule that │ count_max │ 3 │ constraint │
│ │ checks that there │ │ │ `count_max` failed: 6 │
│ │ are at least 3 files │ │ │ > 3 │
│ │ (including │ │ │ │
│ │ subdirectories │ │ │ │
│ │ themselves) anywhere │ │ │ │
│ │ in the fileset. │ │ │ │
│ │ │ │ │ │
└────────────────┴──────────────────────┴───────────────┴──────────────────┴───────────────────────┘
We can increase count_max to 6 to make our fileset valid again:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 | |
1.4. Queries¶
Our manifest is already useful with a single rule, with two constraints that check the minimum number of files. Let's add another more complicated rule, so that our manifest looks like this:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 | |
On line 21, we define a query field in the rule.
This field uses a glob pattern to filter the list of files, before they are validated against the constraints: count_min and count_max.
In this case, *.txt matches only files that are at the top level of the fileset (*), and end with the .txt extension.
After filtering, the only file that matches this query is observations.txt, which is the only file that is validated against the constraints.
Note
If query is not defined for a rule, it is automatically set to the default value of **, which matches all files in the fileset.
Suppose that we wanted to check the number of .csv files anywhere in the fileset, not just at the top level.
We can define another rule to check this, now using the **/*.csv query that matches all .csv files anywhere in the fileset.
Append to the manifest rules:
24 25 26 27 28 29 | |
Since the **/*.csv query matches two files, my_subdirectory/subject1.csv and my_subdirectory/subject2.csv (it will also match .csv files at the top level, if there were any), the count_min constraint is satisfied.
For more details on how to write glob-style queries, see online resources such as this article from VS Code and the wcmatch documentation (which is the library that File Validator uses).
1.5. Complex Queries¶
So far, we've only used simple glob-style queries to filter the list of files in the fileset.
We can define more complex queries using the query field, which accepts path and type fields.
For example, we can define a query that matches the glob-style pattern subject, but excludes directories (only files):
30 31 32 33 34 35 36 37 | |
The path pattern **/*subject* matches my_directory/subject1.csv, my_directory/subject2.csv, my_directory/subject3.csv, and the directory my_subject/.
With the type: file component, my_subject/ is excluded from the query, and only the three files are validated against the constraints.
1.6. Regex Constraint¶
So far, we've only used the count_min and count_max constraints.
We can also use the regex constraint to check if file paths match a regular expression.
For example, we can create a new rule that checks that all files in the my_subdirectory directory end with the .csv or .txt extension:
38 39 40 41 42 43 44 45 | |
For details on how to write regular expressions, see online resources such as regexr.com.
1.7. Eval Constraint¶
Although the built-in constraints give us a lot of flexibility in defining rules, sometimes we need to write custom logic to check if a file is valid.
The eval constraint accommodates these use cases by allowing us to write custom Python code to check if a fileset is valid.
The value of eval should be a definition of a Python function that:
- Is defined using the
defsyntax, as opposed tolambdasyntax. - Takes as its first argument a list of dictionaries, where each dictionary contains information about a file in the fileset.
- Returns a boolean value:
Trueif the fileset is valid, andFalseotherwise.
For example, we can define a rule that implements the same logic as our top_level_txt_files rule, but uses the eval constraint instead of built-in constraints:
46 47 48 49 50 51 52 53 54 55 56 57 58 59 | |
When we validate, we see the STDERR output from the print statement:
'atime_ns': 1704917387733281020,
'ctime_ns': 1704917387125281156,
'extension': '.txt',
'last_modified': '2024-01-10T13:09:47.125281+00:00',
'mtime_ns': 1704917387125281156,
'name': 'observations.txt',
'path': 'observations.txt',
'rel_path': 'observations.txt',
'size': 0,
'type': 'file'}
✔ Validation successful!
For details on the fields available in each file dictionary, see the dataclass attributes of the FileMetadata class.
Tip: Debugging with eval
The eval constraint can be useful for debugging new rules.
For example, we can use it to ensure that the query field works as expected:
60 61 62 63 64 65 66 67 | |
When we validate, we see all the files that match our query:
$ datajoint-file-validator validate $MY_FILESET_PATH my_type.yaml
Our query returned files:
[{'abs_path': '/path/to/my_fileset/my_subdirectory/subject2.csv',
'atime_ns': 1705600299673026466,
'ctime_ns': 1705600299673026466,
'extension': '.csv',
'last_modified': '2024-01-18T10:51:39.673027+00:00',
'mtime_ns': 1705600299673026466,
'name': 'subject2.csv',
'path': 'my_subdirectory/subject2.csv',
'rel_path': 'my_subdirectory/subject2.csv',
'size': 0,
'type': 'file'},
{'abs_path': '/path/to/my_fileset/my_subdirectory/subject1.csv',
'atime_ns': 1705600299673026466,
'ctime_ns': 1705600299673026466,
'extension': '.csv',
'last_modified': '2024-01-18T10:51:39.673027+00:00',
'mtime_ns': 1705600299673026466,
'name': 'subject1.csv',
'path': 'my_subdirectory/subject1.csv',
'rel_path': 'my_subdirectory/subject1.csv',
'size': 0,
'type': 'file'}]
✔ Validation successful!
1.7.1. Best Practices¶
With the eval constraint, manifest authors have flexibility to write almost any rule they can think of.
But with great power comes great responsibility, so we recommend that you adhere to the following best practices:
- Use a built-in constraint if possible. Built-in constraints validate faster, and emit more informative error messages when validation fails.
- Avoid running complex or computationally intensive logic in
evalfunctions. Fileset validation should be quick and easy to run. Instead, move complex logic to a separate script and usedatajoint-file-validatoras a dependency. - Ensure that the code you write in
evalis safe to run. Avoid fetching data from the internet or installing software in theevalfunction. - If the function
prints anything, ensure that it writes tosys.stderr, not the defaultsts.stdoutbuffer. You can achieve this by passingfile=sys.stderrto theprintfunction. This ensures that users can redirect validation reports fromSTDOUTto file without corrupting the YAML or JSON formatted report.
1.8. Conclusion¶
In this section, we learned how to write a custom manifest that can be used to validate a fileset. We encourage you to experiment with writing your own manifests, and consult the manifest registry for examples and inspiration. The complete manifest that we wrote in this section is shown below:
my_type.yaml
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 | |
Next Steps¶
Now that you've written a custom manifest, you can publish it to the manifest registry so that others can use or extend it. See part 3 of this tutorial for more details.