Conversation
See https://github.com/bede/deacon?tab=readme-ov-file#set-operations - Adds two optional commands: - the first for choosing an index subcommand (build, union, intersect, diff -> defaults to build) - the second is a tuple of meta, genome, where genome is a path to .idx or fasta file(s) to use for the selected set operation (should be omitted for build). - Modifies the main reference input to accept both fasta files (for build) and .idx files (for set operations). - Various validation checks ensure that the correct input options are used based on the subcommand (e.g., diff accepts 1 fa or idx file, union/intersect can accept multiple idx files, build excepts nothing). - The selected subcommand is propagated in the output meta map.
Input and output names weren't updated yet after last changes to main.nf, and wrong indentation was used for meta_set.
Only deacon index build has multi-threading options. Now the --threads argument is omitted when selecting any other subcommand.
This is an alternative to combining all the set operation options into a single deacon/index module.
3e9b4e7 to
0015065
Compare
|
I've added the alternative option of splitting the various If we go with this option, we'd still need to decide how to construct the inputs. In the end it comes down to how the upstream genomes are most likely to be constructed. If these are individual
I'm not sure which assumptions these modules should make in this regard. Additionally, for the union and intersect modules, you could argue that it makes more sense to split their inputs into a main index + a set of indexes, just like diff, but here order of operations does not matter so I combined everything into a single input. |
|
See relevant discussion on Slack: https://nfcore.slack.com/archives/CE6SDBX2A/p1771355056637259 Conclusion: based on the nf-core guidelines (https://nf-co.re/docs/guidelines/components/modules#module-granularity), we'll use separate modules for the different set operations. Names should be changed to For the existing index build module, we can either keep it as is (backwards compatibility + sane default) or rename it to For now, I'll proceed with adding tests to the set modules and leave the existing index module as is. |
Goal
Update deacon index to allow set operations
See https://github.com/bede/deacon?tab=readme-ov-file#set-operations
Some considerations:
More flexibile/complex deacon/index module vs separate deacon/index_diff, deacon/index_union, etc. vs separate deacon/index_set_operations module? (EDIT: I've added an example of the separate modules approach now, see later comments.)
I had originally written a separate index_diff module for my own use in a custom pipeline. If the preferred approach is to split modules based on use case, I can split the current implementation out into multiple modules. On the other hand, if the preference of nf-core is to have every module provide a particular tool (rather than a specific subcommand), we can keep the current approach.
Roughly, it would look like this for the diff command if we were to split it out:
For build, the main input is supplied via
fasta(now renamed toreference). For the others, the expected input is two or more idx files (forintersect/union) or an idx + idx/fa file (fordiff). Either we re-use the main fasta input of the build command for supplying the first index to operate on, and supply the rest via theset_genomes, or we can leavereferenceempty for the set operations and provide all inputs for the set operation viaset_genomes. For the latter approach, they must be supplied in exact order (e.g., fordiffI'd argue it makes sense to provide them separately so that it is clear what is subtracted from what), and it makes it more difficult to provide an idx + fa file. Ergo I opted for option 1.Construction of output command:
Currently I use a ternary operator that constructs a different set of cli arguments depending on the selected subcommand (
${subcommand == 'build' ? "${reference}" : "${reference} ${set_genomes}"} > ${prefix}.idx), but it could be simplified intobecause set_genomes would be empty for the build subcommand anyway.
There are a few extra index subcommands, like dump (creates .fa based on idx) and info, but since these result in different output types than the others, I've omitted them for now.
I still need to add new tests for the set operation commands. For this we need idx files. Are there any in nf-test's dataset yet or can we create them on the fly in chained tests somehow?
PR checklist
Closes #XXX
topic: versions- See version_topicslabelnf-core modules test <MODULE> --profile dockernf-core modules test <MODULE> --profile singularitynf-core modules test <MODULE> --profile condanf-core subworkflows test <SUBWORKFLOW> --profile dockernf-core subworkflows test <SUBWORKFLOW> --profile singularitynf-core subworkflows test <SUBWORKFLOW> --profile conda