Skip to content

Comments

Deacon index set operations#10027

Open
pmoris wants to merge 7 commits intonf-core:masterfrom
pmoris:deacon-index-diff
Open

Deacon index set operations#10027
pmoris wants to merge 7 commits intonf-core:masterfrom
pmoris:deacon-index-diff

Conversation

@pmoris
Copy link
Contributor

@pmoris pmoris commented Feb 13, 2026

Goal

Update deacon index to allow set operations

See https://github.com/bede/deacon?tab=readme-ov-file#set-operations

Some considerations:

  • More flexibile/complex deacon/index module vs separate deacon/index_diff, deacon/index_union, etc. vs separate deacon/index_set_operations module? (EDIT: I've added an example of the separate modules approach now, see later comments.)

    I had originally written a separate index_diff module for my own use in a custom pipeline. If the preferred approach is to split modules based on use case, I can split the current implementation out into multiple modules. On the other hand, if the preference of nf-core is to have every module provide a particular tool (rather than a specific subcommand), we can keep the current approach.

    Roughly, it would look like this for the diff command if we were to split it out:

        input:
      tuple val(meta_fasta), path(fasta)
      tuple val(meta_index), path(index)
    
      output:
      tuple val(meta_index), path("*.idx"), emit: index
      path "versions.yml",                  emit: versions
    
      when:
      task.ext.when == null || task.ext.when
    
      script:
      def args = task.ext.args ?: ''
      def prefix = task.ext.prefix ?: "${meta_index.id}"
      """
      deacon \\
          index \\
          diff \\
          --threads ${task.cpus} \\
          $args \\
          $index \\
          $fasta > ${prefix}.diff.idx
    
  • For build, the main input is supplied via fasta (now renamed to reference). For the others, the expected input is two or more idx files (for intersect/union) or an idx + idx/fa file (for diff). Either we re-use the main fasta input of the build command for supplying the first index to operate on, and supply the rest via the set_genomes, or we can leave reference empty for the set operations and provide all inputs for the set operation via set_genomes. For the latter approach, they must be supplied in exact order (e.g., for diff I'd argue it makes sense to provide them separately so that it is clear what is subtracted from what), and it makes it more difficult to provide an idx + fa file. Ergo I opted for option 1.

  • Construction of output command:
    Currently I use a ternary operator that constructs a different set of cli arguments depending on the selected subcommand (${subcommand == 'build' ? "${reference}" : "${reference} ${set_genomes}"} > ${prefix}.idx), but it could be simplified into

      deacon \\
          index \\
          build \\
          --threads ${task.cpus} \\
          $args \\
          $fasta 
          $set_genomes > ${prefix}.idx
    

    because set_genomes would be empty for the build subcommand anyway.

  • There are a few extra index subcommands, like dump (creates .fa based on idx) and info, but since these result in different output types than the others, I've omitted them for now.

  • I still need to add new tests for the set operation commands. For this we need idx files. Are there any in nf-test's dataset yet or can we create them on the fly in chained tests somehow?

PR checklist

Closes #XXX

  • This comment contains a description of changes (with reason).
  • If you've fixed a bug or added code that should be tested, add tests!
  • If you've added a new tool - have you followed the module conventions in the contribution docs
  • If necessary, include test data in your PR.
  • Remove all TODO statements.
  • Broadcast software version numbers to topic: versions - See version_topics
  • Follow the naming conventions.
  • Follow the parameters requirements.
  • Follow the input/output options guidelines.
  • Add a resource label
  • Use BioConda and BioContainers if possible to fulfil software requirements.
  • Ensure that the test works with either Docker / Singularity. Conda CI tests can be quite flaky:
    • For modules:
      • nf-core modules test <MODULE> --profile docker
      • nf-core modules test <MODULE> --profile singularity
      • nf-core modules test <MODULE> --profile conda
    • For subworkflows:
      • nf-core subworkflows test <SUBWORKFLOW> --profile docker
      • nf-core subworkflows test <SUBWORKFLOW> --profile singularity
      • nf-core subworkflows test <SUBWORKFLOW> --profile conda

See https://github.com/bede/deacon?tab=readme-ov-file#set-operations

- Adds two optional commands:
  - the first for choosing an index subcommand
(build, union, intersect, diff -> defaults to build)
  - the second is a tuple of meta, genome, where genome is a path to
   .idx or fasta file(s) to use for the selected set operation
   (should be omitted for build).
- Modifies the main reference input to accept both fasta files (for build)
  and .idx files (for set operations).
- Various validation checks ensure that the correct input options are
used based on the subcommand (e.g., diff accepts 1 fa or idx file,
union/intersect can accept multiple idx files, build excepts nothing).
- The selected subcommand is propagated in the output meta map.
Input and output names weren't updated yet after last changes
to main.nf, and wrong indentation was used for meta_set.
Only deacon index build has multi-threading options.
Now the --threads argument is omitted when selecting any other subcommand.
This is an alternative to combining all the set operation
options into a single deacon/index module.
@pmoris
Copy link
Contributor Author

pmoris commented Feb 16, 2026

I've added the alternative option of splitting the various deacon index subcommands into separate modules. No tests yet until we decide which option to use.

If we go with this option, we'd still need to decide how to construct the inputs.

In the end it comes down to how the upstream genomes are most likely to be constructed. If these are individual [meta, index] tuples in a channel, you can:

  1. naively combine them via index_ch.collect(), and then the input to these modules would have to be a val (list of tuples).
  2. map them into a single element with [meta, path(index_files): then the module only has a single meta.id to deal with.
  3. map them into a tuple consisting of two lists [ meta_list, path(index_files)]`, which retains the individual meta info for each genome requires the input channel to go through a mapping operation.

I'm not sure which assumptions these modules should make in this regard.


Additionally, for the union and intersect modules, you could argue that it makes more sense to split their inputs into a main index + a set of indexes, just like diff, but here order of operations does not matter so I combined everything into a single input.

@pmoris pmoris requested a review from vagkaratzas February 16, 2026 09:49
@pmoris pmoris changed the title Deacon index diff Deacon index set operations Feb 16, 2026
@pmoris
Copy link
Contributor Author

pmoris commented Feb 19, 2026

See relevant discussion on Slack: https://nfcore.slack.com/archives/CE6SDBX2A/p1771355056637259

Conclusion: based on the nf-core guidelines (https://nf-co.re/docs/guidelines/components/modules#module-granularity), we'll use separate modules for the different set operations. Names should be changed to deacon/indexbuildunion.

For the existing index build module, we can either keep it as is (backwards compatibility + sane default) or rename it to indexbuild and deprecate the old one (https://nf-co.re/docs/guidelines/components/module_deprecation).

For now, I'll proceed with adding tests to the set modules and leave the existing index module as is.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant