Avoid redundant index steps for shared refs#9983
Avoid redundant index steps for shared refs#9983pmoris wants to merge 3 commits intonf-core:masterfrom
Conversation
|
Waiting on #9982 before finishing this. |
faa24d8 to
7f4dd4b
Compare
|
Rebased the branch on top of the new topic version changes in master. @vagkaratzas : do you prefer the output of the index channel to remain unchanged or do you agree that its meta map should include the fasta filename as its |
I think the fasta name makes more sense, since it will also mean that it's unique per fasta. |
BUT!!! Without the initial meta.id it wont be able to be joined with other stuff downstream in the pipeline if a user wants to do that. So we also need to keep the old meta.id as another attribute inside the meta. |
In that case, I propose to keep |
06f87fb to
71431b3
Compare
This deacon subworkflow supports per-sample references, but the previous implementation ran the indexing step for each sample, regardless if samples shared the same reference. This update allows the index step to only run once for each unique reference, rather than once per-sample. I.e. shared references among samples are only indexed once. The output of the previous implementation is conserved, but this could still be changed for the index output (by adding fasta-based meta.id values rather than the sample metadata).
This commit ensures that the sample-level metadata associated with the sample reads is retained in the index output channel. The baseName of the index - which itself should be equal to the baseName of the used reference fasta, or the chosen `ext.prefix` for the deacon/index module - is added as an additional index_id key in the meta map. Note that the index channel has the same length as the number of input samples (i.e. it does not only contain the unique indexes).
71431b3 to
a04fcc7
Compare
|
Linting checks work locally with the dev version of nf-core. All tests pass locally as well. Just rebased on the latest version of master to see if that fixes the failing GH actions. |
Snapshots needed to be updated because the index output channel was updated with an additional index_id key in its meta map.
a04fcc7 to
da400bc
Compare
|
I forgot to update the snapshots of |
vagkaratzas
left a comment
There was a problem hiding this comment.
Awesome work! I think this will work!.. but can you add an nf-test that gives the same fasta file twice as input to index and make sure it runs only once?
| - "@Baksic-Ivan" | ||
| - "@Omer0191" | ||
| contributors: | ||
| = "@pmoris" |
There was a problem hiding this comment.
| = "@pmoris" | |
| - "@pmoris" |
This deacon subworkflow supports per-sample references, but the previous implementation ran the indexing step for each sample, regardless if samples shared the same reference.
This update allows the index step to only run once for each unique reference, rather than once per-sample. I.e. shared references among samples are only indexed once.
The output of the previous implementation is conserved, but this could still be changed for the index output (by adding fasta-based meta.id values rather than the sample metadata). See the TODO comment at the bottom.
TODO: I haven't updated the test snapshots yet. If we retain the original emit output for the index channel, it won't be needed. If we update the channel to contain the fasta meta.id, it will.
PR checklist
Closes #XXX
topic: versions- See version_topicslabelnf-core modules test <MODULE> --profile dockernf-core modules test <MODULE> --profile singularitynf-core modules test <MODULE> --profile condanf-core subworkflows test <SUBWORKFLOW> --profile dockernf-core subworkflows test <SUBWORKFLOW> --profile singularitynf-core subworkflows test <SUBWORKFLOW> --profile conda