Fix: include sample-selection parameters in PLINK output filename#1121
Open
Mowblow wants to merge 1 commit intomalariagen:masterfrom
Open
Fix: include sample-selection parameters in PLINK output filename#1121Mowblow wants to merge 1 commit intomalariagen:masterfrom
Mowblow wants to merge 1 commit intomalariagen:masterfrom
Conversation
…prevent collisions
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Fix: include sample-selection parameters in PLINK output filename
Fixes #1113
Problem
The filename prefix for
biallelic_snps_to_plinkwas built from only a subset of input parameters:Parameters like
sample_sets,sample_query,sample_query_options,sample_indices,site_mask, andrandom_seedwere excluded, meaning two calls with different sample cohorts could silently resolve to the same filename and overwrite each other.Approach
I considered three options:
Append raw parameter values to the filename — rejected because
sample_setscan be a list andsample_querycan be a long arbitrary string, making filenames unpredictable and potentially invalid on some operating systems.Include only the most commonly varied parameters — rejected because it still leaves some parameters out and the same collision problem could reappear.
Hash all sample-selection parameters together — chosen because it keeps filenames short and OS-safe, guarantees uniqueness across any combination of inputs, and handles
Nonevalues and lists cleanly through JSON serialization.Solution
Bundle all sample-selection parameters into a dictionary, serialize with
json.dumps(usingsort_keys=Truefor consistency anddefault=strfor safety), and append an 8-character MD5 hash to the filename:The resulting filename is clean, stable, and unique per parameter combination — for example:
2L.1000.2.8.0.a3f7bc12.Testing
I verified that:
sample_setsproduce different filename prefixesNonevalues are handled without errors