Skip to content

Prevent PLINK filename collisions by hashing content-affecting selection parameters#1114

Open
rehanxt5 wants to merge 1 commit intomalariagen:masterfrom
rehanxt5:gh1113-plink-filename-hash
Open

Prevent PLINK filename collisions by hashing content-affecting selection parameters#1114
rehanxt5 wants to merge 1 commit intomalariagen:masterfrom
rehanxt5:gh1113-plink-filename-hash

Conversation

@rehanxt5
Copy link
Contributor

@rehanxt5 rehanxt5 commented Mar 14, 2026

Issue Summary

biallelic_snps_to_plink generates a filename prefix from a subset of arguments (region and SNP filtering parameters), while other arguments that also change exported content were excluded from the path. This could cause two semantically different export requests to resolve to the same .bed/.bim/.fam prefix and overwrite or reuse incorrect files.

Proposed solution

Include a deterministic hash in the PLINK filename prefix derived from the parameters that affect output content but are not suitable to place directly in filenames.

  1. Build a dictionary from the excluded content-affecting parameters. (sample_sets , sample_query etc)
  2. Hash that dictionary deterministically using the existing _hash_params() utility.
  3. Append a short prefix of the hash, the first 8 hex characters, to the filename prefix.

This keeps filenames short while preventing collisions from distinct sample-selection inputs.

Implementation

  • Modified biallelic_snps_to_plink() in to_plink.py to hash excluded parameters and append the 8-character suffix to the filename.
  • Updated test_plink_converter.py to use the returned file prefix from biallelic_snps_to_plink() rather than reconstructing the filename pattern, ensuring tests validate the actual behavior.

Why hashing is preferable

  • sample_sets can be long lists.
  • sample_query and sample_query_options can contain arbitrary expressions and structured values.
  • Directly embedding those values into filenames would be verbose and fragile.
  • The repository already has _hash_params() for deterministic parameter hashing, so this follows an existing internal pattern.

Proposed filename change

Current style:

{output_dir}/{region}.{n_snps}.{min_minor_ac}.{max_missing_an}.{thin_offset}

Proposed style:

{output_dir}/{region}.{n_snps}.{min_minor_ac}.{max_missing_an}.{thin_offset}.{param_hash}

Validation

  • All ruff checks passed on modified files.
  • PLINK converter tests (2 test cases) passed.
  • Full non-integration test suite passed (993 passed, 4 skipped).

Compatibility note

This will change the output filenames generated by these methods. That is intentional, because the current naming scheme can alias distinct exports to the same path.

closes #1113

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Hash PLINK export filenames by content-affecting selection parameters to prevent collisions

1 participant