Skip to content

Fix: include sample-selection parameters in PLINK output filename#1121

Open
Mowblow wants to merge 1 commit intomalariagen:masterfrom
Mowblow:fix-plink-filename-params
Open

Fix: include sample-selection parameters in PLINK output filename#1121
Mowblow wants to merge 1 commit intomalariagen:masterfrom
Mowblow:fix-plink-filename-params

Conversation

@Mowblow
Copy link

@Mowblow Mowblow commented Mar 14, 2026

Fix: include sample-selection parameters in PLINK output filename

Fixes #1113

Problem

The filename prefix for biallelic_snps_to_plink was built from only a subset of input parameters:

plink_file_path = f"{output_dir}/{region}.{n_snps}.{min_minor_ac}.{max_missing_an}.{thin_offset}"

Parameters like sample_sets, sample_query, sample_query_options, sample_indices, site_mask, and random_seed were excluded, meaning two calls with different sample cohorts could silently resolve to the same filename and overwrite each other.

Approach

I considered three options:

  1. Append raw parameter values to the filename — rejected because sample_sets can be a list and sample_query can be a long arbitrary string, making filenames unpredictable and potentially invalid on some operating systems.

  2. Include only the most commonly varied parameters — rejected because it still leaves some parameters out and the same collision problem could reappear.

  3. Hash all sample-selection parameters together — chosen because it keeps filenames short and OS-safe, guarantees uniqueness across any combination of inputs, and handles None values and lists cleanly through JSON serialization.

Solution

Bundle all sample-selection parameters into a dictionary, serialize with json.dumps (using sort_keys=True for consistency and default=str for safety), and append an 8-character MD5 hash to the filename:

sample_params = {
    "sample_sets": sample_sets,
    "sample_query": sample_query,
    "sample_query_options": sample_query_options,
    "sample_indices": sample_indices,
    "site_mask": site_mask,
    "random_seed": random_seed,
}
params_hash = hashlib.md5(
    json.dumps(sample_params, sort_keys=True, default=str).encode()
).hexdigest()[:8]

plink_file_path = f"{output_dir}/{region}.{n_snps}.{min_minor_ac}.{max_missing_an}.{thin_offset}.{params_hash}"

The resulting filename is clean, stable, and unique per parameter combination — for example: 2L.1000.2.8.0.a3f7bc12.

Testing

I verified that:

  • Two calls with different sample_sets produce different filename prefixes
  • Two calls with identical parameters produce the same filename prefix (caching still works)
  • None values are handled without errors

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Hash PLINK export filenames by content-affecting selection parameters to prevent collisions

1 participant