Skip to content

Added LD pruning support and PLINK export for ADMIXTURE analysis #1063

Open
rehanxt5 wants to merge 13 commits intomalariagen:masterfrom
rehanxt5:GH1049-add-ld-pruning
Open

Added LD pruning support and PLINK export for ADMIXTURE analysis #1063
rehanxt5 wants to merge 13 commits intomalariagen:masterfrom
rehanxt5:GH1049-add-ld-pruning

Conversation

@rehanxt5
Copy link
Contributor

@rehanxt5 rehanxt5 commented Mar 6, 2026

Solution

I have added LD pruning capabilities to the API, allowing the users to inspect and export admixture compatible data in a single call. The implementation uses scikit-allel's locate_unlinked() , to do LD pruning.

I created a new module ld_params.py this holds the parameters description of ld_pruning function. Created ld.py where the actual functions/methods live:

  • biallelic_snps_ld_pruned() : it is used to perform ld pruning on the biallelic SNPs
  • biallelic_snps_ld_pruned_to_plink(): it is used to generate the data in plink format , which is ready to use with admixture

All the changes in detail are discussed below:

Changes

New Module: ld_params.py

It Defines 3 configurable parameters for LD pruning:

  • size (default 500): It is the window size in number of SNPs
  • step (default 200): How many SNPs to advance the window each iteration
  • threshold (default 0.1): r² threshold for linkage disequilibrium

New Mixin: ld.py (AnophelesLdAnalysis)

Public method biallelic_snps_ld_pruned()

  • it performs LD pruning on biallelic SNPs for a given region and sample set
  • it supports all standard filtering: min_minor_ac, max_missing_an, n_snps, cohort downsampling
  • it returns an xarray Dataset with pruned variants
  • The results are cached efficiently (stores only the pruning mask, not full genotypes)
  • I also added a warning , it will warn the users about memory usage before computing

Public method biallelic_snps_ld_pruned_to_plink()

  • It combines LD pruning with PLINK binary export in one call.
  • Outputs .bed, .bim, .fam files compatible with ADMIXTURE
  • It also supports file overwriting control
  • To avoid redundant computation , it uses cached results.

Updated anopheles.py

  • Added AnophelesLdAnalysis to the AnophelesDataResource class MRO (between AnophelesPca and PlinkConverter)
  • Both methods now available on all Anopheles species API objects

Documentation

  • Updated RST docs for all 4 species (Ag3, Af1, Adir1, Amin1)
  • Added both new methods to API autosummary

Tests

  • test_biallelic_snps_ld_pruned: Basic functionality, caching, result validation
  • test_biallelic_snps_ld_pruned_with_n_snps: Pre-filtering with SNP count limit
  • test_biallelic_snps_ld_pruned_with_cohort_size: Individual downsampling
  • test_biallelic_snps_ld_pruned_to_plink: PLINK export validation

Usage Example

import malariagen_data

ag3 = malariagen_data.Ag3()

# Export LD-pruned PLINK files ready for ADMIXTURE
plink_path = ag3.biallelic_snps_ld_pruned_to_plink(
    output_dir="./admixture_input",
    region="3L",
    size=500,
    step=200,
    threshold=0.1
)

# Then run: admixture ./admixture_input/3L.ld_pruned*.bed K

Technical Details

  • The LD pruning uses scikit-allel's locate_unlinked() method
  • Caching stores only the boolean pruning mask (efficient).
  • Memory warning before compute step alerts users to potential RAM usage.
  • Lazy evaluation with dask arrays until final compute.
  • Supports all existing SNP filtering and sample selection parameters.

Testing

  • All 8 new tests pass
  • No regression in existing PLINK converter tests
  • Ruff linting passes (code style compliant)
  • Tested on ag3_sim and af1_sim fixtures

Closes #1049

@rehanxt5
Copy link
Contributor Author

rehanxt5 commented Mar 6, 2026

Hey @jonbrenas the draft PR is ready , i built this out as a standalone biallelic_snps_ld_pruned utility. i applied memory safety , caching , pre-filtering. Happy to make any changes or tweaks to the implementation. Let me know what you think!

@rehanxt5
Copy link
Contributor Author

rehanxt5 commented Mar 7, 2026

@jonbrenas hey can you look into this , when you have time . Thanks!

Copy link

@saadte saadte left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

loc_var = np.any(gn_ref != gn_ref[:, 0, np.newaxis], axis=1)
Can you explain what this does?

inline_array: base_params.inline_array = base_params.inline_array_default,
chunks: base_params.chunks = base_params.native_chunks,
):
# Define output file path using parameters that affect the content.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can I know how is this determined?

Define output file path using parameters that affect the content.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume you are asking why this specific set of parameters was chosen for the filename

the filename is determined with the parameters: output_dir, region, n_snps, min_minor_ac, max_missing_an, thin_offset, size, step, threshold. I excluded the other selection parameters (like sample sets or sample query) . Since those can be very long strings or complex queries, moreover my approach mirrors the biallelic_snps_to_plink convention in to_plink.py, where sample selection is expected to be differentiated via output_dir, consistent with how existing PLINK export works

Copy link

@saadte saadte Mar 7, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have you considered the possibility of a file collision if the parameters you skip for naming the file change the content? That might result in files with different content pointing to the exact same file, causing silent downstream problems.

@rehanxt5
Copy link
Contributor Author

rehanxt5 commented Mar 7, 2026

@saadte
loc_var = np.any(gn_ref != gn_ref[:, 0, np.newaxis], axis=1)

basically what it does is , the gn_ref will have many sites which are constant. Now , for PLINK we require varying(polymorphic variants) sites , there are two approaches , first one is to compare each column to every other column (that would be compute heavy) , so what i implemented was , to compare sample0 to every other columns , and only keep the rows which are varying. For ex:

gn_ref:

Site Person 0 Person 1 Person 2 Person 3 Note
Site A 2 2 2 2 Everyone is identical
Site B 0 1 0 2 People are different
Site C 1 1 1 1 Everyone is identical
Site D 2 2 0 1 People are different

Now what we do is , we take the Person0 , compare it against every other sample , and get a boolean mask:

Site Person 0 Person 1 Person 2 Person 3 Note
Site A False False False False All False
Site B False True False True Found True
Site C False False False False All False
Site D False False True True Found True

After this the np.any() checks horizontally if any value is True , if yes then it marks it as True and finally returns:
[False, True, False, True]

Now this mask is used to lower or filter the dataset size in the very next lines:

ds_final = _dask_compress_dataset(ds_pruned, loc_var, dim="variants")
gn_ref_final = gn_ref[loc_var]

loc_var = np.any(gn_ref != gn_ref[:, 0, np.newaxis], axis=1) Can you explain what this does?

@jonbrenas
Copy link
Collaborator

Thanks @saadte, @rehanxt5,

My questions are a bit more fundamental:

  1. Why would you need to select polymorphic sites when you only extracted biallelic sites?
  2. You are aware of the existence of biallelic_snps_to_plink. Why is another function converting biallelic SNPs to plink necessary? Why is it in a class that is (from its name) not connected to plink at all?

@saadte
Copy link

saadte commented Mar 7, 2026

Just upstream, you do

gn_ref = allel.GenotypeDaskArray(gt_asc).to_n_ref(fill=-127)
, which I think is encoding missing calls to "-127"

Then,

loc_var = np.any(gn_ref != gn_ref[:, 0, np.newaxis], axis=1)
will result in something like [-127,0,0,0] being treated as variant, whereas the calls are constant with one missing value.

I might be wrong with my understanding, and will be happy if we're able to clear this.

@rehanxt5
Copy link
Contributor Author

rehanxt5 commented Mar 7, 2026

@saadte @jonbrenas

  1. You are right , polymorphism is already guaranteed with biallelic sites. I carried it over from the existing pattern in biallelic_snps_to_plink in to_plink.py but i verified it that loc_var is not required as biallelic_snp_calls() will handle this.

However , on examining the loc_var in biallelic_snps_to_plink , the method takes ds that had already been through biallelic_snp_calls() which has same is_biallelic guarentee , so from what i can see loc_var is reduntant there too. I could be wrong , will be happy if you clarify me on this

@saadte you're correct , loc_var has a real bug with -127 fill values. However , the bug is blocked in practice because is_biallelic() in biallelic_snp_call() filters non-biallelic sites before loc_var runs.

2. biallelic_snps_to_plink() cannot be reused here directly because it hardcodes calling biallelic_snp_calls() internally , However , i can create a private helper writer function which can be used by both the methods. I'll move the biallelic_snps_ld_pruned_to_plink at it's right location in to_plink.py

rehanxt5 added 5 commits March 8, 2026 17:37
…nkConverter

- Extract shared _write_plink() helper in PlinkConverter to reduce
  duplication between biallelic_snps_to_plink and
  biallelic_snps_ld_pruned_to_plink
- Move biallelic_snps_ld_pruned_to_plink() from AnophelesLdAnalysis
  to PlinkConverter (proper class ownership)
- Clean up ld.py imports (remove os, bed_reader, numpy, plink_params)
- Update test fixtures to use combined _LdPlinkTestApi class
- Preserve loc_var in _write_plink (legacy behaviour)
…ps_to_plink and ld pruned to plink , moved the bialleclic_snps_ld_pruned_to_plink to its currect poistion in to_plink.py
@rehanxt5
Copy link
Contributor Author

rehanxt5 commented Mar 8, 2026

@jonbrenas @saadte i have refactored the code , i created new private plink writer function _write_plink which can be used by both biallelic_snps_ld_pruned_plink and biallelic_snps_plink , i didn't remove the loc_var as it was part of biallelic_snps_plink, i also moved the function to to_plink

@jonbrenas
Copy link
Collaborator

Hi @rehanxt5. I think this is better. That said:

i didn't remove the loc_var as it was part of biallelic_snps_plink,

Is that your only reason? The majority of the codebase was not written by computer scientists or computer engineers: you are expected to be the expert here and be able to say "I think this design is a mistake, I can do better." If you think having 'loc_var' is the right decision, you need to be able to say why YOU would make that decision, independently of what other people have decided. The same if you think it was not the right decision.

@rehanxt5
Copy link
Contributor Author

rehanxt5 commented Mar 8, 2026

However , on examining the loc_var in biallelic_snps_to_plink , the method takes ds that had already been through biallelic_snp_calls() which has same is_biallelic guarentee , so from what i can see loc_var is reduntant there too. I could be wrong , will be happy if you clarify me on this

Is that your only reason? The majority of the codebase was not written by computer scientists or computer engineers: you are expected to be the expert here and be able to say "I think this design is a mistake, I can do better." If you think having 'loc_var' is the right decision, you need to be able to say why YOU would make that decision, independently of what other people have decided. The same if you think it was not the right decision.

@jonbrenas I completely agree with you , thanks for the push. To answer your question.
Yes i believe that loc_var is not required , as the is_biallelic() method , already filters the data, so loc_var is redundant
I have now removed , loc_var. I initially left it out , because i didn't wanted to risk backward compatibility of the biallelic_snps_to_plink function

@rehanxt5
Copy link
Contributor Author

Hey @jonbrenas can you take a look at the changes

@jonbrenas
Copy link
Collaborator

Hi @rehanxt5.

I initially left it out , because i didn't wanted to risk backward compatibility of the biallelic_snps_to_plink function.

What kind of backward compatibility issues were you afraid of? Are you now convinced that there are no backward compatibility issues? What changed your mind?

@saadte
Copy link

saadte commented Mar 10, 2026

Hi @rehanxt5,

Previously, I had asked how you determine, as you have commented in your code:

Define output file path using parameters that affect the content.

and had asked for some understanding on:

Have you considered the possibility of a file collision if the parameters you skip for naming the file change the content? That might result in files with different content pointing to the exact same file, causing silent downstream problems.

This can prove to be critical if not addressed. Can you please have a look and help me understand how exactly do you determine the relevant parameters and prevent collisions?

@rehanxt5
Copy link
Contributor Author

@saadte @jonbrenas

@saadte You're right to flag this. I had assumed output_dir to be intended differentiator however i don't see any explicit documentation about whether output_dir is the intended differentiator. This pattern exists in the pre-existing biallelic_snps_to_plink() method too.

My recommendation would be:
as sample_sets , sample_query would be very long and complex strings , we can use these two to generate hash and truncate to use first 8chars, and add it to our filename , this can be applied to both biallelic_snps_to_plink() , and biallelic_snps_ld_pruned_to_plink , and this will ensure no file collision risk, there is already _hash_params function which can be used to do this.

I will implement this into biallelic_snps_ld_pruned_plink() within this PR. However, fixing biallelic_snps_to_plink() in this PR would be scope creep. I will open a new issue and fix it there for biallelic_snps_to_plink().

@jonbrenas i was afraid to not break any filtering in the legacy function , by removing loc_var , as at that time i was not so clear about it .
What changed my mind was tracing the data pipeline properly. biallelic_snp_calls() already applies min_minor_ac=2 by default, which guarantees every site reaching _write_plink() has atleast 2 minor allele observations across the sample subset - so it is by definition varying.
The tests confirm this empirically: all 10 tests pass identically with loc_var removed

@rehanxt5
Copy link
Contributor Author

rehanxt5 commented Mar 15, 2026

I will implement this into biallelic_snps_ld_pruned_plink() within this PR. However, fixing biallelic_snps_to_plink() in this PR would be scope creep. I will open a new issue and fix it there for biallelic_snps_to_plink().

hey @jonbrenas @saadte i have adressed the File collision issue in the biallelic_snps_ld_pruned_to_plink in this PR , i have used the _hash_params utility to generate a hash using sample_sets , sample_query , sample_query_options etc , and add first 8 characters of these to postfix ,

In #1114 i have did the same for biallelic_snps_to_plink

@jonbrenas
Copy link
Collaborator

Thanks, @rehanxt5. I have 2 main comments:

  • Why does it matter for the function writing the plink file whether the biallelic SNPs are LD-pruned or not? Will we need a different function every time we apply a different filter to our SNP data?
  • It seems to me that the simplest way to allow the user to avoid collision is to let them choose the name of the output file they want to generate. The main issue I see with the collision risk is that the user might not be aware that they are not computing a new analysis when "overwrite == False" as it silently uses the pre-existing file. Your solution avoids the risk of collision, but it doesn't increase the chances of the user knowing which of two similarly named files, generated with different sample sets but every other parameter being the same for example, contains the one they are looking for for a given analysis. Do you think it is an issue?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Adding Admixture functionalities to the API

3 participants