Added LD pruning support and PLINK export for ADMIXTURE analysis by rehanxt5 · Pull Request #1063 · malariagen/malariagen-data-python

rehanxt5 · 2026-03-06T17:27:41Z

Solution

I have added LD pruning capabilities to the API, allowing the users to inspect and export admixture compatible data in a single call. The implementation uses scikit-allel's locate_unlinked() , to do LD pruning.

I created a new module ld_params.py this holds the parameters description of ld_pruning function. Created ld.py where the actual functions/methods live:

biallelic_snps_ld_pruned() : it is used to perform ld pruning on the biallelic SNPs
biallelic_snps_ld_pruned_to_plink(): it is used to generate the data in plink format , which is ready to use with admixture

All the changes in detail are discussed below:

Changes

New Module: ld_params.py

It Defines 3 configurable parameters for LD pruning:

size (default 500): It is the window size in number of SNPs
step (default 200): How many SNPs to advance the window each iteration
threshold (default 0.1): r² threshold for linkage disequilibrium

New Mixin: ld.py (`AnophelesLdAnalysis`)

Public method biallelic_snps_ld_pruned()

it performs LD pruning on biallelic SNPs for a given region and sample set
it supports all standard filtering: min_minor_ac, max_missing_an, n_snps, cohort downsampling
it returns an xarray Dataset with pruned variants
The results are cached efficiently (stores only the pruning mask, not full genotypes)
I also added a warning , it will warn the users about memory usage before computing

Public method biallelic_snps_ld_pruned_to_plink()

It combines LD pruning with PLINK binary export in one call.
Outputs .bed, .bim, .fam files compatible with ADMIXTURE
It also supports file overwriting control
To avoid redundant computation , it uses cached results.

Updated anopheles.py

Added AnophelesLdAnalysis to the AnophelesDataResource class MRO (between AnophelesPca and PlinkConverter)
Both methods now available on all Anopheles species API objects

Documentation

Updated RST docs for all 4 species (Ag3, Af1, Adir1, Amin1)
Added both new methods to API autosummary

Tests

test_biallelic_snps_ld_pruned: Basic functionality, caching, result validation
test_biallelic_snps_ld_pruned_with_n_snps: Pre-filtering with SNP count limit
test_biallelic_snps_ld_pruned_with_cohort_size: Individual downsampling
test_biallelic_snps_ld_pruned_to_plink: PLINK export validation

Usage Example

import malariagen_data

ag3 = malariagen_data.Ag3()

# Export LD-pruned PLINK files ready for ADMIXTURE
plink_path = ag3.biallelic_snps_ld_pruned_to_plink(
    output_dir="./admixture_input",
    region="3L",
    size=500,
    step=200,
    threshold=0.1
)

# Then run: admixture ./admixture_input/3L.ld_pruned*.bed K

Technical Details

The LD pruning uses scikit-allel's locate_unlinked() method
Caching stores only the boolean pruning mask (efficient).
Memory warning before compute step alerts users to potential RAM usage.
Lazy evaluation with dask arrays until final compute.
Supports all existing SNP filtering and sample selection parameters.

Testing

All 8 new tests pass
No regression in existing PLINK converter tests
Ruff linting passes (code style compliant)
Tested on ag3_sim and af1_sim fixtures

Closes #1049

rehanxt5 · 2026-03-06T19:00:48Z

Hey @jonbrenas the draft PR is ready , i built this out as a standalone biallelic_snps_ld_pruned utility. i applied memory safety , caching , pre-filtering. Happy to make any changes or tweaks to the implementation. Let me know what you think!

rehanxt5 · 2026-03-07T12:43:36Z

@jonbrenas hey can you look into this , when you have time . Thanks!

saadte

loc_var = np.any(gn_ref != gn_ref[:, 0, np.newaxis], axis=1)
Can you explain what this does?

saadte · 2026-03-07T16:15:07Z

malariagen_data/anoph/ld.py

+        inline_array: base_params.inline_array = base_params.inline_array_default,
+        chunks: base_params.chunks = base_params.native_chunks,
+    ):
+        # Define output file path using parameters that affect the content.


Can I know how is this determined?

Define output file path using parameters that affect the content.

I assume you are asking why this specific set of parameters was chosen for the filename

the filename is determined with the parameters: output_dir, region, n_snps, min_minor_ac, max_missing_an, thin_offset, size, step, threshold. I excluded the other selection parameters (like sample sets or sample query) . Since those can be very long strings or complex queries, moreover my approach mirrors the biallelic_snps_to_plink convention in to_plink.py, where sample selection is expected to be differentiated via output_dir, consistent with how existing PLINK export works

Have you considered the possibility of a file collision if the parameters you skip for naming the file change the content? That might result in files with different content pointing to the exact same file, causing silent downstream problems.

rehanxt5 · 2026-03-07T18:25:13Z

@saadte
loc_var = np.any(gn_ref != gn_ref[:, 0, np.newaxis], axis=1)

basically what it does is , the gn_ref will have many sites which are constant. Now , for PLINK we require varying(polymorphic variants) sites , there are two approaches , first one is to compare each column to every other column (that would be compute heavy) , so what i implemented was , to compare sample0 to every other columns , and only keep the rows which are varying. For ex:

gn_ref:

Site	Person 0	Person 1	Person 2	Person 3	Note
Site A	2	2	2	2	Everyone is identical
Site B	0	1	0	2	People are different
Site C	1	1	1	1	Everyone is identical
Site D	2	2	0	1	People are different

Now what we do is , we take the Person0 , compare it against every other sample , and get a boolean mask:

Site	Person 0	Person 1	Person 2	Person 3	Note
Site A	False	False	False	False	All False
Site B	False	True	False	True	Found True
Site C	False	False	False	False	All False
Site D	False	False	True	True	Found True

After this the np.any() checks horizontally if any value is True , if yes then it marks it as True and finally returns:
[False, True, False, True]

Now this mask is used to lower or filter the dataset size in the very next lines:

ds_final = _dask_compress_dataset(ds_pruned, loc_var, dim="variants")

gn_ref_final = gn_ref[loc_var]

loc_var = np.any(gn_ref != gn_ref[:, 0, np.newaxis], axis=1) Can you explain what this does?

jonbrenas · 2026-03-07T19:17:04Z

Thanks @saadte, @rehanxt5,

My questions are a bit more fundamental:

Why would you need to select polymorphic sites when you only extracted biallelic sites?
You are aware of the existence of biallelic_snps_to_plink. Why is another function converting biallelic SNPs to plink necessary? Why is it in a class that is (from its name) not connected to plink at all?

saadte · 2026-03-07T19:46:40Z

Just upstream, you do

gn_ref = allel.GenotypeDaskArray(gt_asc).to_n_ref(fill=-127)
, which I think is encoding missing calls to "-127"

Then,

loc_var = np.any(gn_ref != gn_ref[:, 0, np.newaxis], axis=1)
will result in something like [-127,0,0,0] being treated as variant, whereas the calls are constant with one missing value.

I might be wrong with my understanding, and will be happy if we're able to clear this.

rehanxt5 · 2026-03-07T21:37:40Z

@saadte @jonbrenas

You are right , polymorphism is already guaranteed with biallelic sites. I carried it over from the existing pattern in biallelic_snps_to_plink in to_plink.py but i verified it that loc_var is not required as biallelic_snp_calls() will handle this.

However , on examining the loc_var in biallelic_snps_to_plink , the method takes ds that had already been through biallelic_snp_calls() which has same is_biallelic guarentee , so from what i can see loc_var is reduntant there too. I could be wrong , will be happy if you clarify me on this

@saadte you're correct , loc_var has a real bug with -127 fill values. However , the bug is blocked in practice because is_biallelic() in biallelic_snp_call() filters non-biallelic sites before loc_var runs.

2. biallelic_snps_to_plink() cannot be reused here directly because it hardcodes calling biallelic_snp_calls() internally , However , i can create a private helper writer function which can be used by both the methods. I'll move the biallelic_snps_ld_pruned_to_plink at it's right location in to_plink.py

…nkConverter - Extract shared _write_plink() helper in PlinkConverter to reduce duplication between biallelic_snps_to_plink and biallelic_snps_ld_pruned_to_plink - Move biallelic_snps_ld_pruned_to_plink() from AnophelesLdAnalysis to PlinkConverter (proper class ownership) - Clean up ld.py imports (remove os, bed_reader, numpy, plink_params) - Update test fixtures to use combined _LdPlinkTestApi class - Preserve loc_var in _write_plink (legacy behaviour)

…ps_to_plink and ld pruned to plink , moved the bialleclic_snps_ld_pruned_to_plink to its currect poistion in to_plink.py

rehanxt5 · 2026-03-08T21:39:16Z

@jonbrenas @saadte i have refactored the code , i created new private plink writer function _write_plink which can be used by both biallelic_snps_ld_pruned_plink and biallelic_snps_plink , i didn't remove the loc_var as it was part of biallelic_snps_plink, i also moved the function to to_plink

jonbrenas · 2026-03-08T22:00:27Z

Hi @rehanxt5. I think this is better. That said:

i didn't remove the loc_var as it was part of biallelic_snps_plink,

Is that your only reason? The majority of the codebase was not written by computer scientists or computer engineers: you are expected to be the expert here and be able to say "I think this design is a mistake, I can do better." If you think having 'loc_var' is the right decision, you need to be able to say why YOU would make that decision, independently of what other people have decided. The same if you think it was not the right decision.

rehanxt5 · 2026-03-08T23:13:51Z

However , on examining the loc_var in biallelic_snps_to_plink , the method takes ds that had already been through biallelic_snp_calls() which has same is_biallelic guarentee , so from what i can see loc_var is reduntant there too. I could be wrong , will be happy if you clarify me on this

Is that your only reason? The majority of the codebase was not written by computer scientists or computer engineers: you are expected to be the expert here and be able to say "I think this design is a mistake, I can do better." If you think having 'loc_var' is the right decision, you need to be able to say why YOU would make that decision, independently of what other people have decided. The same if you think it was not the right decision.

@jonbrenas I completely agree with you , thanks for the push. To answer your question.
Yes i believe that loc_var is not required , as the is_biallelic() method , already filters the data, so loc_var is redundant
I have now removed , loc_var. I initially left it out , because i didn't wanted to risk backward compatibility of the biallelic_snps_to_plink function

rehanxt5 · 2026-03-10T06:55:14Z

Hey @jonbrenas can you take a look at the changes

jonbrenas · 2026-03-10T06:57:47Z

Hi @rehanxt5.

I initially left it out , because i didn't wanted to risk backward compatibility of the biallelic_snps_to_plink function.

What kind of backward compatibility issues were you afraid of? Are you now convinced that there are no backward compatibility issues? What changed your mind?

saadte · 2026-03-10T07:08:47Z

Hi @rehanxt5,

Previously, I had asked how you determine, as you have commented in your code:

Define output file path using parameters that affect the content.

and had asked for some understanding on:

Have you considered the possibility of a file collision if the parameters you skip for naming the file change the content? That might result in files with different content pointing to the exact same file, causing silent downstream problems.

This can prove to be critical if not addressed. Can you please have a look and help me understand how exactly do you determine the relevant parameters and prevent collisions?

rehanxt5 · 2026-03-10T18:25:06Z

@saadte @jonbrenas

@saadte You're right to flag this. I had assumed output_dir to be intended differentiator however i don't see any explicit documentation about whether output_dir is the intended differentiator. This pattern exists in the pre-existing biallelic_snps_to_plink() method too.

My recommendation would be:
as sample_sets , sample_query would be very long and complex strings , we can use these two to generate hash and truncate to use first 8chars, and add it to our filename , this can be applied to both biallelic_snps_to_plink() , and biallelic_snps_ld_pruned_to_plink , and this will ensure no file collision risk, there is already _hash_params function which can be used to do this.

I will implement this into biallelic_snps_ld_pruned_plink() within this PR. However, fixing biallelic_snps_to_plink() in this PR would be scope creep. I will open a new issue and fix it there for biallelic_snps_to_plink().

@jonbrenas i was afraid to not break any filtering in the legacy function , by removing loc_var , as at that time i was not so clear about it .
What changed my mind was tracing the data pipeline properly. biallelic_snp_calls() already applies min_minor_ac=2 by default, which guarantees every site reaching _write_plink() has atleast 2 minor allele observations across the sample subset - so it is by definition varying.
The tests confirm this empirically: all 10 tests pass identically with loc_var removed

rehanxt5 · 2026-03-15T07:34:43Z

I will implement this into biallelic_snps_ld_pruned_plink() within this PR. However, fixing biallelic_snps_to_plink() in this PR would be scope creep. I will open a new issue and fix it there for biallelic_snps_to_plink().

hey @jonbrenas @saadte i have adressed the File collision issue in the biallelic_snps_ld_pruned_to_plink in this PR , i have used the _hash_params utility to generate a hash using sample_sets , sample_query , sample_query_options etc , and add first 8 characters of these to postfix ,

In #1114 i have did the same for biallelic_snps_to_plink

jonbrenas · 2026-03-15T10:10:32Z

Thanks, @rehanxt5. I have 2 main comments:

Why does it matter for the function writing the plink file whether the biallelic SNPs are LD-pruned or not? Will we need a different function every time we apply a different filter to our SNP data?
It seems to me that the simplest way to allow the user to avoid collision is to let them choose the name of the output file they want to generate. The main issue I see with the collision risk is that the user might not be aware that they are not computing a new analysis when "overwrite == False" as it silently uses the pre-existing file. Your solution avoids the risk of collision, but it doesn't increase the chances of the user knowing which of two similarly named files, generated with different sample sets but every other parameter being the same for example, contains the one they are looking for for a given analysis. Do you think it is an issue?

rehanxt5 added 3 commits March 6, 2026 22:35

GH1049: Add LD pruning for ADMIXTURE

72756e5

fix: correct n_snps assertion in LD pruning test

5dae46d

Merge branch 'master' into GH1049-add-ld-pruning

c45e213

Merge branch 'master' into GH1049-add-ld-pruning

e26368d

saadte reviewed Mar 7, 2026

View reviewed changes

rehanxt5 added 5 commits March 8, 2026 17:37

created helper writer function which can be used in both biallelic_sn…

8764102

…ps_to_plink and ld pruned to plink , moved the bialleclic_snps_ld_pruned_to_plink to its currect poistion in to_plink.py

fix: add mypy type ignore for mixin method call

9dee7f8

Merge branch 'master' into GH1049-add-ld-pruning

cc7f69c

fix: handle empty biallelic SNP dataset in LD pruning

350f9e8

fix: remove redundant loc_var filter from _write_plink

e78e19e

Merge branch 'master' into GH1049-add-ld-pruning

cf30bf7

GH1049 add hash suffix to ld-pruned plink filename

b14c465

Merge branch 'master' into GH1049-add-ld-pruning

f449c7c

Conversation

rehanxt5 commented Mar 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Solution

Changes

New Module: ld_params.py

New Mixin: ld.py (AnophelesLdAnalysis)

Updated anopheles.py

Documentation

Tests

Usage Example

Technical Details

Testing

Uh oh!

rehanxt5 commented Mar 6, 2026

Uh oh!

rehanxt5 commented Mar 7, 2026

Uh oh!

saadte left a comment

Choose a reason for hiding this comment

Uh oh!

saadte Mar 7, 2026

Choose a reason for hiding this comment

Uh oh!

rehanxt5 Mar 7, 2026

Choose a reason for hiding this comment

Uh oh!

saadte Mar 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rehanxt5 commented Mar 7, 2026

Uh oh!

jonbrenas commented Mar 7, 2026

Uh oh!

saadte commented Mar 7, 2026

Uh oh!

rehanxt5 commented Mar 7, 2026

Uh oh!

rehanxt5 commented Mar 8, 2026

Uh oh!

jonbrenas commented Mar 8, 2026

Uh oh!

rehanxt5 commented Mar 8, 2026

Uh oh!

rehanxt5 commented Mar 10, 2026

Uh oh!

jonbrenas commented Mar 10, 2026

Uh oh!

saadte commented Mar 10, 2026

Uh oh!

rehanxt5 commented Mar 10, 2026

Uh oh!

rehanxt5 commented Mar 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jonbrenas commented Mar 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

rehanxt5 commented Mar 6, 2026 •

edited

Loading

New Mixin: ld.py (`AnophelesLdAnalysis`)

saadte Mar 7, 2026 •

edited

Loading

rehanxt5 commented Mar 15, 2026 •

edited

Loading