Added LD pruning support and PLINK export for ADMIXTURE analysis #1063
Added LD pruning support and PLINK export for ADMIXTURE analysis #1063rehanxt5 wants to merge 13 commits intomalariagen:masterfrom
Conversation
|
Hey @jonbrenas the draft PR is ready , i built this out as a standalone |
|
@jonbrenas hey can you look into this , when you have time . Thanks! |
saadte
left a comment
There was a problem hiding this comment.
loc_var = np.any(gn_ref != gn_ref[:, 0, np.newaxis], axis=1)
Can you explain what this does?
malariagen_data/anoph/ld.py
Outdated
| inline_array: base_params.inline_array = base_params.inline_array_default, | ||
| chunks: base_params.chunks = base_params.native_chunks, | ||
| ): | ||
| # Define output file path using parameters that affect the content. |
There was a problem hiding this comment.
Can I know how is this determined?
Define output file path using parameters that affect the content.
There was a problem hiding this comment.
I assume you are asking why this specific set of parameters was chosen for the filename
the filename is determined with the parameters: output_dir, region, n_snps, min_minor_ac, max_missing_an, thin_offset, size, step, threshold. I excluded the other selection parameters (like sample sets or sample query) . Since those can be very long strings or complex queries, moreover my approach mirrors the biallelic_snps_to_plink convention in to_plink.py, where sample selection is expected to be differentiated via output_dir, consistent with how existing PLINK export works
There was a problem hiding this comment.
Have you considered the possibility of a file collision if the parameters you skip for naming the file change the content? That might result in files with different content pointing to the exact same file, causing silent downstream problems.
|
@saadte basically what it does is , the
Now what we do is , we take the Person0 , compare it against every other sample , and get a boolean mask:
After this the np.any() checks horizontally if any value is True , if yes then it marks it as True and finally returns: Now this mask is used to lower or filter the dataset size in the very next lines: ds_final = _dask_compress_dataset(ds_pruned, loc_var, dim="variants")
|
|
My questions are a bit more fundamental:
|
|
Just upstream, you do
Then,
I might be wrong with my understanding, and will be happy if we're able to clear this. |
However , on examining the @saadte you're correct , 2. |
…nkConverter - Extract shared _write_plink() helper in PlinkConverter to reduce duplication between biallelic_snps_to_plink and biallelic_snps_ld_pruned_to_plink - Move biallelic_snps_ld_pruned_to_plink() from AnophelesLdAnalysis to PlinkConverter (proper class ownership) - Clean up ld.py imports (remove os, bed_reader, numpy, plink_params) - Update test fixtures to use combined _LdPlinkTestApi class - Preserve loc_var in _write_plink (legacy behaviour)
…ps_to_plink and ld pruned to plink , moved the bialleclic_snps_ld_pruned_to_plink to its currect poistion in to_plink.py
|
@jonbrenas @saadte i have refactored the code , i created new private plink writer function |
|
Hi @rehanxt5. I think this is better. That said:
Is that your only reason? The majority of the codebase was not written by computer scientists or computer engineers: you are expected to be the expert here and be able to say "I think this design is a mistake, I can do better." If you think having 'loc_var' is the right decision, you need to be able to say why YOU would make that decision, independently of what other people have decided. The same if you think it was not the right decision. |
@jonbrenas I completely agree with you , thanks for the push. To answer your question. |
|
Hey @jonbrenas can you take a look at the changes |
|
Hi @rehanxt5.
What kind of backward compatibility issues were you afraid of? Are you now convinced that there are no backward compatibility issues? What changed your mind? |
|
Hi @rehanxt5, Previously, I had asked how you determine, as you have commented in your code:
and had asked for some understanding on:
This can prove to be critical if not addressed. Can you please have a look and help me understand how exactly do you determine the relevant parameters and prevent collisions? |
|
@saadte You're right to flag this. I had assumed My recommendation would be: I will implement this into @jonbrenas i was afraid to not break any filtering in the legacy function , by removing |
hey @jonbrenas @saadte i have adressed the File collision issue in the In #1114 i have did the same for |
|
Thanks, @rehanxt5. I have 2 main comments:
|
Solution
I have added LD pruning capabilities to the API, allowing the users to inspect and export admixture compatible data in a single call. The implementation uses scikit-allel's
locate_unlinked(), to do LD pruning.I created a new module
ld_params.pythis holds the parameters description of ld_pruning function. Createdld.pywhere the actual functions/methods live:biallelic_snps_ld_pruned(): it is used to perform ld pruning on the biallelic SNPsbiallelic_snps_ld_pruned_to_plink(): it is used to generate the data in plink format , which is ready to use with admixtureAll the changes in detail are discussed below:
Changes
New Module: ld_params.py
It Defines 3 configurable parameters for LD pruning:
size(default 500): It is the window size in number of SNPsstep(default 200): How many SNPs to advance the window each iterationthreshold(default 0.1): r² threshold for linkage disequilibriumNew Mixin: ld.py (
AnophelesLdAnalysis)Public method
biallelic_snps_ld_pruned()min_minor_ac,max_missing_an,n_snps, cohort downsamplingPublic method
biallelic_snps_ld_pruned_to_plink()Updated anopheles.py
AnophelesLdAnalysisto theAnophelesDataResourceclass MRO (betweenAnophelesPcaandPlinkConverter)Documentation
Tests
test_biallelic_snps_ld_pruned: Basic functionality, caching, result validationtest_biallelic_snps_ld_pruned_with_n_snps: Pre-filtering with SNP count limittest_biallelic_snps_ld_pruned_with_cohort_size: Individual downsamplingtest_biallelic_snps_ld_pruned_to_plink: PLINK export validationUsage Example
Technical Details
locate_unlinked()methodTesting
Closes #1049