Giter Site home page Giter Site logo

pgscatalog / pgs-harmonizer Goto Github PK

View Code? Open in Web Editor NEW
4.0 4.0 2.0 428 KB

A pipeline to format and harmonise Polygenic Score (PGS) Catalog Scoring Files within and between different genome builds.

License: Apache License 2.0

Python 94.30% Perl 2.18% Nextflow 3.52%

pgs-harmonizer's People

Contributors

dependabot[bot] avatar ens-lgil avatar smlmbrt avatar

Stargazers

 avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

Forkers

ens-lgil

pgs-harmonizer's Issues

Problem variants?

Some reports:

variants with a valid rsid tags but an "Unknown" source reference tag or “NR” genome_build tag, though it also occurs in some novel variants. For example PGS003753_hmPOS_GRCh38.txt.gz rs147937400 (T>C), which seems to correspond to an original reference of GRCh37 (chr6 29068745), but appears unmapped in the PGS Catalog’s GRCh38 harmonized score file (UCSC reports it mapped to chr6 29100968 in GRCh38)

Related to PGScatalog/pgsc_calc#137

Replace HmVCF with code based on pgscatalog_utils/match_variants

The current code is very slow - using match_variants would be much faster and more consistent with what we're doing in pgsc_calc.

Would use a variant of this script: https://github.com/PGScatalog/pgscatalog_utils/blob/main/pgscatalog_utils/match/match_variants.py

Currently it outputs a csv of all possible matches in the log file (along with which one was best), example:

row_nr accession chr_name chr_position effect_allele other_allele effect_weight effect_type ID REF ALT matched_effect_allele match_type is_multiallelic ambiguous match_flipped best_match exclude duplicate_best_match duplicate_ID match_IDs match_status dataset
0 PGS000018_hmPOS_GRCh37 1 2245570 G C -0.0276009 additive 1:2245570:C:G C G G altref false true false true true false false false excluded 1000G-chr2
0 PGS000018_hmPOS_GRCh37 1 2245570 G C -0.0276009 additive 1:2245570:C:G C G C refalt_flip false true true false true false false false not_best 1000G-chr2
1 PGS000018_hmPOS_GRCh37 1 22132518 G A 0.023934 additive 1:22132518:G:A G A G refalt false false false true true false false false excluded 1000G-chr2
2 PGS000018_hmPOS_GRCh37 1 38386727 G A -0.0174935 additive 1:38386727:G:A G A G refalt false false false true true false false false excluded 1000G-chr2
3 PGS000018_hmPOS_GRCh37 1 55496039 T C 0.0293005 additive 1:55496039:T:C T C T refalt false false false true true false false false excluded 1000G-chr2
8 PGS000018_hmPOS_GRCh37 1 110298166 G C 0.0245969 additive 1:110298166:G:C G C G refalt false true false true true false false false excluded 1000G-chr2
8 PGS000018_hmPOS_GRCh37 1 110298166 G C 0.0245969 additive 1:110298166:G:C G C C altref_flip false true true false true false false false not_best 1000G-chr2
9 PGS000018_hmPOS_GRCh37 1 151762308 G C 0.0209215 additive 1:151762308:C:G C G G altref false true false true true false false false excluded 1000G-chr2
9 PGS000018_hmPOS_GRCh37 1 151762308 G C 0.0209215 additive 1:151762308:C:G C G C refalt_flip false true true false true false false false not_best 1000G-chr2
10 PGS000018_hmPOS_GRCh37 1 154395946 G A -0.0197906 additive 1:154395946:A:G A G G altref false false false true true false false false excluded 1000G-chr2
28 PGS000018_hmPOS_GRCh37 2 164945044 G C 0.0213456 additive 2:164945044:G:C G C G refalt false true false true true false false true excluded 1000G-chr2
28 PGS000018_hmPOS_GRCh37 2 164945044 G C 0.0213456 additive 2:164945044:G:C G C C altref_flip false true true false true false false true not_best 1000G-chr2
29 PGS000018_hmPOS_GRCh37 2 202799924 C T -0.0226885 additive 2:202799924:T:C T C C altref false false false true false false false true matched 1000G-chr2
30 PGS000018_hmPOS_GRCh37 2 203829225 A C -0.0526925 additive 2:203829225:A:C A C A refalt false false false true false false false true matched 1000G-chr2

The rows of this file could then be processed into the output of the current HmVCF : e.g. a single row per scoring file variant with information about how it was matched or excluded (harmonisation code)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.