pgscatalog / pgs_catalog Goto Github PK

View Code? Open in Web Editor NEW

9.0 3.0 5.0 3.51 MB

An open database of polygenic scores and relevant metadata needed to apply and evaluate them correctly.

License: Apache License 2.0

Python 58.90% JavaScript 10.04% HTML 25.42% Shell 0.10% SCSS 5.52% CSS 0.03%

pgs-catalog django website polygenic-scores rest-api

pgs_catalog's People

Contributors

Stargazers

Watchers

Forkers

ens-lgil smlmbrt hdruk akhilpampana fyvon

pgs_catalog's Issues

REST API: Feature request: allow to include trait children with /rest/trait/all

It would be nice to provide the parameter include_children to the /rest/trait/all endpoint.

Right now what I do, as an alternative, is to get all EFO identifiers with /rest/trait/all first, and then run one request to /rest/trait/{trait_id} for each EFO identifier; not ideal.

Recode "NR" as null in method_params

curl -s -X GET "https://www.pgscatalog.org/rest/score/PGS000737?format=json" -H  "accept: application/json" | jq '.' | grep method_params

"method_params": "NR",

/rest/trait/{trait_id} does not seem to include a `child_traits` element

The endpoint /rest/trait/{trait_id} should return a EFOTrait_OntologyChild element as per the schema. This object type should include a child_traits element at the top level.

I have just played with the examples here: https://www.pgscatalog.org/rest/#/Trait%20endpoints/getTrait but it seems that the element child_traits is never included in the response object, resembling the response of type EFOTrait_Ontology.

BTW: Would you consider making both endpoints return objects of type EFOTrait_OntologyChild?

Highlight when publication is a preprint

We should have a flag, badge, or something that marks a Publication as a preprint. Maybe a button with hover text that explains what a preprint is: "manuscript has not undergone peer review"

Add 'autocomplete' to search

Would be nice to have, but not high priority.

https://www.elastic.co/guide/en/elasticsearch/reference/current/search-suggesters.html#completion-suggester

Recode 'NR' as null in JSON responses

Oftentimes that are missing values which come coded as "NR" (for "Not Reported", I believe).

For the REST API user it would best if these came as null in the JSON responses. I already have to re-code null values to NA (Not Available) anyway. I don't know I easy it would be for you to do this, but if you could do this re-coding server-side it would simplify and improve efficiency of parsing on the client side.

Some examples of object keys whose values I found to be "NR":

"ancestry_free": "NR"
"ancestry_country": "Australia, U.K., NR, U.S."
"variants_genomebuild": "NR"
"ancestry_free": "NR"
"ancestry_broad": "NR"
"ancestry_country": "NR"
"method_params": "NR"
"ancestry_country": "U.S., Australia, Canada, NR"

Not sure about those cases where the value is not a single "NR", e.g., "Australia, U.K., NR, U.S." or "U.S., Australia, Canada, NR". It is probably best to leave these cases as it is.

500 Internal Server Error with `https://www.pgscatalog.org/rest/score/search?pmid=PGP000003&format=json`

https://www.pgscatalog.org/rest/score/search?pmid=PGP000003&format=json triggers a 500 Internal Server Error.

NB: While I do understand that PGP000003 is a not a possible value for pmid, the error shouldn't be an internal server error nevertheless.

Tracking GCSTs that need updating/merging

Sometimes we index GWAS studies which are pre-release and don’t have curated metadata (e.g. missing sample numbers that are reported as NR). We should check which can be updated periodically. A current example is: https://www.ebi.ac.uk/gwas/studies/GCST90137411

Discussion: scoring file formats

List of things to check/consider:

missing rsIDs can cause problems for some pipelines. Should we replace with . as is given to us in some files? Need to check how many PGS this affects.

Replacement for the "Browse Scores" page

Web page currently broken because of the big amount of data + charts to display.
The idea is to add a search form to limit the scores displayed (for a given trait for instance)

#275

Potential bug with annotation trait URI : Ischemic stroke

The PGS traits dataset (pgs_traits_data.csv) entry for ischemic stroke (http://purl.obolibrary.org/obo/HP_0002140) is, I think, malformed and contains

The current value is "https://www.ebi.ac.uk/ols/ontologies/efo/terms?iri=http://purl.obolibrary.org/obo/HP_0002140"

Whereas it should be "http://purl.obolibrary.org/obo/HP_0002140"

New Web Display

Restart the new web display as quite a lot of code have changed since the first PR

/rest/cohort/NICCC returns two hits -- name_full contains "center"/"centre" with American and British spellings

Possible mistake here?

curl -X GET "https://www.pgscatalog.org/rest/cohort/NICCC" -H  "accept: application/json" | jq '.'

{
  "size": 2,
  "count": 2,
  "next": null,
  "previous": null,
  "results": [
    {
      "name_short": "NICCC",
      "name_full": "National Israeli Cancer Control Centre",
      "associated_pgs_ids": {
        "development": [
          "PGS000721"
        ],
        "evaluation": []
      }
    },
    {
      "name_short": "NICCC",
      "name_full": "National Israeli Cancer Control Center",
      "associated_pgs_ids": {
        "development": [],
        "evaluation": [
          "PGS000004",
          "PGS000005",
          "PGS000006",
          "PGS000351",
          "PGS000352"
        ]
      }
    }
  ]
}

`count` field in `ancestry_distribution` in `scores`

Hi PGS Catalog Team

Another question here about the count field in ancestry_distribution in scores.

With count you are conflating two different concepts: sample size and number of sample sets. Have you considered splitting them in two different fields?

Here's how I'm representing this data on the client side, e.g., PGS000018:

$ancestries
# A tibble: 3 x 7
  pgs_id    stage sample_size n_sample_sets ..resource                                                  ..timestamp         ..page
  <chr>     <chr>       <dbl>         <dbl> <chr>                                                       <dttm>               <int>
1 PGS000018 gwas       382026             0 https://www.pgscatalog.org/rest/score/PGS000018?format=json 2021-05-07 13:08:04      1
2 PGS000018 dev          3000             0 https://www.pgscatalog.org/rest/score/PGS000018?format=json 2021-05-07 13:08:04      1
3 PGS000018 eval           NA            16 https://www.pgscatalog.org/rest/score/PGS000018?format=json 2021-05-07 13:08:04      1

$ancestry_frequencies
# A tibble: 12 x 7
   pgs_id    stage ancestry_class_symbol frequency ..resource                                                  ..timestamp         ..page
   <chr>     <chr> <chr>                     <dbl> <chr>                                                       <dttm>               <int>
 1 PGS000018 gwas  AFR                         0.8 https://www.pgscatalog.org/rest/score/PGS000018?format=json 2021-05-07 13:08:04      1
 2 PGS000018 gwas  AMR                         1.1 https://www.pgscatalog.org/rest/score/PGS000018?format=json 2021-05-07 13:08:04      1
 3 PGS000018 gwas  EAS                         3   https://www.pgscatalog.org/rest/score/PGS000018?format=json 2021-05-07 13:08:04      1
 4 PGS000018 gwas  EUR                        37   https://www.pgscatalog.org/rest/score/PGS000018?format=json 2021-05-07 13:08:04      1
 5 PGS000018 gwas  GME                         0.6 https://www.pgscatalog.org/rest/score/PGS000018?format=json 2021-05-07 13:08:04      1
 6 PGS000018 gwas  MAE                        50.9 https://www.pgscatalog.org/rest/score/PGS000018?format=json 2021-05-07 13:08:04      1
 7 PGS000018 gwas  SAS                         6.7 https://www.pgscatalog.org/rest/score/PGS000018?format=json 2021-05-07 13:08:04      1
 8 PGS000018 dev   MAE                       100   https://www.pgscatalog.org/rest/score/PGS000018?format=json 2021-05-07 13:08:04      1
 9 PGS000018 eval  AFR                        12.5 https://www.pgscatalog.org/rest/score/PGS000018?format=json 2021-05-07 13:08:04      1
10 PGS000018 eval  AMR                        12.5 https://www.pgscatalog.org/rest/score/PGS000018?format=json 2021-05-07 13:08:04      1
11 PGS000018 eval  EUR                        68.8 https://www.pgscatalog.org/rest/score/PGS000018?format=json 2021-05-07 13:08:04      1
12 PGS000018 eval  MAE                         6.2 https://www.pgscatalog.org/rest/score/PGS000018?format=json 2021-05-07 13:08:04      1

$multi_ancestry_composition
# A tibble: 6 x 7
  pgs_id    stage multi_ancestry_class_symbol ancestry_class_symbol ..resource                                                  ..timestamp         ..page
  <chr>     <chr> <chr>                       <chr>                 <chr>                                                       <dttm>               <int>
1 PGS000018 gwas  MAE                         EUR                   https://www.pgscatalog.org/rest/score/PGS000018?format=json 2021-05-07 13:08:04      1
2 PGS000018 gwas  MAE                         SAS                   https://www.pgscatalog.org/rest/score/PGS000018?format=json 2021-05-07 13:08:04      1
3 PGS000018 dev   MAE                         EUR                   https://www.pgscatalog.org/rest/score/PGS000018?format=json 2021-05-07 13:08:04      1
4 PGS000018 dev   MAE                         NR                    https://www.pgscatalog.org/rest/score/PGS000018?format=json 2021-05-07 13:08:04      1
5 PGS000018 eval  MAE                         EUR                   https://www.pgscatalog.org/rest/score/PGS000018?format=json 2021-05-07 13:08:04      1
6 PGS000018 eval  MAE                         NR                    https://www.pgscatalog.org/rest/score/PGS000018?format=json 2021-05-07 13:08:04      1

So I split count in sample_size and n_sample_sets, and then I set n_sample_sets to zero for stages gwas and dev, and make sample_size NA (Not Available) for stage eval. Perhaps you could provide the total sample size for stage eval too?

ancestries' ontology and /rest/ancestry_categories

Hi PGS Catalog Team

Now that you have updated your ancestries' ontology by including the multi-ancestry categories, namely, Multi-Ancestry
(including Europeans) and Multi-Ancestry (excluding Europeans), and also added the display categories, I would like to take the opportunity to ascertain if my assumptions are correct about how each relate to one another, and whether some renaming would be appropriate.

As I see it, we have now three levels of ancestry description (using my own wording here to refer to these levels):

Ancestry (ancestry_free in Sample). This corresponds to the most basic level of description. Examples include Chinese, Japanese, Korean, Spanish, Swedish, Brazilian, Mexican, etc.. Essentially, the concepts given as examples in the third column of Table 1 of Morales et al..
Ancestry category (ancestry_broad in Sample). This is the ancestry category from the NHGRI-EBI GWAS Catalog framework, first column of Table 1 of Morales et al., plus a couple of new categories, i.e., the multi-ancestry categories defined by you guys: Multi-Ancestry (including Europeans) and Multi-Ancestry (excluding Europeans).
Ancestry class. This is a grouping of ancestry categories for visualisation purposes. You refer to this level by several names, such as: Display category (for charts) and Display categories in /docs/ancestry/ and Sample Ancestry in /docs/#desc_smpl.

As you can see from my bullet point no. 3, I am planning to refer to display categor(y/ies) as ancestry class(es) in quincunx. The reasons are:

the term display categories has categories in it, which, in my opinion, might give the wrong impression that these are on the same hierarchical level as ancestry categories;
having the word display reflects your original intention of using these new groupings for more effective visualisation in the web GUI, but I think it'd be better if the term were agnostic of its original end application;
finally, the word class is already used to mean a group of categories in the context of Thematic Framework analysis, so it would be a nice co-opt.

Regarding the /rest/ancestry_categories that provides the mapping of ancestry symbols and their name, I think there are two issues here:

the mapping is actually between ancestry class symbols and ancestry class names, not between ancestry category symbols and ancestry category names, so the endpoint name might be a bit of a misnomer; but again, I know you're not calling it classes...
the ancestry class names returned by /rest/ancestry_categories are in some cases simplified, i.e., not including the terms in brackets, e.g., "Additional Asian Ancestries" instead of "Additional Asian Ancestries (including Central, and South East Asian)". This is fine if all one is doing is consultation, but for programmatic analysis, where these values might be matched across the database, this can be a problem.

I think it would be more useful if a full table of all these ancestries were returned. Currently, in quincunx, I have a saved dataset of ancestries that simultaneously provides the mapping between ancestry categories and classes, the mapping between ancestry classes and their symbol (as provided by /rest/ancestry_categories), the hexadecimal colour code of the class, and the ancestry category definition as given in your documentation. Perhaps you could provide these data in /rest/ancestry_categories instead?

# A tibble: 19 x 6
   ancestry_category                       ancestry_class                           ancestry_class_sy… ancestry_class_co… definition                                                                                                                        examples            
   <chr>                                   <chr>                                    <chr>              <chr>              <chr>                                                                                                                             <chr>               
 1 Aboriginal Australian                   Additional Diverse Ancestries            OTH                #999999            "Includes individuals who either self-report or have been described by authors as Australian Aboriginal. These are expected to b… Martu Australian Ab…
 2 African American or Afro-Caribbean      African                                  AFR                #FFD900            "Includes individuals who either self-report or have been described by authors as African American or Afro-Caribbean. This categ… African American, A…
 3 African unspecified                     African                                  AFR                #FFD900            "Includes individuals that either self-report or have been described as African, but there was not sufficient information to all… African, non-Hispan…
 4 Asian unspecified                       Additional Asian Ancestries (including … ASN                #B15928            "Includes individuals that either self-report or have been described as Asian but there was not sufficient information to allow … Asian, Asian Americ…
 5 Central Asian                           Additional Asian Ancestries (including … ASN                #B15928            "Includes individuals who either self-report or have been described by authors as Central Asian. We note that there does not app… Silk Road (founder/…
 6 East Asian                              East Asian                               EAS                #4DAF4A            "Includes individuals who either self-report or have been described by authors as East Asian or one of the sub-populations from … Chinese, Japanese, …
 7 European                                European                                 EUR                #377EB8            "Includes individuals who either self-report or have been described by authors as European, Caucasian, white, or one of the sub-… Spanish, Swedish    
 8 Greater Middle Eastern (Middle Eastern… Greater Middle Eastern (Middle Eastern,… GME                #00CED1            "Includes individuals who self-report or were described by authors as Middle Eastern, North African, Persian, or one of the sub-… Tunisian, Arab, Ira…
 9 Hispanic or Latin American              Hispanic or Latin American               AMR                #E41A1C            "Includes individuals who either self-report or are described by authors as Hispanic, Latino, Latin American, or one of the sub-… Brazilian, Mexican  
10 Native American                         Additional Diverse Ancestries            OTH                #999999            "Includes indigenous individuals of North, Central, and South America, descended from the original human migration into the Amer… Pima Indian, Plains…
11 Not reported                            Ancestry Not Reported                    NR                 #BBBBBB            "Includes individuals for which no ancestry or country of recruitment information is available"                                   NA                  
12 Oceanian                                Additional Diverse Ancestries            OTH                #999999            "Includes individuals that either self-report or have been described by authors as Oceanian or one of the sub-populations from t… Solomon Islander, M…
13 Other                                   Additional Diverse Ancestries            OTH                #999999            "Includes individuals where an ancestry descriptor is known but insufficient information is available to allow assignment to one… Surinamese, Russian 
14 Other admixed ancestry                  Additional Diverse Ancestries            OTH                #999999            "Includes individuals who either self-report or have been described by authors as admixed and do not fit the definition of the o… NA                  
15 South Asian                             South Asian                              SAS                #984EA3            "Includes individuals who either self-report or have been described by authors as South Asian or one of the sub-populations from… Bangladeshi, Sri La…
16 South East Asian                        Additional Asian Ancestries (including … ASN                #B15928            "Includes individuals who either self-report or have been described by authors as South East Asian or one of the sub-populations… Thai, Malay         
17 Sub-Saharan African                     African                                  AFR                #FFD900            "Includes individuals who either self-report or have been described by authors as Sub-Saharan African or one of the sub-populati… Yoruban, Gambian    
18 Multi-Ancestry (including Europeans)    Multi-Ancestry (including Europeans)     MAE                #A6CEE3            "Combined sample of multiple ancestries that includes European ancestry individuals. Used when ancestry-specific sample sizes ar… NA                  
19 Multi-Ancestry (excluding Europeans)    Multi-Ancestry (excluding Europeans)     MAO                #FF7F00            "Combined sample of multiple ancestries that does not include any European ancestry individuals. Used when ancestry-specific sam… NA

I am happy to hear your thoughts on this.

API Issues Re: GWAS Catalog linkage

See EBISPOT/goci#435 for more information re: CORS. Also it appears the GCST endpoint they need is also not working (e.g. https://www.pgscatalog.org/rest/gwas/get_score_ids/GCST004988), despite the matched landing page working (https://www.pgscatalog.org/gwas/GCST004988/).

Provide population reference calculations and distributions

Hi PGS Catalog team

This is a feature request.

While answering one of the comments to your medRxiv preprint, @smlmbrt mentioned that you were "(...) exploring ways to to add population reference calculations and distributions (e.g. percentiles) to aid end-user applications and interpretation , and hope to add those features in the future." (from a reply to Charles Warden).

So, I am just adding this request here to GitHub issues to keep track of its progress.

EFO trait "Orphanet_130" (Brugada syndrome) is not found

Hi PGS Catalog team,

On the website, searching for the trait "Orphanet_130" shows two results:

but clicking on the Brugada syndrome trait gives an error 404.

On the REST API the response to https://www.pgscatalog.org/rest/trait/Orphanet_130?include_children=0&format=json is 200 but empty.

multiple effect and reference allele definitions in PGS001405

Dear PGScatalog developers,

Thanks for this very nice and comprehensive respource.

We want to calulate PGS for height and we tried using https://www.pgscatalog.org/score/PGS001405/. However, we noticed that on some rows there are multiple nucleotides listed for effect and/or reference allele. How should this be interpreted? Here are some examples:

rs71579661      1       160151624       GC      G       2.117951e-02
rs35769739      12      106500158       G       GT      -2.205375e-02
rs200768101     17      42981237        GGGA    G       2.630668e-02
rs138731997     19      41888850        A       AGGGGACTGGGC    -4.669412e-02

Also on some rows the rsID is missing:

        10      26518418        T       C       1.159232e-02
        19      49244218        CAA     C       4.904787e-02
        19      52004795        GT      G       5.072118e-02

Many thanks for your help!

sample size for PGS000737 does not match source publication's values

The score https://www.pgscatalog.org/score/PGS000737/ indicates a sample size of 1,427 individuals.

however, in the source publication 10.1093/eurheartj/ehz435, I only find the numbers 1,400 and 1,368 after filtering.

Mistake here?

Reformat the PGS Scoring File headers

The current headers are hard to parse are work with in the harmonisation pipeline. Propose the following fix:

Current:

### PGS CATALOG SCORING FILE - see www.pgscatalog.org/downloads/#dl_ftp for additional information
## POLYGENIC SCORE (PGS) INFORMATION
# PGS ID = PGS000001
# Reported Trait = Breast Cancer
# Original Genome Build = NR
# Number of Variants = 77
## SOURCE INFORMATION
# PGP ID = PGP000001
# Citation = Mavaddat N et al. J Natl Cancer Inst (2015). doi:10.1093/jnci/djv036

Potentially:

###PGS CATALOG SCORING FILE - see www.pgscatalog.org/downloads/#dl_ftp for additional information
##POLYGENIC SCORE (PGS) INFORMATION
#pgs_id=PGS000001
#trait_reported=Breast Cancer
#genome_build=NR
#variants_number=77
##SOURCE INFORMATION
#pgp_id=PGP000001
#citation=Mavaddat N et al. J Natl Cancer Inst (2015). doi:10.1093/jnci/djv036

This would change code in the initial loading of the PGS files from the raw/source files.

EFO traits' description with extra punctuation

The description of some of the EFO traits are enclosed in ['...'] or ["..."].

Examples:

    efo_id       trait                                                                                   description                                              
    <chr>        <chr>                                                                                   <chr>                                                    
  1 EFO_0009460  ACPA-negative rheumatoid arthritis                                                      "['A subtype of rheumatoid arthritis defined by the abse…
  2 EFO_0009459  ACPA-positive rheumatoid arthritis                                                      "['A subtype of rheumatoid arthritis defined by the pres…
  3 EFO_0005056  age at death                                                                            "['The age at which death occurs.']"                     
  4 EFO_0007878  alcohol consumption measurement                                                         "['quantification of some aspect of alcohol consumption …
  5 EFO_0007835  alcohol dependence measurement                                                          "['quantification of some aspect of alcohol dependence o…
  6 EFO_0004533  alkaline phosphatase measurement                                                        "['Alkaline phosphatase measurement is a quantification …
  7 EFO_1001870  late-onset Alzheimers disease                                                           "['This is the most common form of the disease, which ha…
  8 EFO_0003913  angina pectoris                                                                         "['The symptom of paroxysmal pain consequent to MYOCARDI…
  9 EFO_0004614  apolipoprotein A 1 measurement                                                          "['Is a quantification of serum lipoprotein A. Apolipopr…
 10 EFO_0004615  apolipoprotein B measurement                                                            "['The measurement of ApoB in blood. Apolipoprotein B is…
 11 EFO_0004736  aspartate aminotransferase measurement                                                  "['Is a quantification of aspartate aminotransferase, an…
 12 EFO_0010934  aspartate aminotransferase to alanine aminotransferase ratio                            "['The ratio between the levels of aspartate aminotransf…
 13 EFO_0005090  basophil count                                                                          "['quantification of basophils in the blood', 'The numbe…
 14 EFO_0007992  basophil percentage of leukocytes                                                       "['A calculated measurement in which the number of basop…
 15 EFO_0007992  basophil percentage of leukocytes                                                       "['A calculated measurement in which the number of basop…
 16 EFO_0004570  bilirubin measurement                                                                   "['A bilirubin measurement is a quantification of biliru…
 17 EFO_0007937  blood protein measurement                                                               "['quantification of the levels of some protein in a blo…
 18 EFO_0008036  BMI-adjusted fasting blood glucose measurement                                          "[\"fasting blood glucose measurement that has been adju…
 19 EFO_0008037  BMI-adjusted fasting blood insulin measurement                                          "[\"fasting insulin measurement that has been adjusted f…
 20 EFO_0007788  BMI-adjusted waist-hip ratio                                                            "['waist-hip ratio that has been adjusted by subjects’ b…
 21 EFO_0004339  body height                                                                             "['The distance from the sole to the crown of the head w…
 22 EFO_0004340  body mass index                                                                         "['An indicator of body density as determined by the rel…
 23 EFO_0003923  bone density                                                                            "['The amount of mineral per square centimeter of BONE. …
 24 EFO_0007772  calcaneal bone quantitative ultrasound measurement                                      "['bone quantitiave ultrasound of the main bone in the h…
 25 EFO_0004838  calcium measurement                                                                     "['Is a quantification of calcium, typically in serum. C…
 26 EFO_1001958  high grade ovarian serous adenocarcinoma                                                "['A rapidly growing serous adenocarcinoma that arises f…
 27 EFO_1001516  ovarian serous carcinoma                                                                "['serous carcinoma located in the ovary']"              
 28 EFO_0008328  chronotype measurement                                                                  "['quantification of some aspect of chronotype such as e…
 29 EFO_0007710  cognitive decline measurement                                                           "[\"quantification of some aspect of cognitive decline s…
 30 EFO_0009819  comparative body size at age 10, self-reported                                          "[\"Description of an individual's body size at age 10 c…
 31 EFO_0009518  complication                                                                            "['Any disease or disorder that occurs during the course…
 32 EFO_0004458  C-reactive protein measurement                                                          "['C-reactive protein (CRP) measurement is a measurement…
 33 EFO_0007934  creatinine clearance measurement                                                        "['The clearance rate of creatinine, that is, the volume…
 34 EFO_0004518  creatinine measurement                                                                  "['A creatinine measurement is a measure of the metaboli…
 35 EFO_0004617  cystatin C measurement                                                                  "['is a quantification of serum cystatin C C (formerly g…
 36 EFO_0007006  depressive symptom measurement                                                          "['quantification of the existence and severity of depre…
 37 EFO_0006336  diastolic blood pressure                                                                "['The blood pressure after the contraction of the heart…
 38 EFO_0004842  eosinophil count                                                                        "['Is a quantification of eosinphils in blood.', 'The nu…
 39 EFO_0007991  eosinophil percentage of leukocytes                                                     "['A calculated measurement in which the number of eosin…
 40 EFO_0007991  eosinophil percentage of leukocytes                                                     "['A calculated measurement in which the number of eosin…
 41 EFO_0004305  erythrocyte count                                                                       "['The number of red blood cells\\xa0per unit volume in …
 42 EFO_0004465  fasting blood glucose measurement                                                       "['An fasting blood glucose measurement is a measurement…
 43 EFO_0008036  BMI-adjusted fasting blood glucose measurement                                          "[\"fasting blood glucose measurement that has been adju…
 44 EFO_0004466  fasting blood insulin measurement                                                       "['A fasting blood insulin measurement is a measurement …
 45 EFO_0008037  BMI-adjusted fasting blood insulin measurement                                          "[\"fasting insulin measurement that has been adjusted f…
 46 PATO_0000383 female                                                                                  "['A biological sex quality inhering in an individual or…
 47 EFO_1001958  high grade ovarian serous adenocarcinoma                                                "['A rapidly growing serous adenocarcinoma that arises f…
 48 EFO_1001516  ovarian serous carcinoma                                                                "['serous carcinoma located in the ovary']"              
 49 EFO_0006829  GFR change measurement                                                                  "[\"A quantification of the variation in an individual's…
 50 EFO_0005208  glomerular filtration rate                                                              "['measurement of the flow rate of filtered fluid throug…
 51 EFO_0006829  GFR change measurement                                                                  "[\"A quantification of the variation in an individual's…
 52 EFO_0004468  glucose measurement                                                                     "['Is any quantification of glucose.']"                  
 53 EFO_0004465  fasting blood glucose measurement                                                       "['An fasting blood glucose measurement is a measurement…
 54 EFO_0008036  BMI-adjusted fasting blood glucose measurement                                          "[\"fasting blood glucose measurement that has been adju…
 55 EFO_0004541  HbA1c measurement                                                                       "['A quantification of glycated A1c hemoglobin in blood …
 56 EFO_0004541  HbA1c measurement                                                                       "['A quantification of glycated A1c hemoglobin in blood …
 57 EFO_0004348  hematocrit                                                                              "['The volume of packed RED BLOOD CELLS in a blood speci…
 58 EFO_0004509  hemoglobin measurement                                                                  "['hemoglobin levels', 'Hemoglobin measurement is a meas…
 59 EFO_0004541  HbA1c measurement                                                                       "['A quantification of glycated A1c hemoglobin in blood …
 60 EFO_0004528  mean corpuscular hemoglobin concentration                                               "['The mean corpuscular hemoglobin concentration is a me…
 61 EFO_0004612  high density lipoprotein cholesterol measurement                                        "['The measurement of HDL cholesterol in blood used as a…
 62 EFO_1001958  high grade ovarian serous adenocarcinoma                                                "['A rapidly growing serous adenocarcinoma that arises f…
 63 EFO_0004627  IGF-1 measurement                                                                       "['Is the quantification of Insulin-like growth factor 1…
 64 EFO_0002614  insulin resistance                                                                      "['diminished effectiveness of insulin in lowering plasm…
 65 EFO_0008001  insulin secretion measurement                                                           "['Measurement of compounds, generally C-peptide or matu…
 66 EFO_0004695  intraocular pressure measurement                                                        "['Is a quantification of intraocular pressure. Increase…
 67 EFO_1001870  late-onset Alzheimers disease                                                           "['This is the most common form of the disease, which ha…
 68 EFO_0008206  left ventricular systolic function measurement                                          "['quantification of some aspect of the systolic functio…
 69 EFO_0004308  leukocyte count                                                                         "['The number of\\xa0WHITE BLOOD CELLS\\xa0per unit volu…
 70 EFO_0007992  basophil percentage of leukocytes                                                       "['A calculated measurement in which the number of basop…
 71 EFO_0007990  neutrophil percentage of leukocytes                                                     "['A calculated measurement in which the number of neutr…
 72 EFO_0007989  monocyte percentage of leukocytes                                                       "['A calculated measurement in which the number of monoc…
 73 EFO_0005091  monocyte count                                                                          "['quantification of monocytes in the blood']"           
 74 EFO_0005090  basophil count                                                                          "['quantification of basophils in the blood', 'The numbe…
 75 EFO_0004587  lymphocyte count                                                                        "['A quantification of lymphocytes in blood.']"          
 76 EFO_0004842  eosinophil count                                                                        "['Is a quantification of eosinphils in blood.', 'The nu…
 77 EFO_0004833  neutrophil count                                                                        "['Is a quantification of neutrophils in blood.', 'The n…
 78 EFO_0007993  lymphocyte percentage of leukocytes                                                     "['A calculated measurement in which the number of lymph…
 79 EFO_0007991  eosinophil percentage of leukocytes                                                     "['A calculated measurement in which the number of eosin…
 80 EFO_0006925  lipoprotein A measurement                                                               "['quantification of some lipoprotein A in a sample']"   
 81 EFO_0010821  liver fat measurement                                                                   "['A quantification of the fat content of the liver such…
 82 EFO_0004300  longevity                                                                               "[\"The  length of time of an organism's life.\"]"       
 83 EFO_0004611  low density lipoprotein cholesterol measurement                                         "['The measurement of LDL cholesterol in blood used as a…
 84 EFO_0004587  lymphocyte count                                                                        "['A quantification of lymphocytes in blood.']"          
 85 EFO_0007993  lymphocyte percentage of leukocytes                                                     "['A calculated measurement in which the number of lymph…
 86 EFO_0007993  lymphocyte percentage of leukocytes                                                     "['A calculated measurement in which the number of lymph…
 87 PATO_0000384 male                                                                                    "['A biological sex quality inhering in an individual or…
 88 EFO_0004527  mean corpuscular hemoglobin                                                             "['The MCH is  the average mass of hemoglobin per red bl…
 89 EFO_0004528  mean corpuscular hemoglobin concentration                                               "['The mean corpuscular hemoglobin concentration is a me…
 90 EFO_0004526  mean corpuscular volume                                                                 "['A mean corpuscular volume is the result of calculatio…
 91 EFO_0004584  mean platelet volume                                                                    "['A measurement of mean platelet volume is a machine-ca…
 92 EFO_0010701  mean reticulocyte volume                                                                "['Mean volume of reticulocyte cells']"                  
 93 EFO_0005091  monocyte count                                                                          "['quantification of monocytes in the blood']"           
 94 EFO_0007989  monocyte percentage of leukocytes                                                       "['A calculated measurement in which the number of monoc…
 95 EFO_0007989  monocyte percentage of leukocytes                                                       "['A calculated measurement in which the number of monoc…
 96 EFO_0004833  neutrophil count                                                                        "['Is a quantification of neutrophils in blood.', 'The n…
 97 EFO_0007990  neutrophil percentage of leukocytes                                                     "['A calculated measurement in which the number of neutr…
 98 EFO_0007990  neutrophil percentage of leukocytes                                                     "['A calculated measurement in which the number of neutr…
 99 EFO_0008421  non-alcoholic fatty liver disease severity measurement                                  "['Quantification of the severity of non-alcoholic fatty…
100 EFO_1001516  ovarian serous carcinoma                                                                "['serous carcinoma located in the ovary']"              
101 EFO_1001958  high grade ovarian serous adenocarcinoma                                                "['A rapidly growing serous adenocarcinoma that arises f…
102 EFO_1001516  ovarian serous carcinoma                                                                "['serous carcinoma located in the ovary']"              
103 EFO_1001958  high grade ovarian serous adenocarcinoma                                                "['A rapidly growing serous adenocarcinoma that arises f…
104 EFO_1001516  ovarian serous carcinoma                                                                "['serous carcinoma located in the ovary']"              
105 EFO_0010968  phosphate measurement                                                                   "['Quantification of phosphate levels in a sample.']"    
106 EFO_0007984  platelet component distribution width                                                   "['The determination of the amount of platelet shape cha…
107 EFO_0004309  platelet count                                                                          "['The number of\\xa0PLATELETS\\xa0per unit volume in a …
108 EFO_0007985  platelet crit                                                                           "['The proportion of blood volume that is occupied by pl…
109 EFO_0004462  PR interval                                                                             "[\"A PR interval is an  electrocardiography measurement…
110 EFO_0005055  QRS duration                                                                            "[\"QRS duration is a measurement of the combined durati…
111 EFO_0004682  QT interval                                                                             "[\"The QT interval is a measure of the time between the…
112 EFO_0010246  recurrent                                                                               "['Episodes of disease that occur in individuals who hav…
113 EFO_0005192  red blood cell distribution width                                                       "['measure of the variation of red blood cell (RBC) volu…
114 EFO_0007766  response to beta blocker                                                                "['Any process that results in a change in state or acti…
115 GO_0097366   response to bronchodilator                                                              "['Any process that results in a change in state or acti…
116 EFO_0004351  resting heart rate                                                                      "['quantification of the number of times the heart beats…
117 EFO_0007986  reticulocyte count                                                                      "['The number of reticulocytes per unit volume of blood.…
118 EFO_0008579  risk-taking behaviour                                                                   "['The tendency to take risks. Risk-taking behaviour is …
119 EFO_0009820  seeing a general practitioner for nerves, anxiety, tension or depression, self-reported "['Seeing a general practitioner for nerves, anxiety, te…
120 EFO_0009821  seeing a psychiatrist for nerves, anxiety, tension or depression, self-reported         "['Seeing a psychiatrist for nerves, anxiety, tension or…
121 EFO_0009799  self-reported trait                                                                     "['Characteristics of an individual that are reported by…
122 EFO_0009821  seeing a psychiatrist for nerves, anxiety, tension or depression, self-reported         "['Seeing a psychiatrist for nerves, anxiety, tension or…
123 EFO_0009820  seeing a general practitioner for nerves, anxiety, tension or depression, self-reported "['Seeing a general practitioner for nerves, anxiety, te…
124 EFO_0009819  comparative body size at age 10, self-reported                                          "[\"Description of an individual's body size at age 10 c…
125 EFO_0004735  serum alanine aminotransferase measurement                                              "['Is a quantification of serum alanine aminotransferase…
126 EFO_0004535  serum albumin measurement                                                               "['An albumin measurement is a quantification of albumin…
127 EFO_0004532  serum gamma-glutamyl transferase measurement                                            "['Serum gamma-glutamyl transferase level measurement is…
128 EFO_0004579  serum IgE measurement                                                                   "[\"A serum immunoglobulin E measurement is the measurem…
129 EFO_0004568  serum non-albumin protein measurement                                                   "['The measurement of the non-albumin portion of blood p…
130 EFO_0009795  serum urea measurement                                                                  "['Quantification of the amount of urea in serum.']"     
131 EFO_0004696  sex hormone-binding globulin measurement                                                "['Is a quantification of sex hormone binding globulin. …
132 EFO_0009282  sodium measurement                                                                      "['A quantitative measurement of the amount of sodium pr…
133 EFO_0006335  systolic blood pressure                                                                 "['The blood pressure during the contraction of the left…
134 EFO_0004908  testosterone measurement                                                                "['is a quantification of testosterone, typically in ser…
135 EFO_0009933  Thyroid preparation use measurement                                                     "['Quantification of some aspect of the use of thyroid p…
136 EFO_0004536  total blood protein measurement                                                         "['A total blood protein measurement is a quantification…
137 EFO_0004574  total cholesterol measurement                                                           "['A total cholesterol measurement is the quantification…
138 EFO_0004530  triglyceride measurement                                                                "['A triglyceride  measurement is a quantification of tr…
139 EFO_0003761  unipolar depression                                                                     "['A mood disorder having a clinical course involving on…
140 EFO_0004531  urate measurement                                                                       "['A urate measurement is the quantification of some ura…
141 EFO_0004761  uric acid measurement                                                                   "['Is a quantification of uric acid, typically in blood.…
142 EFO_0007778  urinary albumin to creatinine ratio                                                     "['quantification of the ratio of albumin to creatinine …
143 EFO_0005116  urinary metabolite measurement                                                          "['quantification of some metabolite in urine']"         
144 EFO_0010952  urinary potassium measurement                                                           "['A quantitative measurement of the total amount of pot…
145 EFO_0010967  urinary microalbumin measurement                                                        "['The quantification of microalbumin in urine.']"       
146 EFO_0007778  urinary albumin to creatinine ratio                                                     "['quantification of the ratio of albumin to creatinine …
147 EFO_0010967  urinary microalbumin measurement                                                        "['The quantification of microalbumin in urine.']"       
148 EFO_0010952  urinary potassium measurement                                                           "['A quantitative measurement of the total amount of pot…
149 EFO_0004631  vitamin D measurement                                                                   "['A quantification of Vitamin D levels, typically in bl…
150 EFO_0004342  waist circumference                                                                     "['The measurement around the body at the level of the\\…

Expose release dates

Add the PGS Catalog release date information for Publication and Score models in the API and downloads.

Inconsistencies associated with recent changes to /rest/publication/ -- split into "development" and "evaluation"

One of your latest changes to the REST API was the split of the array of the field associated_pgs_ids (from the /rest/publication/ endpoint) in 2 arrays development and evaluation. I believe this change should be accompanied by other changes to keep things consistent, or alternatively regress a bit here.

Take this example snippet of PGP000013 from a response to https://www.pgscatalog.org/rest/publication/PGP000013:

{
    "id": "PGP000013",
    "title": "Type 1 Diabetes Risk in African-Ancestry Participants and Utility of an Ancestry-Specific Genetic Risk Score.",
    "doi": "10.2337/dc18-1727",
    "PMID": 30659077,
    "journal": "Diabetes Care",
    "firstauthor": "Onengut-Gumuscu S",
    "date_publication": "2019-01-18",
    "authors": "Onengut-Gumuscu S, Chen WM, Robertson CC, Bonnie JK, Farber E, Zhu Z, Oksenberg JR, Brant SR, Bridges SL, Edberg JC, Kimberly RP, Gregersen PK, Rewers MJ, Steck AK, Black MH, Dabelea D, Pihoker C, Atkinson MA, Wagenknecht LE, Divers J, Bell RA, SEARCH for Diabetes in Youth, Type 1 Diabetes Genetics Consortium, Erlich HA, Concannon P, Rich SS.",
    "associated_pgs_ids": {
        "development": [
            "PGS000023"
        ],
        "evaluation": [
            "PGS000021",
            "PGS000023"
        ]
    }
}

Given that PGS000021 is associated with PGP000013, I would expect that querying for score PGS000021 would list also PGP000013 as an associated publication. However the response to https://www.pgscatalog.org/rest/score/PGS000021?format=json only shows PGP000011:

{
  "id": "PGS000021",
  "name": "GRS1",
  "ftp_scoring_file": "http://ftp.ebi.ac.uk/pub/databases/spot/pgs/scores/PGS000021/ScoringFiles/PGS000021.txt.gz",
  "publication": {
    "id": "PGP000011",
    "title": "A Type 1 Diabetes Genetic Risk Score Can Aid Discrimination Between Type 1 and Type 2 Diabetes in Young Adults.",
    "doi": "10.2337/dc15-1111",
    "PMID": 26577414,
    "journal": "Diabetes Care",
    "firstauthor": "Oram RA",
    "date_publication": "2015-11-17"
  },
...

I am guessing that now it would be important to also split the publication field into development and evaluation, and have arrays of publications inside them. Or, in alternative, include a new field in publications, i.e., stage, and simply have an array of publications with this extra variable.

Nevertheless, I had the understanding that the /rest/score endpoints returned information associated only with the development of a PGS, and not its evaluation. So the publications returned in this context would only be related to development. I think that would be a good idea as you already have the PPM concept that lists associations with respective PGSes and whose publications are also listed in the objects returned by /rest/performance/.

So right now, we have a hybrid situation, where PGPs map to "development" and "evaluation" PGS, but PGS only map to "development" PGP. Finally PPM only map to "evaluation" PGP.

REST API: documentation for the Cohort_extended schema resulting from /rest/cohort/{cohort_symbol} different from actual response

The schema for the response from /rest/cohort/{cohort_symbol} is a Cohort_extended object.

According to the documentation, the object associated_pgs_ids should be an array. Both the schema description and the cached response example given agree with this. However, an actual response returns associated_pgs_ids as an object of two elements: development and evaluation. I believe the documentation needs an update.

Documentation of Cohort_extended schema

Example value

Actual response from /rest/cohort/ABCFS

PGS scoring file columns not following format for PGS000662

PGS000662.txt.gz has the following columns:

rsID
chr_name
chr_position
effect_allele
reference_allele
effect_weight
weight_type
allelefrequency_effect_European
allelefrequency_effect_African
allelefrequency_effect_Asian
allelefrequency_effect_Hispanic

allelefrequency_effect is part of the documented columns (10.1101/2020.05.20.20108217v1):

However these variations are not:

allelefrequency_effect_European
allelefrequency_effect_African
allelefrequency_effect_Asian
allelefrequency_effect_Hispanic

curl -sX GET 'http://ftp.ebi.ac.uk/pub/databases/spot/pgs/scores/PGS000662/ScoringFiles/PGS000662.txt.gz' | gunzip | grep rsID

rsID	chr_name	chr_position	effect_allele	reference_allele	effect_weight	weight_type	allelefrequency_effect_European	allelefrequency_effect_African	allelefrequency_effect_Asian	allelefrequency_effect_Hispanic

REST API: Values that are empty strings should be replaced with null

In JSON responses some of the values are empty; it would be best to replace them with null.

Example:

curl -X GET "https://www.pgscatalog.org/rest/performance/search?pgs_id=PGS000004&offset=0&limit=20&format=json" -H  "accept: application/json" | jq '.' | grep -n "\"\""

189:            "source_GWAS_catalog": "",
190:            "source_PMID": "",
240:      "covariates": "",
241:      "performance_comments": ""
271:            "source_GWAS_catalog": "",
272:            "source_PMID": "",
322:      "covariates": "",
323:      "performance_comments": ""
353:            "source_GWAS_catalog": "",
354:            "source_PMID": "",
404:      "covariates": "",
405:      "performance_comments": ""
435:            "source_GWAS_catalog": "",
436:            "source_PMID": "",
486:      "covariates": "",
487:      "performance_comments": ""
517:            "source_GWAS_catalog": "",
518:            "source_PMID": "",
568:      "covariates": "",
569:      "performance_comments": ""

Suggestion: create a unique identifier column for variants in the PGS scoring files

As per the PGS scoring file schema: each row in the PGS scoring file pertains a variant.

Currently, each variant is identified either by its "rsID" or by the combination of "chr_name" and "chr_position". For analyses involving various scoring files it would be nice to have one single identifier column. So here is a suggestion:

Create a new column (to be the first one), named e.g. "variant_id" whose value is preferably the "rsID" if it exists, otherwise, it is the concatenation of the chromosome name and the position (e.g., " 1:757640"). This way the reader of PGS scoring files would always know to look at this column for the identifier. Usually I find myself applying this logic on my side.

Is the `reference_allele` field in scoring files mandatory or optional?

In your medRxiv preprint, page 14, Supplemental Note 1, it reads Suggested.

Does that mean optional?

Question about `/rest/trait/search` with `include_children=1`?

Do I understand it correctly that although the endpoint /rest/trait/search allows for inclusion of child traits with include_children=1, in reality is not possible to tell apart the traits that are direct matches of the queries, and which ones are children thereof.

Instead, would it not be possible to make /rest/trait/search return EFOTrait_OntologyChild responses, like the endpoint /rest/trait/{trait_id} does?

interval type: value is not case consistent, sometimes is "IQR", other times "iqr"

According to the schema, the Demographic object may contain an interval object, whose type value must be one of: range, iqr or ci. It would be nice if this is followed strictly, namely, case-sensitive.

Example response containing "iqr"

curl -X GET "https://www.pgscatalog.org/rest/performance/all?offset=1260&limit=20&format=json" -H  "accept: application/json" | jq '.' | grep -n iqr

1158:                "type": "iqr",
1223:                "type": "iqr",
1288:                "type": "iqr",
1353:                "type": "iqr",

Example response containing "IQR"

curl -X GET "https://www.pgscatalog.org/rest/performance/all?offset=380&limit=20&format=json" -H  "accept: application/json" | jq '.' | grep -n IQR

260:                "type": "IQR",
325:                "type": "IQR",
390:                "type": "IQR",
455:                "type": "IQR",
520:                "type": "IQR",
585:                "type": "IQR",
650:                "type": "IQR",
715:                "type": "IQR",
780:                "type": "IQR",
845:                "type": "IQR",
910:                "type": "IQR",
975:                "type": "IQR",

Refactoring of /rest/info and /rest/releases?

Hi PGS Catalog team,

This is not an issue/bug but more of a question or feature request.

I noticed you had included a few new endpoints, some of which requested by me. Thank you so much, really appreciated!

Regarding the new /rest/info may I leave a few suggestions?

Suggestion 1

You've probably thought about this, but still here are my five cents. I think it would be nice to leave the /rest/info endpoint only with details about the software side of the REST API, and the /rest/release/ endpoint reserved only for data related info.

So this would imply removing this JSON element from the /rest/info response:

"latest_release": {
    "date": "2021-04-28",
    "scores": 761,
    "publications": 167,
    "traits": 204
  },

and add extra fields in the /rest/release response by including also the number of traits, and perhaps the number of sample sets. I understand that in /rest/release you were providing the increments in new entities whereas in /rest/info you are giving the totals. I think it would be nice to stick to increments, as we can always add them together to get the total at a given point in time.

In the R package quincunx, the main table resulting from a request to /rest/release/all looks like:

# A tibble: 27 x 5
   date       n_pgs n_ppm n_pgp notes                                                                                         
   <date>     <int> <int> <int> <chr>                                                                                         
 1 2021-04-28     4    25     7 This release contains 4 new Score(s), 7 new Publication(s) and 25 new Performance metric(s)   
 2 2021-04-07     6    22     5 This release contains 6 new Score(s), 5 new Publication(s) and 22 new Performance metric(s)   
 3 2021-03-22    13   144    11 This release contains 13 new Score(s), 11 new Publication(s) and 144 new Performance metric(s)
 4 2021-02-23    17   118    11 This release contains 17 new Score(s), 11 new Publication(s) and 118 new Performance metric(s)
 5 2021-02-03    58   265     8 This release contains 58 new Score(s), 8 new Publication(s) and 265 new Performance metric(s) 
 6 2021-01-07     6    31     6 This release contains 6 new Score(s), 6 new Publication(s) and 31 new Performance metric(s)   
 7 2020-12-15   306   313     4 This release contains 306 new Score(s), 4 new Publication(s) and 313 new Performance metric(s)
 8 2020-12-08     9    65     6 This release contains 9 new Score(s), 6 new Publication(s) and 65 new Performance metric(s)   
 9 2020-11-20     4    50     5 This release contains 4 new Score(s), 5 new Publication(s) and 50 new Performance metric(s)   
10 2020-11-05     4    19     4 This release contains 4 new Score(s), 4 new Publication(s) and 19 new Performance metric(s)   
# … with 17 more rows

So it would be nice to have in addition, as I said, the n_efo (number of new traits) and n_pss (number of new sample sets) migrated from the response from /rest/info.

Suggestion 2

Also move the citation and terms_of_use to the response from /rest/release. Don't you think it belongs here more than in /rest/info/?

"citation": {
    "title": "The Polygenic Score Catalog as an open database for reproducibility and systematic evaluation",
    "doi": "10.1038/s41588-021-00783-5",
    "PMID": 33692568,
    "authors": "Samuel A. Lambert, Laurent Gil, Simon Jupp, Scott C. Ritchie, Yu Xu, Annalisa Buniello, Aoife McMahon, Gad Abraham, Michael Chapman, Helen Parkinson, John Danesh, Jacqueline A. L. MacArthur and Michael Inouye.",
    "journal": "Nature Genetics",
    "year": 2021
  },
  "terms_of_use": "https://www.ebi.ac.uk/about/terms-of-use"

Suggestion 3

It would be nice to provide a few extra endpoints under /rest/info, namely:

/rest/info/all for all REST API versions (this would be the most useful of the endpoints here suggested, because, as of now, I can only see that latest changes to the API, and oftentimes it would be nice to review the changelog over a longer period of time; otherwise, I am left with no other alternative than checking the GitHub repository and revise the commit history... which is not very efficient.)
/rest/info/{release_date}, analogous to /rest/release/{release_date}
/rest/info/{version}, e.g., /rest/info/1.7

So in its final form, the JSON from /rest/info would be simply an array of:

    "date": "2021-04-28",
    "version": 1.7,
    "changelog": [
      "New data 'ancestry_distribution' in the `/rest/score` endpoints, providing information about ancestry distribution on each stage of the PGS",
      "New endpoint `/rest/ancestry_categories` providing the list of ancestry symbols and names."
    ]

Again, all in all, thanks for the terrific work! These are just some ideas, and are not really that important aspects of the REST API.

REST API: trailing space in `name_full`

curl -X GET "https://www.pgscatalog.org/rest/performance/search?pgs_id=PGS000004&offset=0&limit=20&format=json" -H  "accept: application/json" | jq '.' | grep -n "\b \""

40:                "name_full": "Agricultural Health Study "
44:                "name_full": "Breakthrough Generations Study "
48:                "name_full": "European Prospective Investigation into Cancer "
60:                "name_full": "Nurses Health Study "
64:                "name_full": "Nurses Health Studies II "
etc.

New overall web display

New homepage + minor display changes in most pages (page title for instance).
Also update the REST API web pages

#273
#276

BUG: REST API: https://www.pgscatalog.org/rest/cohort/CARE_b?format=json is empty

The endpoint /rest/cohort/ for cohort CARE_b returns 200 but is empty. However, this cohort exists, e.g., it is associated with PGS000010, PPM000015 and PSS000009.

Additional variant information

Ideas:

Subdividing the number of variants to provides flags (or numbers) of: SNPs, INDELs (> 1bp), HLA alleles, etc

Cleanup Reported Traits

The reported traits (in Score models) need some cleanup in order to improve the display of the Trait entries in the Search results.

Here are few examples:

Potential typos

cardiovascular measurement:
- Heart rate
- Heart rate (AR)
- Heat rate <==== Typo ?

Same trait but reported slighty differently

cardiovascular measurement:
- LDL
- LDL Cholesterol
- LDL cholesterol
...
- QT interval
- QT-interval

Will need to merge similar reported traits and fix typos

Uploader of Risk Score is Unclear

PGS000116 is anti-correlated to all of the other scores in EFO_0001645. Moreover, the journal article associated with it never mentions PGS000116, not even in supplementary text. So, it makes me think that the creator didn't upload it but someone else did. Was it mistakenly multiplied by -1 and is therefore a resilience rather than disease risk score? I also wonder about PGS003727 ...

Could the score overview web page have additional detail about precisely who uploaded the score?

PGS scoring files column names: undocumented column in the schema: 'OR'

Some PGS scoring files include an undocumented column in the schema: 'OR'.

Is it a mistake in these files or is it a missing schema documentation entry in here? In any case, it is not consistent across PGS scoring files. Here I provide two inconsistent examples, but this issue seems to happen throughout many of the PGS scoring files provided in the ftp server.

Example: PGS000001.txt.gz contains the "OR" column

curl -s 'http://ftp.ebi.ac.uk/pub/databases/spot/pgs/scores/PGS000001/ScoringFiles/PGS000001.txt.gz' | gunzip -c | grep -P '\bOR\b'

rsID	chr_name	effect_allele	reference_allele	effect_weight	locus_name	OR

Example: PGS000004.txt.gz does not contain the "OR" column

curl -s 'http://ftp.ebi.ac.uk/pub/databases/spot/pgs/scores/PGS000004/ScoringFiles/PGS000004.txt.gz' | gunzip -c | grep -P '\bchr_name\b'

chr_name	chr_position	effect_allele	reference_allele	effect_weight	allelefrequency_effect

REST API: extra trailing space

Related to #152.

Trailing space still prevails in other variables:

curl -X GET "https://www.pgscatalog.org/rest/performance/search?pgs_id=PGS000004&offset=0&limit=20&format=json" -H  "accept: application/json" | jq '.' | grep -n "\b \""

1478:      "phenotyping_reported": "Contralateral breast cancer ",
1776:      "covariates": "Age, country ",
2080:      "covariates": "Age, country ",
2086:      "phenotyping_reported": "Contralateral breast cancer ",
2694:      "phenotyping_reported": "Contralateral breast cancer ",
2772:      "covariates": "Age, country ",
2856:      "covariates": "Age, country ",

curl -X GET "https://www.pgscatalog.org/rest/score/PGS000014?format=json" -H  "accept: application/json" | jq '.' | grep -n "\b \""

111:  "method_name": "LDPred ",

Incorrect Categorisation

PGS002777 and PGS002778 should be in EFO_0000537 and not in EFO_0001645, shouldn't they? Correlation of these two is about 0 with all other EFO_0001645 members but all other scores are highly positively correlated with each other.

FR: REST API: /rest/release/* endpoints -- add more info

Hi PGS Catalog team,

perhaps it would be nice to have a bit more information about releases:

Right now I use the release date as a kind of key but this won't work if you make more than one release in one day. So perhaps returning a timestamp with resolution to the second would be best.
It would also be nice to return the versioning of the Catalog database and the REST API itself. Then I could add two more columns to my first table (see below) in R: e.g., catalog_version and rest_server_version.

Currently this is how I parse your releases endpoints in R:

An object of class "releases"                                                                                                                                              
Slot "releases":
# A tibble: 23 x 5
   date       n_pgs n_ppm n_pgp notes                                                                                         
   <date>     <int> <int> <int> <chr>                                                                                         
 1 2021-02-03    58   265     8 This release contains 58 new Score(s), 8 new Publication(s) and 265 new Performance metric(s) 
 2 2021-01-07     6    31     6 This release contains 6 new Score(s), 6 new Publication(s) and 31 new Performance metric(s)   
 3 2020-12-15   306   313     4 This release contains 306 new Score(s), 4 new Publication(s) and 313 new Performance metric(s)
 4 2020-12-08     9    65     6 This release contains 9 new Score(s), 6 new Publication(s) and 65 new Performance metric(s)   
 5 2020-11-20     4    50     5 This release contains 4 new Score(s), 5 new Publication(s) and 50 new Performance metric(s)   
 6 2020-11-05     4    19     4 This release contains 4 new Score(s), 4 new Publication(s) and 19 new Performance metric(s)   
 7 2020-10-19    79    79     1 This release contains 79 new Score(s), 1 new Publication(s) and 79 new Performance metric(s)  
 8 2020-10-16     1     1     1 This release contains 1 new Score(s), 1 new Publication(s) and 1 new Performance metric(s)    
 9 2020-09-18    10    34     3 This release contains 10 new Score(s), 3 new Publication(s) and 34 new Performance metric(s)  
10 2020-09-04     6    17     3 This release contains 6 new Score(s), 3 new Publication(s) and 17 new Performance metric(s)   
# … with 13 more rows

Slot "pgs_ids":
# A tibble: 721 x 2
   date       pgs_id   
   <date>     <chr>    
 1 2021-02-03 PGS000668
 2 2021-02-03 PGS000669
 3 2021-02-03 PGS000670
 4 2021-02-03 PGS000671
 5 2021-02-03 PGS000672
 6 2021-02-03 PGS000673
 7 2021-02-03 PGS000674
 8 2021-02-03 PGS000675
 9 2021-02-03 PGS000676
10 2021-02-03 PGS000677
# … with 711 more rows

Slot "ppm_ids":
# A tibble: 1,533 x 2
   date       ppm_id   
   <date>     <chr>    
 1 2021-02-03 PPM001396
 2 2021-02-03 PPM001397
 3 2021-02-03 PPM001398
 4 2021-02-03 PPM001399
 5 2021-02-03 PPM001400
 6 2021-02-03 PPM001401
 7 2021-02-03 PPM001402
 8 2021-02-03 PPM001403
 9 2021-02-03 PPM001404
10 2021-02-03 PPM001405
# … with 1,523 more rows

Slot "pgp_ids":
# A tibble: 133 x 2
   date       pgp_id   
   <date>     <chr>    
 1 2021-02-03 PGP000128
 2 2021-02-03 PGP000129
 3 2021-02-03 PGP000130
 4 2021-02-03 PGP000132
 5 2021-02-03 PGP000133
 6 2021-02-03 PGP000134
 7 2021-02-03 PGP000135
 8 2021-02-03 PGP000136
 9 2021-01-07 PGP000122
10 2021-01-07 PGP000123
# … with 123 more rows

FR: Allow request all with /rest/cohort and /rest/sample_set

For debugging reasons and for the sake of completeness it would be nice to have these endpoints:

/rest/cohort/all
/rest/sample_set/all

Curation tracker

Create django admin infrastructure to track and annotate PGS Catalog publication eligibility and curation status. Requirements:

Allow input from multiple sources (LitSuggest, manual entry of studies)
Should let us know the curation queue (awaiting L1/L2)
Should allow management/release of studies and embargoed data to the curation DB and live server

REST API: values that are "UNKNOWN" should (probably) be replaced with null

I am guessing this is something that needs to be corrected at the database level, and not at the REST API server level.

Example:

curl -X GET "https://www.pgscatalog.org/rest/performance/search?pgs_id=PGS000004&offset=0&limit=20&format=json" -H  "accept: application/json" | jq '.' | grep -in "unknown"

56:                "name_full": "UNKNOWN"
76:                "name_full": "UNKNOWN"
676:                "name_full": "UNKNOWN"

Consider making the "stage" annotation a data value of json elements

In the PGS Catalog objects returned by REST API endpoints we see that the "stage" annotation of pgs_ids and samples are always done implicitly, i.e., in the name of keys of parent JSON elements, e.g.:

objects returned by /rest/score includes objects samples_variants and samples_training, that then include the actual samples whose stage annotation has to be read from the parent elements, i.e, samples_variants corresponds to samples annotated with stage being "gwas", or "gwas variants", or perhaps "discovery", and samples_training ought to be annotated with "training". It would be better if these data were actually values, and not keys of parent elements, as it is now.
the same happens, e.g., with associated_pgs_ids, which is split into development and evaluation, again two stage annotations of pgs_ids that would be best moved out of the parent elements' names by creating a new key:value pair ("stage": "development" or "stage": "evaluation").

Would it not be better to settle down on a new categorical variable, named "stage", whose possible levels would be:

"discovery" (or other term you might prefer)
"training"
"development" (catch-all term for either "discovery" or "training")
"evaluation"

and then annotate PGSes and samples with "stage": .

This will decrease the current nesting level where this info is needed and make the parsing more straightforward on the user end. Here's a scribble indicating where I think changes would be required: out.pdf.

Retrieving and querying variants from the REST API

Hi PGS Catalog Team

This is a feature request.

It would be really nice to have a way of retrieving scores by variant identifiers (variant_id), and the reverse, to get variant identifiers by score ids (pgs_id):

/rest/variant/search?pgs_id=PGS000001
/rest/score/search?variant_id=rs123213

As I understand it, currently, there is no way of finding these relationships between variants and scores in the metadata. So, to find this information, we have to download all PGS scoring files and look in there.

I understand though that you'd probably need to fix #150 before implementing such a feature.

Replace EFOTrait_Base variable type for synonyms and mapped_terms

Investigate the replacement of the | separated TextFields of synonyms and mapped_terms (from the model EFOTrait_Base) by JSON fields.

e.g.: for breast cancer

synonyms = [ {'name': 'cancer of breast'}, {'name': 'mammary cancer'}, ... ]
mapped_terms = [ {'name': 'DOID:1612'}, {'name': 'ICD10CM:C50'}, ...]

This will require changes in:

catalog/model.py (including the functions synonyms_list and mapped_terms_list)
search/search.py

Add link to validator on curation admin?

Would it be possible to add the PGS Catalog header at the top of the tracker? That would make finding the validator easier.

What are "Covariates Included in the Model?"

If we go here, we see "age, sex and the first ten principal components of genetic ancestry" are controlled for in performance metrics. Other PGSs have more or less.

Are the covariates (as reported in the PGS Catalog):

A: Variables which are controlled for in the performance calculation (for example, so as not include genetic sex differences in the model, so the performance doesn't reflect the effect of sex). This means including sex as a covariate could decrease the reported performance of the model, and having this covariate might make us relatively more confident in the robustness of the model. The final model removes the effect of the covariates so that ~only the PGS is used assess performance.
B: Variables which are added to the model, such that the performance calculation includes the effects of these variables. That is, the performance metric might increase if one included sex (if sex is relevant to the phenotype prediction). Including covariates in this way would make us less certain of the accuracy of the polygenic score alone. The final model uses the effect of both the covariates and the PGS to achieve performance.
C: Indeterminate. Authors choose either A or B, and one must try to glean the information from the individual paper.

An example of the question that A could answer is "how well does PRS predict without the effect of X Y Z known risk factors?"
An example of the question that B could answer is "how much does a model with both PRS and X Y Z known risk factors predict?"

I believe it is A, but I wanted to verify that it is not C. Thank you.

pgscatalog / pgs_catalog Goto Github PK

pgs_catalog's People

Contributors

Stargazers

Watchers

Forkers

pgs_catalog's Issues

Documentation of Cohort_extended schema

Example value

Actual response from /rest/cohort/ABCFS

Example response containing "iqr"

Example response containing "IQR"

Suggestion 1

Suggestion 2

Suggestion 3

Example: PGS000001.txt.gz contains the "OR" column

Example: PGS000004.txt.gz does not contain the "OR" column

Recommend Projects

Recommend Topics

Recommend Org