Giter Site home page Giter Site logo

activedriverdb's People

Contributors

dependabot[bot] avatar krassowski avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

activedriverdb's Issues

adding stats to website

Hi Michal,

how about adding some numbers to the front page of the web site as an overview of the data we have

  • number of mutations in cancer, population, inherited disease
  • number of PTM sites considered
  • number of mutation annotations (all DNA>protein table + MIMP annotations)
  • number of mutations in PTM sites
  • number of mutations with MIMP effect
  • .. something else you remember

this will help viewers understand the impressive scope and amount of data we analyse.

selecting protein isoforms

We should create a list of preferred isoforms (refseq IDs), one per every gene. This isoform is shown as primary isoform when a target gene or kinase is shown. Other isoforms are available to view with links.

Let's use the following strategy, per gene

  1. RefSeq with longest protein sequence
  2. is multiple longest Refseqs exist, choose one with smaller numeric ID in Refseq (after removing NM_ and padding zeroes).

The last step is a heuristic but might represent older and more confident sequences. Let's investigate this a little further.

network view

lets create a network view of proteins and kinases that interact with it. Size of protein correlates with number of mutations. Clicking on protein opens protein view.

more visualisation ideas coming later.

stability of network view

we should reduce the sensitivity of network view and make it fixed after shorter time of optimisation.

searching for proteins

we should improve search for proteins.

First, current searching by name appears to be search-when-you-type, which is quite slow to me. Maybe easier to search once user has finished typing and entered the keyword.

Second, search results should be ranked according to increasing edit distance, that is, the number of differences in search term and result. For example, currently when searching for TP53 the first result is TP53AIP1 and the 12th result is TP53. TP53 should occur first, and the next should be a gene where the least amounts of edits are needed (for example TP53I3 - two edits needed).

mutation counts not represented correctly in needle plots?

hello - looks that there might be a bug with representing mutations in needle plots. Whatever protein I look at, the maximum mutation count is almost never more than 19.

For example, TP53 R175H should be present 56 times, also seen in the table, but in the needle plot it is only at 5 copies (?).

Last column in mutation table

We should update the last column(s) in the mutation table so that they show minimal information. This information will be moved to the record that opens when (+) is clicked.

TCGA:
2 cancer types: GBM, KIRC

ClinVar:
2 disease annotations: Noonan syndrome, Inherited cancer

G1000, ESP6500:
MAF: 0.1234

network visualisation #1 (reduce the size of nodes)

we should improve the network visualisation. It can become very messy now if there are many interactors.

  1. we should reduce the size of nodes. A fixed-size font with maximum gene length 7 would be a good size of a node I think.

REST service for exporting mutations

We should have a web service interface to get mutations from the database. The output should contain protein mutations, status relative to PTM sites, MIMP status, mutation counts (cancer) or allele frequencies (population), further meta-information (cancer type, disease name, population subgroup name).

potential input queries to discuss

  1. protein mutation - NM_000123 A12R
  2. DNA mutation - chr1 123456 A T
  3. all mutations in gene - NM_000123
  4. all mutations in gene and dataset - NM_000123 ESP6500
  5. all mutations in range of DNA coordinates - chr1 123456 223456

website organisation

We should design a page structure that would includes information pages in addition to the actual database. For example - about, help & FAQ, contact us. Also the front page might have some information. Can we set up a simple content management system to add pages and edit these interactively?

showing records in protein search

Let's make the search results a little more compact and add some info:

CURRENT VERSION:

TP53

17
+
NM_000546

Coding region: start 7572926 end 7579912 Length: 6986 Transcription positions: start 7571719 end 7590868
protein network (56)

PROPOSED VERSION

TP53
NM_000546 | Protein: 392 residues | CDS: 7572926-7579912 | Transcript: 7571719-7590868 | Chr17 (+)
VIEWS: Protein sequence (XX PTM sites, YY mutations) | Site-specific PTM interaction network (56 proteins)

PTM sites to merge

in the protein view, we should merge PTM flanking regions of individual PTM sites.

For example:

site1 at S13
site2 at S15

region1 is 13-7 until 13+7
region2 is 15-7 until 15+7
region_merged is 13-7 until 15+7

headers and metainfo in mutation table

New header names:

  • Pos,
  • Ref,
  • Mut,
  • PTM impact,
  • PTMs \n affected,

  • Summary

Tooltips for header names

  • Position of mutation in protein
  • Reference amino acid residue in protein
  • Mutated amino acid residue in protein
  • Impact on closest PTM site
  • Number of adjacent PTMs affected
  • Mutation summary (click on + for more info)

PTM sites by color

We have four types of PTM sites. These are now shown as green blocks in needleplot. We should use colors to distinguish the different types of PTM sites.

dark blue - phosphorylation
dark green - ubiquitination
light blue - acetylation
light green - methylation

click+labels on needles

when clicking on needles, it would be great to open a small window that shows information on mutations:

  • what mutation
  • what cancer types + counts (or population groups or diseases)
  • if PTM mutation, what PTM sites are affected + impacts (direct, proximal, distal, MIMP)
  • if MIMP mutation, show logo(s) affected

if this information is too much, then we could do the following

  1. show minimal info in popup
  2. have link [show more], which scrolls the page down to the table where these mutations are shown. Rows with that mutation will be highlighted with a different color.

Also, the needle heads should be even larger.

site/mutation visual by color

Let's visualise PTM sites and mutations by their impact.

site with mutation - dark blue
site without mutation - light blue

non-PTM mutation - gray
PTM mutation, direct - red
PTM mutation, flank proximal - orange
PTM mutation, flank distal - yellow

mutation info in table and tooltip

we should have a link in the mutation table (and mutation tooltip) that will expand the row in the mutation table and provide more information about the mutation. The information is already shown in the tooltip. When user clicks on tooltip, the page will scroll down and expand the information in the table row.

Alternatively, we can open a new page but this will make the page slower, especially when the user wants to go back and forth between many mutations.

kinase family

let's remove the kinases of the kinase family that have no binding sites in the protein in the view. I know that I previously suggested the opposite, but some families are quite large and it simplifies the view by not showing those.

PTM mutation switch

we should have a global switch to show all mutations or only PTM mutations. This should be reflected in the protein view, network view and mutations table.

mapping DNA to protein

We should start with mapping DNA-level coordinates to protein coordinates. I have precomputed a table with all single nucleotide variants that correspond to PTM site mutations available below

http://journals.plos.org/plosgenetics/article?id=10.1371/journal.pgen.1004919
S1 File and S2 File.

I will compute a similar larger table for all SNVs that change protein sequence (work in progress).

The goal is the following - person comes in with set of variants at DNA level (chr1 12345678 A C). We will look this up in the table and return PROT1 345 S Q (corresponding protein variant) and say whether this is a PTM variant or non PTM variant.

This table needs rapid indexing because it is quite large. Also people may have a large number of variants to test. We want this to be as interactive as possible, but maybe we need to implement a offline calculation in the future.

filtering of mutations

we should be able to filter mutations by cancer type and population subgroups.

First, a user selects the dataset of mutations that are used (four datasets - TCGA cancer, ESP6500 population, 1000G population, ClinVar).

Second, a new filtering panel becomes available that shows cancer types or population subgroups. This could be a multiple choice form. By default, all types/subgroups are active.

I think ClinVar currently has no such grouping, but I will look at the data and see if it makes sense to add one option like this. For example, maybe there are high-quality and low-quality mutations in clinvar.

MIMP multiple logos

As discussed, we currently only show the first MIMP logo. We instead should show the best logo per gain and the best logo per loss. If only one type is present, then we should show the best for that type. We should choose the logo(s) with the highest probability score.

Under the loss logo, we could show a list of other kinases that also had losses with lower probability scores. Same for the gain logo.

In other places of the protein view, like the table and the tooltip, we should indicate if both loss and gain are predicted in the network-rewiring mutation.

search form

In the current version, we have two tab for searching. Instead let's have one page with two fields - a. smaller one-line field for gene/protein (grayed out value "TP53 or NM_000546")
b. larger multi-line field for mutations (grayed out value to include a few DNA and protein level mutations).

there should be a button or link to "run example" that the user can run.

visualising MIMP mutations

Here's an example with MIMP mutations.

In R, this is how we see the top MIMP-mutation in TP53 (they are ordered by prob)
all_mimp_annotations[all_mimp_annotations[,1]=="NM_000546",][1,]

         gene               mut         psite_pos          mut_dist
  "NM_000546"           "N310R"             "313"              "-3"
           wt                mt          score_wt          score_mt

"RALPNNTSSSPQPKK" "RALPRNTSSSPQPKK" "0.12003083" "0.92115458"
log_ratio pwm pwm_fam nseqs
" 2.940038" "CHEK1" "CAMKL_CHK1" "104"
prob effect
"0.9971076" "gain"

Let's imagine that a user uploads the mutation N310R. In column psite_pos, there is the PTM site that gets affected (313) and the kinase whose binding site is affected is shown in column pwm (CHEK1). Below is the logo for CHEK1.
screen shot 2016-08-23 at 3 19 34 pm

you can see that creating an R-residue to the site creates all the necessary information for the kinase to find its site. The column mut_dist also shows that the mutation N310R is -3 steps away from the site at 313.

So if R is added, we have a gain of motif. We could indicate this GAIN in the logo with a light green box around R.

So looking at the second example, we have a loss of motif.

all_mimp_annotations[all_mimp_annotations[,1]=="NM_000546",][2,]
gene mut psite_pos mut_dist
"NM_000546" "Q100F" " 99" " 1"
wt mt score_wt score_mt
"PLSSSVPSQKTYQGS" "PLSSSVPSFKTYQGS" "0.94418516" "0.11160907"
log_ratio pwm pwm_fam nseqs
"-3.080616" "PRKDC" "PIKK_DNAPK" " 67"
prob effect
"0.9970976" "loss"

screen shot 2016-08-23 at 3 24 48 pm

the mutation Q100F would be very important in destroying the motif, because Q is a key residue in the motif (+1 steps away from the central residue at position 99).

We could indicate LOSS in the logo by drawing a red box around position +1 and drawing a top-left-to-bottom-right diagonal through the box.

"select all" in filtering the cancer types

In the list of checkboxes to select cancer types (and population types), we should have an additional box on the top that selects all types. this is active by default. If the box is clicked, all types become deselected. If someone is interested in one type of cancer, they can get the info quickly and don't need to unselect every other type.

also, lets add the short names of cancer types in the end of cancers. We can do the same with population types if we have that information.
Glioblastoma multiforme (GBM)

table rows colored according to mutation impact

Let's change the colors of table rows according to the impact of mutations shown in needle plots - yellow-orange-red-darkred-gray. We should probably tune all the colors to be a lighter so that the table remains easy to read.

protein view - mutation and site colors

We should simplify the coloring of mutations in the protein view because there are too many colors and more will come (more cancer types, inherited mutations, etc).

Let's color in four categories:

  1. gray - mutation in nonPTM site
  2. yellow - distal mutation (3-7 amino acids away from closest PTM site)
  3. orange - proximal mutation (1-2 amino acids away from closest PTM site)
  4. red - direct mutation (on PTM site)
  5. light blue - mutation affecting PTM sequence motif. These mutations could be either category (1), (2), or (3) but category (5) overrides categories (2-3) but not category (1) (more on this soon!)

Site coloring should change too. First idea is to color them by mutation impact - non-mutated PTM sites are gray. Mutated sites are colored with the above color scheme, according to which mutations are the most frequent. If a site has 8 distal mutations and 5 proximal mutations, the site should be colored yellow.

tooltips should be dataset specific

currently tooltips display all information about a specific mutation. while this may be useful in some cases, the tooltips are prone to get very long and complicated. we should only show information about the selected dataset (cancers in TCGA, 1KG in thousand genomes, etc).

cancer types of mutations

We should update cancer type of mutations (the table column title is misleading). cancer type is actually shown in the sample_ID column field.

For example, the first line is shown below:
comments: NA blca TCGA-BL-A0C8-01A-11D-A10S-08 1 1

BLCA is the cancer type of the first mutation (third element when separated by space).

this is probably an enhancement of the database as well. generally cancer types are indicated by 4-letter codes and also have descriptions (BLCA is bladder urothelial carcinoma). I've copied cancer_types.txt to Dropbox for your reference.

TCGA mutations duplicated due to semicolons

Some TCGA mutations are currently duplicated. This is because Annovar aggregates effects of duplicated mutations in input and paste these together with semicolons.

To solve this issue, we need to separate impacts by semicolon and take only the first string.

Unique patient-mutation pairs will remain the same when considering the comments field with TCGA barcode, as far as I understand.

mutation table

We should build a mutation table below the protein view. the following info for every mutation:

  1. coordinate / position in sequence
  2. reference AA
  3. mutated AA
  4. count observed in data
  5. mutation type - PTM or non-PTM
  6. impact on PTM - direct (on central site); flanking proximal (1-2 AA); flanking distal (3-7 AA). This is related to the closest PTM.
  7. number of PTMs this mutation affects

(6-7) only shown if (5) is PTM.

We should be able to sort tables by columns and filter by value of (5).

filtering of PTM sites

we should be able to filter which PTM sites are considered in the analysis of mutations. By default, all PTMs are included, but the user can choose one PTM or several PTMs to be included.

This filtering should change needleplot, table and any other aspects of the website.

user input as collection of variants

One necessary way of inputting data is by chromosome coordinates of genome variants (chr, pos, ref, alt). For example VCF and MAF files have that information. The user can have one or many (MANY) from a genome sequencing study he or she performed. Some of these variants can be in the dataset more than one time, then a count should be shown.

We should convert these DNA coordinates to protein coordinates and show protein-level mutations as a table and links to protein views. Non-mapping coordinates should be excluded and a line of text will say how many were excluded. There is a filter that allows to only show PTM-site mutations (default=ON).

As the number of variants is large, we may need to store the information on server sideand allow the user to retrieve the session for a limited time.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.