reimandlab / activedriverdb Goto Github PK

View Code? Open in Web Editor NEW

12.0 12.0 3.0 44.2 MB

ActiveDriverDB

License: GNU Lesser General Public License v2.1

Python 73.75% JavaScript 9.15% HTML 12.31% Shell 1.48% Sass 2.40% Nunjucks 0.91%

activedriverdb's People

Contributors

Stargazers

Watchers

Forkers

wenmm kailicn krassowski

activedriverdb's Issues

adding stats to website

Hi Michal,

how about adding some numbers to the front page of the web site as an overview of the data we have

number of mutations in cancer, population, inherited disease
number of PTM sites considered
number of mutation annotations (all DNA>protein table + MIMP annotations)
number of mutations in PTM sites
number of mutations with MIMP effect
.. something else you remember

this will help viewers understand the impressive scope and amount of data we analyse.

ESP6500 global MAF is missing

This should be added to the last metadata column of the table

I have just discovered a typo in the repository name "Visualistion" should "Visualisation". I cannot access the settings, but it should be easy to change - on the bar under the title there should be a "Settings" tab (the last one). @reimand0, could you help me with that one?

selecting protein isoforms

We should create a list of preferred isoforms (refseq IDs), one per every gene. This isoform is shown as primary isoform when a target gene or kinase is shown. Other isoforms are available to view with links.

Let's use the following strategy, per gene

RefSeq with longest protein sequence
is multiple longest Refseqs exist, choose one with smaller numeric ID in Refseq (after removing NM_ and padding zeroes).

The last step is a heuristic but might represent older and more confident sequences. Let's investigate this a little further.

network view

lets create a network view of proteins and kinases that interact with it. Size of protein correlates with number of mutations. Clicking on protein opens protein view.

more visualisation ideas coming later.

needles overlapping with filters box not accessible

I would like to click on the mutation V600E in the BRAF protein. The filters menu pops up instead. We should probably move this menu elsewhere.

It also needs to be changed - add filtering by cancer types and PTM types. These changes are documented in other Issues.

https://rl-db.oicr.on.ca/protein/show/NM_004333

Thanks!

stability of network view

we should reduce the sensitivity of network view and make it fixed after shorter time of optimisation.

searching for proteins

we should improve search for proteins.

First, current searching by name appears to be search-when-you-type, which is quite slow to me. Maybe easier to search once user has finished typing and entered the keyword.

Second, search results should be ranked according to increasing edit distance, that is, the number of differences in search term and result. For example, currently when searching for TP53 the first result is TP53AIP1 and the 12th result is TP53. TP53 should occur first, and the next should be a gene where the least amounts of edits are needed (for example TP53I3 - two edits needed).

Mutations counts in statistics does not sum up to total mutations count

Current statistics indicate there might be some bug in counting or import script:
All: 2559719
ClinVar: 181457
TCGA: 478498
ESP 6500: 1319413
1K Genomes: 1071901

2559719 != 181457 + 478498 + 1319413 + 1071901

mutation counts not represented correctly in needle plots?

hello - looks that there might be a bug with representing mutations in needle plots. Whatever protein I look at, the maximum mutation count is almost never more than 19.

For example, TP53 R175H should be present 56 times, also seen in the table, but in the needle plot it is only at 5 copies (?).

network visualisation #2 (add zoom buttons)

we should add zoom buttons (zoom in, zoom out, zoom to fit window)

zero mutation counts at ClinVar view

https://rl-db.oicr.on.ca/protein/show/NM_002834?filter%5Bsources%5D%5Bcmp%5D=in&filter%5Bsources%5D=ClinVar&filter%5Bis_ptm%5D=None&fallback=True

for example, N18S is cont zero.

Should we keep or remove these?

Last column in mutation table

We should update the last column(s) in the mutation table so that they show minimal information. This information will be moved to the record that opens when (+) is clicked.

TCGA:
2 cancer types: GBM, KIRC

ClinVar:
2 disease annotations: Noonan syndrome, Inherited cancer

G1000, ESP6500:
MAF: 0.1234

Migrate to OICR server

network visualisation #1 (reduce the size of nodes)

we should improve the network visualisation. It can become very messy now if there are many interactors.

we should reduce the size of nodes. A fixed-size font with maximum gene length 7 would be a good size of a node I think.

REST service for exporting mutations

We should have a web service interface to get mutations from the database. The output should contain protein mutations, status relative to PTM sites, MIMP status, mutation counts (cancer) or allele frequencies (population), further meta-information (cancer type, disease name, population subgroup name).

potential input queries to discuss

protein mutation - NM_000123 A12R
DNA mutation - chr1 123456 A T
all mutations in gene - NM_000123
all mutations in gene and dataset - NM_000123 ESP6500
all mutations in range of DNA coordinates - chr1 123456 223456

website organisation

We should design a page structure that would includes information pages in addition to the actual database. For example - about, help & FAQ, contact us. Also the front page might have some information. Can we set up a simple content management system to add pages and edit these interactively?

showing records in protein search

Let's make the search results a little more compact and add some info:

CURRENT VERSION:

TP53

17
+
NM_000546

Coding region: start 7572926 end 7579912 Length: 6986 Transcription positions: start 7571719 end 7590868
protein network (56)

PROPOSED VERSION

PTM sites to merge

in the protein view, we should merge PTM flanking regions of individual PTM sites.

For example:

site1 at S13
site2 at S15

region1 is 13-7 until 13+7
region2 is 15-7 until 15+7
region_merged is 13-7 until 15+7

headers and metainfo in mutation table

New header names:

Pos,
Ref,
Mut,
PTM impact,
PTMs \n affected,
Summary

Tooltips for header names

Position of mutation in protein
Reference amino acid residue in protein
Mutated amino acid residue in protein
Impact on closest PTM site
Number of adjacent PTMs affected
Mutation summary (click on + for more info)

MIMP is not a mutation dataset

MIMP is a type of mutation label instead - so it should not be under the data source filter in the top right corner.

PTM sites by color

We have four types of PTM sites. These are now shown as green blocks in needleplot. We should use colors to distinguish the different types of PTM sites.

dark blue - phosphorylation
dark green - ubiquitination
light blue - acetylation
light green - methylation

click+labels on needles

when clicking on needles, it would be great to open a small window that shows information on mutations:

what mutation
what cancer types + counts (or population groups or diseases)
if PTM mutation, what PTM sites are affected + impacts (direct, proximal, distal, MIMP)
if MIMP mutation, show logo(s) affected

if this information is too much, then we could do the following

show minimal info in popup
have link [show more], which scrolls the page down to the table where these mutations are shown. Rows with that mutation will be highlighted with a different color.

Also, the needle heads should be even larger.

site/mutation visual by color

Let's visualise PTM sites and mutations by their impact.

site with mutation - dark blue
site without mutation - light blue

non-PTM mutation - gray
PTM mutation, direct - red
PTM mutation, flank proximal - orange
PTM mutation, flank distal - yellow

needleplots in cancer type filtering

needles should change their height when cancer types are selected and deselected.

mutation info in table and tooltip

we should have a link in the mutation table (and mutation tooltip) that will expand the row in the mutation table and provide more information about the mutation. The information is already shown in the tooltip. When user clicks on tooltip, the page will scroll down and expand the information in the table row.

Alternatively, we can open a new page but this will make the page slower, especially when the user wants to go back and forth between many mutations.

table records with MIMP results not showing

Looks like a bug in showing the mutation table. Network/rewiring mutations show no records

for example, TP53 P47S in ESP6500

https://rl-db.oicr.on.ca/protein/show/NM_000546?filter%5Bsources%5D%5Bcmp%5D=in&filter%5Bsources%5D=ESP6500&filter%5Bis_ptm%5D=None&fallback=True

kinase family

let's remove the kinases of the kinase family that have no binding sites in the protein in the view. I know that I previously suggested the opposite, but some families are quite large and it simplifies the view by not showing those.

distal mutation color in needleplot

These mutations should be yellow, but are currently shown as black

database code

Can we please add database code to the repository.

network search does not work

the search bar at https://rl-db.oicr.on.ca/network/ does not work

PTM mutation switch

we should have a global switch to show all mutations or only PTM mutations. This should be reflected in the protein view, network view and mutations table.

mapping DNA to protein

We should start with mapping DNA-level coordinates to protein coordinates. I have precomputed a table with all single nucleotide variants that correspond to PTM site mutations available below

http://journals.plos.org/plosgenetics/article?id=10.1371/journal.pgen.1004919
S1 File and S2 File.

I will compute a similar larger table for all SNVs that change protein sequence (work in progress).

The goal is the following - person comes in with set of variants at DNA level (chr1 12345678 A C). We will look this up in the table and return PROT1 345 S Q (corresponding protein variant) and say whether this is a PTM variant or non PTM variant.

This table needs rapid indexing because it is quite large. Also people may have a large number of variants to test. We want this to be as interactive as possible, but maybe we need to implement a offline calculation in the future.

filtering of mutations

we should be able to filter mutations by cancer type and population subgroups.

First, a user selects the dataset of mutations that are used (four datasets - TCGA cancer, ESP6500 population, 1000G population, ClinVar).

Second, a new filtering panel becomes available that shows cancer types or population subgroups. This could be a multiple choice form. By default, all types/subgroups are active.

I think ClinVar currently has no such grouping, but I will look at the data and see if it makes sense to add one option like this. For example, maybe there are high-quality and low-quality mutations in clinvar.

MIMP multiple logos

As discussed, we currently only show the first MIMP logo. We instead should show the best logo per gain and the best logo per loss. If only one type is present, then we should show the best for that type. We should choose the logo(s) with the highest probability score.

Under the loss logo, we could show a list of other kinases that also had losses with lower probability scores. Same for the gain logo.

In other places of the protein view, like the table and the tooltip, we should indicate if both loss and gain are predicted in the network-rewiring mutation.

search form

In the current version, we have two tab for searching. Instead let's have one page with two fields - a. smaller one-line field for gene/protein (grayed out value "TP53 or NM_000546")
b. larger multi-line field for mutations (grayed out value to include a few DNA and protein level mutations).

there should be a button or link to "run example" that the user can run.

visualising MIMP mutations

Here's an example with MIMP mutations.

In R, this is how we see the top MIMP-mutation in TP53 (they are ordered by prob)
all_mimp_annotations[all_mimp_annotations[,1]=="NM_000546",][1,]

         gene               mut         psite_pos          mut_dist
  "NM_000546"           "N310R"             "313"              "-3"
           wt                mt          score_wt          score_mt

"RALPNNTSSSPQPKK" "RALPRNTSSSPQPKK" "0.12003083" "0.92115458"
log_ratio pwm pwm_fam nseqs
" 2.940038" "CHEK1" "CAMKL_CHK1" "104"
prob effect
"0.9971076" "gain"

Let's imagine that a user uploads the mutation N310R. In column psite_pos, there is the PTM site that gets affected (313) and the kinase whose binding site is affected is shown in column pwm (CHEK1). Below is the logo for CHEK1.

you can see that creating an R-residue to the site creates all the necessary information for the kinase to find its site. The column mut_dist also shows that the mutation N310R is -3 steps away from the site at 313.

So if R is added, we have a gain of motif. We could indicate this GAIN in the logo with a light green box around R.

So looking at the second example, we have a loss of motif.

all_mimp_annotations[all_mimp_annotations[,1]=="NM_000546",][2,]
gene mut psite_pos mut_dist
"NM_000546" "Q100F" " 99" " 1"
wt mt score_wt score_mt
"PLSSSVPSQKTYQGS" "PLSSSVPSFKTYQGS" "0.94418516" "0.11160907"
log_ratio pwm pwm_fam nseqs
"-3.080616" "PRKDC" "PIKK_DNAPK" " 67"
prob effect
"0.9970976" "loss"

the mutation Q100F would be very important in destroying the motif, because Q is a key residue in the motif (+1 steps away from the central residue at position 99).

We could indicate LOSS in the logo by drawing a red box around position +1 and drawing a top-left-to-bottom-right diagonal through the box.

mutations in table should reflect mutations on needleplotq

currently ClinVar view seems to reflect TCGA mutations

sites missing in network-rewiring mutations in needleplot tooltips protein vies

note "sites: object [Object]" in screenshot attached. thanks!

"select all" in filtering the cancer types

In the list of checkboxes to select cancer types (and population types), we should have an additional box on the top that selects all types. this is active by default. If the box is clicked, all types become deselected. If someone is interested in one type of cancer, they can get the info quickly and don't need to unselect every other type.

also, lets add the short names of cancer types in the end of cancers. We can do the same with population types if we have that information.
Glioblastoma multiforme (GBM)

table rows colored according to mutation impact

Let's change the colors of table rows according to the impact of mutations shown in needle plots - yellow-orange-red-darkred-gray. We should probably tune all the colors to be a lighter so that the table remains easy to read.

protein view - mutation and site colors

We should simplify the coloring of mutations in the protein view because there are too many colors and more will come (more cancer types, inherited mutations, etc).

Let's color in four categories:

gray - mutation in nonPTM site
yellow - distal mutation (3-7 amino acids away from closest PTM site)
orange - proximal mutation (1-2 amino acids away from closest PTM site)
red - direct mutation (on PTM site)
light blue - mutation affecting PTM sequence motif. These mutations could be either category (1), (2), or (3) but category (5) overrides categories (2-3) but not category (1) (more on this soon!)

Site coloring should change too. First idea is to color them by mutation impact - non-mutated PTM sites are gray. Mutated sites are colored with the above color scheme, according to which mutations are the most frequent. If a site has 8 distal mutations and 5 proximal mutations, the site should be colored yellow.

tooltips should be dataset specific

currently tooltips display all information about a specific mutation. while this may be useful in some cases, the tooltips are prone to get very long and complicated. we should only show information about the selected dataset (cancers in TCGA, 1KG in thousand genomes, etc).

cancer types of mutations

We should update cancer type of mutations (the table column title is misleading). cancer type is actually shown in the sample_ID column field.

For example, the first line is shown below:
comments: NA blca TCGA-BL-A0C8-01A-11D-A10S-08 1 1

BLCA is the cancer type of the first mutation (third element when separated by space).

this is probably an enhancement of the database as well. generally cancer types are indicated by 4-letter codes and also have descriptions (BLCA is bladder urothelial carcinoma). I've copied cancer_types.txt to Dropbox for your reference.

tutorial to update input data for database

we need a tutorial to update the database with new information or sites. This should be part of developer documentation, perhaps the github wiki.

TCGA mutations duplicated due to semicolons

Some TCGA mutations are currently duplicated. This is because Annovar aggregates effects of duplicated mutations in input and paste these together with semicolons.

To solve this issue, we need to separate impacts by semicolon and take only the first string.

Unique patient-mutation pairs will remain the same when considering the comments field with TCGA barcode, as far as I understand.

mutation table

We should build a mutation table below the protein view. the following info for every mutation:

coordinate / position in sequence
reference AA
mutated AA
count observed in data
mutation type - PTM or non-PTM
impact on PTM - direct (on central site); flanking proximal (1-2 AA); flanking distal (3-7 AA). This is related to the closest PTM.
number of PTMs this mutation affects

(6-7) only shown if (5) is PTM.

We should be able to sort tables by columns and filter by value of (5).

filtering of PTM sites

we should be able to filter which PTM sites are considered in the analysis of mutations. By default, all PTMs are included, but the user can choose one PTM or several PTMs to be included.

This filtering should change needleplot, table and any other aspects of the website.

user input as collection of variants

One necessary way of inputting data is by chromosome coordinates of genome variants (chr, pos, ref, alt). For example VCF and MAF files have that information. The user can have one or many (MANY) from a genome sequencing study he or she performed. Some of these variants can be in the dataset more than one time, then a count should be shown.

We should convert these DNA coordinates to protein coordinates and show protein-level mutations as a table and links to protein views. Non-mapping coordinates should be excluded and a line of text will say how many were excluded. There is a filter that allows to only show PTM-site mutations (default=ON).

As the number of variants is large, we may need to store the information on server sideand allow the user to retrieve the session for a limited time.

zooming in protein and network view

We should be able to zoom in and out in protein view

reimandlab / activedriverdb Goto Github PK

activedriverdb's People

Contributors

Stargazers

Watchers

Forkers

activedriverdb's Issues

PTMs \n affected,

Recommend Projects

Recommend Topics

Recommend Org