poonlab / ovrf-viz Goto Github PK

View Code? Open in Web Editor NEW

2.0 2.0 0.0 13.93 MB

Review article on overlapping reading frames in viruses

License: MIT License

Python 43.91% R 40.50% TeX 8.90% HTML 1.39% CSS 1.00% JavaScript 3.44% Shell 0.85%

ovrf-viz's People

Contributors

Stargazers

Watchers

ovrf-viz's Issues

Paper

Figures of general distribution: draw Trend-Line

Create one plot for at least one family on every Baltimore Classification

Baltimore class	Family	Done
ss DNA	Parvoviridae
ds DNA	Adenoviridae, Herpesviridae, Picornavirinae (from Caudovirales)	Adenoviridae (X)
(-) RNA	Rhabdoviridae, Mononegavirales	X, X
(+) RNA	Coronaviridae, Flaviviridae	X, X
ds RNA	Reoviridae (rotavirus)	X
Retrovirus	Retroviridae	X

Inconsistent labeling of virus family or order

Currently the Family field contains Order terms, or another taxonomic level altogether. For example, Riboviria is a "realm" and currently the largest category in Family.

We are seeing a large number of virus genomes with a single protein. This appears to be a misannotation in the NCBI database. For example, Bovine mastadenovirus C is listed as a virus genome with accession number NC_043093.1, but this record is labeled "Bovine adenovirus 10 isolate Ma268 hexon gene, partial cds".

In addition, this record has far too many entries in find_ovrfs.csv:

> overlaps[overlaps$accn=='NC_043093',]
            accn                          prod1   loc1 dir1
106352 NC_043093           hypothetical protein   2722   -1
106353 NC_043093           hypothetical protein   3021    1
106354 NC_043093                 transactivator   9623   -1
106355 NC_043093           hypothetical protein  15602    1
106356 NC_043093           hypothetical protein  18178   -1
106357 NC_043093       antigenic virion protein  19957   -1
106358 NC_043093                   glycoprotein  34850   -1
106359 NC_043093 Polymerase processivity factor  38918   -1
106360 NC_043093        capsid assembly protein  42998    1
106361 NC_043093        putative virion protein  54205   -1
106362 NC_043093                 virion protein  55371    1
106363 NC_043093                 Glycoprotein B  60707   -1
106364 NC_043093      transport/capsid assembly  63153   -1
106365 NC_043093               putative dUTPase  75142   -1
106366 NC_043093                  viron protein  82042    1
106367 NC_043093           hypothetical protein  97527    1
106368 NC_043093             Putative terminase  98576   -1
106369 NC_043093           hypothetical protein  99920    1
106370 NC_043093               tegument protein 100554    1
106371 NC_043093               tegument protein 101839    1
106372 NC_043093           hypothetical protein 103755    1
106373 NC_043093           hypothetical protein 104816    1
106374 NC_043093    Myristylated virion protein 108265    1
106375 NC_043093       helicase/primase complex 111950    1
106376 NC_043093         putative viron protein 114631   -1
106377 NC_043093       Helicase/primase complex 116414    1
106378 NC_043093           hypothetical protein 156043   -1
106379 NC_043093           hypothetical protein 156342    1

Get proportion of overlapping edges over total edges

Overlap lengths not being computed correctly

> which.max(virus$len.overlaps)
[1] 5223
> virus[5223,]
              Family                      Genome
5223 Closteroviridae Diodia vein chlorosis virus
       Source.information                  Accession Date.completed
5223 isolate:Fayetteville ['NC_038787', 'NC_038786']     08/24/2018
     Date.updated Genome.length Number.of.proteins   Host first.acc
5223   08/24/2018         16230                 10 plants NC_038787
     n.overlaps len.overlaps
5223          6      18017.5
> overlaps[overlaps$accn=='NC_038787',]
            accn                              prod1 loc1 dir1
119451 NC_038787 heat shock protein 70-like protein  935    1
119452 NC_038787                                 p6 2608    1
119453 NC_038787                                p60 2769    1
119454 NC_038787                                p10 4304    1
119455 NC_038787                                 CP 4545    1
119456 NC_038787                                p77 5269    1
                                    prod2 loc2 dir2 seqlen1 seqlen2
119451                                 p5  865    1    1674     132
119452 heat shock protein 70-like protein  935    1     168    1674
119453                                 p6 2608    1    1557     168
119454                                p60 2769    1     255    1557
119455                                p10 4304    1     756     255
119456                                 CP 4545    1    2025     756
       overlap shift
119451      62     1
119452       1     2
119453       7     2
119454      22     2
119455      14     1
119456      32     1

Self edge formation in visualization method

Apply clustering method to distance matrix from #12

Ideally the clustering method should be able to:

operate on a pairwise distance matrix,
automatically select the optimal number of clusters,
be consistent with Genbank annotations

Retrovirus clustering

The retrovirus genome plot have some accession numbers where all proteins appear to be assign to the same cluster. Why?

Developing new visualization method

The current idea is to use a graph where each node represents a cluster of protein sequences (presumably homologous with respect to sequence and function), and each edge represents the prevalence of overlaps between the respective protein clusters (genes).

We might also want to incorporate information about adjacency of genes into the graph (for example, genes that tend to be closer together or adajcent in genome sequences could be connected by another type of edge).

Myoviridae phage with no overlapping genes?

> virus[virus$Family=='Myoviridae',][285,]
         Family                 Genome Source.information
1218 Myoviridae Klebsiella phage K64-1                  -
     Accession Date.completed Date.updated Genome.length
1218 NC_027399     06/24/2015   08/13/2018        346602
     Number.of.proteins     Host first.acc n.overlaps
1218                 64 bacteria NC_027399          0
     len.overlaps
1218           NA

find_ovrfs: determine type of overlap (+0, +1, ...)

Needs to use strand and coords info

HTTP Error

When trying to retrieve new table with viruses information, get:

  File "/usr/lib/python3.6/urllib/request.py", line 650, in http_error_default
    raise HTTPError(req.full_url, code, msg, hdrs, fp)
urllib.error.HTTPError: HTTP Error 500: Internal Server Error

Visualization bugs

Sometimes the adjacency graphs are displayed packed in the left-top corner
Some genome plots like geminiviridae are not being displayed

k-mer distance matrix

Evaluating the use of k-mers to calculate a dot-product (kernel) matrix for comparing protein sequences without an alignment step.

Re-analysing all NCBI database

The new table that contains all accession numbers is now organized as:

## Neighbors data for complete genomes: Viruses (taxid 10239)
## Columns:     "Representative"        "Neighbor"      "Host"  "Selected lineage"      "Taxonomy name" "Segment name"
NC_003663       HQ420896        human,vertebrates       Poxviridae,Orthopoxvirus,Cowpox virus   Cowpox virus    segment  
NC_003663       KY463519        human,vertebrates       Poxviridae,Orthopoxvirus,Cowpox virus   Cowpox virus    segment  
NC_003663       HQ420897        human,vertebrates       Poxviridae,Orthopoxvirus,Cowpox virus   Cowpox virus    segment  
NC_003663       MK035759        human,vertebrates       Poxviridae,Orthopoxvirus,Cowpox virus   Cowpox virus    segment  
NC_003663       KY569019        human,vertebrates       Poxviridae,Orthopoxvirus,Cowpox virus   Cowpox virus    segment  
NC_003663       KY549145        human,vertebrates       Poxviridae,Orthopoxvirus,Cowpox virus   Cowpox virus    segment

Here they state that "neighbors" dataset was created in order to fill the diversity gaps that the RefSeq dataset couldn't provide, but are also complete or nearly complete viral genomes. Should we use "neighbors" or continue with the RefSequences?

Error on overlap calculation

Detected when re-creating this plot:

Wrong labels on JS graph

I was working on making a button so that we can save the SVG generated by d3 for graphs, but I noticed a problem with the labels:

Compare the node numbering, color and labels to this DOT viz:

I'm pretty sure that 1 is gag, 2 is pol and 4 is env.

find_ovrfs.py: error at calculating total overlap

In find_ovrfs.py

https://github.com/PoonLab/ovrf-review/blob/3107965a082029c5b83594abe2305ab89521a94a/scripts/find_ovrfs.py#L60-L62

This calculation would mislead the overlap by one nucleotide. Instead we should use:

left = max(l1, l2)
right = min(r1, r2)
overlap = (right - left) + 1

Improving adjacency graphs

Cluster subgraphs: How would they look like?

Bigger picture: Mononegavirales

In order to connect this study with our previous analysis for the ovrf Review, I wanted to try to plot a bigger group of viruses. The Mononegavirles (negative strand ssRNA) group includes families like Rhabdoviridae and Filoviridae (Ebola). It has 327 complete genomes. The plots look like this:

Cluster formation
Genome plot
Adjacency plot (edge_count == 3)
WordCloud

I wonder if it would worth to make this analysis for an entire Baltimore class so comparison between networks would be more straight-forward.

Summary statistics for weighted graphs

Adjust for non-independence of virus genomes

We are making comparisons among genomes with respect to summary statistics such as total overlap length and genome length.
There is abundant variation among virus families in the number of genomes (sample size) and we cannot treat these genomes as independent observations - some genomes may be closely related, leading to pseudo-replication. (Suppose there is a dense cluster of points in one region of the plot - those may be re-sampling closely related genomes.)

To adjust for this effect, we would need some way of quantifying the evolutionary divergence between genomes that applies across virus families. This is challenging because this measure would have to be somewhat clock-like at the among family level.

For example, we could use a k-mer distance to get a crude measure of divergence between genomes within a family, but the rate of evolution with respect to this distance could vary among families.

Plotting information

Calculate the total number of nucleotides involved in an overlap per each genome
Calculate average overlap length using number of genes as denominator
Instead of families as pie charts, plot them as stacked bar plots. Group them by Baltimore classification.
Plot the number of overlaps as ridgeplot (using ggrich from ggfree)
For the visualization of overlap length, compare genomes between families using earth mover's distance.

Virus families not being parsed correctly

For example, we currently have 481 genomes categorized as "Waterbird 1 orthobornavirus", which is definitely not correct.

poonlab / ovrf-viz Goto Github PK

ovrf-viz's People

Contributors

Stargazers

Watchers

ovrf-viz's Issues

Recommend Projects

Recommend Topics

Recommend Org