Giter Site home page Giter Site logo

ovrf-viz's People

Contributors

artpoon avatar horaciobam avatar

Stargazers

 avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

ovrf-viz's Issues

Paper

Figures of general distribution: draw Trend-Line

Error in computing overlaps

We are seeing a large number of virus genomes with a single protein. This appears to be a misannotation in the NCBI database. For example, Bovine mastadenovirus C is listed as a virus genome with accession number NC_043093.1, but this record is labeled "Bovine adenovirus 10 isolate Ma268 hexon gene, partial cds".

In addition, this record has far too many entries in find_ovrfs.csv:

> overlaps[overlaps$accn=='NC_043093',]
            accn                          prod1   loc1 dir1
106352 NC_043093           hypothetical protein   2722   -1
106353 NC_043093           hypothetical protein   3021    1
106354 NC_043093                 transactivator   9623   -1
106355 NC_043093           hypothetical protein  15602    1
106356 NC_043093           hypothetical protein  18178   -1
106357 NC_043093       antigenic virion protein  19957   -1
106358 NC_043093                   glycoprotein  34850   -1
106359 NC_043093 Polymerase processivity factor  38918   -1
106360 NC_043093        capsid assembly protein  42998    1
106361 NC_043093        putative virion protein  54205   -1
106362 NC_043093                 virion protein  55371    1
106363 NC_043093                 Glycoprotein B  60707   -1
106364 NC_043093      transport/capsid assembly  63153   -1
106365 NC_043093               putative dUTPase  75142   -1
106366 NC_043093                  viron protein  82042    1
106367 NC_043093           hypothetical protein  97527    1
106368 NC_043093             Putative terminase  98576   -1
106369 NC_043093           hypothetical protein  99920    1
106370 NC_043093               tegument protein 100554    1
106371 NC_043093               tegument protein 101839    1
106372 NC_043093           hypothetical protein 103755    1
106373 NC_043093           hypothetical protein 104816    1
106374 NC_043093    Myristylated virion protein 108265    1
106375 NC_043093       helicase/primase complex 111950    1
106376 NC_043093         putative viron protein 114631   -1
106377 NC_043093       Helicase/primase complex 116414    1
106378 NC_043093           hypothetical protein 156043   -1
106379 NC_043093           hypothetical protein 156342    1

Overlap lengths not being computed correctly

> which.max(virus$len.overlaps)
[1] 5223
> virus[5223,]
              Family                      Genome
5223 Closteroviridae Diodia vein chlorosis virus
       Source.information                  Accession Date.completed
5223 isolate:Fayetteville ['NC_038787', 'NC_038786']     08/24/2018
     Date.updated Genome.length Number.of.proteins   Host first.acc
5223   08/24/2018         16230                 10 plants NC_038787
     n.overlaps len.overlaps
5223          6      18017.5
> overlaps[overlaps$accn=='NC_038787',]
            accn                              prod1 loc1 dir1
119451 NC_038787 heat shock protein 70-like protein  935    1
119452 NC_038787                                 p6 2608    1
119453 NC_038787                                p60 2769    1
119454 NC_038787                                p10 4304    1
119455 NC_038787                                 CP 4545    1
119456 NC_038787                                p77 5269    1
                                    prod2 loc2 dir2 seqlen1 seqlen2
119451                                 p5  865    1    1674     132
119452 heat shock protein 70-like protein  935    1     168    1674
119453                                 p6 2608    1    1557     168
119454                                p60 2769    1     255    1557
119455                                p10 4304    1     756     255
119456                                 CP 4545    1    2025     756
       overlap shift
119451      62     1
119452       1     2
119453       7     2
119454      22     2
119455      14     1
119456      32     1

Retrovirus clustering

The retrovirus genome plot have some accession numbers where all proteins appear to be assign to the same cluster. Why?

image

Developing new visualization method

The current idea is to use a graph where each node represents a cluster of protein sequences (presumably homologous with respect to sequence and function), and each edge represents the prevalence of overlaps between the respective protein clusters (genes).

We might also want to incorporate information about adjacency of genes into the graph (for example, genes that tend to be closer together or adajcent in genome sequences could be connected by another type of edge).

Myoviridae phage with no overlapping genes?

> virus[virus$Family=='Myoviridae',][285,]
         Family                 Genome Source.information
1218 Myoviridae Klebsiella phage K64-1                  -
     Accession Date.completed Date.updated Genome.length
1218 NC_027399     06/24/2015   08/13/2018        346602
     Number.of.proteins     Host first.acc n.overlaps
1218                 64 bacteria NC_027399          0
     len.overlaps
1218           NA

HTTP Error

When trying to retrieve new table with viruses information, get:

  File "/usr/lib/python3.6/urllib/request.py", line 650, in http_error_default
    raise HTTPError(req.full_url, code, msg, hdrs, fp)
urllib.error.HTTPError: HTTP Error 500: Internal Server Error

Visualization bugs

  1. Sometimes the adjacency graphs are displayed packed in the left-top corner
  2. Some genome plots like geminiviridae are not being displayed

k-mer distance matrix

Evaluating the use of k-mers to calculate a dot-product (kernel) matrix for comparing protein sequences without an alignment step.

Re-analysing all NCBI database

The new table that contains all accession numbers is now organized as:

## Neighbors data for complete genomes: Viruses (taxid 10239)
## Columns:     "Representative"        "Neighbor"      "Host"  "Selected lineage"      "Taxonomy name" "Segment name"
NC_003663       HQ420896        human,vertebrates       Poxviridae,Orthopoxvirus,Cowpox virus   Cowpox virus    segment  
NC_003663       KY463519        human,vertebrates       Poxviridae,Orthopoxvirus,Cowpox virus   Cowpox virus    segment  
NC_003663       HQ420897        human,vertebrates       Poxviridae,Orthopoxvirus,Cowpox virus   Cowpox virus    segment  
NC_003663       MK035759        human,vertebrates       Poxviridae,Orthopoxvirus,Cowpox virus   Cowpox virus    segment  
NC_003663       KY569019        human,vertebrates       Poxviridae,Orthopoxvirus,Cowpox virus   Cowpox virus    segment  
NC_003663       KY549145        human,vertebrates       Poxviridae,Orthopoxvirus,Cowpox virus   Cowpox virus    segment  

Here they state that "neighbors" dataset was created in order to fill the diversity gaps that the RefSeq dataset couldn't provide, but are also complete or nearly complete viral genomes. Should we use "neighbors" or continue with the RefSequences?

Wrong labels on JS graph

I was working on making a button so that we can save the SVG generated by d3 for graphs, but I noticed a problem with the labels:
Screen Shot 2021-02-10 at 10 52 37 PM

Compare the node numbering, color and labels to this DOT viz:
retroviridae dot

I'm pretty sure that 1 is gag, 2 is pol and 4 is env.

Bigger picture: Mononegavirales

In order to connect this study with our previous analysis for the ovrf Review, I wanted to try to plot a bigger group of viruses. The Mononegavirles (negative strand ssRNA) group includes families like Rhabdoviridae and Filoviridae (Ebola). It has 327 complete genomes. The plots look like this:

  1. Cluster formation
    image

  2. Genome plot
    image

  3. Adjacency plot (edge_count == 3)
    image

  4. WordCloud
    image

I wonder if it would worth to make this analysis for an entire Baltimore class so comparison between networks would be more straight-forward.

Adjust for non-independence of virus genomes

We are making comparisons among genomes with respect to summary statistics such as total overlap length and genome length.
There is abundant variation among virus families in the number of genomes (sample size) and we cannot treat these genomes as independent observations - some genomes may be closely related, leading to pseudo-replication. (Suppose there is a dense cluster of points in one region of the plot - those may be re-sampling closely related genomes.)

To adjust for this effect, we would need some way of quantifying the evolutionary divergence between genomes that applies across virus families. This is challenging because this measure would have to be somewhat clock-like at the among family level.

For example, we could use a k-mer distance to get a crude measure of divergence between genomes within a family, but the rate of evolution with respect to this distance could vary among families.

Plotting information

  • Calculate the total number of nucleotides involved in an overlap per each genome
  • Calculate average overlap length using number of genes as denominator
  • Instead of families as pie charts, plot them as stacked bar plots. Group them by Baltimore classification.
  • Plot the number of overlaps as ridgeplot (using ggrich from ggfree)
  • For the visualization of overlap length, compare genomes between families using earth mover's distance.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.