Giter Site home page Giter Site logo

poonlab / covizu Goto Github PK

View Code? Open in Web Editor NEW
45.0 45.0 20.0 34.82 MB

Rapid analysis and visualization of coronavirus genome variation

Home Page: https://filogeneti.ca/CoVizu/

License: MIT License

Python 37.59% Shell 0.01% CSS 2.33% HTML 17.52% JavaScript 41.81% R 0.73%

covizu's People

Contributors

abayomi-olabode avatar artpoon avatar babarlelephant avatar bonnielu avatar dependabot[bot] avatar dpannguyen avatar ebrintn avatar ewong347 avatar gopigugan avatar horaciobam avatar kwade4 avatar momomomoliu avatar nav-mohan avatar rouxcil avatar sandeepthokala avatar syoutono242 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

covizu's Issues

Mouseover/click node in beadplot should display info

  • Display following information:
    • Number of samples represented by node
    • Description/accession numbers of samples
    • Countries/regions of sample collection
    • Sample collection date
  • Mouseover could be annoying because info would switch to another node if user moves mouse
  • Use tooltip or a fixed text box?
  • Ideally, we should also select (highlight) the vertical edge that connects the selected node to its parent node if it is the first sample of the variant. This will help resolve the problem of overlapping vertical edges.

Either fasttree or treetime mangles tip labels

From Treetime NEXUS output:

#NEXUS
Begin Taxa;
 Dimensions NTax=119;
 TaxLabels hCoV-19/HongKong/HKPU45-0502/2020|E_000 hCoV-19/HongKong/HKPU23_2601/2020|E_001 hCoV-19/Scotland/CVR248/2020|EPI_IS_002 

GISAID is updating genome sequences without notification

Sergei Pond gave a talk at Dynamics where he reported that genome sequences were being changed by GISAID without any notification or indication of versioning in the sequence record. (reported by @RouxCil )

This is really problematic because we are currently only downloading the newest records day to day. We may need to maintain an unaligned set of genomes and regularly (weekly?) download the entire database, do an exact match comparison to the unaligned set, and then update the aligned data set if changes are detected.

Internal nodes disappearing

art@Kestrel:~/git/covizu/2020-04-15-0001_treetime$ grep -o NODE timetree.nexus | wc -l1917
art@Kestrel:~/git/covizu/2020-04-15-0001_treetime$ grep -o NODE clean.nwk | wc -l1320

Changing gotoh directory name broke directory path

From nohup/cronlog:

Traceback (most recent call last):
File "/home/covid/gotoh2.py", line 262, in
aligner = Aligner()
File "/home/covid/gotoh2.py", line 25, in init
files = pkgres.resource_listdir('gotoh2', 'models')
File "/usr/lib/python2.7/dist-packages/pkg_resources/init.py", line 1190,$
resource_name
File "/usr/lib/python2.7/dist-packages/pkg_resources/init.py", line 1486,$
return self._listdir(self._fn(self.module_path, resource_name))
File "/usr/lib/python2.7/dist-packages/pkg_resources/init.py", line 1574,$
return os.listdir(path)
OSError: [Errno 2] No such file or directory: '/home/covid/models'

Splitting minimum spanning tree is stochastic

Right now I am splitting the tree on nodes with high out-degree (above 20). However, the number of resulting subtree components varies between 14 and 19 with repeated runs of the same script.

Front end: Page layout

To be implemented via CSS

              |
treetime SVG  |
              |   side
--------------|   bar
              |   (help + 
 cluster SVG  |   info)
              |

Branches adjacent to root not being drawn

Currently time-scaled tree is rendered like this:
image

It should look like this:
image

There's a problem with node id numbering:

> df.length
200
> edgeset.length
394
> df[199]
{parentId: null, parentLabel: null, thisId: 201, thisLabel: "↵", children: Array(2),}
angle: undefined
branchLength: 0
children: (2) [80, 198]
isTip: false
parentId: null
parentLabel: null
thisId: 201
thisLabel: "↵"
x: 0
y: 41.74586639404297
__proto__: Object

thisID should match the node index in the Array df.

I suspect that something is wrong with parsing the Newick tree string:

covizu/js/phylo.js

Lines 12 to 81 in 4695087

function readTree(text) {
// remove whitespace
text = text.replace(/ \t/g, '');
var tokens = text.split(/(;|\(|\)|,)/),
root = {'parent': null, 'children':[]},
curnode = root,
nodeId = 0;
for (const token of tokens) {
if (token == "" || token == ';') {
continue
}
//console.log(token);
if (token == '(') {
// add a child to current node
var child = {
'parent': curnode,
'children': []
};
curnode.children.push(child);
curnode = child; // climb up
}
else if (token == ',') {
// climb down, add another child to parent
curnode = curnode.parent;
var child = {
'parent': curnode,
'children': []
}
curnode.children.push(child);
curnode = child; // climb up
}
else if (token == ')') {
// climb down twice
curnode = curnode.parent;
if (curnode === null) {
break;
}
}
else {
var nodeinfo = token.split(':');
if (nodeinfo.length==1) {
if (token.startsWith(':')) {
curnode.label = "";
curnode.branchLength = parseFloat(nodeinfo[0]);
} else {
curnode.label = nodeinfo[0];
curnode.branchLength = null;
}
}
else if (nodeinfo.length==2) {
curnode.label = nodeinfo[0];
curnode.branchLength = parseFloat(nodeinfo[1]);
}
else {
// TODO: handle edge cases with >1 ":"
console.warn(token, "I don't know what to do with two colons!");
}
curnode.id = nodeId++; // assign then increment
}
}
// if root node is unlabelled
curnode.id = nodeId;
drawtree(root);
return(root);
}

Front-end: Tool tips

Containing object-specific info (e.g., summary composition of cluster with respect to countries of sampling)

write CONTRIBUTING.md

Describe each stage of workflow to sufficient degree for lab members to contribute

Spaces in GISAID labels breaking pipeline

While running pipeline, the following exception was thrown in hclust.R:

Error: all(is.element(variants$cluster, headers)) is not TRUE

Running this script manually, I found the following problem:

> variants <- read.csv('data/variants.csv')
> test <- is.element(variants$cluster, headers)
> table(test)
test
FALSE  TRUE 
    5 20702 
> which(!test)
[1] 12422 12423 12424 12425 12426
> variants[12422:12426,]
                                               cluster
12422 hCoV-19/DRC/1423/2020 |EPI_ISL_434710|2020-04-08
12423 hCoV-19/DRC/1423/2020 |EPI_ISL_434710|2020-04-08
12424 hCoV-19/DRC/1423/2020 |EPI_ISL_434710|2020-04-08
12425 hCoV-19/DRC/1423/2020 |EPI_ISL_434710|2020-04-08
12426 hCoV-19/DRC/1423/2020 |EPI_ISL_434710|2020-04-08
                                                 label    coldate region
12422 hCoV-19/DRC/1423/2020 |EPI_ISL_434710|2020-04-08 2020-04-08 Africa
12423  hCoV-19/DRC/2120/2020|EPI_ISL_435163|2020-04-12 2020-04-12 Africa
12424  hCoV-19/DRC/2121/2020|EPI_ISL_435164|2020-04-12 2020-04-12 Africa
12425  hCoV-19/DRC/2125/2020|EPI_ISL_435166|2020-04-13 2020-04-13 Africa
12426  hCoV-19/DRC/2384/2020|EPI_ISL_437196|2020-04-14 2020-04-14 Africa
      country
12422     DRC
12423     DRC
12424     DRC
12425     DRC
12426     DRC

GISAID record was sampled in the future

art@Kestrel:~/git/covizu$ grep -r hCoV-19/India/NCDC-3175/2020 data
data/clusters.json:          "label1": "hCoV-19/India/NCDC-3175/2020",
data/gisaid-aligned.fa:>hCoV-19/India/NCDC-3175/2020|EPI_ISL_436426|2022-04-05

Record has since been corrected in GISAID.
Need to write a screen for records with collection dates in the future, probably in the filtering.py script.

Updating database

@ewong347 can you please download the GISAID FASTA on a weekly basis to Langley, with a unique filename (e.g., GISAID-2020-05-01.fa), since you are currently the only one with the ability to access the data export?

Front-end: implement walk-through animations

Clicking on a "help" button should start a walk-through that selects & highlights objects on the time-scaled tree and cluster SVGs with tool-tips, and text in the side bar. User should be able to exit the walkthrough at any time (returning to normal interface).

Filter out genomes with large number of N's

GISAID flags genomes with more than 1% N's (uncalled bases), which corresponds to roughly 300 positions.
These histograms summarize the number of N's for 6,018 genome sequences:
Rplot01

where the right histogram is rescaled with respect to the y-axis. Only 1,758 genomes are free of ambiguous base calls. I arbitrarily decided on a cutoff of 10%, which corresponds to about 2,990 positions with N's.

Concatenated sequences in aligned data

On Langley:

>>> import gotoh2
>>> handle = open('gisaid-aligned.fa')
>>> fasta = gotoh2.convert_fasta(handle)
>>> len(fasta)
19371
>>> fd = dict(fasta)
>>> seq = fd["hCoV-19/USA/MI-MDHHS-SC20187/2020|EPI_ISL_436835|2020-03-22"]
>>> len(seq)
86631
>>> len(fasta[0][1])
29903
>>> seq[28000:28100]
'GT>HCOV-19/USA/CA-CZB-1039/2020|EPI_ISL_436680|2020-04-04ATTAAAGGTTTATACCTTCCCAGGTAACAAACCAACCAACTTT'

Some kind of problem with line break characters...

Removing outlier sequences for TreeTime analysis

Currently using a Python script to screen terminal branches for excessive lengths.
This helps but is not sufficient:

769.24	TreeTime: The following tips don't fit the clock model, please remove
      	them from the tree. Their dates have been reset:
769.24	EPI_ISL_418068, input date: 2020.2, apparent date: 2020.42
769.24	EPI_ISL_418184, input date: 2020.211, apparent date: 2020.60
769.24	EPI_ISL_408485, input date: 2020.047, apparent date: 2020.29
769.24	EPI_ISL_418185, input date: 2020.211, apparent date: 2020.49
769.24	EPI_ISL_417205, input date: 2020.167, apparent date: 2020.45
769.24	EPI_ISL_418189, input date: 2020.225, apparent date: 2020.79
769.24	EPI_ISL_418874, input date: 2020.225, apparent date: 2020.67
769.24	EPI_ISL_415618, input date: 2020.186, apparent date: 2020.76
769.24	EPI_ISL_419243, input date: 2020.142, apparent date: 2020.46

Backend: dead Web content processes are accumulating on Langley

art@Langley:~$ top -b -n1 | grep Web | head
28302 ewong347  20   0 2614688 118852  92436 S   5.9  0.2  64:53.86 Web Content
29486 ewong347  20   0 2614688 118860  92248 S   5.9  0.2  64:39.60 Web Content
  791 ewong347  20   0 2654016 113768  78460 S   0.0  0.2  10:32.29 WebExtensi+
  846 ewong347  20   0 2614692 119388  92160 S   0.0  0.2  73:10.07 Web Content
  918 ewong347  20   0 2551104  72908  59024 S   0.0  0.1   0:00.37 Web Content
 1263 ewong347  20   0 2675184 175260  80520 S   0.0  0.3   7:47.22 WebExtensi+
 1308 ewong347  20   0 2642312 136220 101144 S   0.0  0.2  65:55.46 Web Content
 1382 ewong347  20   0 2553028  75780  61372 S   0.0  0.1   0:00.41 Web Content
 1540 ewong347  20   0 2674160 174856  80580 S   0.0  0.3   7:58.89 WebExtensi+
 1591 ewong347  20   0 2642572 136888 101304 S   0.0  0.2  66:16.85 Web Content
art@Langley:~$ top -b -n1 | grep Web | wc -l
116

Constrain treetime with clock rate, reduce number of tips

Currently adding additional genome variants that are observed at least 10 times in the data to augment the cluster sequences (roughly 56) - otherwise we get a bad estimate of the clock rate. This makes the resulting tree a bit cluttered.

Stop adding these extra genomes and instead constrain the clock rate to published estimates.

Use d3 paths to prevent vertical edges from overlapping?

If there are multiple vertical edges corresponding to the same sample collection date, then we can apply some curvature to prevent them from overlapping.
Maybe we can have this be animated, so if the user selects the derived variant bead (child node) then the vertical edge curves to the left.

Aligning sequences

I suspect that running MUSCLE or MAFFT will take too long to align the entire data.
I have written a pairwise alignment script that is running on orolo (start time April 7, 14:57)
It is running on two cores using MPI to split the sequence processing jobs.
Even pairwise alignment is memory intensive, consuming about 8GB (although a lot of this is probably due to poor memory efficiency of gotoh2), so running more than two cores crashes my machine.

Current estimate is about 50 sequences in 10 minutes using 2 cores.
The full data file gisaid_cov2020_sequences.fasta has 3815 sequences.
So I estimate it will take about 13 hours to complete processing the data.

The alignment files look good so far.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.