The covizu from poonlab

Building a front-end

Move scripts to a folder to tidy up repo

This will break the pipeline script covizu.sh until we repair paths

Mouseover/click node in beadplot should display info

Display following information:
- Number of samples represented by node
- Description/accession numbers of samples
- Countries/regions of sample collection
- Sample collection date
Mouseover could be annoying because info would switch to another node if user moves mouse
Use tooltip or a fixed text box?
Ideally, we should also select (highlight) the vertical edge that connects the selected node to its parent node if it is the first sample of the variant. This will help resolve the problem of overlapping vertical edges.

Either fasttree or treetime mangles tip labels

From Treetime NEXUS output:

#NEXUS
Begin Taxa;
 Dimensions NTax=119;
 TaxLabels hCoV-19/HongKong/HKPU45-0502/2020|E_000 hCoV-19/HongKong/HKPU23_2601/2020|E_001 hCoV-19/Scotland/CVR248/2020|EPI_IS_002

Front-end: search field to locate a specific node or cluster containing node

User could enter either GISAID accession number or description, e.g., BavPat1. Should highlight clusters and/or nodes that match the search string. Multiple matches should be possible.

Extremely short sequences in GISAID download

For example, hCoV-19/Iran/MHKN-1/2020|EPI_ISL_413517|2020-02-26 is only 322 nt long in the FASTA files. This is screwing up the TN93 calculations.

GISAID is updating genome sequences without notification

Sergei Pond gave a talk at Dynamics where he reported that genome sequences were being changed by GISAID without any notification or indication of versioning in the sequence record. (reported by @RouxCil )

This is really problematic because we are currently only downloading the newest records day to day. We may need to maintain an unaligned set of genomes and regularly (weekly?) download the entire database, do an exact match comparison to the unaligned set, and then update the aligned data set if changes are detected.

Internal nodes disappearing

art@Kestrel:~/git/covizu/2020-04-15-0001_treetime$ grep -o NODE timetree.nexus | wc -l1917
art@Kestrel:~/git/covizu/2020-04-15-0001_treetime$ grep -o NODE clean.nwk | wc -l1320

Refactor: move country to region mapping out of variants.py

This should be moved closer to front-end.

Sanity check - generate subset alignment of sequences in same cluster

They should be identical genome sequences.

Label time-scaled tree with clades

Will have to identify internal nodes via landmark tip labels..

Changing gotoh directory name broke directory path

From nohup/cronlog:

Traceback (most recent call last):
File "/home/covid/gotoh2.py", line 262, in
aligner = Aligner()
File "/home/covid/gotoh2.py", line 25, in init
files = pkgres.resource_listdir('gotoh2', 'models')
File "/usr/lib/python2.7/dist-packages/pkg_resources/init.py", line 1190,$
resource_name
File "/usr/lib/python2.7/dist-packages/pkg_resources/init.py", line 1486,$
return self._listdir(self._fn(self.module_path, resource_name))
File "/usr/lib/python2.7/dist-packages/pkg_resources/init.py", line 1574,$
return os.listdir(path)
OSError: [Errno 2] No such file or directory: '/home/covid/models'

Add GISAID accession numbers

Also need to make these searchable..

Splitting minimum spanning tree is stochastic

Right now I am splitting the tree on nodes with high out-degree (above 20). However, the number of resulting subtree components varies between 14 and 19 with repeated runs of the same script.

Move time axis for time-scaled tree to top

Add x-axis to timetree display

Labeled with year/month/day

Front end: Page layout

To be implemented via CSS

              |
treetime SVG  |
              |   side
--------------|   bar
              |   (help + 
 cluster SVG  |   info)
              |

Branches adjacent to root not being drawn

Currently time-scaled tree is rendered like this:

It should look like this:

There's a problem with node id numbering:

> df.length
200
> edgeset.length
394
> df[199]
{parentId: null, parentLabel: null, thisId: 201, thisLabel: "↵", children: Array(2), …}
angle: undefined
branchLength: 0
children: (2) [80, 198]
isTip: false
parentId: null
parentLabel: null
thisId: 201
thisLabel: "↵"
x: 0
y: 41.74586639404297
__proto__: Object

thisID should match the node index in the Array df.

I suspect that something is wrong with parsing the Newick tree string:

covizu/js/phylo.js

Lines 12 to 81 in 4695087

    
           function readTree(text) { 
        
               // remove whitespace 
        
               text = text.replace(/ \t/g, ''); 
        
               var tokens = text.split(/(;|\(|\)|,)/), 
        
                   root = {'parent': null, 'children':[]}, 
        
                   curnode = root, 
        
                   nodeId = 0; 
        
               for (const token of tokens) { 
        
                   if (token == "" || token == ';') { 
        
                       continue 
        
                   } 
        
                   //console.log(token); 
        
                   if (token == '(') { 
        
                       // add a child to current node 
        
                       var child = { 
        
                           'parent': curnode, 
        
                           'children': [] 
        
                       }; 
        
                       curnode.children.push(child); 
        
                       curnode = child;  // climb up 
        
                   } 
        
                   else if (token == ',') { 
        
                       // climb down, add another child to parent 
        
                       curnode = curnode.parent; 
        
                       var child = { 
        
                           'parent': curnode, 
        
                           'children': [] 
        
                       } 
        
                       curnode.children.push(child); 
        
                       curnode = child;  // climb up 
        
                   } 
        
                   else if (token == ')') { 
        
                       // climb down twice 
        
                       curnode = curnode.parent; 
        
                       if (curnode === null) { 
        
                           break; 
        
                       } 
        
                   } 
        
                   else { 
        
                       var nodeinfo = token.split(':'); 
        
                       if (nodeinfo.length==1) { 
        
                           if (token.startsWith(':')) { 
        
                               curnode.label = ""; 
        
                               curnode.branchLength = parseFloat(nodeinfo[0]); 
        
                           } else { 
        
                               curnode.label = nodeinfo[0]; 
        
                               curnode.branchLength = null; 
        
                           } 
        
                       } 
        
                       else if (nodeinfo.length==2) { 
        
                           curnode.label = nodeinfo[0]; 
        
                           curnode.branchLength = parseFloat(nodeinfo[1]); 
        
                       } 
        
                       else { 
        
                           // TODO: handle edge cases with >1 ":" 
        
                           console.warn(token, "I don't know what to do with two colons!"); 
        
                       } 
        
                       curnode.id = nodeId++;  // assign then increment 
        
                   } 
        
               } 
        
               // if root node is unlabelled 
        
               curnode.id = nodeId; 
        
               drawtree(root); 
        
               return(root); 
        
           }

Front-end: Tool tips

Containing object-specific info (e.g., summary composition of cluster with respect to countries of sampling)

Front-end: animate tree + cluster drawing on page load

write CONTRIBUTING.md

Describe each stage of workflow to sufficient degree for lab members to contribute

Colour clusters and beads by predominant continent

Add time axis to bead plots

Duplicate entries in gisaid-aligned.fa

updater script is not working properly

Spaces in GISAID labels breaking pipeline

While running pipeline, the following exception was thrown in hclust.R:

Error: all(is.element(variants$cluster, headers)) is not TRUE

Running this script manually, I found the following problem:

> variants <- read.csv('data/variants.csv')
> test <- is.element(variants$cluster, headers)
> table(test)
test
FALSE  TRUE 
    5 20702 
> which(!test)
[1] 12422 12423 12424 12425 12426
> variants[12422:12426,]
                                               cluster
12422 hCoV-19/DRC/1423/2020 |EPI_ISL_434710|2020-04-08
12423 hCoV-19/DRC/1423/2020 |EPI_ISL_434710|2020-04-08
12424 hCoV-19/DRC/1423/2020 |EPI_ISL_434710|2020-04-08
12425 hCoV-19/DRC/1423/2020 |EPI_ISL_434710|2020-04-08
12426 hCoV-19/DRC/1423/2020 |EPI_ISL_434710|2020-04-08
                                                 label    coldate region
12422 hCoV-19/DRC/1423/2020 |EPI_ISL_434710|2020-04-08 2020-04-08 Africa
12423  hCoV-19/DRC/2120/2020|EPI_ISL_435163|2020-04-12 2020-04-12 Africa
12424  hCoV-19/DRC/2121/2020|EPI_ISL_435164|2020-04-12 2020-04-12 Africa
12425  hCoV-19/DRC/2125/2020|EPI_ISL_435166|2020-04-13 2020-04-13 Africa
12426  hCoV-19/DRC/2384/2020|EPI_ISL_437196|2020-04-14 2020-04-14 Africa
      country
12422     DRC
12423     DRC
12424     DRC
12425     DRC
12426     DRC

Colour beads by predominant (?) region of sampling

GISAID record was sampled in the future

art@Kestrel:~/git/covizu$ grep -r hCoV-19/India/NCDC-3175/2020 data
data/clusters.json:          "label1": "hCoV-19/India/NCDC-3175/2020",
data/gisaid-aligned.fa:>hCoV-19/India/NCDC-3175/2020|EPI_ISL_436426|2022-04-05

Record has since been corrected in GISAID.
Need to write a screen for records with collection dates in the future, probably in the filtering.py script.

Mouseover/click vertical edges should highlight + report genetic distance

Updating database

@ewong347 can you please download the GISAID FASTA on a weekly basis to Langley, with a unique filename (e.g., GISAID-2020-05-01.fa), since you are currently the only one with the ability to access the data export?

Mouseover on bead should display case descriptions, countries

Pop-up explanation box on load?

Shall we use that annoying "we use cookies" web component to briefly explain what CoVizu does on page load?

Front-end: implement walk-through animations

Clicking on a "help" button should start a walk-through that selects & highlights objects on the time-scaled tree and cluster SVGs with tool-tips, and text in the side bar. User should be able to exit the walkthrough at any time (returning to normal interface).

Global info - database time stamp, number of genomes

Somewhere on the page we should display the following information:

date/time that we pulled data from GISAID
total number of genome sequences represented by our analysis
?

Filter out genomes with large number of N's

GISAID flags genomes with more than 1% N's (uncalled bases), which corresponds to roughly 300 positions.
These histograms summarize the number of N's for 6,018 genome sequences:

where the right histogram is rescaled with respect to the y-axis. Only 1,758 genomes are free of ambiguous base calls. I arbitrarily decided on a cutoff of 10%, which corresponds to about 2,990 positions with N's.

Concatenated sequences in aligned data

On Langley:

>>> import gotoh2
>>> handle = open('gisaid-aligned.fa')
>>> fasta = gotoh2.convert_fasta(handle)
>>> len(fasta)
19371
>>> fd = dict(fasta)
>>> seq = fd["hCoV-19/USA/MI-MDHHS-SC20187/2020|EPI_ISL_436835|2020-03-22"]
>>> len(seq)
86631
>>> len(fasta[0][1])
29903
>>> seq[28000:28100]
'GT>HCOV-19/USA/CA-CZB-1039/2020|EPI_ISL_436680|2020-04-04ATTAAAGGTTTATACCTTCCCAGGTAACAAACCAACCAACTTT'

Some kind of problem with line break characters...

Duplicate accession numbers in GISAID data

art@Kestrel:~/git/covizu$ grep EPI_ISL_420841 data/gisaid-aligned.fa

hCoV-19/Congo/300/2020|EPI_ISL_420841|2020-03-22
hCoV-19/DRC/300/2020|EPI_ISL_420841|2020-03-22

Variant labels are cut off by left margin

Rescale bead plots to have consistent spacing between variants

For example, here is a cluster with fewer variants:

and another cluster with lots of variants:

Removing outlier sequences for TreeTime analysis

Currently using a Python script to screen terminal branches for excessive lengths.
This helps but is not sufficient:

769.24	TreeTime: The following tips don't fit the clock model, please remove
      	them from the tree. Their dates have been reset:
769.24	EPI_ISL_418068, input date: 2020.2, apparent date: 2020.42
769.24	EPI_ISL_418184, input date: 2020.211, apparent date: 2020.60
769.24	EPI_ISL_408485, input date: 2020.047, apparent date: 2020.29
769.24	EPI_ISL_418185, input date: 2020.211, apparent date: 2020.49
769.24	EPI_ISL_417205, input date: 2020.167, apparent date: 2020.45
769.24	EPI_ISL_418189, input date: 2020.225, apparent date: 2020.79
769.24	EPI_ISL_418874, input date: 2020.225, apparent date: 2020.67
769.24	EPI_ISL_415618, input date: 2020.186, apparent date: 2020.76
769.24	EPI_ISL_419243, input date: 2020.142, apparent date: 2020.46

Write script to automate batch download of GISAID genomes (limited to day)

A Python script would help us keep the local database up to date - web browsing, including form submission, can be done with the mechanize module (or a more recent replacement).
Refer to PoonLab/disorder repository for reference on web scraping/interface scripts in Python.

Need some way to deal with overlapping beads

The Wuhan cluster is especially problematic:

Backend: dead Web content processes are accumulating on Langley

art@Langley:~$ top -b -n1 | grep Web | head
28302 ewong347  20   0 2614688 118852  92436 S   5.9  0.2  64:53.86 Web Content
29486 ewong347  20   0 2614688 118860  92248 S   5.9  0.2  64:39.60 Web Content
  791 ewong347  20   0 2654016 113768  78460 S   0.0  0.2  10:32.29 WebExtensi+
  846 ewong347  20   0 2614692 119388  92160 S   0.0  0.2  73:10.07 Web Content
  918 ewong347  20   0 2551104  72908  59024 S   0.0  0.1   0:00.37 Web Content
 1263 ewong347  20   0 2675184 175260  80520 S   0.0  0.3   7:47.22 WebExtensi+
 1308 ewong347  20   0 2642312 136220 101144 S   0.0  0.2  65:55.46 Web Content
 1382 ewong347  20   0 2553028  75780  61372 S   0.0  0.1   0:00.41 Web Content
 1540 ewong347  20   0 2674160 174856  80580 S   0.0  0.3   7:58.89 WebExtensi+
 1591 ewong347  20   0 2642572 136888 101304 S   0.0  0.2  66:16.85 Web Content
art@Langley:~$ top -b -n1 | grep Web | wc -l
116

Front-end: testing on other platforms and browsers

Constrain treetime with clock rate, reduce number of tips

Currently adding additional genome variants that are observed at least 10 times in the data to augment the cluster sequences (roughly 56) - otherwise we get a bad estimate of the clock rate. This makes the resulting tree a bit cluttered.

Stop adding these extra genomes and instead constrain the clock rate to published estimates.

Limit TN93 to short distance for first round

Since the initial run of TN93 is only for finding effectively identical genomes, we should set the -t flag to limit the output filesize (currently 11G!).

Use d3 paths to prevent vertical edges from overlapping?

If there are multiple vertical edges corresponding to the same sample collection date, then we can apply some curvature to prevent them from overlapping.
Maybe we can have this be animated, so if the user selects the derived variant bead (child node) then the vertical edge curves to the left.

Aligning sequences

I suspect that running MUSCLE or MAFFT will take too long to align the entire data.
I have written a pairwise alignment script that is running on orolo (start time April 7, 14:57)
It is running on two cores using MPI to split the sequence processing jobs.
Even pairwise alignment is memory intensive, consuming about 8GB (although a lot of this is probably due to poor memory efficiency of gotoh2), so running more than two cores crashes my machine.

Current estimate is about 50 sequences in 10 minutes using 2 cores.
The full data file gisaid_cov2020_sequences.fasta has 3815 sequences.
So I estimate it will take about 13 hours to complete processing the data.

The alignment files look good so far.

	function readTree(text) {
	// remove whitespace
	text = text.replace(/ \t/g, '');

	var tokens = text.split(/(;\|\(\|\)\|,)/),
	root = {'parent': null, 'children':[]},
	curnode = root,
	nodeId = 0;

	for (const token of tokens) {
	if (token == "" \|\| token == ';') {
	continue
	}
	//console.log(token);
	if (token == '(') {
	// add a child to current node
	var child = {
	'parent': curnode,
	'children': []
	};
	curnode.children.push(child);
	curnode = child; // climb up
	}
	else if (token == ',') {
	// climb down, add another child to parent
	curnode = curnode.parent;
	var child = {
	'parent': curnode,
	'children': []
	}
	curnode.children.push(child);
	curnode = child; // climb up
	}
	else if (token == ')') {
	// climb down twice
	curnode = curnode.parent;
	if (curnode === null) {
	break;
	}
	}
	else {
	var nodeinfo = token.split(':');

	if (nodeinfo.length==1) {
	if (token.startsWith(':')) {
	curnode.label = "";
	curnode.branchLength = parseFloat(nodeinfo[0]);
	} else {
	curnode.label = nodeinfo[0];
	curnode.branchLength = null;
	}
	}
	else if (nodeinfo.length==2) {
	curnode.label = nodeinfo[0];
	curnode.branchLength = parseFloat(nodeinfo[1]);
	}
	else {
	// TODO: handle edge cases with >1 ":"
	console.warn(token, "I don't know what to do with two colons!");
	}
	curnode.id = nodeId++; // assign then increment
	}
	}

	// if root node is unlabelled
	curnode.id = nodeId;

	drawtree(root);
	return(root);
	}

poonlab / covizu Goto Github PK

covizu's People

Contributors

Stargazers

Watchers

Forkers

covizu's Issues

Recommend Projects

Recommend Topics

Recommend Org