poonlab / covizu Goto Github PK
View Code? Open in Web Editor NEWRapid analysis and visualization of coronavirus genome variation
Home Page: https://filogeneti.ca/CoVizu/
License: MIT License
Rapid analysis and visualization of coronavirus genome variation
Home Page: https://filogeneti.ca/CoVizu/
License: MIT License
This will break the pipeline script covizu.sh
until we repair paths
From Treetime NEXUS output:
#NEXUS
Begin Taxa;
Dimensions NTax=119;
TaxLabels hCoV-19/HongKong/HKPU45-0502/2020|E_000 hCoV-19/HongKong/HKPU23_2601/2020|E_001 hCoV-19/Scotland/CVR248/2020|EPI_IS_002
User could enter either GISAID accession number or description, e.g., BavPat1
. Should highlight clusters and/or nodes that match the search string. Multiple matches should be possible.
For example, hCoV-19/Iran/MHKN-1/2020|EPI_ISL_413517|2020-02-26
is only 322 nt long in the FASTA files. This is screwing up the TN93 calculations.
Sergei Pond gave a talk at Dynamics where he reported that genome sequences were being changed by GISAID without any notification or indication of versioning in the sequence record. (reported by @RouxCil )
This is really problematic because we are currently only downloading the newest records day to day. We may need to maintain an unaligned set of genomes and regularly (weekly?) download the entire database, do an exact match comparison to the unaligned set, and then update the aligned data set if changes are detected.
art@Kestrel:~/git/covizu/2020-04-15-0001_treetime$ grep -o NODE timetree.nexus | wc -l1917
art@Kestrel:~/git/covizu/2020-04-15-0001_treetime$ grep -o NODE clean.nwk | wc -l1320
This should be moved closer to front-end.
They should be identical genome sequences.
Will have to identify internal nodes via landmark tip labels..
From nohup/cronlog:
Traceback (most recent call last):
File "/home/covid/gotoh2.py", line 262, in
aligner = Aligner()
File "/home/covid/gotoh2.py", line 25, in init
files = pkgres.resource_listdir('gotoh2', 'models')
File "/usr/lib/python2.7/dist-packages/pkg_resources/init.py", line 1190,$
resource_name
File "/usr/lib/python2.7/dist-packages/pkg_resources/init.py", line 1486,$
return self._listdir(self._fn(self.module_path, resource_name))
File "/usr/lib/python2.7/dist-packages/pkg_resources/init.py", line 1574,$
return os.listdir(path)
OSError: [Errno 2] No such file or directory: '/home/covid/models'
Also need to make these searchable..
Right now I am splitting the tree on nodes with high out-degree (above 20). However, the number of resulting subtree components varies between 14 and 19 with repeated runs of the same script.
Labeled with year/month/day
To be implemented via CSS
|
treetime SVG |
| side
--------------| bar
| (help +
cluster SVG | info)
|
Currently time-scaled tree is rendered like this:
There's a problem with node id numbering:
> df.length
200
> edgeset.length
394
> df[199]
{parentId: null, parentLabel: null, thisId: 201, thisLabel: "↵", children: Array(2), …}
angle: undefined
branchLength: 0
children: (2) [80, 198]
isTip: false
parentId: null
parentLabel: null
thisId: 201
thisLabel: "↵"
x: 0
y: 41.74586639404297
__proto__: Object
thisID
should match the node index in the Array df
.
I suspect that something is wrong with parsing the Newick tree string:
Lines 12 to 81 in 4695087
Containing object-specific info (e.g., summary composition of cluster with respect to countries of sampling)
Describe each stage of workflow to sufficient degree for lab members to contribute
updater script is not working properly
While running pipeline, the following exception was thrown in hclust.R
:
Error: all(is.element(variants$cluster, headers)) is not TRUE
Running this script manually, I found the following problem:
> variants <- read.csv('data/variants.csv')
> test <- is.element(variants$cluster, headers)
> table(test)
test
FALSE TRUE
5 20702
> which(!test)
[1] 12422 12423 12424 12425 12426
> variants[12422:12426,]
cluster
12422 hCoV-19/DRC/1423/2020 |EPI_ISL_434710|2020-04-08
12423 hCoV-19/DRC/1423/2020 |EPI_ISL_434710|2020-04-08
12424 hCoV-19/DRC/1423/2020 |EPI_ISL_434710|2020-04-08
12425 hCoV-19/DRC/1423/2020 |EPI_ISL_434710|2020-04-08
12426 hCoV-19/DRC/1423/2020 |EPI_ISL_434710|2020-04-08
label coldate region
12422 hCoV-19/DRC/1423/2020 |EPI_ISL_434710|2020-04-08 2020-04-08 Africa
12423 hCoV-19/DRC/2120/2020|EPI_ISL_435163|2020-04-12 2020-04-12 Africa
12424 hCoV-19/DRC/2121/2020|EPI_ISL_435164|2020-04-12 2020-04-12 Africa
12425 hCoV-19/DRC/2125/2020|EPI_ISL_435166|2020-04-13 2020-04-13 Africa
12426 hCoV-19/DRC/2384/2020|EPI_ISL_437196|2020-04-14 2020-04-14 Africa
country
12422 DRC
12423 DRC
12424 DRC
12425 DRC
12426 DRC
art@Kestrel:~/git/covizu$ grep -r hCoV-19/India/NCDC-3175/2020 data
data/clusters.json: "label1": "hCoV-19/India/NCDC-3175/2020",
data/gisaid-aligned.fa:>hCoV-19/India/NCDC-3175/2020|EPI_ISL_436426|2022-04-05
Record has since been corrected in GISAID.
Need to write a screen for records with collection dates in the future, probably in the filtering.py
script.
@ewong347 can you please download the GISAID FASTA on a weekly basis to Langley, with a unique filename (e.g., GISAID-2020-05-01.fa
), since you are currently the only one with the ability to access the data export?
Shall we use that annoying "we use cookies" web component to briefly explain what CoVizu does on page load?
Clicking on a "help" button should start a walk-through that selects & highlights objects on the time-scaled tree and cluster SVGs with tool-tips, and text in the side bar. User should be able to exit the walkthrough at any time (returning to normal interface).
Somewhere on the page we should display the following information:
GISAID flags genomes with more than 1% N's (uncalled bases), which corresponds to roughly 300 positions.
These histograms summarize the number of N's for 6,018 genome sequences:
where the right histogram is rescaled with respect to the y-axis. Only 1,758 genomes are free of ambiguous base calls. I arbitrarily decided on a cutoff of 10%, which corresponds to about 2,990 positions with N's.
On Langley:
>>> import gotoh2
>>> handle = open('gisaid-aligned.fa')
>>> fasta = gotoh2.convert_fasta(handle)
>>> len(fasta)
19371
>>> fd = dict(fasta)
>>> seq = fd["hCoV-19/USA/MI-MDHHS-SC20187/2020|EPI_ISL_436835|2020-03-22"]
>>> len(seq)
86631
>>> len(fasta[0][1])
29903
>>> seq[28000:28100]
'GT>HCOV-19/USA/CA-CZB-1039/2020|EPI_ISL_436680|2020-04-04ATTAAAGGTTTATACCTTCCCAGGTAACAAACCAACCAACTTT'
Some kind of problem with line break characters...
art@Kestrel:~/git/covizu$ grep EPI_ISL_420841 data/gisaid-aligned.fa
hCoV-19/Congo/300/2020|EPI_ISL_420841|2020-03-22
hCoV-19/DRC/300/2020|EPI_ISL_420841|2020-03-22
Currently using a Python script to screen terminal branches for excessive lengths.
This helps but is not sufficient:
769.24 TreeTime: The following tips don't fit the clock model, please remove
them from the tree. Their dates have been reset:
769.24 EPI_ISL_418068, input date: 2020.2, apparent date: 2020.42
769.24 EPI_ISL_418184, input date: 2020.211, apparent date: 2020.60
769.24 EPI_ISL_408485, input date: 2020.047, apparent date: 2020.29
769.24 EPI_ISL_418185, input date: 2020.211, apparent date: 2020.49
769.24 EPI_ISL_417205, input date: 2020.167, apparent date: 2020.45
769.24 EPI_ISL_418189, input date: 2020.225, apparent date: 2020.79
769.24 EPI_ISL_418874, input date: 2020.225, apparent date: 2020.67
769.24 EPI_ISL_415618, input date: 2020.186, apparent date: 2020.76
769.24 EPI_ISL_419243, input date: 2020.142, apparent date: 2020.46
A Python script would help us keep the local database up to date - web browsing, including form submission, can be done with the mechanize module (or a more recent replacement).
Refer to PoonLab/disorder repository for reference on web scraping/interface scripts in Python.
art@Langley:~$ top -b -n1 | grep Web | head
28302 ewong347 20 0 2614688 118852 92436 S 5.9 0.2 64:53.86 Web Content
29486 ewong347 20 0 2614688 118860 92248 S 5.9 0.2 64:39.60 Web Content
791 ewong347 20 0 2654016 113768 78460 S 0.0 0.2 10:32.29 WebExtensi+
846 ewong347 20 0 2614692 119388 92160 S 0.0 0.2 73:10.07 Web Content
918 ewong347 20 0 2551104 72908 59024 S 0.0 0.1 0:00.37 Web Content
1263 ewong347 20 0 2675184 175260 80520 S 0.0 0.3 7:47.22 WebExtensi+
1308 ewong347 20 0 2642312 136220 101144 S 0.0 0.2 65:55.46 Web Content
1382 ewong347 20 0 2553028 75780 61372 S 0.0 0.1 0:00.41 Web Content
1540 ewong347 20 0 2674160 174856 80580 S 0.0 0.3 7:58.89 WebExtensi+
1591 ewong347 20 0 2642572 136888 101304 S 0.0 0.2 66:16.85 Web Content
art@Langley:~$ top -b -n1 | grep Web | wc -l
116
Currently adding additional genome variants that are observed at least 10 times in the data to augment the cluster sequences (roughly 56) - otherwise we get a bad estimate of the clock rate. This makes the resulting tree a bit cluttered.
Stop adding these extra genomes and instead constrain the clock rate to published estimates.
Since the initial run of TN93 is only for finding effectively identical genomes, we should set the -t
flag to limit the output filesize (currently 11G!).
If there are multiple vertical edges corresponding to the same sample collection date, then we can apply some curvature to prevent them from overlapping.
Maybe we can have this be animated, so if the user selects the derived variant bead (child node) then the vertical edge curves to the left.
I suspect that running MUSCLE or MAFFT will take too long to align the entire data.
I have written a pairwise alignment script that is running on orolo (start time April 7, 14:57)
It is running on two cores using MPI to split the sequence processing jobs.
Even pairwise alignment is memory intensive, consuming about 8GB (although a lot of this is probably due to poor memory efficiency of gotoh2), so running more than two cores crashes my machine.
Current estimate is about 50 sequences in 10 minutes using 2 cores.
The full data file gisaid_cov2020_sequences.fasta
has 3815 sequences.
So I estimate it will take about 13 hours to complete processing the data.
The alignment files look good so far.
Can this be extracted from the JSON?
Basic hierarchical clustering seems to provide some interesting and useful results, see new script hclust.R
All the required info should be contained in clusters.json
.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.