poonlab / plodex Goto Github PK

View Code? Open in Web Editor NEW

2.0 2.0 0.0 402 KB

Interactive visualization of SARS-CoV-2 variants

License: GNU Affero General Public License v3.0

Python 100.00%

plodex's People

Contributors

Stargazers

Watchers

plodex's Issues

Filter genomes with too many differences

Some genomes have an enormous number of differences from the references and are most likely problematic entries in the database.

I modified the parse_fasta.py script to output a long listing of genome collection dates and numbers of genetic differences:

    for qname, diffs, missing in encoder:
        region, country, coldate = parse_header(qname, regions, typos)
        args.outfile.write('{},{}\n'.format(coldate, len(diffs)))

    sys.exit()

and wrote the output to data/clock.csv.

Next, plotted the result in R:

clock <- read.csv('~/git/plodex/data/clock.csv')
clock$coldate <- as.Date(clock$coldate)
plot(clock$coldate, clock$count, cex=0.5)

Group mutation counts by epi week

Recording the prevalence of mutations by calendar date, as well as by country, is too low level precision and would result in an enormous data file.

Epiweek parser broken?

> foo[foo$year==2019,]
     region country year week mut.type mut.pos mut.diff count
5676   Asia   China 2019    1        ~      15        A     1
5677   Asia   China 2019    1        ~   20581        A     1
5678   Asia   China 2019    1        ~   20590        A     1
5679   Asia   China 2019    1        ~   21048        G     1
5680   Asia   China 2019    1        ~   21227        A     1
5681   Asia   China 2019    1        ~   21567        A     1
5682   Asia   China 2019    1        ~      22        C     1
5683   Asia   China 2019    1        ~      23        G     1
5684   Asia   China 2019    1        ~   24236        G     3
5685   Asia   China 2019    1        ~      30        G     1
5686   Asia   China 2019    1        ~      31        C     1
5687   Asia   China 2019    1        ~      35        A     1
5688   Asia   China 2019    1        ~    6907        C     1
5689   Asia   China 2019    1        ~    6927        A     1
5690   Asia   China 2019    1        ~    7912        C     1
5691   Asia   China 2019    1        ~    9445        T     1
5692   Asia   China 2019   52        ~   11675        A     1
5693   Asia   China 2019   52        ~    3689        G     1
5694   Asia   China 2019   52        ~    6879        A     1
5695   Asia   China 2019   52        ~    8299        G     1
5696   Asia   China 2019   52        ~    8898        A     1

It shouldn't be possible that genomes were sampled in January 2019.

Which framework?

I am leaning towards RMarkdown to generate an HTML document "narrative".

JSON is more efficient but difficult to work with

Should these data be stored as a CSV instead?

No denominator

Just realized that storing mutation counts provides no information on the number of genomes sampled from a given country at a given date, i.e., the denominator of these counts.

poonlab / plodex Goto Github PK

plodex's People

Contributors

Stargazers

Watchers

plodex's Issues

Filter genomes with too many differences

Group mutation counts by epi week

Epiweek parser broken?

Which framework?

JSON is more efficient but difficult to work with

No denominator

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent