Giter Site home page Giter Site logo

plodex's People

Contributors

artpoon avatar

Stargazers

 avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

plodex's Issues

Filter genomes with too many differences

Some genomes have an enormous number of differences from the references and are most likely problematic entries in the database.

I modified the parse_fasta.py script to output a long listing of genome collection dates and numbers of genetic differences:

    for qname, diffs, missing in encoder:
        region, country, coldate = parse_header(qname, regions, typos)
        args.outfile.write('{},{}\n'.format(coldate, len(diffs)))

    sys.exit()

and wrote the output to data/clock.csv.

Next, plotted the result in R:

clock <- read.csv('~/git/plodex/data/clock.csv')
clock$coldate <- as.Date(clock$coldate)
plot(clock$coldate, clock$count, cex=0.5)

image

Group mutation counts by epi week

Recording the prevalence of mutations by calendar date, as well as by country, is too low level precision and would result in an enormous data file.

Epiweek parser broken?

> foo[foo$year==2019,]
     region country year week mut.type mut.pos mut.diff count
5676   Asia   China 2019    1        ~      15        A     1
5677   Asia   China 2019    1        ~   20581        A     1
5678   Asia   China 2019    1        ~   20590        A     1
5679   Asia   China 2019    1        ~   21048        G     1
5680   Asia   China 2019    1        ~   21227        A     1
5681   Asia   China 2019    1        ~   21567        A     1
5682   Asia   China 2019    1        ~      22        C     1
5683   Asia   China 2019    1        ~      23        G     1
5684   Asia   China 2019    1        ~   24236        G     3
5685   Asia   China 2019    1        ~      30        G     1
5686   Asia   China 2019    1        ~      31        C     1
5687   Asia   China 2019    1        ~      35        A     1
5688   Asia   China 2019    1        ~    6907        C     1
5689   Asia   China 2019    1        ~    6927        A     1
5690   Asia   China 2019    1        ~    7912        C     1
5691   Asia   China 2019    1        ~    9445        T     1
5692   Asia   China 2019   52        ~   11675        A     1
5693   Asia   China 2019   52        ~    3689        G     1
5694   Asia   China 2019   52        ~    6879        A     1
5695   Asia   China 2019   52        ~    8299        G     1
5696   Asia   China 2019   52        ~    8898        A     1

It shouldn't be possible that genomes were sampled in January 2019.

Which framework?

I am leaning towards RMarkdown to generate an HTML document "narrative".

No denominator

Just realized that storing mutation counts provides no information on the number of genomes sampled from a given country at a given date, i.e., the denominator of these counts.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.