Giter Site home page Giter Site logo

highliner's People

Contributors

artpoon avatar lisa-monique avatar

Stargazers

 avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Forkers

george-githinji

highliner's Issues

Make print and summary functions for session object

Presently printing a session displays the following:

> ses
<environment: 0x5583ffd49208>
attr(,"class")
[1] "session"     "environment"

and calling summary() raises different errors:

> summary(ses)
Error in order(var_counts, decreasing = F) : argument 1 is not a vector

Write FASTA/FASTQ parser

@Lisa-Monique found environment as a possible approach to hash-based key-value data structures in R.
Also allow user to use SAM file format as input.
In all cases, we should try to stream the input file and process data "online" instead of reading the entire contents into memory first.

Diversity measures

  • nucleotide/amino acid entropy
  • tree length (requires a method to reconstruct tree)
  • nucleotide diversity
  • percent complexity (number of variants / number of sequences)
  • mean genetic distance (use dist.dna in ape package)

variable names are limited to 10000 bytes

Thank you for this. I have this code trying to visualize an alignment

files <- 'data/dta.fasta.aln'
highline(files[1])

and then I get this error

Error in exists(sequence, envir = data$compressed) : 
  variable names are limited to 10000 bytes

Write INSTALL.md

  • Decide how we want to distribute the code (devtools, bioconductor, cran?)
  • Package the code
  • Installation instructions

Idea: random resampling to adjust for template count

If an NGS data set contains many more reads than available templates (nucleic acids) in the sequencing reaction, then the frequency distribution of variants - and thereby the resulting diversity measures - may be skewed. It might be possible to adjust for this by sampling N reads at random (with replacement?) M times, and then averaging the resulting variant frequencies across the M replicates to yield the expected frequencies.

Write unit tests

For example, make a toy example FASTA file to import and compare the parsed result to the expected result.

Tidy up plot

  • too much white space
  • master sequence can be emphasized by drawing a solid line around the bar, for example
  • visual cues for read abundance (squares plotted to the right of each bar)
  • other enhancements?

Installing from github clone

Unlisted packaged dependency (ggpubr)

art@orolo:~/git/highlineR$ R CMD INSTALL .
* installing to library ‘/home/art/R/x86_64-pc-linux-gnu-library/3.5’
ERROR: dependency ‘ggpubr’ is not available for package ‘highlineR’
* removing ‘/home/art/R/x86_64-pc-linux-gnu-library/3.5/highlineR’

After installing ggpubr the above command executes properly.

Tests fail

art@orolo:~/git/highlineR/tests$ Rscript testthat.R
── 1. Failure: quality score conversion correct (@test_parser.R#48)  ───────────
`convert_quality(intToUtf8(33:73), encoding = "invalid")` threw an error with unexpected message.
Expected match: "'arg' should be one of \"sanger\", \"solexa\", \"illumina1.3\", \"illumina1.5\", \"illumina1.8\""
Actual message: "'arg' should be one of “sanger”, “solexa”, “illumina1.3”, “illumina1.5”, “illumina1.8”"

══ testthat results  ═══════════════════════════════════════════════════════════
[ OK: 38 | SKIPPED: 0 | WARNINGS: 0 | FAILED: 1 ]
1. Failure: quality score conversion correct (@test_parser.R#48) 

Error: testthat unit tests failed
Execution halted

Make diversity measures more efficient

There must be a lot of redundant calculation after computing the distance between sequence A and B, since the third sequence C must be almost identical to either one - we should be able to reuse a lot of the first calculation.

Make examples

place public or fake data in folder with example scripts for making plots

Limit to regex in gsub

I ran into the following error when trying to process a dataset of HIV-1 env sequences:

Error in gsub(master, paste(master, "(m)"), seqs) : 
  invalid regular expression 'ATGAGAGTGATGGGGATCAAGACGAATTGTCAGCACTCAGAAAAATTGTGGGTCACAGTCTATTATGGGGTACCTGTGTGGAAAGAAGCAACTACTACTCTATTTTGTGCATCAGATGCTAAAGCATATGACACAGAGATGCATAATGTTTGGGCCACACATGCCTGCGTACCCACAGACCCTAACCCACAAGAAATGCAATTAGTGACAGAAAATTTTAACATGTGGAAAAATGACATGGTAGAACAAATGCATGAGGATATAATCAGTTTATGGGATCAAAGCCTAAAGCCATGTGTAAAATTAACCCCACTCTGTGTTATTTTAAATTGCGAAATAAAAAACTGCTCTTTCAATGTCTATGCACTCTTTAATACAATAGATGTGATACCAATATATATGTTGACAAATTGCAATACCTCAGTCATTACACAGGCCTGTCCAAAGGTATCCTTCGAACCAATTCCCATACATTATTGTACCCCTGCTGGTTTTGCGATTCTAAAGTGTAAGGATGAGAAGTTCAATGGAACAGGTCCATGTAAAAATGTCAGCTCAGTACAATGTACACATGGAATTAGGCCAGTGGTGTCAACTCAACTGTTGTTAAATGGCAGTCTAGCAGAAAAAGAGGTAGTAATTAGATCTGAAAATTTCACCAACAATGCTAAAACCATAATAGTACAGCTGAAGGACGCTGTAAACATTACTTGTATGAGACCCGGCAACAATACAAGAAAAAGTATAACTATAGGACCAGGGAGTGCATTTTATACATCAATAATAGGAGATATAAGACGAGCACATTGTAATGTTAGTGCAACAAAATGGAACAAGACTTTACATCAGGTAGTTGAAAAACTAAGAAAATTTAATCAATCCGGAGGGGACCCAGAAATTACAATGCATACCTTTAATTGTGGAGGGGAATTTTTCTATTGTAATACAACAAAACTGTTTAA

This issue has come up before, where the solution was to set the option perl=TRUE.

I've made this fix in the plot.Data function and it resolves this issue.

Write README

  • description
  • installation (prerequisites)
  • usage example

Write plot function

The plot function should have the following features:

  • allow user to specify a reference sequence
  • calculate differences between each sequence and the reference
  • draw a figure illustrating these results

And eventually

  • distinguish among different types of nucleotide differences (e.g., nonsynonymous/synonymous)
  • layout multiple data sets on the same plot

Let's try using ggplot for this.

Rewrite man pages

For example, the current help page for plot.session lists a number of functions, e.g., data_melt, that are not exposed for the user. This help page was generated from inline comments in the source code, which is not adequate.

Warning messages from unit tests

Commit a185d5a

Warning messages:
1: In for (n in impnames) if (!is.null(genImp <- impenv[[n]])) { :
  closing unused connection 5 (/home/art/git/highlineR/tests/testthat/test_data/valid/fastq/test.fq)
2: In for (n in impnames) if (!is.null(genImp <- impenv[[n]])) { :
  closing unused connection 4 (/home/art/git/highlineR/tests/testthat/test_data/valid/fasta/test.fa)

missing leg

> plot(ses)
Error in get_legend(plots) : object 'leg' not found

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.