The highliner from poonlab

Make print and summary functions for session object

Presently printing a session displays the following:

> ses
<environment: 0x5583ffd49208>
attr(,"class")
[1] "session"     "environment"

and calling summary() raises different errors:

> summary(ses)
Error in order(var_counts, decreasing = F) : argument 1 is not a vector

Write FASTA/FASTQ parser

@Lisa-Monique found environment as a possible approach to hash-based key-value data structures in R.
Also allow user to use SAM file format as input.
In all cases, we should try to stream the input file and process data "online" instead of reading the entire contents into memory first.

Diversity measures

nucleotide/amino acid entropy
~~tree length (requires a method to reconstruct tree)~~
nucleotide diversity
percent complexity (number of variants / number of sequences)
mean genetic distance (use dist.dna in ape package)

variable names are limited to 10000 bytes

Thank you for this. I have this code trying to visualize an alignment

files <- 'data/dta.fasta.aln'
highline(files[1])

and then I get this error

Error in exists(sequence, envir = data$compressed) : 
  variable names are limited to 10000 bytes

Write summary function for S3 data object

This could return:

the number of sequences
the sequence length
the number of variants
summary measures of variant counts
the most abundant sequence

Write INSTALL.md

Decide how we want to distribute the code (devtools, bioconductor, cran?)
Package the code
Installation instructions

Idea: random resampling to adjust for template count

If an NGS data set contains many more reads than available templates (nucleic acids) in the sequencing reaction, then the frequency distribution of variants - and thereby the resulting diversity measures - may be skewed. It might be possible to adjust for this by sampling N reads at random (with replacement?) M times, and then averaging the resulting variant frequencies across the M replicates to yield the expected frequencies.

Write unit tests

For example, make a toy example FASTA file to import and compare the parsed result to the expected result.

Tidy up plot

too much white space
master sequence can be emphasized by drawing a solid line around the bar, for example
visual cues for read abundance (squares plotted to the right of each bar)
other enhancements?

Installing from github clone

Unlisted packaged dependency (ggpubr)

art@orolo:~/git/highlineR$ R CMD INSTALL .
* installing to library ‘/home/art/R/x86_64-pc-linux-gnu-library/3.5’
ERROR: dependency ‘ggpubr’ is not available for package ‘highlineR’
* removing ‘/home/art/R/x86_64-pc-linux-gnu-library/3.5/highlineR’

After installing ggpubr the above command executes properly.

Running test suite requires 'testthat' package

Tests fail

art@orolo:~/git/highlineR/tests$ Rscript testthat.R
── 1. Failure: quality score conversion correct (@test_parser.R#48)  ───────────
`convert_quality(intToUtf8(33:73), encoding = "invalid")` threw an error with unexpected message.
Expected match: "'arg' should be one of \"sanger\", \"solexa\", \"illumina1.3\", \"illumina1.5\", \"illumina1.8\""
Actual message: "'arg' should be one of “sanger”, “solexa”, “illumina1.3”, “illumina1.5”, “illumina1.8”"

══ testthat results  ═══════════════════════════════════════════════════════════
[ OK: 38 | SKIPPED: 0 | WARNINGS: 0 | FAILED: 1 ]
1. Failure: quality score conversion correct (@test_parser.R#48) 

Error: testthat unit tests failed
Execution halted

Make diversity measures more efficient

There must be a lot of redundant calculation after computing the distance between sequence A and B, since the third sequence C must be almost identical to either one - we should be able to reuse a lot of the first calculation.

Write README

Make examples

place public or fake data in folder with example scripts for making plots

Let user turn off title and/or legend

Tick marks should span the height of the band

Code documentation and cleanup

Represent deletions in master sequence

specify master by index

Let user pass named vector to highline() for individual plot titles

Currently displays filenames

Limit to regex in gsub

I ran into the following error when trying to process a dataset of HIV-1 env sequences:

Error in gsub(master, paste(master, "(m)"), seqs) : 
  invalid regular expression 'ATGAGAGTGATGGGGATCAAGACGAATTGTCAGCACTCAGAAAAATTGTGGGTCACAGTCTATTATGGGGTACCTGTGTGGAAAGAAGCAACTACTACTCTATTTTGTGCATCAGATGCTAAAGCATATGACACAGAGATGCATAATGTTTGGGCCACACATGCCTGCGTACCCACAGACCCTAACCCACAAGAAATGCAATTAGTGACAGAAAATTTTAACATGTGGAAAAATGACATGGTAGAACAAATGCATGAGGATATAATCAGTTTATGGGATCAAAGCCTAAAGCCATGTGTAAAATTAACCCCACTCTGTGTTATTTTAAATTGCGAAATAAAAAACTGCTCTTTCAATGTCTATGCACTCTTTAATACAATAGATGTGATACCAATATATATGTTGACAAATTGCAATACCTCAGTCATTACACAGGCCTGTCCAAAGGTATCCTTCGAACCAATTCCCATACATTATTGTACCCCTGCTGGTTTTGCGATTCTAAAGTGTAAGGATGAGAAGTTCAATGGAACAGGTCCATGTAAAAATGTCAGCTCAGTACAATGTACACATGGAATTAGGCCAGTGGTGTCAACTCAACTGTTGTTAAATGGCAGTCTAGCAGAAAAAGAGGTAGTAATTAGATCTGAAAATTTCACCAACAATGCTAAAACCATAATAGTACAGCTGAAGGACGCTGTAAACATTACTTGTATGAGACCCGGCAACAATACAAGAAAAAGTATAACTATAGGACCAGGGAGTGCATTTTATACATCAATAATAGGAGATATAAGACGAGCACATTGTAATGTTAGTGCAACAAAATGGAACAAGACTTTACATCAGGTAGTTGAAAAACTAAGAAAATTTAATCAATCCGGAGGGGACCCAGAAATTACAATGCATACCTTTAATTGTGGAGGGGAATTTTTCTATTGTAATACAACAAAACTGTTTAA

This issue has come up before, where the solution was to set the option perl=TRUE.

I've made this fix in the plot.Data function and it resolves this issue.

Analyze HIV data

Implement variant count/compression

Using R environment objects

Cannot plot files with single variant

" Error in order(simil) : argument 1 is not a vector"
Cannot order sequences because no sequences to compare

Write README

description
installation (prerequisites)
usage example

Write plot function

The plot function should have the following features:

allow user to specify a reference sequence
calculate differences between each sequence and the reference
draw a figure illustrating these results

And eventually

distinguish among different types of nucleotide differences (e.g., nonsynonymous/synonymous)
layout multiple data sets on the same plot

Let's try using ggplot for this.

Warning messages:
1: In for (n in impnames) if (!is.null(genImp <- impenv[[n]])) { :
  closing unused connection 5 (/home/art/git/highlineR/tests/testthat/test_data/valid/fastq/test.fq)
2: In for (n in impnames) if (!is.null(genImp <- impenv[[n]])) { :
  closing unused connection 4 (/home/art/git/highlineR/tests/testthat/test_data/valid/fasta/test.fa)

transitions/transversions (nucleotide)
nonsynonymous/synonymous (requires reading frame)
others?

missing leg

> plot(ses)
Error in get_legend(plots) : object 'leg' not found

poonlab / highliner Goto Github PK

highliner's People

Contributors

Stargazers

Watchers

Forkers

highliner's Issues

Recommend Projects

Recommend Topics

Recommend Org