mdcao / japsa Goto Github PK

View Code? Open in Web Editor NEW

22.0 22.0 14.0 64.08 MB

Just Another JAva Package for Sequence Analysis

License: BSD 3-Clause "New" or "Revised" License

Java 99.62% Makefile 0.19% Roff 0.17% Shell 0.01%

japsa's People

Contributors

Stargazers

Watchers

Forkers

hsnguyen bhuvansankar biocodings allenday devika1 arnoldbaino fw1121 obsh xlinxlin esteinig lachlancoin

japsa's Issues

XM data corruption

XM compressor still has data corruption issue. Compressing some input and decompressing it back produces corrupted output. I.e., decompressed data is different from original file.

Test data size: 30,244 bytes
Test data link: http://kirill.med.u-tokai.ac.jp/data/temp/xm-repro-4-input.zip

Commands to reproduce:

Compress:
jsa.xm.compress --hashSize=11 --context=15 --limit=200 --threshold=0.15 --chance=20 --real=archive.xm original.fasta

Decompress:
jsa.xm.compress --hashSize=11 --context=15 --limit=200 --threshold=0.15 --chance=20 --decode=archive.xm --output=decompressed.fasta

Compare:
cmp original.fasta decompressed.fasta

Produces: original.fasta decompressed.fasta differ: byte 27512, line 274

The decompressed file has correct size, but corrupted sequence data. It was found during testing for Sequence Compression Benchmark.

Let me know if you need any additional information or help.

option illen doesn't work in capsim

In capsim, option illen doesn't work and I go through the code and find it is not used at anywhere except option parsing. It should be used somewhere in simulation to setup the read length of illumina read length.

Duplicated reads id issue

Hi,
There is potential issue in SimulateCaptureCmd.java. The read name/id might be duplicated. It is easy to fix by appending the numGen variable at the end of the name like this:
seq.setName(ID + "_" + chrList.get(chrIndex).getName() + "_" + (chrPos + 1) + "_" +(chrPos + fragLength) + "_" + Long.toString(numGen));

Feature request - use floats for base-call quality averages for jsa.np.filter params

Hi again,

Integers are used for Phred base-call quality values for individual bases because they can easily be ASCII-encoded. However, when average values are taken over a read for base-call qualities, the use of floats has advantages, given the coarseness of the Phred scale. Currently, the '--qualMin' and '--qualMax' params of jsa.np.filter (which by definition concern average base-call quals) accept integer values, but the use of floats for these params and the related filtering calculations would allow e.g. for the creation of finer-grained error-profiles for a read dataset.

Thanks for your consideration of this request!

Resistance profiling Exception in thread "SSS" java.lang.NullPointerException

Getting an error when trying to run jsa.np.rtResistGenes.

Example script that is being used to run is

jsa.np.npreader --folder=$1 --output=- | \                                      
     bwa mem -t12 -x ont2d -K10000 ${2}/${DB_NAME} - 2> /dev/null | \                                                                                                                                                                                                                                                                                                  
     jsa.np.rtResistGenes --output=$3 --bamFile=- --resDB=$2 --time=180 --tmp=tmp/resTest

This same npReader/bwa mem combination works fine with jsa.np.rtSpeciesTyping.

There are no errors coming from bwa mem or npReader. The resistance profiler seems to start fine but returns an error. Stack trace provided:

[main] INFO japsa.tools.bio.np.NanoporeReaderCmd - Start reading
[main] INFO japsa.seq.nanopore.NanoporeReaderStream - Start reading /DataOnline/Data/Nanopore/GN_003_R9_2D_20160928/reads/downloads/pass/
[main] INFO japsa.seq.nanopore.NanoporeReaderStream - Start reading  1495696853857
[main] INFO japsa.bio.np.RealtimeResistanceGene - geneList = 280
[main] INFO japsa.bio.np.RealtimeResistanceGene - geneMap = 280
[main] INFO japsa.bio.np.RealtimeResistanceGene - gene2Group = 609
[main] INFO japsa.bio.np.RealtimeResistanceGene - gene2GeneName = 609
[main] INFO japsa.bio.np.RealtimeResistanceGene - Resistance identification ready at Thu May 25 07:20:53 UTC 2017
[SSS] INFO japsa.bio.np.RealtimeAnalysis - Start analysing data at Thu May 25 07:20:55 UTC 2017
Exception in thread "SSS" java.lang.NullPointerException
        at japsa.bio.np.ErrorCorrection.consensusSequence(ErrorCorrection.java:94)
        at japsa.bio.np.RealtimeResistanceGene$ResistanceGeneFinder.antiBioticsProfile(RealtimeResistanceGene.java:261)
        at japsa.bio.np.RealtimeResistanceGene$ResistanceGeneFinder.antiBioticAnalysis(RealtimeResistanceGene.java:233)
        at japsa.bio.np.RealtimeResistanceGene$ResistanceGeneFinder.analysis(RealtimeResistanceGene.java:439)
        at japsa.bio.np.RealtimeAnalysis.run(RealtimeAnalysis.java:114)
        at java.lang.Thread.run(Thread.java:745)
[main] INFO japsa.seq.nanopore.NanoporeReaderStream - EXITING
[main] INFO japsa.seq.nanopore.NanoporeReaderStream - npReader closing
[main] INFO japsa.seq.nanopore.NanoporeReaderStream - npReader closed
[main] INFO japsa.bio.np.RealtimeAnalysis - All reads received at Thu May 25 08:07:09 UTC 2017
[main] INFO japsa.bio.np.RealtimeResistanceGene - END : Thu May 25 08:07:09 UTC 2017

Example of how I ran the above script is:

./japsaSTtest.sh /DataOnline/Data/Nanopore/GN_003_R9_2D_20160928/reads/downloads/pass/ /DataOnline/Data/Bacterial_Genome/Eskape/ResistanceGenes/resFinder/ - 2> error.log

introducing new variable starting with word 'read' triggers ambiguity in japsa.util.CommandLine

To avoid this when playing around with development in Japsa, ensure new argument names are completely different from existing argument names. eg. '--read' conflicts with '--readList' as the CommandLine module detects ambiguity between the two.

VNTR.xaf

Where can I find this file?

Installation failed

Hi, I got the following error message during the installation process. The installation command I used is: sudo make install INSTALL_DIR=~/.usr/local MXMEM=7000m SERVER=true JLP=false

src/main/java/japsa/util/deploy/Deploy.java:55: error: cannot find symbol
import japsa.tools.bio.np.FlankSeqsDetectorCmd;
                         ^
  symbol:   class FlankSeqsDetectorCmd
  location: package japsa.tools.bio.np
src/main/java/japsa/util/deploy/Deploy.java:184: error: cannot find symbol
		tools.add(new FlankSeqsDetectorCmd());
		              ^
  symbol:   class FlankSeqsDetectorCmd
  location: class Deploy
Note: Some input files use or override a deprecated API.
Note: Recompile with -Xlint:deprecation for details.
Note: Some input files use unchecked or unsafe operations.
Note: Recompile with -Xlint:unchecked for details.
2 errors
Makefile:88: recipe for target 'target/main-classes/./japsa/bio/misc/alignCompress/AlignCompress.class' failed
make: *** [target/main-classes/./japsa/bio/misc/alignCompress/AlignCompress.class] Error 1

java.lang.OutOfMemoryError: Requested array size exceeds VM limit

XM fails to compress large files (3GB) with the following message:

Exception in thread "main" java.lang.OutOfMemoryError: Requested array size exceeds VM limit
at java.util.Arrays.copyOf(Arrays.java:3236)
at japsa.seq.FastaReader.nextSequence(FastaReader.java:159)
at japsa.tools.bio.xm.ExpertModelCmd.main(ExpertModelCmd.java:130)

identity rate in jsa.hts.errorAnalysis

We are using japsa tool to evaluate the accuracy of the basecallers.
We would like to know why identity rate in japsa is defined as 1 - (mismatch + insertion)?

Kind regards

Arda

npreader

Hi!

I want to run npreader on 1Dsquared reads. Do you know if that's also possible?
When I look at the scripts it looks like only the 2D, template and complement are used, but I think 1Dsquared will be stored under a different name in the FAST5 file.

Looking forward to your answer, thanks!

Formula for estimating changes in repeat units

I am trying to understand your code as part of my personal learning exercise, and I noticed that the formula for estimating changes in repeat units is different from what you wrote in Nucl. Acids Res. (2014) 42 (3): e16. Could you please enlighten me on the rationale of such changes?

In line 632 of japsa/src/main/java/japsa/bio/tr/Fragment2TRV.java, you wrote:

double eMean = ((meanP / varP) - (counts[i] * trv.getPeriod() * (mean - gMean) / gVar)) * eVar;

Whereas the original formula was commented out in line 388:

//double eMean = ((meanP / varP) - (counts[i] * (mean - gMean) / gVar)) * eVar;

Looking forward to your reply soon, thanks!

Adjusting prior using external indel predictions and confidence calculations

It's me again :) I am personally very interested in your work and STR variations in general, so I hope you don't mind me asking more about the details of your algorithm.

In your paper, you suggested that indels from tools like Samtools, Dindel etc can be used to adjust the prior distribution of repeat length. I might have overlooked but it seems like you are only adjusting fragment length instead of adjusting the prior.

//line 535-550 of japsa/src/main/java/japsa/bio/tr/Fragment2TRV.java        
                if (indel !=null && refIndexIndel == fragment.getReferenceIndex()){
                    Indel eIndel = indel;
                    int eIdx = indelIdx;
                    while (eIndel.chr.equals(indel.chr) && eIndel.start < fragment.getEnd()){
                        fragment.iSize += eIndel.length;
                        eIdx ++;
                        if (eIdx < indelList.size()) 
                            eIndel = indelList.get(eIdx);
                        else 
                            break;
                    }
                }


                double v = fragment.getISize();

If I have a list of indels predicted from GATK haplotypecaller, and I want to use that to adjust the prior of STRviper calls, how could I best achieve that?

My second question is related to the confidence calculation. I noticed that quality score of the STRviper VCF are low in general (most are below 20), so I want to learn how that was calculated in the paper but to no avail. I looked at the source code and found the relevant code in line 643 of japsa/src/main/java/japsa/bio/tr/Fragment2TRV.java.

trv.setConfidence(d.cumulativeProbability(trv.getVar() - 0.5, trv.getVar() + 0.5));

I don't quite understand why +- 0.5 was chosen as the threshold to calculate the confidence. Could you point me to the rationale behind that? Thank you very much!

Cheers,
Allen

Species binning of fastq

It would be good to have a module which uses the output species to read data to split the input fastq file into separate fastq bins.

VNTRTyper ArrayIndexOutOfBoundsException when handling BAM records with >65535 CIGAR operations

When handling BAM records with >65535 CIGAR operations, VNTRTyper will throw an ArrayIndexOutOfBoundsException.

Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: 516984
at japsa.tools.bio.hts.VNTRLongReadsCmd.getReadPosition(VNTRLongReadsCmd.java:770)
at japsa.tools.bio.hts.VNTRLongReadsCmd.main(VNTRLongReadsCmd.java:288)

NullPointerException XM compression

I'm trying to compress a DNA sequence file and it throws a nullpointer. Any ideas why?

java -version
openjdk version "1.8.0_242"
OpenJDK Runtime Environment (build 1.8.0_242-b08)
OpenJDK 64-Bit Server VM (build 25.242-b08, mixed mode)

$ jsa.xm.compress BuEb
The eXpert-Model (XM) for Compression of DNA Sequences V 3.0
  Minh Duc Cao, T. I. Dix, L. Allison, C. Mears.
  A simple statistical algorithm for biological sequence compression
  IEEE Data Compression Conf., Snowbird, Utah, 2007, [doi:10.1109/DCC.2007.7]

Parameters:
Hash size        : 11
Expert Limit     : 200
Context          : 15
Listen Threshold : 0.15bps
Chances          : 20
BinaryHash       : false
HashType         : Hashtable
Expert Type      : 
 #Reading file(s)
Exception in thread "main" java.lang.NullPointerException
	at japsa.tools.bio.xm.ExpertModelCmd.main(ExpertModelCmd.java:130)

It appears the program doesn't support sequence files? Only Fasta, Fastq.. is that correct?

Understanding of jsa.np.rtSpeciesTyping output

Hi,
We just tested jsa.np.rtSpeciesTyping tool with a sample data and produced a .tsv file. Can you help me to explain what is the meaning of the header (reads, bases, prob, err, tAligned, sAligned)? Thanks a lot!

Refactor `consensusSequence()`

japsa/src/main/java/japsa/bio/np/ErrorCorrection.java

Line 101 in 0f210c8

    
           public static Sequence consensusSequence(List<Sequence> readList, int max, String prefix, String msa) throws IOException, InterruptedException{

Please refactor this method, we need step 4 available in isolation.

It's a good idea while refactoring to isolate all 4 steps, then have a single function that calls down to all 4 in series.

Release v1.7-05b - jsa.np.npreader

Hi,
In my hands the indicated release's version of npreader is failing to yield any output (but no error message), albeit with admittedly old versions of Metrichor fast5 file inputs (2D-MAP-005 workflow). I can provide a small test dataset if necessary to reproduce the issue. Using Poretools is a still-functioning alternative so if you have moved on to focus on more recent file format versions then go ahead and close out this issue.

Thanks!

problem with A dynamic, reusable donut chart for D3.js v4

i followed your steps, but my donut chart is still static, how to make that dynamic? where can i find the new species.tsv through which the data is being changed? please help

jsa.np.rtResistGenes - tmp folder not automatically generated, and _ao.fasta not produced when tmp folder made in advance

Hi,
I'm having an issue running the resistance gene identifier when running it on pre-basecalled reads.
Without making the tmp directory in advance, I get a FileNotFoundException for tmp/resFind_JSA_4_2_ai.fasta
But when I do make the tmp directory in advance, it seems like the tmp/resFind_JSA_4_2_ao.fasta file isn't created and I get another FileNotFoundException when it tries to read in the msa output.

Command looks like this:

cat $fq | bwa mem -t 8 -k11 -W20 -r10 -A1 -B1 -O1 -E1 -L0 -Y -K 10000 -a $rdb/DB.fasta $fq 2> /dev/null \ | jsa.np.rtResistGenes -bam - -score=0.0001 -time 300 -read 0 --resDB $rdb -tmp tmp/resFind -o resistance.dat -thread 7 2> resistance.log

logs are here:

without making tmp dir:
[main] INFO japsa.bio.np.RealtimeResistanceGene - geneList = 280 [main] INFO japsa.bio.np.RealtimeResistanceGene - geneMap = 280 [main] INFO japsa.bio.np.RealtimeResistanceGene - gene2Group = 609 [main] INFO japsa.bio.np.RealtimeResistanceGene - gene2GeneName = 609 [main] INFO japsa.bio.np.RealtimeResistanceGene - Resistance identification ready at Mon Mar 01 02:09:00 UTC 2021 [SSS] INFO japsa.bio.np.RealtimeAnalysis - Start analysing data at Mon Mar 01 02:09:00 UTC 2021 [SSS] INFO japsa.bio.np.RealtimeResistanceGene - ===Found 0 vs 280 0 with 0 [SSS] INFO japsa.bio.np.RealtimeAnalysis - RUNTIME Mon Mar 01 02:09:00 UTC 2021 0.001 0 0.085 [SSS] INFO japsa.bio.np.RealtimeAnalysis - Not due time, sleep for 299.885 seconds [main] INFO japsa.bio.np.RealtimeAnalysis - All reads received at Mon Mar 01 02:12:06 UTC 2021 [main] INFO japsa.bio.np.RealtimeResistanceGene - END : Mon Mar 01 02:12:06 UTC 2021 java.io.FileNotFoundException: tmp/resFind_JSA_4_2_ai.fasta (No such file or directory) at java.base/java.io.FileOutputStream.open0(Native Method) at java.base/java.io.FileOutputStream.open(FileOutputStream.java:298) at java.base/java.io.FileOutputStream.<init>(FileOutputStream.java:237) at java.base/java.io.FileOutputStream.<init>(FileOutputStream.java:126) at japsa.seq.SequenceOutputStream.makeOutputStream(SequenceOutputStream.java:117) at japsa.bio.np.ErrorCorrection.writeAlignmentToFaiFile(ErrorCorrection.java:111) at japsa.bio.np.ErrorCorrection.consensusSequence(ErrorCorrection.java:297) at japsa.bio.np.ErrorCorrection.consensusSequence(ErrorCorrection.java:103) at japsa.bio.np.RealtimeResistanceGene$ResistanceGeneFinder.antiBioticsProfile(RealtimeResistanceGene.java:361) at japsa.bio.np.RealtimeResistanceGene$ResistanceGeneFinder.antiBioticAnalysis(RealtimeResistanceGene.java:335) at japsa.bio.np.RealtimeResistanceGene$ResistanceGeneFinder.analysis(RealtimeResistanceGene.java:463) at japsa.bio.np.RealtimeAnalysis.run(RealtimeAnalysis.java:116) at java.base/java.lang.Thread.run(Thread.java:834) [SSS] INFO japsa.bio.np.RealtimeAnalysis - RUNTIME Mon Mar 01 02:14:00 UTC 2021 300.009 307288 0.007 [SSS] INFO japsa.bio.np.RealtimeAnalysis - Real time analysis done

with making tmp dir:
[main] INFO japsa.bio.np.RealtimeResistanceGene - geneList = 280 [main] INFO japsa.bio.np.RealtimeResistanceGene - geneMap = 280 [main] INFO japsa.bio.np.RealtimeResistanceGene - gene2Group = 609 [main] INFO japsa.bio.np.RealtimeResistanceGene - gene2GeneName = 609 [main] INFO japsa.bio.np.RealtimeResistanceGene - Resistance identification ready at Mon Mar 01 01:02:55 UTC 2021 [SSS] INFO japsa.bio.np.RealtimeAnalysis - Start analysing data at Mon Mar 01 01:02:55 UTC 2021 [SSS] INFO japsa.bio.np.RealtimeResistanceGene - ===Found 0 vs 280 0 with 0 [SSS] INFO japsa.bio.np.RealtimeAnalysis - RUNTIME Mon Mar 01 01:02:55 UTC 2021 0.001 0 0.101 [SSS] INFO japsa.bio.np.RealtimeAnalysis - Not due time, sleep for 299.872 seconds [main] INFO japsa.bio.np.RealtimeAnalysis - All reads received at Mon Mar 01 01:06:02 UTC 2021 [main] INFO japsa.bio.np.RealtimeResistanceGene - END : Mon Mar 01 01:06:02 UTC 2021 [SSS] INFO japsa.bio.np.ErrorCorrection - 8ddcabfa-8fba-4831-8327-2b940bd5f813_r_890_1639 750 [SSS] INFO japsa.bio.np.ErrorCorrection - 36f3366a-35ac-4428-9561-f2ab368976df_803_1619 817 [SSS] INFO japsa.bio.np.ErrorCorrection - Running [kalign, -gpo, 60, -gpe, 10, -tgpe, 0, -bonus, 0, -q, -i, tmp/resFind_JSA_4_2_ai.fasta, -o, tmp/resFind_JSA_4_2_ao.fasta] [SSS] INFO japsa.bio.np.ErrorCorrection - Done [kalign, -gpo, 60, -gpe, 10, -tgpe, 0, -bonus, 0, -q, -i, tmp/resFind_JSA_4_2_ai.fasta, -o, tmp/resFind_JSA_4_2_ao.fasta] java.io.FileNotFoundException: tmp/resFind_JSA_4_2_ao.fasta (No such file or directory) at java.base/java.io.FileInputStream.open0(Native Method) at java.base/java.io.FileInputStream.open(FileInputStream.java:219) at java.base/java.io.FileInputStream.<init>(FileInputStream.java:157) at java.base/java.io.FileInputStream.<init>(FileInputStream.java:112) at japsa.seq.SequenceReader.getReader(SequenceReader.java:299) at japsa.bio.np.ErrorCorrection.readMultipleAlignment(ErrorCorrection.java:235) at japsa.bio.np.ErrorCorrection.consensusSequence(ErrorCorrection.java:314) at japsa.bio.np.ErrorCorrection.consensusSequence(ErrorCorrection.java:103) at japsa.bio.np.RealtimeResistanceGene$ResistanceGeneFinder.antiBioticsProfile(RealtimeResistanceGene.java:361) at japsa.bio.np.RealtimeResistanceGene$ResistanceGeneFinder.antiBioticAnalysis(RealtimeResistanceGene.java:335) at japsa.bio.np.RealtimeResistanceGene$ResistanceGeneFinder.analysis(RealtimeResistanceGene.java:463) at japsa.bio.np.RealtimeAnalysis.run(RealtimeAnalysis.java:116) at java.base/java.lang.Thread.run(Thread.java:834) [SSS] INFO japsa.bio.np.RealtimeAnalysis - RUNTIME Mon Mar 01 01:07:55 UTC 2021 300.009 307288 0.436 [SSS] INFO japsa.bio.np.RealtimeAnalysis - Real time analysis done

BC01 empty

Hi all,

I am running jsa.np.barcode with an input of about 4 million reads. The -bc is fasta file containing ONT barcodes from thier barcoding kit:

BC01
AAGAAAGTTGTCGGTGTCTTTGTG
BC02
TCGATTCCGTTTGTAGTCGTCTGT
BC03
GAGTCTTGTGTCCCAGTTACCAGG
BC04
TTCGGATTCTATCGTGTTTCCCTA
BC05
CTTGTCCAGGGTTTGTGTAACCTT
BC06
TTCTCGCAAAGGCAGAAAGTAGTC
BC07
GTGTTACCGTGGGAATGAATCCTT
BC08
TTCAGGGAACAAACCAAGTTACGT
BC09
AACTAGGCACAGCGAGTCTTGGTT
BC10
AAGCGTTGAAACCTTTGTCCTCTC
BC11
GTTTCATCTATCGGAGGGAATGGA
BC12
CAGGTAGAAAGAAGCAGAATCGGA

The script is running fine, but I just noticed that BC01 is an empty file. I have already demultiplexed this data set with Metrichor, when it was active and I know that I have sequences with BC01.

Is there a reason this could happen?