xinglab / isocirc Goto Github PK

View Code? Open in Web Editor NEW

10.0 10.0 4.0 264.84 MB

isoCirc

License: GNU General Public License v3.0

Shell 2.86% Python 73.98% R 23.15%

isocirc's People

Contributors

Stargazers

Watchers

Forkers

xjyx pycnopodiad

isocirc's Issues

Issues of not generating correct file by using isoCirc

I’m using the isoCirc from your lab, it worked on your test data. But when I download your data from ebi (https://www.ebi.ac.uk/ena/browser/view/PRJNA594380?show=reads) and using isoCirc by the following command (I downloaded the data and put it straight into the pipeline without other steps):

ref=/home/fengyige/data/star/hg19/hg19.fa

gtf=/home/fengyige/data/star/hg19/gencode.v21.annotation.gtf

circ=/home/fengyige/data/circRNA/hsa_hg19_circRNA.bed

input=/home/fengyige/20220826_circrna/fastq/SRR10612056_1.fastq.gz

isocirc -t 20 ${input} ${ref} ${gtf} ${circ} ../SRR10612056/

It did generate isocirc.bed and isocirc.out, but it didn't generate isocirc_stats.out. And the *out file only contains the header line. I wonder if the data downloaded from ebi need some pre-processing? Or I just miss some critical steps？

Thanks for your help!

Issue of not generating out file

I tried to run isocirc with test data. It worked great! However, when I tried to run isocirc with my own data, it did not generate isocirc.out, isocirc_stats.out or isocirc.bed. I downloaded the fa data from ensembl (http://ftp.ensembl.org/pub/release-104/fasta/homo_sapiens/dna/Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz ) and the gtf file also from ensembl (http://ftp.ensembl.org/pub/release-104/gtf/homo_sapiens/Homo_sapiens.GRCh38.104.gtf.gz). The circRNA bed file was downloaded from http://circatlas.biols.ac.cn/.

The output file contains the following files:
cons.fa cons.fa.sam high.bam Homo_sapiens.GRCh38.104.gtf.gene_pred TotalRNAonly.fa.len
cons.fa.fai cons.info Homo_sapiens.GRCh38.104.gtf.bed low.bam trf.out

Thanks for help!

annotation issue?

Hello,

I faced this kind of problem:

== 16:16:34-Apr-14-2021 == [gtf2gene] gtf2gene nanopore_circ/isocirc.bed.exon.gtf AT.gtf nanopore_circ/isocirc.bed.ovlp.gene.out
Traceback (most recent call last):
  File "/opt/exp_soft/local/generic/python/3.9.2/bin/isocirc", line 219, in <module>
    main()
  File "/opt/exp_soft/local/generic/python/3.9.2/bin/isocirc", line 216, in main
    isocirc_core(args)
  File "/opt/exp_soft/local/generic/python/3.9.2/bin/isocirc", line 132, in isocirc_core
    hf.hcBSJ_fullIso(high_bam, low_bam, long_len_fn, cons_info, cons_fa,
  File "/opt/exp_soft/local/generic/python/3.9.2/lib/python3.9/site-packages/isocirc/hcBSJ_fullIso.py", line 826, in hcBSJ_fullIso
    itst_out_dict = intersect_with_bed(out_dir, circRNA_bed, all_anno, all_anno_bed, itst_anno_dict, flank_len, bedtools)
  File "/opt/exp_soft/local/generic/python/3.9.2/lib/python3.9/site-packages/isocirc/hcBSJ_fullIso.py", line 414, in intersect_with_bed
    get_ovlp_gene_name_id(ovlp_gene_name_id, gene_id_dict, gene_name_dict, gene_strand_dict)
  File "/opt/exp_soft/local/generic/python/3.9.2/lib/python3.9/site-packages/isocirc/hcBSJ_fullIso.py", line 214, in get_ovlp_gene_name_id
    strand_dict[ele[0]] = ele[3] if strand_dict[ele[0]] == 'NA' else strand_dict[ele[0]] + ',' + ele[3]
IndexError: list index out of range

I assume that this is a problem with gtf file (maybe chromosome format ?). Could You please tell me how to manage? Also can I run the piplene from this point without repeating previous steps if they are good?

ValueError: file has no sequences defined (mode='r')

I am getting the following error when running data with human genome and the gtf file with circbase bed file.

== 10:33:39-Nov-06-2021 == [Mapping] Mapping consensus sequence to genome done!
== 10:33:39-Nov-06-2021 == [Classifying] Classifying consensus alignment ...
== 10:33:39-Nov-06-2021 == [classify_bam_core] Processing ./cons.fa.sam ... 
Traceback (most recent call last):
  File "/home/user/anaconda3/bin/isocirc", line 219, in <module>
    main()
  File "/home/user/anaconda3/bin/isocirc", line 216, in main
    isocirc_core(args)
  File "/home/user/anaconda3/bin/isocirc", line 117, in isocirc_core
    bc.bam_classify(cons_all_sam, high_bam, low_bam, args.high_max_ratio, args.high_min_ratio, args.high_iden_ratio, args.high_repeat_ratio, args.low_repeat_ratio)
  File "/home/user/anaconda3/lib/python3.8/site-packages/isocirc/bam_classify.py", line 168, in bam_classify
    with ps.AlignmentFile(in_bam_fn) as in_bam, ps.AlignmentFile(high_bam_fn, 'wb', template=in_bam) as high_bam, \
  File "pysam/libcalignmentfile.pyx", line 742, in pysam.libcalignmentfile.AlignmentFile.__cinit__
  File "pysam/libcalignmentfile.pyx", line 991, in pysam.libcalignmentfile.AlignmentFile._open
ValueError: file has no sequences defined (mode='r') - is it SAM/BAM format? Consider opening with check_sq=False

I am using the latest version of isoCirc. Any idea on how I can resolve the issue?

Short-read error correction feature

Hi there,

I've been able to run isoCirc with isocirc -t 1 $PATH/FAQ07459_pass_a4f2108a.fastq $PATH/all-chrs.fa $PATH/hg38_ref_all.gtf $PATH/annotation.bed $PATH/isocirc_output and get intended results. However, when I tried to add the --short parameter to the run command, it seems to me that there is either a really long run time, or I'm not running it correctly.

I have 2 sets of pair-end short-read sequencing data, and here's the command I used:

isocirc -t 1 --short-read $PATH/TruSeq_R1.fastq,$PATH/TruSeq_R2.fastq,$PATH/New_England_R1.fastq,$PATH/New_England_R2.fastq $PATH/FAQ07459_pass_a4f2108a.fastq $PATH/all-chrs.fa $PATH/hg38_ref_all.gtf $PATH/annotation.bed $PATH/isocirc_outputP_short_read

When I look into the log of the run, it shows this (path edited and omitted similar lines for simplicity):

Matplotlib created a temporary config/cache directory at /tmp/matplotlib-f2h0a7nc because the default path (/.config/matplotlib) is not a writable directory; it is highly recommended to set the MPLCONFIGDIR environment variable to a writable directory, in particular to speed up the import of Matplotlib and to better support multiprocessing.
== 00:47:41-Sep-07-2021 == [check_dependencies] Checking dependencies ...
== 00:47:41-Sep-07-2021 == [check_dependencies] Checking dependencies done!
== 00:47:41-Sep-07-2021 == [Error-correction] Hybrid error correction using $PATH/TruSeq_R1.fastq,$PATH/TruSeq_R2.fastq,$PATH/New_England_R1.fastq,$PATH/New_England_R2.fastq ...
== 00:47:41-Sep-07-2021 == [LoRDEC] lordec-correct -2 $PATH/TruSeq_R1.fastq,$PATH/TruSeq_R2.fastq,$PATH/New_England_R1.fastq,$PATH/New_England_R2.fastq -i $PATH/FAQ07459_pass_a4f2108a.fastq -o $PATH/isocirc_output_short_read/long_corrected.fa -k 21 -s 3 -T 1
-2
$PATH/TruSeq_R1.fastq,$PATH/TruSeq_R2.fastq,$PATH/New_England_R1.fastq,$PATH/New_England_R2.fastq
-i
$PATH/FAQ07459_pass_a4f2108a.fastq
-o
$PATH/long_corrected.fa
-k
21
-s
3
-T
1
illumina: $PATH/TruSeq_R1.fastq,$PATH/TruSeq_R2.fastq,$PATH/New_England_R1.fastq,$PATH/New_England_R2.fastq $PATH/HCT116_Illumina_TruSeq_R1.fastq_multi_k21_s3.h5 pacbioFile: $PATH/FAQ07459_pass_a4f2108a.fastq
kmer_len: 21 solid_kmer_thr: 3
max_trials: 5 max_error_rate: 0.4 max_branch: 200
abundance_max: 2147483647
Cannot access the graph file for reference reads: $PATH/HCT116_Illumina_TruSeq_R1.fastq_multi_k21_s3.h5
bRefGraph: 0
bRefSeq: 1
creating the graph from file(s): $PATH/TruSeq_R1.fastq,$PATH/TruSeq_R2.fastq,$PATH/New_England_R1.fastq,$PATH/New_England_R2.fastq

[DSK: counting kmers                     ]  0    %   elapsed:   0 min 0  sec   remaining:   0 min 0  sec   cpu:  -1.0 %   mem: [  28,   28,   76] MB 
[DSK: Pass 1/1, Step 1: partitioning     ]  0    %   elapsed:   0 min 0  sec   remaining:   0 min 0  sec   cpu:  -1.0 %   mem: [  46,   46,   76] MB 
[DSK: Pass 1/1, Step 1: partitioning     ]  1    %   elapsed:   0 min 20 sec   remaining:  32 min 17 sec   cpu:  99.2 %   mem: [ 494,  494,  494] MB 
[DSK: Pass 1/1, Step 1: partitioning     ]  2    %   elapsed:   0 min 39 sec   remaining:  32 min 8  sec   cpu:  99.4 %   mem: [ 575,  575,  575] MB 
[DSK: Pass 1/1, Step 1: partitioning     ]  3    %   elapsed:   0 min 59 sec   remaining:  31 min 47 sec   cpu:  99.3 %   mem: [ 575,  575,  575] MB 
...
[DSK: Pass 1/1, Step 2: counting kmers   ]  50.4 %   elapsed:  16 min 47 sec   remaining:  16 min 30 sec   cpu:  99.2 %   mem: [  96,  608,  608] MB 
[DSK: Pass 1/1, Step 2: counting kmers   ]  53.5 %   elapsed:  17 min 56 sec   remaining:  15 min 36 sec   cpu:  99.2 %   mem: [4298, 4298, 4328] MB 
[DSK: Pass 1/1, Step 2: counting kmers   ]  53.5 %   elapsed:  17 min 56 sec   remaining:  15 min 36 sec   cpu:  99.2 %   mem: [4298, 4298, 4328] MB 
...
[DSK: nb solid kmers found : 156051682   ]  101  %   elapsed:  36 min 56 sec   remaining:   0 min 0  sec   cpu:  99.4 %   mem: [1378, 5928, 5960] MB 

[Building BooPHF]  0.1  %   elapsed:   0 min 0  sec   remaining:   4 min 34 sec
[Building BooPHF]  0.2  %   elapsed:   0 min 0  sec   remaining:   3 min 32 sec
[Building BooPHF]  0.3  %   elapsed:   0 min 1  sec   remaining:   3 min 57 sec
...

[MPHF: populate                          ]  0    %   elapsed:   0 min 0  sec   remaining:   0 min 0  sec   cpu:  -1.0 %   mem: [1583, 1583, 5960] MB 
[MPHF: populate                          ]  2    %   elapsed:   0 min 1  sec   remaining:   0 min 56 sec   cpu:  99.1 %   mem: [1583, 1583, 5960] MB 
[MPHF: populate                          ]  3    %   elapsed:   0 min 2  sec   remaining:   0 min 55 sec   cpu: 100.0 %   mem: [1583, 1583, 5960] MB 
...

[Bloom: read solid kmers                 ]  0    %   elapsed:   0 min 0  sec   remaining:   0 min 0  sec   cpu:  -1.0 %   mem: [1910, 1910, 5960] MB 
[Bloom: read solid kmers                 ]  2    %   elapsed:   0 min 1  sec   remaining:   1 min 12 sec   cpu: 100.0 %   mem: [1910, 1910, 5960] MB 
[Bloom: read solid kmers                 ]  3    %   elapsed:   0 min 2  sec   remaining:   1 min 6  sec   cpu: 100.0 %   mem: [1910, 1910, 5960] MB 
...

[Debloom: finalization                   ]  0    %   elapsed:   0 min 0  sec   remaining:   0 min 0  sec   cpu:  -1.0 %   mem: [2225, 2225, 5960] MB 
[Debloom: finalization                   ]  2    %   elapsed:   0 min 0  sec   remaining:   0 min 23 sec   cpu: 100.0 %   mem: [2273, 2273, 5960] MB 
[Debloom: finalization                   ]  3    %   elapsed:   0 min 1  sec   remaining:   0 min 22 sec   cpu:  98.6 %   mem: [2298, 2298, 5960] MB 
...

[Debloom: save                           ]  0    %   elapsed:   0 min 0  sec   remaining:   0 min 0  sec   cpu:  -1.0 %   mem: [2241, 2241, 5960] MB 
[Debloom: save                           ]  2    %   elapsed:   0 min 1  sec   remaining:   0 min 58 sec   cpu:  99.2 %   mem: [2241, 2241, 5960] MB 
[Debloom: save                           ]  3    %   elapsed:   0 min 2  sec   remaining:   0 min 57 sec   cpu:  99.4 %   mem: [2241, 2241, 5960] MB 
...

[Graph: build branching nodes            ]  0    %   elapsed:   0 min 0  sec   remaining:   0 min 0  sec   cpu:  -1.0 %   mem: [1980, 1980, 5960] MB 
[Graph: build branching nodes            ]  2    %   elapsed:   0 min 8  sec   remaining:   6 min 56 sec   cpu:  99.8 %   mem: [1980, 1980, 5960] MB 
[Graph: build branching nodes            ]  3    %   elapsed:   0 min 13 sec   remaining:   6 min 51 sec   cpu:  99.8 %   mem: [1980, 1980, 5960] MB 
...

[Graph: nb branching found : 28171957    ]  100  %   elapsed:   7 min 11 sec   remaining:   0 min 0  sec   cpu:  99.8 %   mem: [2410, 2410, 5960] MB 
!!! file present : $PATH/TruSeq_R1.fastq_multi_k21_s3.h5
graph created

It seems to me that only one of the short-read files were used, and there hasn't been any more lines printed to the output file. It still says the job is running, but I don't see it proceeding to the next step (finding TRFs).

I also have a question. With this line illumina: $PATH/TruSeq_R1.fastq,$PATH/TruSeq_R2.fastq,$PATH/New_England_R1.fastq,$PATH/New_England_R2.fastq $PATH/HCT116_Illumina_TruSeq_R1.fastq_multi_k21_s3.h5 pacbioFile: $PATH/FAQ07459_pass_a4f2108a.fastq kmer_len: 21 solid_kmer_thr: 3 it looks like it's taking my long read data as PacBio generated. Mine is actually nanopore. Is there anywhere I can specify that?

Really appreciate your help! Please advise me on what I should do next.

[E::idx_find_and_load] Could not retrieve index file

Hi Yan and Yi,

I was using isoCirc to run the toy example in "test_data". It could go through the pipeline and generate all the output files. However, I noticed that there are several warnings messages complaining that some program could not retrieve index files for high.bam and low.bam (shown below).

[M::mm_idx_gen::0.0330.27] collected minimizers
[M::mm_idx_gen::0.0400.39] sorted minimizers
[M::main::0.0400.39] loaded/built the index for 1 target sequence(s)
[M::mm_mapopt_update::0.0400.40] mid_occ = 34
[M::mm_idx_stat] kmer size: 15; skip: 5; is_hpc: 0; #seq: 1
[M::mm_idx_stat::0.0410.41] distinct minimizers: 43525 (97.01% are singletons); average occurrences: 1.104; average spacing: 4.161; total length: 200000
[M::worker_pipeline::0.2290.89] mapped 44 sequences
[M::main] Version: 2.17-r974-dirty
[M::main] CMD: minimap2 -ax splice -ub --MD --eqx -t 1 chr16_toy.fa output/cons.fa
[M::main] Real time: 0.231 sec; CPU: 0.207 sec; Peak RSS: 0.033 GB
[E::idx_find_and_load] Could not retrieve index file for 'output/high.bam'
[E::idx_find_and_load] Could not retrieve index file for 'output/high.bam'
[E::idx_find_and_load] Could not retrieve index file for 'output/high.bam'
[E::idx_find_and_load] Could not retrieve index file for 'output/high.bam'
[E::idx_find_and_load] Could not retrieve index file for 'output/high.bam'
[E::idx_find_and_load] Could not retrieve index file for 'output/low.bam'
[E::idx_find_and_load] Could not retrieve index file for 'output/high.bam'
== 12:58:37-May-24-2021 == [check_dependencies] Checking dependencies ...
== 12:58:37-May-24-2021 == [check_dependencies] Checking dependencies done!
== 12:58:37-May-24-2021 == [Tandem-Repeats-Finder] Finding tandem repeats with TRF ...

I tried to run my own datasets and got the same warning messages. I was wondering where did these warning messages come from? Does it affect the output files?

Thanks,

Qiongyi

Error occurred during isocirc pipeline run

== 11:48:56-Jan-11-2021 == [itst_intergenic] bedtools intersect -v -a output_chr1_control/isocirc.bed.exon.gtf -b /Drive4/nanopore_2nd_experiment/isocirc_nanopore/output_chr1_control/Homo_sapiens.GRCh38.96.gtf.gene.bed > output_chr1_control/isocirc.bed.intergenic.out
== 11:48:56-Jan-11-2021 == [itst_exon] bedtools intersect -a output_chr1_control/isocirc.bed.exon.gtf -b /Drive4/nanopore_2nd_experiment/isocirc_nanopore/output_chr1_control/Homo_sapiens.GRCh38.96.gtf.exon.gtf -wa -wb > output_chr1_control/isocirc.bed.exon.out
== 11:48:59-Jan-11-2021 == [output_isoform_eval] Writing isoform-wise evaluation result to file ...
== 11:48:59-Jan-11-2021 == [output_isoform_eval] Writing isoform-wise evaluation result to file done!
[E::idx_find_and_load] Could not retrieve index file for 'output_chr1_control/high.bam'
Traceback (most recent call last):
File "/home/aclab/.local/bin/isocirc", line 219, in
main()
File "/home/aclab/.local/bin/isocirc", line 216, in main
isocirc_core(args)
File "/home/aclab/.local/bin/isocirc", line 135, in isocirc_core
isoform_out, bed_out, stats_out)
File "/home/aclab/.local/lib/python3.6/site-packages/isocirc/hcBSJ_fullIso.py", line 829, in hcBSJ_fullIso
bs.stats_core(long_len, cons_info, high_bam, isoform_out_fn, all_bsj_stats_dict, stats_out_fn)
File "/home/aclab/.local/lib/python3.6/site-packages/isocirc/basic_stats.py", line 123, in stats_core
tot_map_read_n, tot_map_cons_n, tot_map_cons_base, error_rate = get_error_rate(cons_bam)
File "/home/aclab/.local/lib/python3.6/site-packages/isocirc/basic_stats.py", line 37, in get_error_rate
return tot_mapped_read_n, tot_mapped_cons_n, tot_mapped_base, '{0:.1f}%'.format((tot_ins+tot_del+tot_mis) / (tot_ins+tot_mis+tot_match+0.0) * 100)
ZeroDivisionError: float division by zero

Can you please tell me how to sort out this issue. Thanks in advance.

circRNA annotation

Is there a possibility to run without bed with known circRNAs?

Regards,
Kasia

xinglab / isocirc Goto Github PK

isocirc's People

Contributors

Stargazers

Watchers

Forkers

isocirc's Issues

Issues of not generating correct file by using isoCirc

Issue of not generating out file

annotation issue?

ValueError: file has no sequences defined (mode='r')

Short-read error correction feature

[E::idx_find_and_load] Could not retrieve index file

Error occurred during isocirc pipeline run

circRNA annotation

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent