xinglab / isocirc Goto Github PK
View Code? Open in Web Editor NEWisoCirc
License: GNU General Public License v3.0
isoCirc
License: GNU General Public License v3.0
I’m using the isoCirc from your lab, it worked on your test data. But when I download your data from ebi (https://www.ebi.ac.uk/ena/browser/view/PRJNA594380?show=reads) and using isoCirc by the following command (I downloaded the data and put it straight into the pipeline without other steps):
ref=/home/fengyige/data/star/hg19/hg19.fa
gtf=/home/fengyige/data/star/hg19/gencode.v21.annotation.gtf
circ=/home/fengyige/data/circRNA/hsa_hg19_circRNA.bed
input=/home/fengyige/20220826_circrna/fastq/SRR10612056_1.fastq.gz
isocirc -t 20
It did generate isocirc.bed and isocirc.out, but it didn't generate isocirc_stats.out. And the *out file only contains the header line. I wonder if the data downloaded from ebi need some pre-processing? Or I just miss some critical steps?
Thanks for your help!
I tried to run isocirc with test data. It worked great! However, when I tried to run isocirc with my own data, it did not generate isocirc.out, isocirc_stats.out or isocirc.bed. I downloaded the fa data from ensembl (http://ftp.ensembl.org/pub/release-104/fasta/homo_sapiens/dna/Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz ) and the gtf file also from ensembl (http://ftp.ensembl.org/pub/release-104/gtf/homo_sapiens/Homo_sapiens.GRCh38.104.gtf.gz). The circRNA bed file was downloaded from http://circatlas.biols.ac.cn/.
The output file contains the following files:
cons.fa cons.fa.sam high.bam Homo_sapiens.GRCh38.104.gtf.gene_pred TotalRNAonly.fa.len
cons.fa.fai cons.info Homo_sapiens.GRCh38.104.gtf.bed low.bam trf.out
Thanks for help!
Hello,
I faced this kind of problem:
== 16:16:34-Apr-14-2021 == [gtf2gene] gtf2gene nanopore_circ/isocirc.bed.exon.gtf AT.gtf nanopore_circ/isocirc.bed.ovlp.gene.out
Traceback (most recent call last):
File "/opt/exp_soft/local/generic/python/3.9.2/bin/isocirc", line 219, in <module>
main()
File "/opt/exp_soft/local/generic/python/3.9.2/bin/isocirc", line 216, in main
isocirc_core(args)
File "/opt/exp_soft/local/generic/python/3.9.2/bin/isocirc", line 132, in isocirc_core
hf.hcBSJ_fullIso(high_bam, low_bam, long_len_fn, cons_info, cons_fa,
File "/opt/exp_soft/local/generic/python/3.9.2/lib/python3.9/site-packages/isocirc/hcBSJ_fullIso.py", line 826, in hcBSJ_fullIso
itst_out_dict = intersect_with_bed(out_dir, circRNA_bed, all_anno, all_anno_bed, itst_anno_dict, flank_len, bedtools)
File "/opt/exp_soft/local/generic/python/3.9.2/lib/python3.9/site-packages/isocirc/hcBSJ_fullIso.py", line 414, in intersect_with_bed
get_ovlp_gene_name_id(ovlp_gene_name_id, gene_id_dict, gene_name_dict, gene_strand_dict)
File "/opt/exp_soft/local/generic/python/3.9.2/lib/python3.9/site-packages/isocirc/hcBSJ_fullIso.py", line 214, in get_ovlp_gene_name_id
strand_dict[ele[0]] = ele[3] if strand_dict[ele[0]] == 'NA' else strand_dict[ele[0]] + ',' + ele[3]
IndexError: list index out of range
I assume that this is a problem with gtf file (maybe chromosome format ?). Could You please tell me how to manage? Also can I run the piplene from this point without repeating previous steps if they are good?
I am getting the following error when running data with human genome and the gtf file with circbase bed file.
== 10:33:39-Nov-06-2021 == [Mapping] Mapping consensus sequence to genome done!
== 10:33:39-Nov-06-2021 == [Classifying] Classifying consensus alignment ...
== 10:33:39-Nov-06-2021 == [classify_bam_core] Processing ./cons.fa.sam ...
Traceback (most recent call last):
File "/home/user/anaconda3/bin/isocirc", line 219, in <module>
main()
File "/home/user/anaconda3/bin/isocirc", line 216, in main
isocirc_core(args)
File "/home/user/anaconda3/bin/isocirc", line 117, in isocirc_core
bc.bam_classify(cons_all_sam, high_bam, low_bam, args.high_max_ratio, args.high_min_ratio, args.high_iden_ratio, args.high_repeat_ratio, args.low_repeat_ratio)
File "/home/user/anaconda3/lib/python3.8/site-packages/isocirc/bam_classify.py", line 168, in bam_classify
with ps.AlignmentFile(in_bam_fn) as in_bam, ps.AlignmentFile(high_bam_fn, 'wb', template=in_bam) as high_bam, \
File "pysam/libcalignmentfile.pyx", line 742, in pysam.libcalignmentfile.AlignmentFile.__cinit__
File "pysam/libcalignmentfile.pyx", line 991, in pysam.libcalignmentfile.AlignmentFile._open
ValueError: file has no sequences defined (mode='r') - is it SAM/BAM format? Consider opening with check_sq=False
I am using the latest version of isoCirc. Any idea on how I can resolve the issue?
Hi there,
I've been able to run isoCirc with isocirc -t 1 $PATH/FAQ07459_pass_a4f2108a.fastq $PATH/all-chrs.fa $PATH/hg38_ref_all.gtf $PATH/annotation.bed $PATH/isocirc_output
and get intended results. However, when I tried to add the --short
parameter to the run command, it seems to me that there is either a really long run time, or I'm not running it correctly.
I have 2 sets of pair-end short-read sequencing data, and here's the command I used:
isocirc -t 1 --short-read $PATH/TruSeq_R1.fastq,$PATH/TruSeq_R2.fastq,$PATH/New_England_R1.fastq,$PATH/New_England_R2.fastq $PATH/FAQ07459_pass_a4f2108a.fastq $PATH/all-chrs.fa $PATH/hg38_ref_all.gtf $PATH/annotation.bed $PATH/isocirc_outputP_short_read
When I look into the log of the run, it shows this (path edited and omitted similar lines for simplicity):
Matplotlib created a temporary config/cache directory at /tmp/matplotlib-f2h0a7nc because the default path (/.config/matplotlib) is not a writable directory; it is highly recommended to set the MPLCONFIGDIR environment variable to a writable directory, in particular to speed up the import of Matplotlib and to better support multiprocessing.
== 00:47:41-Sep-07-2021 == [check_dependencies] Checking dependencies ...
== 00:47:41-Sep-07-2021 == [check_dependencies] Checking dependencies done!
== 00:47:41-Sep-07-2021 == [Error-correction] Hybrid error correction using $PATH/TruSeq_R1.fastq,$PATH/TruSeq_R2.fastq,$PATH/New_England_R1.fastq,$PATH/New_England_R2.fastq ...
== 00:47:41-Sep-07-2021 == [LoRDEC] lordec-correct -2 $PATH/TruSeq_R1.fastq,$PATH/TruSeq_R2.fastq,$PATH/New_England_R1.fastq,$PATH/New_England_R2.fastq -i $PATH/FAQ07459_pass_a4f2108a.fastq -o $PATH/isocirc_output_short_read/long_corrected.fa -k 21 -s 3 -T 1
-2
$PATH/TruSeq_R1.fastq,$PATH/TruSeq_R2.fastq,$PATH/New_England_R1.fastq,$PATH/New_England_R2.fastq
-i
$PATH/FAQ07459_pass_a4f2108a.fastq
-o
$PATH/long_corrected.fa
-k
21
-s
3
-T
1
illumina: $PATH/TruSeq_R1.fastq,$PATH/TruSeq_R2.fastq,$PATH/New_England_R1.fastq,$PATH/New_England_R2.fastq $PATH/HCT116_Illumina_TruSeq_R1.fastq_multi_k21_s3.h5 pacbioFile: $PATH/FAQ07459_pass_a4f2108a.fastq
kmer_len: 21 solid_kmer_thr: 3
max_trials: 5 max_error_rate: 0.4 max_branch: 200
abundance_max: 2147483647
Cannot access the graph file for reference reads: $PATH/HCT116_Illumina_TruSeq_R1.fastq_multi_k21_s3.h5
bRefGraph: 0
bRefSeq: 1
creating the graph from file(s): $PATH/TruSeq_R1.fastq,$PATH/TruSeq_R2.fastq,$PATH/New_England_R1.fastq,$PATH/New_England_R2.fastq
[DSK: counting kmers ] 0 % elapsed: 0 min 0 sec remaining: 0 min 0 sec cpu: -1.0 % mem: [ 28, 28, 76] MB
[DSK: Pass 1/1, Step 1: partitioning ] 0 % elapsed: 0 min 0 sec remaining: 0 min 0 sec cpu: -1.0 % mem: [ 46, 46, 76] MB
[DSK: Pass 1/1, Step 1: partitioning ] 1 % elapsed: 0 min 20 sec remaining: 32 min 17 sec cpu: 99.2 % mem: [ 494, 494, 494] MB
[DSK: Pass 1/1, Step 1: partitioning ] 2 % elapsed: 0 min 39 sec remaining: 32 min 8 sec cpu: 99.4 % mem: [ 575, 575, 575] MB
[DSK: Pass 1/1, Step 1: partitioning ] 3 % elapsed: 0 min 59 sec remaining: 31 min 47 sec cpu: 99.3 % mem: [ 575, 575, 575] MB
...
[DSK: Pass 1/1, Step 2: counting kmers ] 50.4 % elapsed: 16 min 47 sec remaining: 16 min 30 sec cpu: 99.2 % mem: [ 96, 608, 608] MB
[DSK: Pass 1/1, Step 2: counting kmers ] 53.5 % elapsed: 17 min 56 sec remaining: 15 min 36 sec cpu: 99.2 % mem: [4298, 4298, 4328] MB
[DSK: Pass 1/1, Step 2: counting kmers ] 53.5 % elapsed: 17 min 56 sec remaining: 15 min 36 sec cpu: 99.2 % mem: [4298, 4298, 4328] MB
...
[DSK: nb solid kmers found : 156051682 ] 101 % elapsed: 36 min 56 sec remaining: 0 min 0 sec cpu: 99.4 % mem: [1378, 5928, 5960] MB
[Building BooPHF] 0.1 % elapsed: 0 min 0 sec remaining: 4 min 34 sec
[Building BooPHF] 0.2 % elapsed: 0 min 0 sec remaining: 3 min 32 sec
[Building BooPHF] 0.3 % elapsed: 0 min 1 sec remaining: 3 min 57 sec
...
[MPHF: populate ] 0 % elapsed: 0 min 0 sec remaining: 0 min 0 sec cpu: -1.0 % mem: [1583, 1583, 5960] MB
[MPHF: populate ] 2 % elapsed: 0 min 1 sec remaining: 0 min 56 sec cpu: 99.1 % mem: [1583, 1583, 5960] MB
[MPHF: populate ] 3 % elapsed: 0 min 2 sec remaining: 0 min 55 sec cpu: 100.0 % mem: [1583, 1583, 5960] MB
...
[Bloom: read solid kmers ] 0 % elapsed: 0 min 0 sec remaining: 0 min 0 sec cpu: -1.0 % mem: [1910, 1910, 5960] MB
[Bloom: read solid kmers ] 2 % elapsed: 0 min 1 sec remaining: 1 min 12 sec cpu: 100.0 % mem: [1910, 1910, 5960] MB
[Bloom: read solid kmers ] 3 % elapsed: 0 min 2 sec remaining: 1 min 6 sec cpu: 100.0 % mem: [1910, 1910, 5960] MB
...
[Debloom: finalization ] 0 % elapsed: 0 min 0 sec remaining: 0 min 0 sec cpu: -1.0 % mem: [2225, 2225, 5960] MB
[Debloom: finalization ] 2 % elapsed: 0 min 0 sec remaining: 0 min 23 sec cpu: 100.0 % mem: [2273, 2273, 5960] MB
[Debloom: finalization ] 3 % elapsed: 0 min 1 sec remaining: 0 min 22 sec cpu: 98.6 % mem: [2298, 2298, 5960] MB
...
[Debloom: save ] 0 % elapsed: 0 min 0 sec remaining: 0 min 0 sec cpu: -1.0 % mem: [2241, 2241, 5960] MB
[Debloom: save ] 2 % elapsed: 0 min 1 sec remaining: 0 min 58 sec cpu: 99.2 % mem: [2241, 2241, 5960] MB
[Debloom: save ] 3 % elapsed: 0 min 2 sec remaining: 0 min 57 sec cpu: 99.4 % mem: [2241, 2241, 5960] MB
...
[Graph: build branching nodes ] 0 % elapsed: 0 min 0 sec remaining: 0 min 0 sec cpu: -1.0 % mem: [1980, 1980, 5960] MB
[Graph: build branching nodes ] 2 % elapsed: 0 min 8 sec remaining: 6 min 56 sec cpu: 99.8 % mem: [1980, 1980, 5960] MB
[Graph: build branching nodes ] 3 % elapsed: 0 min 13 sec remaining: 6 min 51 sec cpu: 99.8 % mem: [1980, 1980, 5960] MB
...
[Graph: nb branching found : 28171957 ] 100 % elapsed: 7 min 11 sec remaining: 0 min 0 sec cpu: 99.8 % mem: [2410, 2410, 5960] MB
!!! file present : $PATH/TruSeq_R1.fastq_multi_k21_s3.h5
graph created
It seems to me that only one of the short-read files were used, and there hasn't been any more lines printed to the output file. It still says the job is running, but I don't see it proceeding to the next step (finding TRFs).
I also have a question. With this line illumina: $PATH/TruSeq_R1.fastq,$PATH/TruSeq_R2.fastq,$PATH/New_England_R1.fastq,$PATH/New_England_R2.fastq $PATH/HCT116_Illumina_TruSeq_R1.fastq_multi_k21_s3.h5 pacbioFile: $PATH/FAQ07459_pass_a4f2108a.fastq kmer_len: 21 solid_kmer_thr: 3
it looks like it's taking my long read data as PacBio generated. Mine is actually nanopore. Is there anywhere I can specify that?
Really appreciate your help! Please advise me on what I should do next.
Hi Yan and Yi,
I was using isoCirc to run the toy example in "test_data". It could go through the pipeline and generate all the output files. However, I noticed that there are several warnings messages complaining that some program could not retrieve index files for high.bam and low.bam (shown below).
[M::mm_idx_gen::0.0330.27] collected minimizers
[M::mm_idx_gen::0.0400.39] sorted minimizers
[M::main::0.0400.39] loaded/built the index for 1 target sequence(s)
[M::mm_mapopt_update::0.0400.40] mid_occ = 34
[M::mm_idx_stat] kmer size: 15; skip: 5; is_hpc: 0; #seq: 1
[M::mm_idx_stat::0.0410.41] distinct minimizers: 43525 (97.01% are singletons); average occurrences: 1.104; average spacing: 4.161; total length: 200000
[M::worker_pipeline::0.2290.89] mapped 44 sequences
[M::main] Version: 2.17-r974-dirty
[M::main] CMD: minimap2 -ax splice -ub --MD --eqx -t 1 chr16_toy.fa output/cons.fa
[M::main] Real time: 0.231 sec; CPU: 0.207 sec; Peak RSS: 0.033 GB
[E::idx_find_and_load] Could not retrieve index file for 'output/high.bam'
[E::idx_find_and_load] Could not retrieve index file for 'output/high.bam'
[E::idx_find_and_load] Could not retrieve index file for 'output/high.bam'
[E::idx_find_and_load] Could not retrieve index file for 'output/high.bam'
[E::idx_find_and_load] Could not retrieve index file for 'output/high.bam'
[E::idx_find_and_load] Could not retrieve index file for 'output/low.bam'
[E::idx_find_and_load] Could not retrieve index file for 'output/high.bam'
== 12:58:37-May-24-2021 == [check_dependencies] Checking dependencies ...
== 12:58:37-May-24-2021 == [check_dependencies] Checking dependencies done!
== 12:58:37-May-24-2021 == [Tandem-Repeats-Finder] Finding tandem repeats with TRF ...
I tried to run my own datasets and got the same warning messages. I was wondering where did these warning messages come from? Does it affect the output files?
Thanks,
Qiongyi
== 11:48:56-Jan-11-2021 == [itst_intergenic] bedtools intersect -v -a output_chr1_control/isocirc.bed.exon.gtf -b /Drive4/nanopore_2nd_experiment/isocirc_nanopore/output_chr1_control/Homo_sapiens.GRCh38.96.gtf.gene.bed > output_chr1_control/isocirc.bed.intergenic.out
== 11:48:56-Jan-11-2021 == [itst_exon] bedtools intersect -a output_chr1_control/isocirc.bed.exon.gtf -b /Drive4/nanopore_2nd_experiment/isocirc_nanopore/output_chr1_control/Homo_sapiens.GRCh38.96.gtf.exon.gtf -wa -wb > output_chr1_control/isocirc.bed.exon.out
== 11:48:59-Jan-11-2021 == [output_isoform_eval] Writing isoform-wise evaluation result to file ...
== 11:48:59-Jan-11-2021 == [output_isoform_eval] Writing isoform-wise evaluation result to file done!
[E::idx_find_and_load] Could not retrieve index file for 'output_chr1_control/high.bam'
Traceback (most recent call last):
File "/home/aclab/.local/bin/isocirc", line 219, in
main()
File "/home/aclab/.local/bin/isocirc", line 216, in main
isocirc_core(args)
File "/home/aclab/.local/bin/isocirc", line 135, in isocirc_core
isoform_out, bed_out, stats_out)
File "/home/aclab/.local/lib/python3.6/site-packages/isocirc/hcBSJ_fullIso.py", line 829, in hcBSJ_fullIso
bs.stats_core(long_len, cons_info, high_bam, isoform_out_fn, all_bsj_stats_dict, stats_out_fn)
File "/home/aclab/.local/lib/python3.6/site-packages/isocirc/basic_stats.py", line 123, in stats_core
tot_map_read_n, tot_map_cons_n, tot_map_cons_base, error_rate = get_error_rate(cons_bam)
File "/home/aclab/.local/lib/python3.6/site-packages/isocirc/basic_stats.py", line 37, in get_error_rate
return tot_mapped_read_n, tot_mapped_cons_n, tot_mapped_base, '{0:.1f}%'.format((tot_ins+tot_del+tot_mis) / (tot_ins+tot_mis+tot_match+0.0) * 100)
ZeroDivisionError: float division by zero
Can you please tell me how to sort out this issue. Thanks in advance.
Is there a possibility to run without bed with known circRNAs?
Regards,
Kasia
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.