nki-ccb / imfusion Goto Github PK
View Code? Open in Web Editor NEWTool for identifying transposon insertions and their effects from RNA-seq data.
License: MIT License
Tool for identifying transposon insertions and their effects from RNA-seq data.
License: MIT License
I have three questions for using IM-fusion:
Best,
Hi,
I've been trying out this package, I think it's a very nice idea!
I've run into the following issue: by default, STAR limits the memory size to 30 GB. If it passes that treshold, it exists with the message:
Apr 01 14:43:41 ..... Started STAR run
Apr 01 14:43:41 ... Starting to generate Genome files
EXITING because of FATAL ERROR: limitGenomeGenerateRAM=31000000000 is too small for your genome
SOLUTION: specify limitGenomeGenerateRAM not less than 35000000000 and make RAM available
However, if I run imfusion with --limitGenomeGenerateRAM
, this option causes an error in the Python parser and is not passed on to STAR.
As a workaround, I added this option to /usr/lib64/python3.5/site-packages/imfusion/external/star.py:
--- star.py.backup 2018-05-10 12:04:31.000000000 -0000
+++ star.py 2018-05-09 13:51:38.000000000 -0000
@@ -33,7 +33,8 @@ def star_index(fasta_path,
args = [
'STAR', '--runMode', 'genomeGenerate', '--genomeDir', str(output_dir),
'--genomeFastaFiles', str(fasta_path), '--sjdbGTFfile', str(gtf_path),
- '--sjdbOverhang', str(overhang), '--runThreadN', str(threads)
+ '--sjdbOverhang', str(overhang), '--runThreadN', str(threads),
+ '--limitGenomeGenerateRAM', '50000000000'
]
run_command(args=args, log_path=log_path)
It would be nice if imfusion would pass additional command-line parameters to STAR or tophat on the command-line instead.
Hi there,
I am building the augmented reference for identifying transposition activity in a bacterial genome using RNAseq data. I have the locations and sequences of the transposons in the host genome (name, start, end, strand, length), not sure what the SA and SD are available.
[2019-07-07 22:44:04] Copying transposon files
here are the err mesage I got:
Traceback (most recent call last):
File "/home/hcao/miniconda2/envs/imfusion/bin/imfusion-build", line 11, in
load_entry_point('imfusion==0.3.2', 'console_scripts', 'imfusion-build')()
File "/home/hcao/miniconda2/envs/imfusion/lib/python3.6/site-packages/imfusion/main/build.py", line 35, in main
output_dir=args.output_dir)
File "/home/hcao/miniconda2/envs/imfusion/lib/python3.6/site-packages/imfusion/build/indexers/base.py", line 161, in build
build_util.check_feature_file(transposon_features_path)
File "/home/hcao/miniconda2/envs/imfusion/lib/python3.6/site-packages/imfusion/build/util.py", line 239, in check_feature_file
raise ValueError('No valid features (with type = SD or SA)'
ValueError: No valid features (with type = SD or SA)were found in the transposon feature file.
is there anyway to build the reference without the SA or SD type information?
Thanks,
HC
Hi,
The documentation for imfusion says
"It is important to blacklist genes or genomic sequences that are part of the transposon sequence using the optional --blacklist_genes and --blacklist_regions arguments. The former can be used to blacklist entire genes (specified by their ID in the GTF file), whilst the latter can be used to blacklist specific regions (specified as chr:start-end). Sequences of blacklisted regions are replaced by โNโ nucleotides in the generated reference."
My regions are listed as BED file and I get an error
ValueError: Unable to parse region 'bl_input.bed'. Ensure that region specifications are in the required format (chromosome:start-end).
They are like this according to the mentioned format, yet I get this error every time.
chr6:57490374-57490593
chr17:54381437-54381678
chr20:37824173-37827680
chr20:37828801-37829052
chr20:37831379-37834859
chr14:78008337-78008574
chr14:78011440-78011654
chr14:78023152-78023395
chr14:78027211-78027452
chr14:78029936-78030180
chr14:78030535-78030755
chr14:78037975-78039034
Kindly help.
Hello,
while testing this tool I ran into two issues. For the first one I can make a pull request and I can add the fix for the latter problem too, but for that I am unsure what behaviour is desired.
The first bug produces a following message
[2018-08-28 12:06:03] Copying transposon files
[2018-08-28 12:06:03] Building indexed reference gtf
[2018-08-28 12:06:54] Building flattened exon gtf
Traceback (most recent call last):
File "xxx/miniconda3/bin/imfusion-build", line 11, in <module>
load_entry_point('imfusion==0.3.2', 'console_scripts', 'imfusion-build')()
File "xxx/miniconda3/lib/python3.5/site-packages/imfusion/main/build.py", line 35, in main
output_dir=args.output_dir)
File "xxx/miniconda3/lib/python3.5/site-packages/imfusion/build/indexers/base.py", line 178, in build
gtf_frame_flat = tabix.flatten_gtf_frame(gtf_frame)
File "xxx/miniconda3/lib/python3.5/site-packages/imfusion/util/tabix.py", line 239, in flatten_gtf_frame
gtf_frame = _gtf_frame_from_exon_array(exons)
File "xxx/miniconda3/lib/python3.5/site-packages/imfusion/util/tabix.py", line 279, in _gtf_frame_from_exon_array
attrs = flat_exons.groupby('gene_id')['gene_id'].transform(_attribute_strs)
File "xxx/miniconda3/lib/python3.5/site-packages/pandas/core/groupby/groupby.py", line 3663, in transform
result = concat(results).sort_index()
File "xxx/miniconda3/lib/python3.5/site-packages/pandas/core/reshape/concat.py", line 225, in concat
copy=copy, sort=sort)
File "xxx/miniconda3/lib/python3.5/site-packages/pandas/core/reshape/concat.py", line 259, in __init__
raise ValueError('No objects to concatenate')
ValueError: No objects to concatenate
Changing the regex in imfusion/util/tabix.py in lines 227 from r'gene_id "(\w+)"'
to 'gene_id \"(\w+).+\"'
and 229 from r'transcript_id "(\w+)"'
to 'transcript_id \"(\w+).+\"'
fixes this issue.
The second bug (or error in the documentation, I am unsure about what is desired here) arises when atempting to blacklist the genes. The genes which have the 'version number', e.g. ENSMUSG00000039095(.8), are not found in the gtf reference by imfusion process. The example given at http://nki-ccb.github.io/imfusion/usage.html uses the gene ID without the .8 part. When executing the command following error is reported:
imfusion-build star --reference_seq xxx/GRCm38.primary_assembly.genome.fa --reference_gtf xxx/gencode.vM18.annotation.gtf --transposon_seq /xxx/packages/imfusion-0.3.2/data/t2onc/t2onc2.sequence.fa --transposon_features xxx/imfusion-0.3.2/data/t2onc/t2onc2.features.txt --output_dir xxx/testingIMF --blacklist_genes ENSMUSG00000039095 ENSMUSG00000038402 --star_threads 4
[2018-08-28 12:15:33] Copying transposon files
[2018-08-28 12:15:33] Building indexed reference gtf
[2018-08-28 12:16:23] Building flattened exon gtf
[2018-08-28 12:17:20] Building augmented reference
> xxx/miniconda3/lib/python3.5/site-packages/imfusion/build/util.py(210)regions_from_genes()
Traceback (most recent call last):
File "xxx/miniconda3/bin/imfusion-build", line 11, in <module>
load_entry_point('imfusion==0.3.2', 'console_scripts', 'imfusion-build')()
File "xxx/miniconda3/lib/python3.5/site-packages/imfusion/main/build.py", line 35, in main
output_dir=args.output_dir)
File "xxx/miniconda3/lib/python3.5/site-packages/imfusion/build/indexers/base.py", line 190, in build
reference.indexed_gtf_path))
File "xxx/miniconda3/lib/python3.5/site-packages/imfusion/build/util.py", line 210, in regions_from_genes
raise ValueError(
ValueError: Gene 'ENSMUSG00000039095' not found in reference gtf. Make sure that the correct gene ID is being used (corresponding with the gene_id annotation in the GTF file).
grep
ing it directly from file results in :
grep 'ENSMUSG00000039095' xxx/testingIMF/reference.gtf
chr5 HAVANA gene 28165694 28172166 . + . gene_id "ENSMUSG00000039095.8"; gene_type "protein_coding"; gene_name "En2"; level 2; havana_gene "OTTMUSG00000037277.2";
chr5 HAVANA transcript 28165694 28172166 . + .gene_id "ENSMUSG00000039095.8"; transcript_id "ENSMUST00000036177.8"; gene_type "protein_coding"; gene_name "En2"; transcript_type "protein_coding"; transcript_name "RP23-358F24.1-001"; level 2; protein_id "ENSMUSP00000036761.7"; transcript_support_level "1"; tag "basic"; tag "appris_principal_1"; tag "CCDS"; ccdsid "CCDS19143.1"; havana_gene "OTTMUSG00000037277.2"; havana_transcript "OTTMUST00000096171.2";
...
With further testing I figured, that the blacklist gene id is tested via the in
statement. (Line 108 in the tabix.py, if rec['gene_id'] in sought_ids:
). Hence if the desired behaviour is to supply genes without the 'version appendix' I suggest to change the line to if rec['gene_id'].split('.')[0] in sought_ids:
. If, on the other hand, the version appendix is needed, then changing the documentation would be helpful.
Best regards,
Olga
Hi,
I'm using IM-fusion to build my genome. I installed the package and its dependencies and prepared all the files I need. When I submit my job, I get this error. Would you please let me know what is wrong with my file?
Best,
Niusha
[2020-11-17 08:30:34] Copying transposon files
[2020-11-17 08:30:35] Building indexed reference gtf
[2020-11-17 08:31:48] Building flattened exon gtf
Traceback (most recent call last):
File "/scratch/user/ENV/bin/imfusion-build", line 8, in
sys.exit(main())
File "/scratch/user/ENV/lib/python3.8/site-packages/imfusion/main/build.py", line 28, in main
indexer.build(
File "/scratch/user/ENV/lib/python3.8/site-packages/imfusion/build/indexers/base.py", line 178, in build
gtf_frame_flat = tabix.flatten_gtf_frame(gtf_frame)
File "/scratch/user/ENV/lib/python3.8/site-packages/imfusion/util/tabix.py", line 239, in flatten_gtf_frame
gtf_frame = _gtf_frame_from_exon_array(exons)
File "/scratch/user/ENV/lib/python3.8/site-packages/imfusion/util/tabix.py", line 279, in _gtf_frame_from_exon_array
attrs = flat_exons.groupby('gene_id')['gene_id'].transform(_attribute_strs)
File "/scratch/user/ENV/lib/python3.8/site-packages/pandas/core/groupby/generic.py", line 493, in transform
return self._transform_general(
File "/scratch/user/ENV/lib/python3.8/site-packages/pandas/core/groupby/generic.py", line 562, in _transform_general
result.index = self._selected_obj.index
File "/scratch/user/ENV/lib/python3.8/site-packages/pandas/core/generic.py", line 5152, in setattr
return object.setattr(self, name, value)
File "pandas/_libs/properties.pyx", line 66, in pandas._libs.properties.AxisProperty.set
File "/scratch/user/ENV/lib/python3.8/site-packages/pandas/core/series.py", line 424, in _set_axis
self._mgr.set_axis(axis, labels)
File "/scratch/user/ENV/lib/python3.8/site-packages/pandas/core/internals/managers.py", line 226, in set_axis
raise ValueError(
ValueError: Length mismatch: Expected axis has 0 elements, new values have 335509 elements
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.