nki-ccb / imfusion Goto Github PK

View Code? Open in Web Editor NEW

3.0 3.0 1.0 2.52 MB

Tool for identifying transposon insertions and their effects from RNA-seq data.

License: MIT License

Python 98.93% Shell 0.11% Makefile 0.97%

imfusion's People

Stargazers

Watchers

Forkers

sofiaff

imfusion's Issues

Input data sets for IM-fusion

I have three questions for using IM-fusion:

Your article identify TE insertions from RNA-Seq based on Sleeping Beauty (SB) transposon model. But, I have several RNA-Seq data sets from Normal vs Tumor paired samples. Are my data sets suitable for insertion detection by using IM-fusion?
I wonder any other transposon elements are suitable for IM-fusion?
And how can I prepare for these files?

Best,

Add option to set STAR memory limit

Hi,

I've been trying out this package, I think it's a very nice idea!

I've run into the following issue: by default, STAR limits the memory size to 30 GB. If it passes that treshold, it exists with the message:

Apr 01 14:43:41 ..... Started STAR run
Apr 01 14:43:41 ... Starting to generate Genome files

EXITING because of FATAL ERROR: limitGenomeGenerateRAM=31000000000 is too small for your genome
SOLUTION: specify limitGenomeGenerateRAM not less than 35000000000 and make RAM available

However, if I run imfusion with --limitGenomeGenerateRAM, this option causes an error in the Python parser and is not passed on to STAR.

As a workaround, I added this option to /usr/lib64/python3.5/site-packages/imfusion/external/star.py:

--- star.py.backup      2018-05-10 12:04:31.000000000 -0000
+++ star.py     2018-05-09 13:51:38.000000000 -0000
@@ -33,7 +33,8 @@ def star_index(fasta_path,
     args = [
         'STAR', '--runMode', 'genomeGenerate', '--genomeDir', str(output_dir),
         '--genomeFastaFiles', str(fasta_path), '--sjdbGTFfile', str(gtf_path),
-        '--sjdbOverhang', str(overhang), '--runThreadN', str(threads)
+        '--sjdbOverhang', str(overhang), '--runThreadN', str(threads),
+        '--limitGenomeGenerateRAM', '50000000000'
     ]

     run_command(args=args, log_path=log_path)

It would be nice if imfusion would pass additional command-line parameters to STAR or tophat on the command-line instead.

SA or SD for building the reference

Hi there,
I am building the augmented reference for identifying transposition activity in a bacterial genome using RNAseq data. I have the locations and sequences of the transposons in the host genome (name, start, end, strand, length), not sure what the SA and SD are available.
[2019-07-07 22:44:04] Copying transposon files
here are the err mesage I got:
Traceback (most recent call last):
File "/home/hcao/miniconda2/envs/imfusion/bin/imfusion-build", line 11, in
load_entry_point('imfusion==0.3.2', 'console_scripts', 'imfusion-build')()
File "/home/hcao/miniconda2/envs/imfusion/lib/python3.6/site-packages/imfusion/main/build.py", line 35, in main
output_dir=args.output_dir)
File "/home/hcao/miniconda2/envs/imfusion/lib/python3.6/site-packages/imfusion/build/indexers/base.py", line 161, in build
build_util.check_feature_file(transposon_features_path)
File "/home/hcao/miniconda2/envs/imfusion/lib/python3.6/site-packages/imfusion/build/util.py", line 239, in check_feature_file
raise ValueError('No valid features (with type = SD or SA)'
ValueError: No valid features (with type = SD or SA)were found in the transposon feature file.

is there anyway to build the reference without the SA or SD type information?

Thanks,

Error while running blacklisted regions

Hi,

The documentation for imfusion says
"It is important to blacklist genes or genomic sequences that are part of the transposon sequence using the optional --blacklist_genes and --blacklist_regions arguments. The former can be used to blacklist entire genes (specified by their ID in the GTF file), whilst the latter can be used to blacklist specific regions (specified as chr:start-end). Sequences of blacklisted regions are replaced by ‘N’ nucleotides in the generated reference."

My regions are listed as BED file and I get an error
ValueError: Unable to parse region 'bl_input.bed'. Ensure that region specifications are in the required format (chromosome:start-end).

They are like this according to the mentioned format, yet I get this error every time.
chr6:57490374-57490593
chr17:54381437-54381678
chr20:37824173-37827680
chr20:37828801-37829052
chr20:37831379-37834859
chr14:78008337-78008574
chr14:78011440-78011654
chr14:78023152-78023395
chr14:78027211-78027452
chr14:78029936-78030180
chr14:78030535-78030755
chr14:78037975-78039034

Kindly help.

Issues while building the reference

Hello,
while testing this tool I ran into two issues. For the first one I can make a pull request and I can add the fix for the latter problem too, but for that I am unsure what behaviour is desired.

The first bug produces a following message

[2018-08-28 12:06:03] Copying transposon files
[2018-08-28 12:06:03] Building indexed reference gtf
[2018-08-28 12:06:54] Building flattened exon gtf
Traceback (most recent call last):
  File "xxx/miniconda3/bin/imfusion-build", line 11, in <module>
    load_entry_point('imfusion==0.3.2', 'console_scripts', 'imfusion-build')()
  File "xxx/miniconda3/lib/python3.5/site-packages/imfusion/main/build.py", line 35, in main
    output_dir=args.output_dir)
  File "xxx/miniconda3/lib/python3.5/site-packages/imfusion/build/indexers/base.py", line 178, in build
    gtf_frame_flat = tabix.flatten_gtf_frame(gtf_frame)
  File "xxx/miniconda3/lib/python3.5/site-packages/imfusion/util/tabix.py", line 239, in flatten_gtf_frame
    gtf_frame = _gtf_frame_from_exon_array(exons)
  File "xxx/miniconda3/lib/python3.5/site-packages/imfusion/util/tabix.py", line 279, in _gtf_frame_from_exon_array
    attrs = flat_exons.groupby('gene_id')['gene_id'].transform(_attribute_strs)
  File "xxx/miniconda3/lib/python3.5/site-packages/pandas/core/groupby/groupby.py", line 3663, in transform
    result = concat(results).sort_index()
  File "xxx/miniconda3/lib/python3.5/site-packages/pandas/core/reshape/concat.py", line 225, in concat
    copy=copy, sort=sort)
  File "xxx/miniconda3/lib/python3.5/site-packages/pandas/core/reshape/concat.py", line 259, in __init__
    raise ValueError('No objects to concatenate')
ValueError: No objects to concatenate

Changing the regex in imfusion/util/tabix.py in lines 227 from r'gene_id "(\w+)"' to 'gene_id \"(\w+).+\"' and 229 from r'transcript_id "(\w+)"' to 'transcript_id \"(\w+).+\"' fixes this issue.

The second bug (or error in the documentation, I am unsure about what is desired here) arises when atempting to blacklist the genes. The genes which have the 'version number', e.g. ENSMUSG00000039095(.8), are not found in the gtf reference by imfusion process. The example given at http://nki-ccb.github.io/imfusion/usage.html uses the gene ID without the .8 part. When executing the command following error is reported:

imfusion-build star     --reference_seq xxx/GRCm38.primary_assembly.genome.fa   --reference_gtf xxx/gencode.vM18.annotation.gtf     --transposon_seq /xxx/packages/imfusion-0.3.2/data/t2onc/t2onc2.sequence.fa     --transposon_features xxx/imfusion-0.3.2/data/t2onc/t2onc2.features.txt     --output_dir xxx/testingIMF     --blacklist_genes ENSMUSG00000039095 ENSMUSG00000038402     --star_threads 4
[2018-08-28 12:15:33] Copying transposon files
[2018-08-28 12:15:33] Building indexed reference gtf
[2018-08-28 12:16:23] Building flattened exon gtf
[2018-08-28 12:17:20] Building augmented reference
> xxx/miniconda3/lib/python3.5/site-packages/imfusion/build/util.py(210)regions_from_genes()
Traceback (most recent call last):
  File "xxx/miniconda3/bin/imfusion-build", line 11, in <module>
    load_entry_point('imfusion==0.3.2', 'console_scripts', 'imfusion-build')()
  File "xxx/miniconda3/lib/python3.5/site-packages/imfusion/main/build.py", line 35, in main
    output_dir=args.output_dir)
  File "xxx/miniconda3/lib/python3.5/site-packages/imfusion/build/indexers/base.py", line 190, in build
    reference.indexed_gtf_path))
  File "xxx/miniconda3/lib/python3.5/site-packages/imfusion/build/util.py", line 210, in regions_from_genes
    raise ValueError(
ValueError: Gene 'ENSMUSG00000039095' not found in reference gtf. Make sure that the correct gene ID is being used (corresponding with the gene_id annotation in the GTF file).

greping it directly from file results in :

grep 'ENSMUSG00000039095' xxx/testingIMF/reference.gtf
chr5	HAVANA	gene	28165694	28172166	.	+	.	gene_id "ENSMUSG00000039095.8"; gene_type "protein_coding"; gene_name "En2"; level 2; havana_gene "OTTMUSG00000037277.2";
chr5	HAVANA	transcript	28165694	28172166	.	+	.gene_id "ENSMUSG00000039095.8"; transcript_id "ENSMUST00000036177.8"; gene_type "protein_coding"; gene_name "En2"; transcript_type "protein_coding"; transcript_name "RP23-358F24.1-001"; level 2; protein_id "ENSMUSP00000036761.7"; transcript_support_level "1"; tag "basic"; tag "appris_principal_1"; tag "CCDS"; ccdsid "CCDS19143.1"; havana_gene "OTTMUSG00000037277.2"; havana_transcript "OTTMUST00000096171.2";
...

With further testing I figured, that the blacklist gene id is tested via the in statement. (Line 108 in the tabix.py, if rec['gene_id'] in sought_ids:). Hence if the desired behaviour is to supply genes without the 'version appendix' I suggest to change the line to if rec['gene_id'].split('.')[0] in sought_ids: . If, on the other hand, the version appendix is needed, then changing the documentation would be helpful.

Best regards,
Olga

Length mismatch

Hi,

I'm using IM-fusion to build my genome. I installed the package and its dependencies and prepared all the files I need. When I submit my job, I get this error. Would you please let me know what is wrong with my file?

Best,
Niusha

[2020-11-17 08:30:34] Copying transposon files
[2020-11-17 08:30:35] Building indexed reference gtf
[2020-11-17 08:31:48] Building flattened exon gtf
Traceback (most recent call last):
File "/scratch/user/ENV/bin/imfusion-build", line 8, in
sys.exit(main())
File "/scratch/user/ENV/lib/python3.8/site-packages/imfusion/main/build.py", line 28, in main
indexer.build(
File "/scratch/user/ENV/lib/python3.8/site-packages/imfusion/build/indexers/base.py", line 178, in build
gtf_frame_flat = tabix.flatten_gtf_frame(gtf_frame)
File "/scratch/user/ENV/lib/python3.8/site-packages/imfusion/util/tabix.py", line 239, in flatten_gtf_frame
gtf_frame = _gtf_frame_from_exon_array(exons)
File "/scratch/user/ENV/lib/python3.8/site-packages/imfusion/util/tabix.py", line 279, in _gtf_frame_from_exon_array
attrs = flat_exons.groupby('gene_id')['gene_id'].transform(_attribute_strs)
File "/scratch/user/ENV/lib/python3.8/site-packages/pandas/core/groupby/generic.py", line 493, in transform
return self._transform_general(
File "/scratch/user/ENV/lib/python3.8/site-packages/pandas/core/groupby/generic.py", line 562, in _transform_general
result.index = self._selected_obj.index
File "/scratch/user/ENV/lib/python3.8/site-packages/pandas/core/generic.py", line 5152, in setattr
return object.setattr(self, name, value)
File "pandas/_libs/properties.pyx", line 66, in pandas._libs.properties.AxisProperty.set
File "/scratch/user/ENV/lib/python3.8/site-packages/pandas/core/series.py", line 424, in _set_axis
self._mgr.set_axis(axis, labels)
File "/scratch/user/ENV/lib/python3.8/site-packages/pandas/core/internals/managers.py", line 226, in set_axis
raise ValueError(
ValueError: Length mismatch: Expected axis has 0 elements, new values have 335509 elements

nki-ccb / imfusion Goto Github PK

imfusion's People

Stargazers

Watchers

Forkers

imfusion's Issues

Input data sets for IM-fusion

Add option to set STAR memory limit

SA or SD for building the reference

Error while running blacklisted regions

Issues while building the reference

Length mismatch

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent