Giter Site home page Giter Site logo

jaudoux / simct Goto Github PK

View Code? Open in Web Editor NEW
1.0 1.0 0.0 237 KB

A configurable generator of simulated RNA-Seq data that can emulate any specific biological mechanism and provide robust data sets covering cases such as fusion genes (or fusions).

Home Page: http://cractools.gforge.inria.fr/softwares/simct/

Perl 99.12% Perl 6 0.88%

simct's People

Contributors

jaudoux avatar sacha34540 avatar

Stargazers

 avatar

Watchers

 avatar  avatar

simct's Issues

found 0 transcripts

Problem:
Sequences names in FASTA files don't match their names in GTF file.

Log message:

Looking for FASTA references in genome_dir/References found :
- Saccharomyces_cerevisiae.R64-1-1.dna.toplevel.id_XIII => genome_dir/Saccharomyces_cerevisiae.R64-1-1.dna.toplevel.id_XIII.fa.gz
- Saccharomyces_cerevisiae.R64-1-1.dna.toplevel.id_XIV => genome_dir/Saccharomyces_cerevisiae.R64-1-1.dna.toplevel.id_XIV.fa.gz
- Saccharomyces_cerevisiae.R64-1-1.dna.toplevel.id_III => genome_dir/Saccharomyces_cerevisiae.R64-1-1.dna.toplevel.id_III.fa.gz
- Saccharomyces_cerevisiae.R64-1-1.dna.toplevel.id_II => genome_dir/Saccharomyces_cerevisiae.R64-1-1.dna.toplevel.id_II.fa.gz
- Saccharomyces_cerevisiae.R64-1-1.dna.toplevel.id_XII => genome_dir/Saccharomyces_cerevisiae.R64-1-1.dna.toplevel.id_XII.fa.gz
- Saccharomyces_cerevisiae.R64-1-1.dna.toplevel.id_XVI => genome_dir/Saccharomyces_cerevisiae.R64-1-1.dna.toplevel.id_XVI.fa.gz
- Saccharomyces_cerevisiae.R64-1-1.dna.toplevel.id_VIII => genome_dir/Saccharomyces_cerevisiae.R64-1-1.dna.toplevel.id_VIII.fa.gz
- Saccharomyces_cerevisiae.R64-1-1.dna.toplevel.id_VII => genome_dir/Saccharomyces_cerevisiae.R64-1-1.dna.toplevel.id_VII.fa.gz
- Saccharomyces_cerevisiae.R64-1-1.dna.toplevel.id_Mito => genome_dir/Saccharomyces_cerevisiae.R64-1-1.dna.toplevel.id_Mito.fa.gz
- Saccharomyces_cerevisiae.R64-1-1.dna.toplevel.id_X => genome_dir/Saccharomyces_cerevisiae.R64-1-1.dna.toplevel.id_X.fa.gz
- Saccharomyces_cerevisiae.R64-1-1.dna.toplevel.id_V => genome_dir/Saccharomyces_cerevisiae.R64-1-1.dna.toplevel.id_V.fa.gz
- Saccharomyces_cerevisiae.R64-1-1.dna.toplevel.id_IX => genome_dir/Saccharomyces_cerevisiae.R64-1-1.dna.toplevel.id_IX.fa.gz
- Saccharomyces_cerevisiae.R64-1-1.dna.toplevel.id_I => genome_dir/Saccharomyces_cerevisiae.R64-1-1.dna.toplevel.id_I.fa.gz
- Saccharomyces_cerevisiae.R64-1-1.dna.toplevel.id_XI => genome_dir/Saccharomyces_cerevisiae.R64-1-1.dna.toplevel.id_XI.fa.gz
- Saccharomyces_cerevisiae.R64-1-1.dna.toplevel.id_IV => genome_dir/Saccharomyces_cerevisiae.R64-1-1.dna.toplevel.id_IV.fa.gz
- Saccharomyces_cerevisiae.R64-1-1.dna.toplevel.id_VI => genome_dir/Saccharomyces_cerevisiae.R64-1-1.dna.toplevel.id_VI.fa.gz
- Saccharomyces_cerevisiae.R64-1-1.dna.toplevel.id_XV => genome_dir/Saccharomyces_cerevisiae.R64-1-1.dna.toplevel.id_XV.fa.gz
Calculating references length
Loading annotations
Building GenomeSimulator (reading annotations)
Generate random mutations (ins,del,sub)
Generate random fusions
Generate the simulated genome as FASTA and GTF
Generate flux simulation
Flux-Simulator v1.2.1 (Flux Library: 1.22)

[INFO] I am collecting information on the run.
initializing profiler

[INFO] Reading error model 76 bases model
[WARN] The error model supports a read length of 76 but
you are trying to create reads of length 100. We are scaling.

[INFO] Checking GTF file
[PROFILING] I am assigning the expression profile
Reading reference annotation OK (00:00:00)
found 0 transcripts

[PROFILING] Parameters
NB_MOLECULES 5000000
EXPRESSION_K -0.6
EXPRESSION_X0 9500.0
EXPRESSION_X1 9.025E7
PRO_FILE_NAME /home/zingo/simulation/dataset/FluxSimulator/fluxSimulator.pro

profiling  OK (00:00:00)
Updating .pro file   OK (00:00:00)
molecules	0

[ERROR] Profiler has no molecules!
java.lang.RuntimeException: Profiler has no molecules!
at barna.flux.simulator.SimulationPipeline.call(SimulationPipeline.java:438)
at barna.flux.simulator.SimulationPipeline.call(SimulationPipeline.java:54)
at barna.commons.launcher.Flux.main(Flux.java:198)

Cannot open dataset/FluxSimulator/fluxSimulator.fastq at /home/zingo/perl5/lib/perl5/CracTools/Utils.pm line 551.

FAIL in Building and testing on ubuntu 18.04

Trying to install simCT from tarball on Linux Ubuntu 18.04 but failed in Building and testing. here is the log file build.log, which indicates the following:

  • Can't call method "start" on an undefined value at t/CracTools-SimCT-Annotations.t line 39.
  • Can't call method "getExon" on an undefined value at t/CracTools-SimCT-GenomeSimulator-fusions.t line 55.
  • Can't call method "sortedExons" on an undefined value at t/CracTools-SimCT-GenomeSimulator.t line 101.

Installation

@zingo:novoSplice$ sudo cpanm ~/Downloads/CracTools-SimCT-0.012.tar.gz
[sudo] password for zingo:
--> Working on /home/zingo/Downloads/CracTools-SimCT-0.012.tar.gz
Fetching file:///home/zingo/Downloads/CracTools-SimCT-0.012.tar.gz ... OK
Configuring CracTools-SimCT-0.012 ... OK
Building and testing CracTools-SimCT-0.012 ... FAIL
! Installing /home/zingo/Downloads/CracTools-SimCT-0.012.tar.gz failed. See /home/zingo/.cpanm/work/1559033762.25327/build.log for details. Retry with --force to force install it.

flux version:

@zingo:novoSplice$ flux-simulator --version
Flux-Simulator v1.2.1 (Flux Library: 1.22)

cpanm version:

@zingo:novoSplice$ cpanm --version
cpanm (App::cpanminus) version 1.7043 (/usr/bin/cpanm)
perl version 5.026001 (/usr/bin/perl)

%Config:
archname=x86_64-linux-gnu-thread-multi
installsitelib=/usr/local/share/perl/5.26.1
installsitebin=/usr/local/bin
installman1dir=/usr/share/man/man1
installman3dir=/usr/share/man/man3
sitearchexp=/usr/local/lib/x86_64-linux-gnu/perl/5.26.1
sitelibexp=/usr/local/share/perl/5.26.1
vendorarch=/usr/lib/x86_64-linux-gnu/perl5/5.26
vendorlibexp=/usr/share/perl5
archlibexp=/usr/lib/x86_64-linux-gnu/perl/5.26
privlibexp=/usr/share/perl/5.26
%ENV:
PERL5LIB=/home/zingo/perl5/lib/perl5
PERL_LOCAL_LIB_ROOT=/home/zingo/perl5
PERL_MB_OPT=--install_base "/home/zingo/perl5"
PERL_MM_OPT=INSTALL_BASE=/home/zingo/perl5
@inc:
/home/zingo/perl5/lib/perl5/5.26.1/x86_64-linux-gnu-thread-multi
/home/zingo/perl5/lib/perl5/5.26.1
/home/zingo/perl5/lib/perl5/x86_64-linux-gnu-thread-multi
/home/zingo/perl5/lib/perl5
/etc/perl
/usr/local/lib/x86_64-linux-gnu/perl/5.26.1
/usr/local/share/perl/5.26.1
/usr/lib/x86_64-linux-gnu/perl5/5.26
/usr/share/perl5
/usr/lib/x86_64-linux-gnu/perl/5.26
/usr/share/perl/5.26
/home/zingo/perl5/lib/perl5/5.26.0
/home/zingo/perl5/lib/perl5/5.26.0/x86_64-linux-gnu-thread-multi
/usr/local/lib/site_perl
/usr/lib/x86_64-linux-gnu/perl-base

query name too long

Hi,

I hope this great suit of tools is still supported!

I found that SimCT simulated reads names can get longer than 254 characters which is maximum query name length, this is not an issue till we try converting SAM to BAM. It would be great if SimCT can be BAM friendly

error when using samtools view to convert SAM to BAM:

[E::sam_parse1] query name too long
[W::sam_read1] Parse error at line 148116670
[main_samview] truncated file.

problematic reads:

50486975:23,35459525,+,9M2584N17M5106N11M936N13M582N14M1319N10M1664N11M1539N13M137N10M4099N7M1759N5M249N9M2230N11M575N10M;23,35467245,-,7M936N13M582N14M1319N10M1664N11M1539N13M137N10M4099N7M1759N5M249N9M2230N11M575N14M14N6M5642N3M7050N7M3085N10M:AAAAAAAAA	99	23	35459528	60	2S7=2584N17=5106N11=936N13=582N14=1319N10=1664N11=1539N13=137N10=4099N7=1759N5=249N9=2230N11=575N7=3S	=	35467246	38748	CTATTTAAGATTCCCAAACCAGATAGGAAATGGTCAGATGGATTTAGCTGTTTTGCCCTTGTATATGCAAACATTAATCATTTATGAATTTACCCATGGATTCGACAAGATAAATTTGAAGCAAGTTAGGCCGACCTTAGAATACATaTa	@BBBCFFFFFFFDHHHHHHHHJJJJJJIGIIJIJIHGEFIIGIEEFHJIIIJJJJJJJJJJJJHGIJGACGB99B9FFGIIJJJJIGIGIIIICAGIJJIJJI@:FFFDDDDDEDE<AEDBB5:<<?CCCBAAABB>B>A@CCDACC###	AS:i:3353	id:Z:ENSDARG00000116857	bt:Z:protein_coding	tm:i:1812
50486975:23,35459525,+,9M2584N17M5106N11M936N13M582N14M1319N10M1664N11M1539N13M137N10M4099N7M1759N5M249N9M2230N11M575N10M;23,35467245,-,7M936N13M582N14M1319N10M1664N11M1539N13M137N10M4099N7M1759N5M249N9M2230N11M575N14M14N6M5642N3M7050N7M3085N10M:AAAAAAAAA	147	23	35467246	60	7=936N13=582N14=1319N10=1664N11=1539N1X12=137N10=4099N7=1759N5=249N9=2230N11=575N14=14N6=5642N3=7050N7=3085N10=	=	35459528	-38748	TGGTCAGATGGATTTAGCTGTTTTGCCCTTGTATATGCAAACATTAATCATTTATtAATTTACCCATGGATTCGACAAGATAAATTTGAAGCAAGTTAGGCCGACCTTAGAATACATTTCTGAAGATAAACATGTCTGGAAGAAGCTGTG	DBBBDDDB=DDDDCEDDDDDDDDDC?AIHHHHHAAGE?@B=;HHHHHHFFFFF==)>?CDDC=CCCIJJJJJJJJIIGHF@HJJJJJJJIGIJIHE?HEIHJJJJIJJJJJJJIIFCBE@?DA@IIIIJIHHHHHHHFFFFFFFFC@@@@	AS:i:3353	id:Z:ENSDARG00000116857	bt:Z:protein_coding	tm:i:1812

Thanks ~

How gene-counts, transcript-counts are being generated?

Hi @jaudoux

I am wondering how gene-counts.tsv.gz, transcript-counts.tsv.gz are being generated, Are they raw-counts, or using a feature counting tool?
In my opinion, it would be so handy if the simulator can generate a SAM like file contains all the reads, which can be passed to HTSeq-count for example, so we can have more comparable truth set.

Intervals must have positive width

Hi,

I run into this problem when running BenchCT,

[checker] Reading info file
[checker] 160005484 alignment(s) read
[checker] 129001963 error(s) read
[checker] Reading junction bed file
Intervals must have positive width at /home/zingo/perl5/lib/perl5/CracTools/Interval/Query.pm line 46, <$fh> line 3.

My guess, BenchCT expects splices.bed.gz produced by SimCT to have start < end, which is not the case for junctions on antisense strand. The follwoing awk command seems to fix the problem, I appreciate any insight into this issue.

zcat splices.bed.gz | awk '{if ($6 == "+") {print $0} else {printf("%s\t%s\t%s\t%s\t%s\t%s\n", $1,$3,$2,$4,$5,$6)} }' | gzip > splices.modified.bed.gz

Thanks ~

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.