tderrien / feelnc Goto Github PK

View Code? Open in Web Editor NEW

80.0 10.0 28.0 39.54 MB

FEELnc : FlExible Extraction of LncRNA

License: GNU General Public License v3.0

R 8.61% Perl 90.27% Shell 1.12%

lncrna rna-seq annotation

feelnc's People

Stargazers

Watchers

feelnc's Issues

lncRNA gene ID Conversion

@tderrien @flegeai @vwucher ,
I have successfully used your program to Identify lncRNAs from two sets of experiments. For downstream processing I need the identity of the lncRNAs to BLAST against the known lncRNAs and see if I have any novel lncRNAs. The classes text file and candidatelncRNA.gtf.lncRNA.gtf files have TCONS and XLOC ids. Do you have a way of identifying the the corresponding transcript or gene ID?

isBest	lncRNA_gene	lncRNA_transcript	partnerRNA_gene	partnerRNA_transcript	direction	type	distance	subtype	location
1	XLOC_004434	TCONS_00004781	ABRACL	NM_001302176	sense	genic	0	containing	exonic
1	XLOC_000968	TCONS_00001039	OAZ2	NM_001142862	sense	genic	0	containing	exonic
0	XLOC_000968	TCONS_00001039	ZNF609	NM_001293220	antisense	intergenic	1457	convergent	downstream
1	XLOC_002571	TCONS_00002801	CMTM7	NM_001007894	sense	genic	0	containing	exonic
0	XLOC_002571	TCONS_00002801	CMTM8	NM_001199703	sense	intergenic	4154	same_strand	downstream

Any help provided will be highly appreciated.
Thanks

Can't call method "close" on an undefined value at.... Running FEELnc_classifier.pl

Hi,
I have the following when running the 'classifier' module
(in cleanup) Can't call method "close" on an undefined value at /home/new/perl5/lib/perl5/Bio/DB/SeqFeature/Store/berkeleydb.pm line 1455.

It shows the line twice on the terminal and doesn't print the output. How do I get it to run all the way to the end and print the .txt?

Your help will be appreciated.

How to select optimal k-mer (lists)?

Hi @tderrien,
I am trying to annotate lncRNAs in the genome of a non-model organism using the shuffle strategy.
Now I was wondering, how do I determine which k-mer frequencies should be preserved in order to maximize the classification accuracy? Can this be judged from the mean accuracy value in the *RF_statsLearn_CrossValidation.txt file? Or do I need to calculate MCC values for each k-mer (list)? If so, can I use the mean TP/TN/FP/FN from the above mentioned file for this?

Thanks for your help!

Issue in the installing of FEELnc

Dear,
I have a problem on the installing process of FEELnc.
[wkj@localhost FEELnc-master]$ FEELnc_classifier.pl
Can't locate DB_File.pm in @inc (@inc contains: /home/wkj/wkj_software/FEELnc/FEELnc-master//lib/ /home/wkj/wkj_software/FEELnc/FEELnc-master/lib/ /home/wkj/perl5/lib/perl5/x86_64-linux-thread-multi /home/wkj/perl5/lib/perl5 /home/wkj/perl5/lib/perl5/x86_64-linux-thread-multi /home/wkj/perl5/lib/perl5 /home/wkj/perl5/lib/perl5/x86_64-linux-thread-multi /home/wkj/perl5/lib/perl5 /home/wkj/perl5/lib/perl5/x86_64-linux-thread-multi /home/wkj/perl5/lib/perl5 /usr/local/lib64/perl5 /usr/local/share/perl5 /usr/lib64/perl5/vendor_perl /usr/share/perl5/vendor_perl /usr/lib64/perl5 /usr/share/perl5 .) at /home/wkj/perl5/lib/perl5/Bio/DB/SeqFeature/Store/LoadHelper.pm line 37.
BEGIN failed--compilation aborted at /home/wkj/perl5/lib/perl5/Bio/DB/SeqFeature/Store/LoadHelper.pm line 37.
Compilation failed in require at /home/wkj/perl5/lib/perl5/Bio/DB/SeqFeature/Store/GFF3Loader.pm line 72.
BEGIN failed--compilation aborted at /home/wkj/perl5/lib/perl5/Bio/DB/SeqFeature/Store/GFF3Loader.pm line 72.
Compilation failed in require at /home/wkj/wkj_software/FEELnc/FEELnc-master//lib//Bio/SeqFeature/database_part.pm line 6.
BEGIN failed--compilation aborted at /home/wkj/wkj_software/FEELnc/FEELnc-master//lib//Bio/SeqFeature/database_part.pm line 6.
Compilation failed in require at /home/wkj/wkj_software/FEELnc/FEELnc-master/scripts/FEELnc_classifier.pl line 12.
BEGIN failed--compilation aborted at /home/wkj/wkj_software/FEELnc/FEELnc-master/scripts/FEELnc_classifier.pl line 12.

I tried a lot of things, but it did not work. What should I do ?
Please help me!
Best regards
Kejun
2018-1-14

Problem in step2- FEELnc_codpot.pl

All the three test steps were done. But, error occurred when i use my transcripts in step 2. I have reference genome, so my command was:
FEELnc_codpot.pl -i step1_PtrLnc.gtf -a Ptr.cds.all.fa -l Ptr.lncrnas.gtf -g Populus_trichocarpa.JGI2.0.dna.toplevel.chr.nocaff.fa -o step2_out --outdir=./FEElnc

I also tried command: FEELnc_codpot.pl -i step1_PtrLnc.gtf -a Ptr.cds.all.fa -g Populus_trichocarpa.JGI2.0.dna.toplevel.chr.nocaff.fa --mode=intergenic -l Ptr.lncrnas.gtf
All get Error below:
Your input GTF file 'Populus_trichocarpa_lncrnas.gtf' contains 4322 transcripts

------------- EXCEPTION: Bio::Root::Exception -------------
MSG: Each line of the qual file must be less than 65,536 characters. Line 3 is 18835764 chars.
STACK: Error::throw
STACK: Bio::Root::Root::throw /Data/Yexx/feelnc/lib/perl5/site_perl/5.22.0/Bio/Root/Root.pm:449
STACK: Bio::DB::IndexedBase::_check_linelength /Data/Yexx/feelnc/lib/perl5/site_perl/5.22.0/Bio/DB/IndexedBase.pm:744
STACK: Bio::DB::Fasta::_calculate_offsets /Data/Yexx/feelnc/lib/perl5/site_perl/5.22.0/Bio/DB/Fasta.pm:175
STACK: Bio::DB::IndexedBase::_index_files /Data/Yexx/feelnc/lib/perl5/site_perl/5.22.0/Bio/DB/IndexedBase.pm:648
STACK: Bio::DB::IndexedBase::index_file /Data/Yexx/feelnc/lib/perl5/site_perl/5.22.0/Bio/DB/IndexedBase.pm:484
STACK: Bio::DB::IndexedBase::new /Data/Yexx/feelnc/lib/perl5/site_perl/5.22.0/Bio/DB/IndexedBase.pm:364
STACK: ExtractFromFeature::feature2seq /Data/Yexx/feelnc/lib/perl5/site_perl/5.22.0/ExtractFromFeature.pm:399
STACK: ExtractCdnaOrf::CreateORFcDNAFromGTF /Data/Yexx/feelnc/lib/perl5/site_perl/5.22.0/ExtractCdnaOrf.pm:303
STACK: /Data/Yexx/feelnc/bin/FEELnc_codpot.pl:322

So tell me where did we go wrong?
Thanks for your help!

an issue by using using FEELnc_filter.pl

Recently, I tried to analyse the lncRNA profile of porcine tissues by using FEELnc.

Requirements were already installed as the descriptions on "https://github.com/tderrien/FEELnc".

However, while using FEELnc_filter.pl, some error information occurred at my Computer Terminal, looked like this:
"Child 13 encountered an unknown erroe at FEELnc_filter.pl line 159"

Can somebody give me advice to resolve this issue?

FEELnc_codpot

Hi,
I recently installed FEELnc to analyze a transcript assembly of a non-model organism. I tried running the test file and got the following error:

Parsing random forest output: random forest output file './feelnc_codpot_out/candidate_lncRNA.gtf_RF.txt' is empty... exiting

I have installed all the prerequisites and they are working correctly. The first step filtering the transcripts executes correctly, but the second crashes with the above mention error. Since this is the test file you provide with the software and I am following your instruction line by line, do you know what could be causing this error?

Best regards,

Juan Montenegro

Error with FEELnc_codpot

Hi!
I am new using FEELnc and I have an error with the second step of the tool. Here it is the error I have received it when running it:

Possible precedence issue with control flow operator at /software/anaconda2/lib/site_perl/5.26.2/Bio/DB/IndexedBase.pm line 845.
You do not have specified a maximum number mRNAs transcripts for the training. Use all the annotation, can be long...
You do not have specified a maximum number lncRNA transcripts for the training. Use all the annotation, can be long...
Error: You should set the environnment variable FEELNCPATH to the dir of installation
export FEELNCPATH=my_dir_of_install/

I know that a lot of people have submited similar errors, but me and the bioinformatic of my time are incapable to solve it, so we need help!

Thanks in advance!

Iraia

ERROR: line with too many characters

Dear Dr. Derrien: I find your pipeline very attractive for several reasons. I'm ttrying to run it on transcriptome data from a gymnosperm megagenome which is only partially sequenced.. I successfully run the first module in the pipeline (FILTER), but an error emerged when running the CODPOT module. The error is described in the header, and I paste here the first few lines, although I can send the complete output error if you wish

------------- EXCEPTION: Bio::Root::Exception -------------
MSG: Each line of the file must be less than 65,536 characters. Line 3510498 is 351472 chars.
STACK: Error::throw
STACK: Bio::Root::Root::throw /home/pablo/Apps/Miniconda3/envs/feelnc/lib/site_perl/5.26.2/Bio/Root/Root.pm:447

.
By logic, the error has originated from my genome file (fasta), from which I have extracted several lines around the relevant one with head and taail.. I would like to know whether this can be fixed, and how, but it strikes me that the previous sequence has 322377 bp but has not triggered the error.
Thank you very much
Pablo

Error in .External2(C_X11, paste0("png::", filename), g$width, g$height, :"unable to start device PNG.

Hi Tderrien,
There was one problem when running the ‘Coding_Potential‘ module, as follows: "Error in .External2(C_X11, paste0("png::", filename), g$width, g$height, :"unable to start device PNG. I have tried to generate png() but failed. Is there any solution to sovle this problem? such as skipping build png or replace png by pdf? Could you give any suggestion.
Your reply will be appreciated!

Using Trinity FASTA file as input for FEELnc_filter.pl

Dear all,

I am trying to run the entire pipeline of FEELnc using Trinity transcript FASTA file as input (instead of cufflinks/stringtie transcripts.GTF). However, I only found possible to run FEELnc_codpot.pl with this type of input but not FEELnc_filter.pl. Is that correct? why is not available the filter step for Trinity transcript FASTA file input?

Thank you so much in advance,

Cristina Osuna

Possible precedence issue with control flow operator at

a5a8f7582a886432041ce0a4001066.zip

Extract attached file.
run with .command in the directory.

Possible precedence issue with control flow operator at /BiO/BioTools/miniconda3/lib/perl5/site_perl/5.22.0/Bio/DB/IndexedBase.pm line 791.

Error: no lncRNA sequences have been given in input and the lncRNA sequences simulation mode have not been set. If no lncRNA, please choose a mode between 'shuffle' or 'intergenic'

Usage:
FEELnc_codpot.pl -i transcripts.GTF -a known_mRNA.GTF -g genome.FA -l
known_lnc.GTF [options...]

Can you check this, please?

Issues about candidate_lncRNA_classes.txt

Hi, @tderrien @vwucher
I got TWO questions.
1)How can extract FASTA format sequence by final output file-candidate_lncRNA_classes.txt combined with my genome.fa? It seems that the classifier output file only contian distance, how can I get its start_position & end_position?
2) Is there any downstream tools for further analysis? I am stuck in there. I did find lncRNAs using FEELnc, but, so, thanks! But, I have to further analyse these lncRNAs.
Your reply will be appreciated!

Vaney

error msg (in cleanup)

(in cleanup) Can't call method "close" on an undefined value at /path/to/CPAN/5.10.1/lib/perl5/Bio/DB/SeqFeature/Store/berkeleydb.pm line 1455.

I'd be happy to provide some run info, if needed.

Undefined subroutine &Parser::parseGTF when running FEELnc_filter.pl

Hi,
I ran in to this error while running the sample code:

FEELnc_filter.pl -i transcript_chr38.gtf -a annotation_chr38.gtf -b transcript_biotype=protein_coding
Filtered transcripts will be available in file: 'transcript_chr38.feelncfilter.log'
Undefined subroutine &Parser::parseGTF called at /home/ke.zhang/Programs/miniconda3/envs/conda_env/bin/FEELnc_filter.pl line 87.

I was able to see the help page when running the script without any arguments.
Could anyone help me identify the problem? Thank you.

Conda install is broken

Hi,
I tried to install FEELnc through conda but FEELnc crashes with these errors:

$conda create -n env_feelnc feelnc
...
...
$conda activate env_feelnc
$FEELnc_classifier.pl
Can't load '/gpfs/home/xxxx/miniconda3/envs/env_feelnc/lib/site_perl/5.26.2/x86_64-linux-thread-multi/auto/DB_File/DB_File.so' for module DB_File: libdb-6.1.so: Ne peut ouvrir le fichier d'objet partagé: Aucun fichier ou dossier de ce type at /gpfs/home/xxxx/miniconda3/envs/env_feelnc/lib/site_perl/5.26.2/x86_64-linux-thread-multi/XSLoader.pm line 96.
 at /gpfs/home/xxxx/miniconda3/envs/env_feelnc/lib/site_perl/5.26.2/x86_64-linux-thread-multi/DB_File.pm line 257.
Compilation failed in require at /gpfs/home/xxxx/miniconda3/envs/env_feelnc/lib/site_perl/5.26.2/Bio/DB/SeqFeature/Store/LoadHelper.pm line 37.
BEGIN failed--compilation aborted at /gpfs/home/xxxx/miniconda3/envs/env_feelnc/lib/site_perl/5.26.2/Bio/DB/SeqFeature/Store/LoadHelper.pm line 37.
Compilation failed in require at /gpfs/home/xxxx/miniconda3/envs/env_feelnc/lib/site_perl/5.26.2/Bio/DB/SeqFeature/Store/GFF3Loader.pm line 72.
BEGIN failed--compilation aborted at /gpfs/home/xxxx/miniconda3/envs/env_feelnc/lib/site_perl/5.26.2/Bio/DB/SeqFeature/Store/GFF3Loader.pm line 72.
Compilation failed in require at /gpfs/home/xxxx/miniconda3/envs/env_feelnc/lib/site_perl/5.26.2/Bio/SeqFeature/database_part.pm line 6.
BEGIN failed--compilation aborted at /gpfs/home/xxxx/miniconda3/envs/env_feelnc/lib/site_perl/5.26.2/Bio/SeqFeature/database_part.pm line 6.
Compilation failed in require at /home/xxxx/miniconda3/envs/env_feelnc/bin/FEELnc_classifier.pl line 12.
BEGIN failed--compilation aborted at /home/xxxx/miniconda3/envs/env_feelnc/bin/FEELnc_classifier.pl line 12.

It seems related to bioconda/bioconda-recipes#24688

FEELnc_classifier.pl issue when determining antisense transcripts with GTF-file versions

I am trying to classify antisense transcripts using FEELnc_classifier.pl with a GENCODE GTF file as a reference annotation. When using the GENCODE reference on transcript level (only including transcript & exon entries) I am getting wildly different results compared to the ones when using a complete reference file (including gene entries and others).
How is the FEELnc_classifier.pl supposed to be used and what input files are recommended?
Thank you!

'MSG: Failed validation of sequence '1'. Invalid characters"

I came to this page because I had the error 'MSG: Failed validation of sequence '1'. Invalid characters" which is in this thread. There was no problem with my files or indices. I introduced line breaks into the FASTA file and it solved the problem. I realized this when I isolated the sequence on which it was bombing and ran the program again only on that sequence. Perl tells me that 'MSG: Each line of the file must be less than 65,536 characters. Line 2 is 721461 chars.'

Originally posted by @ginac in #10 (comment)

Misclassification of partner RNA genes

Hello,
I was using the Feelnc classification filter on my list of lncRNAs and refseq annotation for mouse genome. And I am using the following format for my gtf file:
4 . exon 61545496 61545496 . + . gene_id "ncRNA_as_c4_3298"; transcript_id "ncRNA_as_c4_3298_7"; exon_number "1";
4 . exon 61622060 61622094 . + . gene_id "ncRNA_as_c4_3298"; transcript_id "ncRNA_as_c4_3298_4"; exon_number "1";
4 . exon 61622060 61622094 . + . gene_id "ncRNA_as_c4_3298"; transcript_id "ncRNA_as_c4_3298_8"; exon_number "1";
4 . exon 61622060 61622094 . + . gene_id "ncRNA_as_c4_3298"; transcript_id "ncRNA_as_c4_3298_1"; exon_number "1";
4 . exon 61622060 61622094 . + . gene_id "ncRNA_as_c4_3298"; transcript_id "ncRNA_as_c4_3298_2"; exon_number "1";

and the reference gtf file is similar too.

The issue the feelnc classification does correctly find its antisense gene partners Mup3 and Mup20 (NR_149826; NM_001012323;) but in addition it also predicts two other genes that are not overlapping and on the same chromosome too but still it gives classifies them as intronic with a wrong distance of zero. NR_002445 which is chromosome 16 but still I get a distance of zero and its predicted intronic. Another such instance is a lncRNAs that on chr5: 136,150,706- 136,166,245 is actually anti-sense to Por (and feeLnc outputs that correctly, but it also gives as best predicted output result for LOC102631757 (with a distance of zero and predicted intronic), which is located approx. 7 kb away on Chr5, and is also found on another chromosome.

I have no idea why this would be happening. Does it have something to do with the format of my reference file? Because I saw the test files provided for human datasets doesn't mention the chromosome number but the version of the assembly.

Cannot run FEELnc_codpot.pl

Hi
I am working on the Zebrafish lncRNA and have recently installed your software. The first step of the pipeline is running smoothly and producing the desired output. But, whenever I am trying to run the next step i.e. FEELnc_codpot.pl I am getting the following error:

Undefined subroutine &Bio::DB::IndexedBase::_strip_crnl called at /usr/local/share/perl/5.26.1/Bio/DB/Fasta.pm line 295.

I have tried running the test with toy samples and there also I am getting the same error.

I would appreciate your help very much.

Error when running FEELnc_codpot.pl example

Hello,

I'm assisting a researcher in setting up FEELnc on a cluster.

I've included the installation details under the expandable section below:

We have several of the dependencies already available in modules, so those have been loaded:

# using lmod module system here, so we must load gcc before loading other tools
$ module load gcc/6.2.0
# perl module includes BioPerl 1.007001
$ module load perl/5.24.0
$ module load R/3.4.1

Parallel::ForkManager was installed to a local Perl library, and ROCR and randomForest were installed to a local R package library:

$ eval `perl -Mlocal::lib=~/perl5-O2`
$ cpanm Parallel::ForkManager
# tested, and can successfully use this Perl module

$ mkdir -p ~/R-3.4.1-FEELnc/library
$ echo 'R_LIBS_USER="~/R-3.4.1-FEELnc/library"' >  $HOME/.Renviron
$ export R_LIBS_USER="~/R-3.4.1-FEELnc/library"
$ R
> install.packages("ROCR", repos="http://cran.r-project.org")
> install.packages("randomForest", repos="http://cran.r-project.org")
# tested, and can successfully use both R packages

The provided KmerInShort and fasta_ushuffle Linux executables work as expected, so these were not compiled from source.

The FEELnc installation process was performed as described in the README.

$ git clone https://github.com/tderrien/FEELnc.git
$ cd FEELnc/
$ export FEELNCPATH=${PWD}
$ export PERL5LIB=${FEELNCPATH}/lib/:$PERL5LIB

# need this additional line for BioPerl to be found
$ export PERL5LIB=/n/app/perl/5.24.0/lib/site_perl/5.24.0/:$PERL5LIB

$ export PATH=$PATH:${FEELNCPATH}/scripts/
$ export PATH=$PATH:${FEELNCPATH}/utils/
$ export PATH=$PATH:${FEELNCPATH}/bin/LINUX/

The installation worked without any errors, but testing with the toy example does not. Step 1 (FEELnc_filter.pl) works fine; Step 2 (FEELnc_codpot.pl) ends in No locks available error.

$ cd test
$ FEELnc_filter.pl -i transcript_chr38.gtf -a annotation_chr38.gtf \
> -b transcript_biotype=protein_coding > candidate_lncRNA.gtf
# works as expected
$ FEELnc_codpot.pl -i candidate_lncRNA.gtf -a annotation_chr38.gtf -b transcript_biotype=protein_coding -g genome_chr38.fa --mode=shuffle
Warning: Output directory './feelnc_codpot_out' already exists... files might be overwritten!
You do not have specified a maximum number mRNAs transcripts for the training. Use all the annotation, can be long...
You do not have specified a maximum number lncRNA transcripts for the training. Use all the annotation, can be long...
> Extract ORFs/cDNAs for mRNAs from a GTF file
Parsing file 'annotation_chr38.gtf'...
Parse input file:             [----------------------------------------------------------------------------------------------------]
    Your input GTF file 'annotation_chr38.gtf' contains *254* transcripts

------------- EXCEPTION: Bio::Root::Exception -------------
MSG: Could not open index file genome_chr38.fa.index: No locks available
STACK: Error::throw
STACK: Bio::Root::Root::throw /n/app/perl/5.24.0/lib/site_perl/5.24.0//Bio/Root/Root.pm:447
STACK: Bio::DB::IndexedBase::_open_index /n/app/perl/5.24.0/lib/site_perl/5.24.0//Bio/DB/IndexedBase.pm:712
STACK: Bio::DB::IndexedBase::_index_files /n/app/perl/5.24.0/lib/site_perl/5.24.0//Bio/DB/IndexedBase.pm:689
STACK: Bio::DB::IndexedBase::index_file /n/app/perl/5.24.0/lib/site_perl/5.24.0//Bio/DB/IndexedBase.pm:525
STACK: Bio::DB::IndexedBase::new /n/app/perl/5.24.0/lib/site_perl/5.24.0//Bio/DB/IndexedBase.pm:405
STACK: ExtractFromFeature::feature2seq /home/kmk34/tickets/INC0189539_FEELnc/FEELnc/lib//ExtractFromFeature.pm:399
STACK: ExtractCdnaOrf::CreateORFcDNAFromGTF /home/kmk34/tickets/INC0189539_FEELnc/FEELnc/lib//ExtractCdnaOrf.pm:303
STACK: /home/kmk34/tickets/INC0189539_FEELnc/FEELnc/scripts/FEELnc_codpot.pl:298

I found a related issue about another tool that uses BioPerl, ending in the same No locks available error. The poster suggested that NFS filesystems (which I'm using) do not work well with multiple processes using the same file : http://lists.ensembl.org/pipermail/dev/2014-December/010636.html

Do you have any suggestions on how to avoid this issue?

Thank you!
Kathleen

exonic_lncRNAs

Dear All

I got a list of candidate lncRNAs from FEELnc, but I have some lncRNA which they are matched with exons but plus/minus (according to the blast results) , how I can be determined these are antisense exonic lncRNA and NOT a cleaved or degraded mRNA?

according to the GTF file of candidate lncRNAs, I have more than one transcript for some lncRNA genes but with few bp different in length. for examples: A.1 , A.2 , A.3 for lncRNA A gene. should I consider them as 3 lncRNAs or just take the longest one?

Thanks

Error while running FEELnc

tkm@tkm:~/Software/FEELnc/scripts$ ./FEELnc_filter.pl
Can't locate Parser.pm in @inc (you may need to install the Parser module) (@inc contains: %LIB $/home/tkm/Software/FEELnc/lib/ $/home/tkm/Software/FEELnc/lib/ /home/tkm/Downloads/FEELnc-master/lib/Parser.pm /home/tkm/.cpan/build/Pod-Usage-1.69-0/t/inc/Pod/Parser.pm /etc/perl /usr/local/lib/x86_64-linux-gnu/perl/5.22.1 /usr/local/share/perl/5.22.1 /usr/lib/x86_64-linux-gnu/perl5/5.22 /usr/share/perl5 /usr/lib/x86_64-linux-gnu/perl/5.22 /usr/share/perl/5.22 /usr/local/lib/site_perl /usr/lib/x86_64-linux-gnu/perl-base .) at ./FEELnc_filter.pl line 14.
BEGIN failed--compilation aborted at ./FEELnc_filter.pl line 14.

tkm@tkm:~/Software/FEELnc/scripts$ ./FEELnc_codpot.pl
Can't locate Parser.pm in @inc (you may need to install the Parser module) (@inc contains: /home/tkm/Software/FEELnc/scripts/_Inline/lib %LIB $/home/tkm/Software/FEELnc/lib/ $/home/tkm/Software/FEELnc/lib/ /home/tkm/Downloads/FEELnc-master/lib/Parser.pm /home/tkm/.cpan/build/Pod-Usage-1.69-0/t/inc/Pod/Parser.pm /etc/perl /usr/local/lib/x86_64-linux-gnu/perl/5.22.1 /usr/local/share/perl/5.22.1 /usr/lib/x86_64-linux-gnu/perl5/5.22 /usr/share/perl5 /usr/lib/x86_64-linux-gnu/perl/5.22 /usr/share/perl/5.22 /usr/local/lib/site_perl /usr/lib/x86_64-linux-gnu/perl-base .) at ./FEELnc_codpot.pl line 20.
BEGIN failed--compilation aborted at ./FEELnc_codpot.pl line 20.

tkm@tkm:~/Software/FEELnc/scripts$ ./FEELnc_classifier.pl
Can't locate Bio/SeqFeature/database_part.pm in @inc (you may need to install the Bio::SeqFeature::database_part module) (@inc contains: %LIB $/home/tkm/Software/FEELnc/lib/ $/home/tkm/Software/FEELnc/lib/ /home/tkm/Downloads/FEELnc-master/lib/Parser.pm /home/tkm/.cpan/build/Pod-Usage-1.69-0/t/inc/Pod/Parser.pm /etc/perl /usr/local/lib/x86_64-linux-gnu/perl/5.22.1 /usr/local/share/perl/5.22.1 /usr/lib/x86_64-linux-gnu/perl5/5.22 /usr/share/perl5 /usr/lib/x86_64-linux-gnu/perl/5.22 /usr/share/perl/5.22 /usr/local/lib/site_perl /usr/lib/x86_64-linux-gnu/perl-base .) at ./FEELnc_classifier.pl line 12.
BEGIN failed--compilation aborted at ./FEELnc_classifier.pl line 12.

perl shebang

Hi Thomas and all,

I have tried to run FEELnc_classifier.pl on our cluster and I have bumped into various perl module loading errors which happened in spite of modules are available in the location given in PERL5LIB.
It turned out that the three scripts use the default perl i.e. /usr/bin/perl which overrides the selected system perl I chose which has the correct modules installed.

Example error message:
Can't load '/apps/perl/5.24.0/lib/site_perl/5.24.0/x86_64-linux/auto/Encode/Encode.so' for module Encode: /apps/perl/5.24.0/lib/site_perl/5.24.0/x86_64-linux/auto/Encode/Encode.so: undefined symbol: PL_stack_sp at /apps/perl/5.24.0/lib/site_perl/5.24.0/x86_64-linux/XSLoader.pm line 96.
at /apps/perl/5.24.0/lib/site_perl/5.24.0/x86_64-linux/Encode.pm line 10.
Compilation failed in require at /usr/share/perl5/vendor_perl/Pod/Text.pm line 32.
BEGIN failed--compilation aborted at /usr/share/perl5/vendor_perl/Pod/Text.pm line 32.
Compilation failed in require at (eval 1) line 2.
BEGIN failed--compilation aborted at /usr/share/perl5/vendor_perl/Pod/Usage.pm line 29.
Compilation failed in require at /scs/groups/lncRNA_orthology/src/FEELnc/scripts/FEELnc_classifier.pl line 6.
BEGIN failed--compilation aborted at /scs/groups/lncRNA_orthology/src/FEELnc/scripts/FEELnc_classifier.pl line 6.

Changing the shebang from:
#!/usr/bin/perl -w
to:
#!/usr/bin/env perl
solves the problem.

Can this change made in the three scripts?

Best wishes,
Lel

merged.gtf file not working with FEELnc

Hi @tderrien @vwucher @flegeai
I have used your program with individual gtf files generated using cufflinks and it is working alright. I have used cuffmerge to combine different gtf files (technical replicates) but the merged.gtf file does not produce an output in the filter module. It creates an empty candidate_lncRNA.gtf file.

head merged.gtf

chr1	Cufflinks	exon	70097	70162	.	+	.	gene_id "XLOC_000001"; transcript_id "TCONS_00000001"; exon_number "1"; gene_name "HCLS1"; oId "NM_001031401"; nearest_ref "NM_001031401"; class_code "="; tss_id "TSS1"; p_id "P1";
chr1	Cufflinks	exon	72159	72242	.	+	.	gene_id "XLOC_000001"; transcript_id "TCONS_00000001"; exon_number "2"; gene_name "HCLS1"; oId "NM_001031401"; nearest_ref "NM_001031401"; class_code "="; tss_id "TSS1"; p_id "P1";
chr1	Cufflinks	exon	73496	73569	.	+	.	gene_id "XLOC_000001"; transcript_id "TCONS_00000001"; exon_number "3"; gene_name "HCLS1"; oId "NM_001031401"; nearest_ref "NM_001031401"; class_code "="; tss_id "TSS1"; p_id "P1";
chr1	Cufflinks	exon	73697	73826	.	+	.	gene_id "XLOC_000001"; transcript_id "TCONS_00000001"; exon_number "4"; gene_name "HCLS1"; oId "NM_001031401"; nearest_ref "NM_001031401"; class_code "="; tss_id "TSS1"; p_id "P1";

.....truncated..
What do I need to do to proceed with the merged file?

Thanks.

Can't locate object method "contains" via package "Bio::SeqFeature::Generic"

Hi, I have try to run on test data. FEELnc_filter.pl and FEELnc_codpot.pl. But FEELnc_classifier.pl encountered a error.

Command:
FEELnc_classifier.pl -i feelnc_codpot_out/candidate_lncRNA.gtf.lncRNA.gtf -a annotation_chr38.gtf > candidate_lncRNA_classes.txt

Error:
window : 10000 - max window : 100000 - lncrna : feelnc_codpot_out/candidate_lncRNA.gtf.lncRNA.gtf - mrna : annotation_chr38.gtf - biotype: 0
Can't locate object method "contains" via package "Bio::SeqFeature::Generic" at /usr/local/share/perl/5.18.2/Bio/SeqFeature/Generic.pm line 865, line 1114.

It seems that script can load Bio::SeqFeature::Generic, but can not locate method contains. The error line:

864 } else {
865 if ( !$self->contains($feat) ) {
866 $self->throw("$feat is not contained within parent feature, and expansion is not valid");
867 }
868 }

$ perl -MBio::Perl -le 'print Bio::Perl->VERSION;'
1.006924

$ perl -version
This is perl 5, version 18, subversion 2 (v5.18.2) built for x86_64-linux-gnu-thread-multi
(with 41 registered patches, see perl -V for more detail)

How can I fix this error? Could you have a check.

Many thanks.

Question about transcript classification

Hello,

I am using the FEELnc_classifier.pl script to classify lncRNAs based on their genomic localization.
I am especially interested in natural antisense transcripts: In this case lncRNAs in antisense direction with at least a partial complementarity to corresponding (protein-coding or non-coding) transcripts in opposite direction.

I am wondering about what subset of the classifier output would suit my needs best:
I would choose transcripts tagged "antisense", a "genic" type and "exonic" location to get at least partial complementarity?
Can you recommend anything?

Thank you very much for your help!

Error running feeling_codpot.pl

Hi,

I'm trying to find lncRNA using Feelnc.
My pipeline consists of the following: Trimmomatic --> HISAT2 --> stringtie and stringtie assembly --> feelnc filter using the following input: FEELnc_filter.pl -i new_merged.combined.gtf -a gencode.v38.annotation.gtf > new_candidate_model.gtf. This command was executed without any issues. However, upon trying the following command: FEELnc_codpot.pl -i new_candidate_model.gtf -a gencode.v38.annotation.gtf -b transcript_biotype=protein_coding -l gencode.v38.long_noncoding_RNAs.gtf -g /Users/username/Downloads/gencode.v38.transcripts.fa —mode=shuffle, I got an error. The output from this last was:

Multiple replications of the following:

Parse input file: [------------------------------------------------

And multiple different versions of the following:

ExtractFromFeature::feature2seq: your seq chr5:131954211-131954371 returns an empty string!...Check 'chr' prefix between your annotation and genome files or remove your genome index file ('/Users/username/Downloads/gencode.v38.transcripts.fa.index')...

Where chromosome number (chr5, 6, 7, etc.) and location on the chromosome are differing.

The output ends with:
The number of complete ORF found with computeORF mode is 0 transcripts... That's not enough to train the program

I have also found the following GitHub issue: #3 - where it was suggested to remove the fasta file from the command. However, when I attempt this, feelnc outputs: Error: Cannot read your genome file '' (-g option)....

Also, it shows that all of the following is mandatory (and not optional as stated on the GitHub page):
Usage: FEELnc_codpot.pl -i transcripts.GTF -a known_mRNA.GTF -g genome.FA -l known_lnc.GTF [options...]. Running feelnc_codpot.pl without all the options enabled throws an error

Can't locate Parser.pm in @INC

I install FEELnc by conda, and when I run test command:
FEELnc_filter.pl -i transcript_chr38.gtf -a annotation_chr38.gtf -b transcript_biotype=protein_coding > candidate_lncRNA.gtf
an error occurred:

Can't locate Parser.pm in @inc (you may need to install the Parser module) (@inc contains: /home/CYD/perl5/lib/perl5/5.26.0/x86_64-linux-thread-multi /home/CYD/perl5/lib/perl5/5.26.0 /home/CYD/perl5/lib/perl5/x86_64-linux-thread-multi /home/CYD/perl5/lib/perl5 /home/CYD/perl5/lib/perl5/5.26.0/x86_64-linux-thread-multi /home/CYD/perl5/lib/perl5/5.26.0 /home/CYD/perl5/lib/perl5/x86_64-linux-thread-multi /home/CYD/perl5/lib/perl5 /home/CYD/software/anaconda3/lib/site_perl/5.26.0/x86_64-linux-thread-multi /home/CYD/software/anaconda3/lib/site_perl/5.26.0 /home/CYD/software/anaconda3/lib/5.26.0/x86_64-linux-thread-multi /home/CYD/software/anaconda3/lib/5.26.0 .) at /home/CYD/software/anaconda3/bin/FEELnc_filter.pl line 14.
BEGIN failed--compilation aborted at /home/CYD/software/anaconda3/bin/FEELnc_filter.pl line 14.

what Module should I install?

FEELnc_codpot - Illegal division by zero at RandomForest.pm line 481

Hello,

I am trying to run FEELnc_codpot, and I am getting two types of errors. The first error occurs after parsing the genome annotation file. I get the following message:

--------------------- WARNING ---------------------
MSG: Got a sequence without letters. Could not guess alphabet

I think this message is coming up because of short sequences that BioPerl can't decipher as DNA, RNA, or protein. However, I'm not sure what I need to edit to manually input that the alphabet should be set as DNA.
In case this information is relevant, the genome annotation file was converted from gff to gtf using "gffread -T", and I edited the seqname to match the chromosome names of the genome sequence file. However, I noticed that the resulting output file is missing any attribute information other than gene_id, transcript_id, and some gene_name. So, all exon_number and other information were lost.

The second error occurs after parsing the candidate lncRNA file made from FEELnc_filter. I get the following message stating that there was an illegal division by zero at line 481. I thought this error was due to transcripts with length of 0, so I tried deleting such transcripts from the filtered gtf, but this problem still persists.

Run random Forest on '/tmp//5646_candidate_lncRNA.test_rna.fa'
Illegal division by zero at /media/jolly/WD/FEELnc/lib/RandomForest.pm line 481, line 12196.

Do you have any suggestions on how to resolve these issues?

Thank you,
HK

Problems with FEELnc_codpot.pl (unknown_transcript_1)

Hi, thanks for this software!

I am getting an error with FEELnc_codpot.pl, as follows:

FEELnc_codpot.pl \
>   -i $ANALYSISDIR/feelnc/candidate_lncRNA.gtf \
>   -a $DATADIR/reference/daphnia_genome.gtf \
>   -b transcript_biotype=protein_coding \
>   -g $DATADIR/reference/reference.fasta \
>   --mode=shuffle
Possible precedence issue with control flow operator at /home/joelnitta/kato_daphnia/bin/feelnc/lib/site_perl/5.26.2/Bio/DB/IndexedBase.pm line 805.
You do not have specified a maximum number mRNAs transcripts for the training. Use all the annotation, can be long...
You do not have specified a maximum number lncRNA transcripts for the training. Use all the annotation, can be long...
> Extract ORFs/cDNAs for mRNAs from a GTF file
Parsing file 'daphnia_genome.gtf'...
Parse input file:             [----------------------------------------------------------------------------------------------------]
	Your input GTF file 'daphnia_genome.gtf' contains *35912* transcripts
	Extracting ORFs/cDNAs 35789/35912...
	Extracted '35789' ORF/cDNAs sequences on '35912'.

--------------------- WARNING ---------------------
MSG: Got a sequence without letters. Could not guess alphabet
---------------------------------------------------
> The lncRNA training file is not set. Get ORFs/cDNAs for lncRNAs by shuffling mRNA sequences
Input error: Invalid input file, expecting nucleotide sequence line on line 14910

Failed to run fasta_ushuffle:
/home/joelnitta/kato_daphnia/bin/feelnc/bin/fasta_ushuffle -k 7 -s 1234 -n 3 < /tmp//144030_candidate_lncRNA.gtf.oneLine.fa > /tmp//144030_candidate_lncRNA.gtf.ushuffle.3perm.fa

As you can see, it does not extract all the transcripts successfully.

I tried to find what was causing the error by saving the intermediate output with --keeptmp. I created a slightly smaller GTF file that only contained the 35789 transcripts that were successfully extracted and tried it on that, but still got the "sequence without letters" and "Input error: Invalid input file" errors.

I checked 144030_candidate_lncRNA.gtf.oneLine.fa, and sure enough there is an entry in the fasta at lines 14909 and 14910 like this:

>unknown_transcript_1

There are the same >unknown_transcript_1 followed by an empty line in 85227_candidate_lncRNA.gtf.coding_orf.fa and 85227_candidate_lncRNA.gtf.coding_rna.fa. However, I cannot determine where these are coming from.

My reference fasta and GTF files are from here: https://www.ncbi.nlm.nih.gov/assembly/GCF_003990815.1/

My input GTF file is the result of a isoseq pipeline. The most recent step that generated the GTF was SQANTI3, using of course the same reference genome for mapping.

I am using the conda environment described in the README (thanks for that! so nice not to have to install a bunch of things just to try out the pipeline).

I would really appreciate any suggestions! I can also send the data via email if that would help.

Best,

Joel

FEELnc_classifier.pl occured error.

Typo in ReadMe.md

One of the denpendencies is listed as "Paralell::ForkManager", which should be "Parallel::ForkManager". It takes me 10 mins to figure out why cpan cannot find this module. lol

error when running the classifier

Hi,
I am getting the following error when trying to run the classifier:
Argument "lncrna:" isn't numeric in numeric comparison (<=>) at /apps/BIOPERL/1.6.1/lib/perl5/Bio/DB/SeqFeature/Store/berkeleydb3.pm line 89.
Any idea of how can I solve that?
Thank you in advance,
Jèssica

FEELnc_filter.pl -b transcript_biotype=protein_coding

Hello there,

I'm running the FEELnc_filter module in two approaches, as follows:

-i taco-new-assembly.gtf -a gencode.v40.primary_assembly.annotation.gtf -b transcript_biotype=protein_coding --monoex=0 -p 16 -o FEELnc_filter-1stStranded.log > 1st-strandedTFBfiltered.gtf &

and

-i taco-new-assembly.gtf -a gencode.v40.primary_assembly.annotation.gtf --monoex=0 -p 16 -o FEELnc_filter-1stStranded.log > 1st-strandedTFBfiltered.gtf &

When I checked the results, I got the same number of transcripts (35.395) , and I was expecting to get a larger number using the second approach since I haven't provided the option -b transcript_biotype=protein_coding, why is that? I read some stuff here and on internet, studied my data but couldn't figure it out.

May you help me, please?

Thanks in advance

Error when running FEELnc_codpot

When I was testing the installment and running with the test dataset,
FEELnc_codpot.pl -i candidate_lncRNA.gtf -a annotation_chr38.gtf -b transcript_biotype=protein_coding -g genome_chr38.fa --mode=shuffle

I have this error:

Undefined subroutine &Bio::DB::IndexedBase::_strip_crnl called at /usr/local/share/perl5/Bio/DB/Fasta.pm line 295

Could you please let me know what's the possible reason for this?

Thank you!

Kevin

segmentation fault

Hello I just ran into this problem while trying to execute codpot:

$ FEELnc_codpot.pl -i merge_evidence.gtf -a busco.genes.gtf -g nv_dovetail_4_gapped_chroms.final.fasta -m shuffle --outdir feelnc_shuffle 
You do not have specified a maximum number mRNAs transcripts for the training. Use all the annotation, can be long...
You do not have specified a maximum number lncRNA transcripts for the training. Use all the annotation, can be long...
> Extract ORFs/cDNAs for mRNAs from a GTF file
Parsing file 'busco.genes.gtf'...
Parse input file:             [----------------------------------------------------------------------------------------------------]
	Your input GTF file 'busco.genes.gtf' contains *1400* transcripts
	Extracting ORFs/cDNAs 1400/1400...
	Extracted '1400' ORF/cDNAs sequences on '1400'.
> The lncRNA training file is not set. Get ORFs/cDNAs for lncRNAs by shuffling mRNA sequences
	Extracting ORFs/cDNAs 4200/4200...
	Extracted '4200' ORF/cDNAs sequences on '4200'.
> Extract ORFs/cDNAs for candidates RNAs from a GTF file
Parsing file 'merge_evidence.gtf'...
Parse input file:             [----------------------------------------------------------------------------------------------------]
	Your input GTF file 'merge_evidence.gtf' contains *35900* transcripts
	Extracting ORFs/cDNAs 35893/35900...
	Extracted '35893' ORF/cDNAs sequences on '35900'.
> Run random Forest on '/tmp//93265_merge_evidence.gtf.test_rna.fa'
	1. Compute the size of each sequence and ORF
	2. Compute the kmer ratio for each kmer and put the output file name in a list
	3. Compute the kmer score for each kmer size on learning and test ORF
sh: line 1: 93359 Segmentation fault      (core dumped) /home/jdmontenegroc/Documents/bin/KmerInShort -file /tmp//93265_merge_evidence.gtf.coding_orf.fa.forRandomForest.fa -kval /tmp//93265_merge_evidence.gtf.kmerScoreValues_size1.tmp -nb-cores 1 -kmer-size 1 -dont-reverse -step 1 >> /tmp//93265_merge_evidence.gtf.coding_sequencesKmer_1_ScoreValues.tmp 2> /dev/null
sh: line 1: 93371 Segmentation fault      (core dumped) /home/jdmontenegroc/Documents/bin/KmerInShort -file /tmp//93265_merge_evidence.gtf.noncoding_orf.fa.forRandomForest.fa -kval /tmp//93265_merge_evidence.gtf.kmerScoreValues_size1.tmp -nb-cores 1 -kmer-size 1 -dont-reverse -step 1 >> /tmp//93265_merge_evidence.gtf.noncoding_sequencesKmer_1_ScoreValues.tmp 2> /dev/null
sh: line 1: 93387 Segmentation fault      (core dumped) /home/jdmontenegroc/Documents/bin/KmerInShort -file /tmp//93265_merge_evidence.gtf.test_orf.fa -kval /tmp//93265_merge_evidence.gtf.kmerScoreValues_size1.tmp -nb-cores 1 -kmer-size 1 -dont-reverse -step 1 >> /tmp//93265_merge_evidence.gtf.test_sequencesKmer_1_ScoreValues.tmp 2> /dev/null
sh: line 1: 93400 Segmentation fault      (core dumped) /home/jdmontenegroc/Documents/bin/KmerInShort -file /tmp//93265_merge_evidence.gtf.coding_orf.fa.forRandomForest.fa -kval /tmp//93265_merge_evidence.gtf.kmerScoreValues_size2.tmp -nb-cores 1 -kmer-size 2 -dont-reverse -step 1 >> /tmp//93265_merge_evidence.gtf.coding_sequencesKmer_2_ScoreValues.tmp 2> /dev/null
sh: line 1: 93411 Segmentation fault      (core dumped) /home/jdmontenegroc/Documents/bin/KmerInShort -file /tmp//93265_merge_evidence.gtf.noncoding_orf.fa.forRandomForest.fa -kval /tmp//93265_merge_evidence.gtf.kmerScoreValues_size2.tmp -nb-cores 1 -kmer-size 2 -dont-reverse -step 1 >> /tmp//93265_merge_evidence.gtf.noncoding_sequencesKmer_2_ScoreValues.tmp 2> /dev/null
sh: line 1: 93423 Segmentation fault      (core dumped) /home/jdmontenegroc/Documents/bin/KmerInShort -file /tmp//93265_merge_evidence.gtf.test_orf.fa -kval /tmp//93265_merge_evidence.gtf.kmerScoreValues_size2.tmp -nb-cores 1 -kmer-size 2 -dont-reverse -step 1 >> /tmp//93265_merge_evidence.gtf.test_sequencesKmer_2_ScoreValues.tmp 2> /dev/null
sh: line 1: 93434 Segmentation fault      (core dumped) /home/jdmontenegroc/Documents/bin/KmerInShort -file /tmp//93265_merge_evidence.gtf.coding_orf.fa.forRandomForest.fa -kval /tmp//93265_merge_evidence.gtf.kmerScoreValues_size3.tmp -nb-cores 1 -kmer-size 3 -dont-reverse -step 3 >> /tmp//93265_merge_evidence.gtf.coding_sequencesKmer_3_ScoreValues.tmp 2> /dev/null
sh: line 1: 93446 Segmentation fault      (core dumped) /home/jdmontenegroc/Documents/bin/KmerInShort -file /tmp//93265_merge_evidence.gtf.noncoding_orf.fa.forRandomForest.fa -kval /tmp//93265_merge_evidence.gtf.kmerScoreValues_size3.tmp -nb-cores 1 -kmer-size 3 -dont-reverse -step 3 >> /tmp//93265_merge_evidence.gtf.noncoding_sequencesKmer_3_ScoreValues.tmp 2> /dev/null
sh: line 1: 93459 Segmentation fault      (core dumped) /home/jdmontenegroc/Documents/bin/KmerInShort -file /tmp//93265_merge_evidence.gtf.test_orf.fa -kval /tmp//93265_merge_evidence.gtf.kmerScoreValues_size3.tmp -nb-cores 1 -kmer-size 3 -dont-reverse -step 3 >> /tmp//93265_merge_evidence.gtf.test_sequencesKmer_3_ScoreValues.tmp 2> /dev/null
sh: line 1: 93471 Segmentation fault      (core dumped) /home/jdmontenegroc/Documents/bin/KmerInShort -file /tmp//93265_merge_evidence.gtf.coding_orf.fa.forRandomForest.fa -kval /tmp//93265_merge_evidence.gtf.kmerScoreValues_size6.tmp -nb-cores 1 -kmer-size 6 -dont-reverse -step 3 >> /tmp//93265_merge_evidence.gtf.coding_sequencesKmer_6_ScoreValues.tmp 2> /dev/null
sh: line 1: 93485 Segmentation fault      (core dumped) /home/jdmontenegroc/Documents/bin/KmerInShort -file /tmp//93265_merge_evidence.gtf.noncoding_orf.fa.forRandomForest.fa -kval /tmp//93265_merge_evidence.gtf.kmerScoreValues_size6.tmp -nb-cores 1 -kmer-size 6 -dont-reverse -step 3 >> /tmp//93265_merge_evidence.gtf.noncoding_sequencesKmer_6_ScoreValues.tmp 2> /dev/null
sh: line 1: 93495 Segmentation fault      (core dumped) /home/jdmontenegroc/Documents/bin/KmerInShort -file /tmp//93265_merge_evidence.gtf.test_orf.fa -kval /tmp//93265_merge_evidence.gtf.kmerScoreValues_size6.tmp -nb-cores 1 -kmer-size 6 -dont-reverse -step 3 >> /tmp//93265_merge_evidence.gtf.test_sequencesKmer_6_ScoreValues.tmp 2> /dev/null
sh: line 1: 93505 Segmentation fault      (core dumped) /home/jdmontenegroc/Documents/bin/KmerInShort -file /tmp//93265_merge_evidence.gtf.coding_orf.fa.forRandomForest.fa -kval /tmp//93265_merge_evidence.gtf.kmerScoreValues_size9.tmp -nb-cores 1 -kmer-size 9 -dont-reverse -step 3 >> /tmp//93265_merge_evidence.gtf.coding_sequencesKmer_9_ScoreValues.tmp 2> /dev/null
sh: line 1: 93516 Segmentation fault      (core dumped) /home/jdmontenegroc/Documents/bin/KmerInShort -file /tmp//93265_merge_evidence.gtf.noncoding_orf.fa.forRandomForest.fa -kval /tmp//93265_merge_evidence.gtf.kmerScoreValues_size9.tmp -nb-cores 1 -kmer-size 9 -dont-reverse -step 3 >> /tmp//93265_merge_evidence.gtf.noncoding_sequencesKmer_9_ScoreValues.tmp 2> /dev/null
sh: line 1: 93527 Segmentation fault      (core dumped) /home/jdmontenegroc/Documents/bin/KmerInShort -file /tmp//93265_merge_evidence.gtf.test_orf.fa -kval /tmp//93265_merge_evidence.gtf.kmerScoreValues_size9.tmp -nb-cores 1 -kmer-size 9 -dont-reverse -step 3 >> /tmp//93265_merge_evidence.gtf.test_sequencesKmer_9_ScoreValues.tmp 2> /dev/null
sh: line 1: 93541 Segmentation fault      (core dumped) /home/jdmontenegroc/Documents/bin/KmerInShort -file /tmp//93265_merge_evidence.gtf.coding_orf.fa.forRandomForest.fa -kval /tmp//93265_merge_evidence.gtf.kmerScoreValues_size12.tmp -nb-cores 1 -kmer-size 12 -dont-reverse -step 3 >> /tmp//93265_merge_evidence.gtf.coding_sequencesKmer_12_ScoreValues.tmp 2> /dev/null
sh: line 1: 93547 Segmentation fault      (core dumped) /home/jdmontenegroc/Documents/bin/KmerInShort -file /tmp//93265_merge_evidence.gtf.noncoding_orf.fa.forRandomForest.fa -kval /tmp//93265_merge_evidence.gtf.kmerScoreValues_size12.tmp -nb-cores 1 -kmer-size 12 -dont-reverse -step 3 >> /tmp//93265_merge_evidence.gtf.noncoding_sequencesKmer_12_ScoreValues.tmp 2> /dev/null

I am getting a ton of segfaults, but I am not sure why. I am runinng on a linux Manjaro OS 👍 NAME="Manjaro Linux"

ID=manjaro
ID_LIKE=arch
BUILD_ID=rolling
PRETTY_NAME="Manjaro Linux"
ANSI_COLOR="32;1;24;144;200"
HOME_URL="https://manjaro.org/"
DOCUMENTATION_URL="https://wiki.manjaro.org/"
SUPPORT_URL="https://manjaro.org/"
BUG_REPORT_URL="https://bugs.manjaro.org/"
LOGO=manjarolinux

with perl5.32 and R4.0.3.

Any help would be more than welcome.

Kind regards,

Juan D. Montenegro

Error during FEELnc_codpot.pl running

Dear FEELnc,

During my running I've got an errors.
It generated same error in Shuffle and intergenic mode.
Do you have any idea?
Thanks!

FEELnc_codpot.pl -i candidate_lncRNA.gtf -a ../../55_pasa/pasa_all2.assemblies.fasta.transdecoder.genome.complete.mrna.fasta -g ../../1_fasta/7_final_chr/iceplant_chr.fasta_sorted --mode=shuffle

FEELnc_codpot.pl -i candidate_lncRNA.gtf -a ../../55_pasa/pasa_all2.pasa_assemblies.gtf -g ../../1_fasta/7_final_chr/iceplant_chr.fasta_sorted --mode=intergenic

Your input GTF file 'pasa_all2.pasa_assemblies.gtf' contains 44844 transcripts

------------- EXCEPTION -------------
MSG: Each line of the qual file must be less than 65,536 characters. Line 18 is 35722859 chars.
STACK Bio::DB::IndexedBase::_check_linelength /home/wyim/scratch/bin/FEELnc/lib/Bio/DB/IndexedBase.pm:744
STACK Bio::DB::Fasta::_calculate_offsets /home/wyim/scratch/bin/perl5/perlbrew/perls/perl-5.18.2/lib/site_perl/5.18.2/Bio/DB/Fasta.pm:227
STACK Bio::DB::IndexedBase::_index_files /home/wyim/scratch/bin/FEELnc/lib/Bio/DB/IndexedBase.pm:648
STACK Bio::DB::IndexedBase::index_file /home/wyim/scratch/bin/FEELnc/lib/Bio/DB/IndexedBase.pm:484
STACK Bio::DB::IndexedBase::new /home/wyim/scratch/bin/FEELnc/lib/Bio/DB/IndexedBase.pm:364
STACK ExtractFromFeature::feature2seq /home/wyim/scratch/bin/FEELnc/lib/ExtractFromFeature.pm:399
STACK ExtractCdnaOrf::CreateORFcDNAFromGTF /home/wyim/scratch/bin/FEELnc/lib/ExtractCdnaOrf.pm:303
STACK toplevel /home/wyim/scratch/bin/FEELnc/scripts/FEELnc_codpot.pl:298

codpot performance

Dear All

in the codpot.pl, I used the lncRNAs sequence to train the algorithm but the in the graph ROC curves there is a shifting in mRNA specificity curves when I launched it with -l flag to train algorithm with lncRNAs fasta sequence compare to --mode=shuffle or intergenic flags, why coding probability for mRNA specificity curve started at 0.2 ? according to the curves which method is better, training with lncRNA.fasta or --mode=shuffle ?
Thanks

Many protein coding transcripts in the reference annotation were classified as LncRNA ?

Hi,
I aimed to use FEElnc to profile the LncRNA-ome of porcine tissues. The whole command was as follows:

#=========================================================
1)FEELnc_filter.pl -p 1 -i ./merged.gtf -a ./Sus_scrofa.Sscrofa11.1.90.gtf -b transcript_biotype=protein_coding > candidate_lncRNA.gtf

2)FEELnc_codpot.pl -i candidate_lncRNA.gtf -a ./Sus_scrofa.Sscrofa11.1.90.gtf -b transcript_biotype=protein_coding -g ./Sus_11.1.fa --mode=shuffle

3)FEELnc_classifier.pl -i ./feelnc_codpot_out/candidate_lncRNA.gtf.lncRNA.gtf -a ./Sus_scrofa.Sscrofa11.1.90.gtf > lncRNA_classes.txt
#==========================================================
I found some transcript loci in the merged.gtf file, matching the protein coding transcript module of the reference annotation perfectly, were remained in the candidate_lncRNA.gtf file. The class_code was "=" in the merged.gtf. After codpot module, these transcript loci were also selected in the candidate_lncRNA.gtf.lncRNA.gtf file.
How does this happen? What are these transcrpt loci? LncRNA or protein coding mRNA?

Installion

I thinks the installation is very diffcult for a linux fresh. I cannot install it by conda and git clone. Can you make a more easier installation way?

Question about lncRNA transcript classification

Hello,

I'm studying LncRNA and used the noncode database (http://www.noncode.org) for my own study. Unfortunately, there is only a fasta file in this database and no GTF file.
While to classify lncRNAs based on the following command, a GTF file is required

FEELnc_classifier.pl -i lncRNA.gtf -a ref_annotation.GTF > lncRNA_classes.txt

so, How can I use the fasta file of this database to create lncRNA classification?
Can I add an option in feelnc software to use the fasta file?

Error in Running FEELnc_codpot.pl

$ perl FEELnc_codpot.pl -i Candidate_LNC.gtf -a Oryza_sativa.gtf -b transcript_biotype=protein_coding -g all.fa --mode=shuffle
Possible precedence issue with control flow operator at /usr/local/share/perl/5.20.2/Bio/DB/IndexedBase.pm line 839.
Subroutine Bio::DB::IndexedBase::_strip_crnl redefined at /usr/lib/x86_64-linux-gnu/perl/5.20/DynaLoader.pm line 210.
You do not have specified a maximum number mRNAs transcripts for the training. Use all the annotation, can be long...
You do not have specified a maximum number lncRNA transcripts for the training. Use all the annotation, can be long...
Error: The environnment variable FEELNCPATH does not reach the 'utils/codpot_randomforest.r' script

FEELnc pipeline can use of without a reference genome？

Hello,
In the README.md file you mean this pipeline can use of without a reference genome. But I read all the file and find it need ref_annotation.GTF and ref_genome.FASTA as mandatory arguments.

transcript_biotype

Salut Thomas,
Juste pour te prévenir que dans la version GENCODE 39 de l'annotation humaine de décembre dernier, la notion de gene_biotype et transcript_biotype est devenue gene_type et transcript_type. De souvenir, FEELnc récupère ce champ. Peut-être que cela génèrera du coup des soucis ?
A bientôt ;)
Kévin Muret

mRNA training set

Dear Sir/Madam,

I am very interested in using this tool in a non-model organism whose transcriptome was recently assembled. I understand that in order to do this, I need to supply a training set of reliable protein coding genes,hence my question: what would be the best approach to getting a training set?
My first idea was to align the transcripts to a curated database like uniprot/swissprot and get trasncripts with >=75% identity over 90% of the transcript length. Then I remember that this would exclude coding proteins that are truncated either due to the assembly protocol, library prep or other reasons, so I figured I could use the prediction done by BUSCO as a training set. What do you think? Do you think this would be a sensible approach? Could you suggest some other approach?

Best regards,

Juan Montenegro

FEELnc_filter.pl kicks out all monoexonic transcript although --monoex was set to -1

Hi @tderrien,

I have a set of novel transcripts from stringtie. I prefiltered the candidates with gffcompare and only used class codes i, u and x and fed those into the FEELnc pipeline. I have strand-specific data hence I set --monoex=-1 but it still kicks out ALL monoexonic RNA candidates.
So my command was
FEELnc_filter.pl -i prefilteredCandidateTrans.gtf -a gencode.v27.annotation_proteinCoding.gtf --monoex=-1 -p 20
I also tried the line below but I wasn't sure about the usage of -b - I am using gencode annotation, which uses transcript_type instead of transcript_biotype
FEELnc_filter.pl -i prefilteredCandidateTrans.gtf -a gencode.v27.annotation.gtf -b transcript_type=protein_coding --monoex=-1 -p 20
I using v0.1.0.

Thanks for your help!

error calculating the coding potential

Hi Thomas
Joao from CRG here, I hope everything is fine with you!

I'm trying your tool but I get an error when running "FEELnc_codpot.pl"

"Use of uninitialized value $fileno in array element at /usr/local/share/perl5/Bio/DB/IndexedBase.pm line 1022."

We just had a system upgrade, maybe incompatible packages?

thanks
João

lncRNA gene counts

Hi,

I am trying to find gene counts from lncRNAs to create a lncRNA signature which can differentiate between patients with different conditions and I want to use htseq-count to count lncRNA reads in the samples that I have. Htseq-count of course requires you to provide a gtf file, however, it is unclear to me which one I should provide to count lncRNA reads specifically.
Should I use the (gencode) full annotation file, combine this with the feelnc lncRNAs which have been found using the codpot module, let htseq-count do its work and then select all lncRNAs to see what the specific counts of these reads are? Or should I use the discovered lncRNAs from feelnc only?

Thanks in advance!

The number of complete ORF found with computeORF mode is 0 transcripts...

Hi,

When I run FEELnc_codpot.pl (with standard options) on my dataset, I get the following error message :
The number of complete ORF found with computeORF mode is 0 transcripts... That's not enough to train the program

The feelnc_codpot_out directory is empty.

Is there a way to monitor what computeORF is doing?

Cheers,

Christophe

tderrien / feelnc Goto Github PK

feelnc's People

Stargazers

Watchers

Forkers

feelnc's Issues

--------------------- WARNING --------------------- MSG: Got a sequence without letters. Could not guess alphabet

Recommend Projects

Recommend Topics

Recommend Org

--------------------- WARNING ---------------------
MSG: Got a sequence without letters. Could not guess alphabet