clemgoub / dnapipete Goto Github PK

dnaPipeTE (for de-novo assembly & annotation Pipeline for Transposable Elements), is a pipeline designed to find, annotate and quantify Transposable Elements in small samples of NGS datasets. It is very useful to quantify the proportion of TEs in newly sequenced genomes since it does not require genome assembly and works on small datasets (< 1X).

Dockerfile 100.00%

transposable-elements annotations assembly trinity annotation-pipeline repeatmasker pipeline bioinformatics genomics

dnapipete's People

Contributors

Stargazers

Watchers

Forkers

lablancoberdugo altingia cgnat cnyuanh shiyi-pan annaprotasio shahed30 pauram dantefff mverneret xuelei-dai

dnapipete's Issues

Reading Trinity output files

Hi,
I'm having some trouble understanding the length of repeats given in the Trinity.fasta.out files.
Under repeat 'begin' and 'end' (which I'm assuming will give the length if deducted from one another) some values appear in brackets () and sometimes does not make sense (i.e. begin larger than end).
Could you please clarify?
Cheers

Repbase Libraries

I've successfully installed dnaPipeTE on a cluster and my home computer, but I am still getting this message when I run test_config.sh:

Testing RepeatMasker Libraries...
RepeatMasker.lib doesn't include the Repbase sequences! Follow instruction to install RepeatMasker libraries on https://github.com/clemgoub/dnaPipeTE

This is what the library directory looks like on the cluster:

Libraries]$ ls -l
total 1599056
-rw-r--r-- 1 212876 Jan 28 2017 DfamConsensus.embl
-rw-r--r-- 1 1510837557 Sep 24 2015 Dfam.hmm
-rw-r--r-- 1 214 Jan 28 2017 README.meta
-rw-r--r-- 1 22475384 Aug 7 2015 RepeatAnnotationData.pm
-rw-r--r-- 1 129371 Sep 5 13:04 RepeatMasker.lib
-rw-r--r-- 1 209744 Sep 5 13:04 RepeatMaskerLib.embl
-rw-r--r-- 1 43039 Sep 5 13:09 RepeatMasker.lib.nhr
-rw-r--r-- 1 5272 Sep 5 13:09 RepeatMasker.lib.nin
-rw-r--r-- 1 28191 Sep 5 13:09 RepeatMasker.lib.nsq
-rw-r--r-- 1 0325957 Jan 31 2014 RepeatPeps.lib
-rw-r--r-- 1 1516945 Sep 5 13:09 RepeatPeps.lib.phr
-rw-r--r-- 1 84520 Sep 5 13:09 RepeatPeps.lib.pin
-rw-r--r-- 1 9463396 Sep 5 13:09 RepeatPeps.lib.psq
-rw-r--r-- 1 4401 May 28 2009 RepeatPeps.readme
-rw-r--r-- 1 17204287 Jan 28 2017 RMRBMeta.embl
-rw-r--r-- 1 64450715 Aug 29 2016 taxonomy.dat

I have a working GIRI login. Any advice?

Error building graphs

Hi, I'm trying to test the installation. All the programs are working.

However, I have some problems to make functional dnaPipeTE.py. I run the test but I got an error. I looked at the code and tried to understand but it's getting hard.

I let here the last lines of the output:

#######################################################
### Estimation of Repeat content from blast outputs ###
#######################################################
parsing blastout and adding RM annotations for each read...
Done, results in: blast_out/blastout_final_fmtd_annoted
#########################################
### OK, lets build some pretty graphs ###
#########################################
Drawing graphs...
null device 
          1 
null device 
          1 
null device 
          1 
null device 
          1 
Loading required package: methods
Error in read.table(file1) : no lines available in input
Execution halted
Done
Removing Trinity runs files...
done
Finishin time: Fri Feb 21 16:18:18 2020
########################
#   see you soon !!!   #
########################

At first, I had thought it was an R libraries problems, but I then realized some files needed in the plotting are empty (sorted_families , reads_landscape and factors_and_colors).

Any idea what's going on Clement?

Thanks,

Nicolás

trinity install as part of dnapipeTE

The code is ./init.sh

I'm getting the following error during the trinity install:

g++ -W -Wall -Wno-unused -Wno-deprecated -ansi -pedantic -Wno-long-long -fno-nonansi-builtins -Wctor-dtor-privacy -Wsign-promo -Woverloaded-virtual -Wendif-labels -O3 -ggdb3 -DMAKE_DATE='"Wed 17 Mar 2021 08:40:09 PM UTC "' -DMAKE_OS_RELEASE='"5.3.0-64-generic"' -DMAKE_RELEASE='"3.0"' -DNEW_MAKEFILE -imacros system/BigFileDefines.h -pthread -ftemplate-depth-30 -fno-strict-aliasing -mieee-fp -fopenmp -c ./aligns/KmerAlignCore.cc -o obj/aligns/KmerAlignCore.o
./aligns/KmerAlignCore.cc:6:10: fatal error: aligns/KmerAlignCore.h: No such file or directory
6 | #include "aligns/KmerAlignCore.h"
| ^~~~~~~~~~~~~~~~~~~~~~~~
compilation terminated.
make[1]: *** [Makefile:474: aligns/KmerAlignCore.o] Error 1

From what I can tell, and that is not a lot, this is perhaps related to newer versions of gcc being incompatible with something about this. I can get Trinity installed just fine in a conda environment, but can't figure out how to get dnaPipeTE to look for that instead of its own specific install. If you know either a) how to fix this error or b) how to make dnaPipeTE use a system install of Trinity (if it even can, I have no idea how dependent it is on that particular version of Trinity) I would appreciate it.

Best_RM_annot_80

Hi,
I'm trying to work with the Best_RM_annot_80 files.
My problem is I can't really make sense of some of the columns. I also couldn't find any hint of what the headers for that file might be.
Is there a good source for that? I've looked at the RM website as well as dnaPipeTE READ.ME file and several articles.

Thanks in advance

Figuring out how are Counts.txt calculated

Hello,

I am trying to understand dnaPipeTE outputs. Specifically I am trying to figure out how are the numbers in Counts.txt calculated. When I look into reads_per_component_and_annotation file and simply the numbers in the second (base pairs) column for the relevantly annotated rows I cant seem to arrive at the same number as dnaPipeTE.
For example in Counts.txt i find
Simple_repeat 34570800
However if I add all rows in assigned as Simple_repeat in reads_per_component_and_annotation I get:
Simple_repeat 10408258

Similarly in Counts.txt i have:
LTR 376818

But when looking into reads_per_component_and_annotation and adding all rows with "LTR" anotation I get:
LTR 149042

Is there some normalization involved?

The reason why I am asking this is because I will probably need to manually annotate some un-annotated and mis-annotated contigs, so i would like to know what are the next steps in the pipeline once I am done with reannotating.

Thanks,
Markéta

influence of the input parameters on estimated TE loads

Hello again,

we (@reubwn, @jensbast and me) are wondering about the influence of the input parameters (genome size and sampling coverage) for the estimate of the total TE load.

The reason why we started to dig in it was that dnaPipeTE find way fewer TEs than assembly-based approaches in bdelloid rotifers (these reads: ERR2135445; genomes are published here). They are difficult genomes, degenerate tetraploids and therefore it was not that clear to us, what genome size we should go for. We tested three values of input genome sizes and coverages and basically more nt we have samples, higher fraction of TEs we have annotated. You can read details in this thread.

We hoped that it's just rotifer that is weird, but we tried the same with a genome that far more sane, a parasitoid wasp (SRR7028347), but the pattern was kind of the same. More sampled nucleotides higher TE loads detected (details latter in the same thread).

This is a bit worrying to us, as the TE load should not be dependent on the depth of sampling (at least we would hope not). We are kind of running out of inspiration how to explain the pattern. Could you help us make sense out of it?

Best,
Kamil

Not all contigs in reads_per_component_and_annotation

Hi,
I want to know the count of reads in every contig. There are 19 contigs in Trinity.fasta but only 17 in reads_per_component_and_annotation. One in the rest 2 contigs even has high proportion of it covered by a Repeat Masker annotation. I'm wondering whether the 2 contigs disapeared because of low reads count and is there any filtering threshold was used? Thanks.

Chenjiaqi

missing 'reads_per_component_and_annotation' in output folder

Hello!

I have run dnaPipeTE on a few different species, most of them around 0.1x coverage. My outputs include everything that should be there, except for the reads_per_component_and_annotation file, and the graphs. The graphs are to be expected as I am running from a supercomputer cluster without R. Do I need R for the reads_per_component_and_annotation file to generate as well?
Is it safe to just use the Counts.txt file to determine the percentage of a TE class in the genome? Is there a way to determine superfamily annotations without this file?

Hopefully this makes sense, I am very new to this. Thank you!

Error getting after Run

awk: fatal: cannot open file /home/gnomeadmin/te_analysis/dnaPipeTE/dnaPipeTE//Trinity_run0/chrysalis/readsToComponents.out.sort' for reading (No such file or directory) sed: can't read /home/gnomeadmin/te_analysis/dnaPipeTE/dnaPipeTE//Trinity_run1/Trinity.fasta: No such file or directory awk: fatal: cannot open file /home/gnomeadmin/te_analysis/dnaPipeTE/dnaPipeTE//Trinity_run1/Trinity.fasta' for reading (No such file or directory)
Traceback (most recent call last):
File "./dnaPipeTE.py", line 698, in
RepeatMasker(config['DEFAULT']['RepeatMasker'], args.RepeatMasker_library, args.RM_species, args.cpu, args.output_folder, args.RM_threshold)
File "./dnaPipeTE.py", line 381, in init
self.repeatmasker_run()
File "./dnaPipeTE.py", line 400, in repeatmasker_run
with open(self.output_folder+"/Trinity.fasta.out", 'r') as trinity_handle:
FileNotFoundError: [Errno 2] No such file or directory: '/home/gnomeadmin/te_analysis/dnaPipeTE/dnaPipeTE//Trinity.fasta.out'

Using sample_size option

Hi,
I'm trying to subsample some of the reads in my fastq file by using the sample_size option.
Is this option sampling reads randomly or by order of appearance in fastq file and if not random by default, is there a way to make it so?

Thanks

Issues installing on a cluster/HPC

Hi, I use dnaPipeTE frequently and it's great. I would like to get it working on my universities cluster but require some changes which I cannot work out. Specifically, my cluster has repeatmasker and trinity already installed in a module loading system, so I cannot install them again, but can use the preinstalled copies. How do I change dnaPipeTE to use Trinity and RepeatMasker that are already in the environment/path (as in I can use Trinity by just typing Trinity ) ?

Any help would be great.

Thanks

Re-running analysis

Hi again Clement!
I hope you are OK.
I am writing to ask you about something. I managed to classify some Trinity contigs that were not classified by dnaPipeTE, and I was wondering if there was a way to re-run some of the analysis carried on by the program without running all the pipeline again. Specifically, I would like to re-calculate the Counts.txt file and to produce a new repeatlandscape.
I believe it is posible, because I have been reading your code and If I didn't understand incorrectly, the main statistics about all contigs are already calculated, such as % of divergence of each read to its contig and number of mapping reads and bases. I guess I could do it by adding the reads mapping to this new classified contig to the reads_landscape from sorted.reads_vs_unannoted.blast.out and re-running the Rscript (you already explained me how to do so, hehe).
In the case of the Counts file, I think I could do two things, or I calculate the new counts by hand, adding the pb of this contig to the acording class and substracting from the "Unclassified", or I could try to re-run the funtion "count" that is inside the main python script.
If I follow the second alternative I would have to move the reads matching the contig from sorted.reads_vs_unannoted.blast.out to sorted.reads_vs_annoted.blast.out and also include the new classification inside one_RM_hit_per_Trinity_contigs or there is a simpler way?
I am sorry if this is a little confusing, but I thought It may concern you because some of this things could help If you still want to develop new checkpoints in the pipeline.

Thanks in advanced!!
Mylena

gz compression of inputfiles

Hi again.

I seem to have an issue with the handling of compressed fastq input and get this error:
Start time: Fri Sep 9 16:26:32 2016
gz compression detected for /reads/xy.fastq.gz
counting reads number...Traceback (most recent call last):
File "/progz/dnaPipeTE/dnaPipeTE.py", line 697, in
Sampler = FastqSamplerToFasta(args.input_file, args.sample_size, args.genome_size, args.genome_coverage, args.sample_number, args.output_folder, False)
File "/progz/dnaPipeTE/dnaPipeTE.py", line 156, in init
self.get_sampled_id(self.fastq_R1)
File "/progz/dnaPipeTE/dnaPipeTE.py", line 188, in get_sampled_id
with gzip.open(file_name+".gz", 'rt') as file1 :
File "/usr/lib64/python3.2/gzip.py", line 46, in open
return GzipFile(filename, mode, compresslevel)
File "/usr/lib64/python3.2/gzip.py", line 156, in init
raise IOError("Mode " + mode + " not supported")
IOError: Mode rt not supported

the input was xy.fastq.gz file which was generate by "gzip xy.fastq".
any idea if its a python problem or dnapiteTE problem?

Landscape error

Hi clement,
I have this issue when i look at the landscape graph. I tried different coverages for a ~2GB genome, and as the number of "families" is high, the landscape graph is not well constructed. Im sending you three pictures showing the problem. I guess It's due to the height parameter in the R script, but as Im not used to work with R-language I'm unable to solve it efficiently.
Also, I wanted to know if there is some way re-run the landscapes.R script. I guess the input file is the reads_landscape and factors_and_colors... but are there others?
landscape02X.pdf
landscape008X.pdf
landscape025X.pdf

Question on input

Dear Clément,

happy new year. Thank you for all your help before in running dnaPipeTE. Today I spoke with a collaborator (Alex Suh), and I told him that dnaPipeTE had died since it pulled too much memory for too long in my cluster. He said that one should downscale the input data. I tried with 15Gb of input, which is 5x (3Gb genome). Alex recommended to ask you, but definitely downscale the data. I wonder if you have any recommendations on how low I should go?

Thank you for your time.
José Cerca

FileNotFoundError: [Errno 2] No such file or directory: '/dnapipete_3/Trinity.fasta.out'

Hello!

I am trying to run dnaPipeTE with publicly available WGS data.

I will greatly appreciate it if you can help me with the following error message:

parseTagData: ID field not to EMBL spec "SNAP-OL2 repeatmasker; DNA; ???; BP.
" from DE RepbaseID: SNAP-OL2XX

at /home/Softwares/dnaPipeTE/bin/RepeatMasker/RepeatMasker line 7611.
Traceback (most recent call last):
File "./dnaPipeTE.py", line 698, in
RepeatMasker(config['DEFAULT']['RepeatMasker'], args.RepeatMasker_library, args.RM_species, args.cpu, args.output_folder, args.RM_threshold)
File "./dnaPipeTE.py", line 381, in init
self.repeatmasker_run()
File "./dnaPipeTE.py", line 400, in repeatmasker_run
with open(self.output_folder+"/Trinity.fasta.out", 'r') as trinity_handle:
FileNotFoundError: [Errno 2] No such file or directory: '/mnt/data/Data/dnapipete_3/Trinity.fasta.out'

ggplot2

Almost at the end of the run I get this message:

#########################################
### OK, lets build some pretty graphs ###
#########################################
Drawing graphs...
null device 
          1 
null device 
          1 
null device 
          1 
null device 
          1 
Error in library(ggplot2) : there is no package called ‘ggplot2’
Execution halted
Done

Am I missing graphics as a result?

Thanks - Claudio

Installation issue in Linux Redhat 6.2

I have been trying to install dnaPipeTE on Linux server (version 6.2, Redhat). After adding the login and password in the file (init.sh) I tried to run it as suggested in the installation, I get the following message.

[nishma@login1 dnaPipeTE]$ ./init.sh
GIRINST_USERNAME=: Command not found.
GIRINST_PASSWORD=: Command not found.
GIRINST_USERNAME: Undefined variable.

The same username and password had worked on a Linux computer, the RAM is small so I could not run the analyses on the computer.

Installation issue: blastx & makeblastdb not found

Hi Clement,

I'm trying to install dnaPipeTE on my directory on a shared cluster (it cannot be installed like the other tools because it has to be run in the installation folder)

the installation runs fine until the blast step; I put the path to the locally installed blast
/home/datawork-lep-s/deepadapt/dnaPipeTE/bin/ncbi-blast-2.2.28+/bin

but I get the follwing errors:
/home/datawork-lep-s/deepadapt/dnaPipeTE/bin/ncbi-blast-2.2.28+/bin/makeblastdb does not exist
/home/datawork-lep-s/deepadapt/dnaPipeTE/bin/ncbi-blast-2.2.28+/bin/blastx does not exist

indeed blastx and makeblastdb are not in this directory, nor in the ncbi-rmblastn-2.2.28/bin directory

I have a local installation of blast if that helps?
Could you help me solve this issue?

Thanks in advance, and I'm looking forward to using this tool!
Alexandra

RepeatMasker checkpoint never triggers as RepeatMasker done

Hello,
I have run dnaPipeTE on university cluster which has access to RepeatMasker library. The analysis did get past the RepeatMasker step but and failed to produce graphs. I wanted to rerun the last part on my computer where I know the graphs do get produced but I only have there an old RepeatMasker library. The dnaPipeTE does successfully skip Trinity phase but always reruns RepeatMasker.

Is suspect the problem is the test_RepeatMasker checks for presence of file /Annotation/Best_RM_annot_80_80 which seems to be never produced. Shouldn't it be checking for /Annotation/Best_RM_annot_80 instead? Or did something wrong happen with my run and /Annotation/Best_RM_annot_80_80 should be there?

Trinity requires access to Java version 1.6 or 1.7

Hi !

I got a error while trying to test dnaPipeTE with the following command :python3 ./dnaPipeTE.py -input test_dataset.fastq -output /home/loutre/dnaPipeTE/try -genome_size 10000000 -genome_coverage 0.1 -sample_number 1
Here the error message I get :

Start time: Fri Jan 27 10:22:45 2017
sampling file found, skipping sampling...
###################################
### TRINITY to assemble repeats ###
###################################

***** TRINITY iteration 1 *****

Selecting reads for Trinity iteration number 1...
awk: fatal : impossible d'ouvrir le fichier « /home/loutre/dnaPipeTE/try/Trinity_run0/chrysalis/readsToComponents.out.sort » en lecture (Aucun fichier ou dossier de ce type)
Done

Current settings:
time(seconds)        unlimited
file(blocks)         unlimited
data(kbytes)         unlimited
stack(kbytes)        8192
coredump(blocks)     0
memory(kbytes)       unlimited
locked memory(kbytes) 64
process              515438
nofiles              1024
vmemory(kbytes)      unlimited
locks                unlimited


Error, Trinity requires access to Java version 1.6 or 1.7.  Currently installed version is: java version "1.8.0_121"
Java(TM) SE Runtime Environment (build 1.8.0_121-b13)
Java HotSpot(TM) 64-Bit Server VM (build 25.121-b13, mixed mode)
Trinity iteration 1 Done'
renaming Trinity output...
awk: fatal : impossible d'ouvrir le fichier « /home/loutre/dnaPipeTE/try/Trinity_run1/Trinity.fasta » en lecture (Aucun fichier ou dossier de ce type)
done
/home/loutre/dnaPipeTE/try/Annotation/one_RM_hit_per_Trinity_contigs
/home/loutre/dnaPipeTE/try/Annotation/Best_RM_annot_80-80
/home/loutre/dnaPipeTE/try/Annotation/Best_RM_annot_partial
#######################################
### REPEATMASKER to anotate contigs ###
#######################################

RepeatMasker version open-4.0.6
The RepeatMasker installation directory ($RepeatMaskerConfig::REPEATMASKER_DIR) is incorrectly set in the RepeatMaskerConfig.pm file.  Please open the RepeatMaskerConfig.pm file  and edit the $RepeatMaskerConfig::REPEATMASKER_DIR line.
Traceback (most recent call last):
  File "./dnaPipeTE.py", line 700, in <module>
    RepeatMasker(config['DEFAULT']['RepeatMasker'], args.RepeatMasker_library, args.RM_species, args.cpu, args.output_folder, args.RM_threshold)
  File "./dnaPipeTE.py", line 359, in __init__
    self.repeatmasker_run()
  File "./dnaPipeTE.py", line 378, in repeatmasker_run
    with open(self.output_folder+"/Trinity.fasta.out", 'r') as trinity_handle:
FileNotFoundError: [Errno 2] No such file or directory: '/home/loutre/dnaPipeTE/try/Trinity.fasta.out'
Loutre:~/dnaPipeTE$ pwd 
/home/loutre/dnaPipeTE

I think this is a problem with java version. I already Trinity on my computer, and it worked fine with java 1.8. My trinity version is 1.3, yours is installed trinityrnaseq_r2013_08_14.

Do you think it will be possible to update trinity version ? Or should I install a older java version ?

RepeatMasker library

Hi!

I am installing dnaPipeTE, following all the steps and including the giri username and subscription. However, I got these errors:

When runnin init.sh:

(...)

Saving to: ‘RepBaseRepeatMaskerEdition-20170127.tar.gz’

RepBaseRepeatMaskerEdition-20170127.tar.gz    100%[===============================================================================================>]   9,43K  --.-KB/s    in 0,03s   

2019-08-06 14:00:44 (296 KB/s) - ‘RepBaseRepeatMaskerEdition-20170127.tar.gz’ saved [9661/9661]


gzip: stdin: not in gzip format
tar: Child returned status 1
tar: Error is not recoverable: exiting now

##################################################################################################
installation of dependencies done, now run the ./configure script in the ./bin/RepeatMasker folder

And if I continue with the dependencies configuration and then try ./test_config.sh

./test_config.sh 
This is the test script for dnaPipeTE
                  ***                

We will test a few dependancies to be sure tha the pipeline run properly


Testing Java...
java version OK!


Testing RepeatMasker Libraries...
RepeatMasker.lib doesn't include the Repbase sequences! Follow instruction to install RepeatMasker libraries on https://github.com/clemgoub/dnaPipeTE

Do you know how I can solve it? or where the problem may be?

Thanks

modify dnaPipeTE.py --JM to --max_memory to support new Trinity versions

The changes from v1 are great!
I just managed to get it running on a new system and was trying trinity 2.2 (latest).
the dnapipete.py needs to be amended: --max_memory instead of --JM for making it compatible

There seems to be another issue at the Butterfly step

Butterfly assemblies are written to /dnaPipeTE/testout/Trinity_run1/Trinity.fasta

Trinity iteration 1 Done'
Traceback (most recent call last):
File "./dnaPipeTE.py", line 699, in
Trinity(config['DEFAULT']['Trinity'], config['DEFAULT']['Trinity_memory'], args.cpu, config['DEFAULT']['Trinity_glue'], args.output_folder, sample_files, args.sample_number, args.contig_length)
File "./dnaPipeTE.py", line 301, in init
self.new_version_correction()
File "./dnaPipeTE.py", line 329, in new_version_correction
year = re.search('\d{4}', str(out)).group(0)
AttributeError: 'NoneType' object has no attribute 'group'

any idea how to fix this? my python knowledge is rather limited.
cheers
eckart

Why the single or low copy DNA is 100% in TEs_pipchart file?

During test run, it runned smoothly for about 5 min, and produced files of Base_per_components.pdf and landscape.pdf, but the TEs_piechart.pdf file is a wholly gray figure and shows only "single or low copy DNA".

I run ./test_config.sh, it showed "java version < 1.8". But I actually installed java 10.0.1, perl 3.5, R 3.4.2, but GNU 2.12 in my environment.

$python
 Python 3.5.1 (default, Jun 24 2016, 15:59:19) 
[GCC 4.4.7 20120313 (Red Hat 4.4.7-17)] on linux

$java -version
openjdk version "10.0.1" 2018-04-17
OpenJDK Runtime Environment (build 10.0.1+10)
OpenJDK 64-Bit Server VM (build 10.0.1+10, mixed mode)

The error files are as attached.
TEs_piechart.pdf
dnaPipeTE.52650.err.txt
dnaPipeTE.52650.out.txt

How to set -genome_size parameter?

Hello,

For the -genome_size parameter, should we use haploid genome size (e.g. human = 3 Gb) or total DNA content (for human this would be 2N = 6 Gb)?

I am working with asexual organisms with fully collapsed assembly span of ~150 Mb, and am unsure whether to use 150 Mb or 300 Mb for this parameter. Or perhaps it is not so important?

Thanks for your help,
reubwn

parameter -cpu vs real number of used cpu

Hello,

I am trying to run dnaPipeTE on university computing cluster. To run a job there I need to specify the number of cpu the job will use. If the job exceeds the specified cpu at any time-point it gets killed.
I have found that if I use dnaPipeTE parameter -cpu exactly the same as number as specified cpus for job then the job gets killed at Trinity stage. I tried to add some buffer cpus for the job with several trials and killed jobs. I am currently at the iteration when if I use dnaPipeTE -cpu 6 and allocate 10 cpus for the job then the job gets killed at RepeatMasker stage, because apparently RepeatMasker tries to use more than 10 cpus.
Any suggestions on how to estimate real number of used cpus?

Thanks,
Markéta

Help with paired reads

Hello,
I have low to high coverage paired reads of a lot of different species that I work with.
Is there a way to use the paired end sequencing data with this software?
Could I get away with using just the forward reads?
Please let me know what you think.
Best,
Basanta

Feature Request: RepeatMasker compatible fasta headers

Hi Clement,

Just a small request based on the work I'm doing. I'm using the dnaPipeTE contigs as a custom library of repeats for my organism for repeatmasker, and I've had to go through and edit the fasta headers in the Annotation file to include the classification in a format compatible with repeatmasker.

Currently an example of a fasta header here is:

>DNA_comp_TRINITY_DN16970_c0_g1_i1

and for repeatmasker compatibility it would look like

>DNA_comp_TRINITY_DN16970_c0_g1_i1#DNA

For deeper level classification it might read

>DNA_comp_TRINITY_DN16970_c0_g1_i1#DNA/Helitron

for example.

As it stands, I have gotten around this with a sed command, but other users may also appreciate this feature, be it as an option or as a default.

Thanks again for such a great tool and for helping out with any previous issues too!

Figuring out read_per_component_and_annotation/bar_graph and Count.txt

Hello,

I am trying to figure out the output of dnaPipeTE.
What does the bar plot obtained from the read_per_component_and_annotation represents. what does rep >0.1 applies ? Does it signify something with contig assembled.
Do we have a file with base pair mapped from which analysis of Count.txt is done. I am aware that read_per_component_and_annotation have information of read_mapped, bp_mapped. But these numbers are different from Count.txt file. So, I was wondering if there is some file which is used from which we get the Count.txt analysis done.
Lastly, can we get access to file which had all reads which where used for blastn search.

nearly no TEs annotated

Hello,

I run dnaPipeTE and it worked till now well. I run exactly the same read-set with exactly the same parameters

python3 ./dnaPipeTE.py -input "$READ_DIR"/all_reads.fastq.gz \
    -output "$OUTPUT" \
    -cpu 32 -genome_size "$GENOME_SIZE" -genome_coverage 0.5 -sample_number 3

and very similar config files without specified rm library

rm_species = All
repeatmasker_library =

However, when I look at the results, I get almost no TEs annotated by 1.3.1 in comparison to 1.2 (pasted Counts.txt files):

> TEs_2 # version 1.2.0
               V1        V2
1             LTR  11122701
2            LINE   1486291
3            SINE    102149
4             DNA   3251843
5            MITE         0
6        Helitron    267206
7            rRNA   6914419
8  Low_Complexity     30440
9       Satellite    512504
10 Tandem_repeats         0
11  Simple_repeat    321150
12         others     37846
13             na  87606406
14         Others         0
15          Total 129595103
> TEs_1 # version 1.3.1
               V1        V2
1             LTR         0
2            LINE     14004
3            SINE         0
4             DNA         0
5            MITE         0
6        Helitron         0
7            rRNA         0
8  Low_Complexity   1911368
9       Satellite         0
10 Tandem_repeats         0
11  Simple_repeat  19292537
12         others         0
13             na  88947189
14         Others         0
15          Total 129595357

The only thing I could think of is a difference in rm database. I could find this in log of run 1.3.1: Master RepeatMasker Database: /.../RepeatMaskerLib.embl ( Complete Database: dc20170127 )m but I don't have a log from the 1.2 run anymore. I suppose that it used the analogical database of 1.2.

Where do you think could be a problem?

Best,
Kamil

P.S. We cited you already in our preprint, here we jsut add some data for a review :-)

RepeatMasker.lib

I am using dnaPipeTE to see what transposable elements are present in the genome of the insect species I am studying. I was having an issue with it, I will be really thankful if you can please help me through this.

I get the following error whenever I run ./test_config.sh.

"RepeatMasker.lib doesn't include the Repbase sequences! Follow instruction to install RepeatMasker libraries on https://github.com/clemgoub/dnaPipeTE"

If anyone say how can this be resolved??

repbase

it seems giri asks for non-free subscription to obtain the RM libraries so my username and password are only getting some xml file from their repository. any ideas how to solve this? would a flat fasta be enough?

thanks!

Trity file empty (Jellyfish ran out of Memory) - FIXED

Hello,
So I made the test data work fine, but once I try to run my data then for some reason pipe is failing, my trinity doesn't seem to be creating any files.
I have attached both my error file.

-thanks!
dna_pipe_Asha.e.152537.txt

RepeatMasker libraries not found

Hi!

I'm attempting to install your program, but when configuring RepeatMasker it doesn't seem to find the required libraries. I'm pointing to a path that should contain them, but I get this error message:

No repeat libraries found!  At a minimum the Dfam_consensus
is required to run.  Please download and install the latest 
Dfam_consensus.  It is highly recommended that you also install the
latest RepBase RepeatMasker Edition library obtainable from GIRI.
General instructions can be found here: http://www.repeatmasker.org

The folder does have a file called DfamConsensus.embl, but is this not what it is looking for?

Thanks so much for any help!

Caroline

Pacbio reads ?

Hi,

I am wondering if dnaPipeTE works with pacbio read ?

Trinity.fasta empty ~ java version

The program stops saying that Trinity.fasta (in the output folder) is empty (indeed 0 bytes). Way before this happens I get this message:

Use of uninitialized value $java_version in pattern match (m//) at /home/manager/dnaPipeTE/bin/trinityrnaseq-Trinity-v2.5.1/Trinity line 1023.

** Warning, Trinity cannot determine which version of Java is being used. Version 1.8 is required.

I did run fixjava with an OK. This is my current version:

manager@sb:~/dnaPipeTE$ java -version
openjdk version "1.8.0_141-BLFS"
OpenJDK Runtime Environment (build 1.8.0_141-BLFS-b15)
OpenJDK 64-Bit Server VM (build 25.141-b15, mixed mode)
manager@sb:~/dnaPipeTE$ java -version
openjdk version "1.8.0_141-BLFS"
OpenJDK Runtime Environment (build 1.8.0_141-BLFS-b15)
OpenJDK 64-Bit Server VM (build 25.141-b15, mixed mode)

Will it fix by manually add the version in /dnaPipeTE/bin/trinityrnaseq-Trinity-v2.5.1/Trinity line 1023?

Trinity iteration errors

The program gives this error:
awk: fatal: cannot open file 'output/Trinity_run0/chrysalis/readsToComponents.out.sort' for reading (No such file or directory)

This can be traced back to 'trinity_iteration' and 'select_reads', where in some cases 'iteration+1' is used and in other cases 'iteration': file naming is not consistent. This 'iteration'-error is easily fixed, however the underlying problem then still remains. 'awk' expect output from trinity, but in the first iteration this output does not exists.
In the code there is this call: 'self.trinity_iteration(0)' (line 298), however iteration '0' is not handled differently within that method. Maybe some code got lost somehow?

Annotation database with dnaPipeTE?

Hi!!

Could a dnaPipeTE assembly+annotation be used as database to annotate the assemble of repetitive sequences generated by other softwares like RepeatExplorer?

Thanks!

Landscapes.R plot issue

Thanks for the new update! It runs very smoothly, I just found a problem with the landscapes.R plotting pipeline. The "factors_and_colors" file does not always have the correct amount of columns in each entry, so R does not open it. For example, the entry "DNA/Academ-1" does not have a class associated with it (in my file), so the second column is left open here. I made the plot manually after fixing the file, but thought you may like to be aware of it.

issue in the last part of analysis

Greetings, I write because I'm having problems with the "Counts" part of the analysis: the " Counts.txt" file contain only the count of totals TE and the "na" ones; the others fields have 0 even if "reads_per_component_and_annotation" file and all others files show clearly the presence of different TE and several TE families. I reckon the problem is about RAM or space though no warning is showing.
I obtained the same results twice.
So, I was wondering if there is a way to recover only that step or if I can estimate by myself the TE proportion I miss.

thanks in advance

about the output file reads_per_component_and_annotation

Hi dnaPipeTE devs!

I would like to understand all the columns in the output file reads_per_component_and_annotation. From the manual, I understand that there are 5 columns: read counts, aligned bases, contig name, RM annotation and proportion of contig with RM hit. However, my reads_per_component_and_annotation file contains 6 columns.
For example :

168 24506 comp_TRINITY_DN4252_c0_g1_i2 2983 Gypsy-32_LMi-I LTR/Gypsy 0.7562856185048609
168 24940 comp_TRINITY_DN4286_c3_g4_i4
166 23137 comp_TRINITY_DN4278_c3_g2_i1 1457 Penelope-43_LMi LINE/Penelope 0.9993136582017845
165 13058 comp_TRINITY_DN2925_c0_g1_i1 218 RTE-53_LMi LINE/RTE-BovB 0.5504587155963303
165 22865 comp_TRINITY_DN4295_c12_g1_i1 339 CR1-4_LMi LINE/CR1 0.9970501474926253
165 24465 comp_TRINITY_DN4223_c5_g1_i14 509 Mariner-10_LMi DNA/TcMar-Tc1 0.7033398821218074
158 23363 comp_TRINITY_DN4009_c0_g3_i1 3032 Gypsy-53_LMi-I LTR/Gypsy 0.9993403693931399

I would like to know what is the number right after the contig name.

Best Wishes,

Abhijeet

Calculating TEs divergence

Hi,
I have used dnaPipeTE on a large collection of wild barley I have, comprised of several populations of the same species. I've used the cultivated barley 'repbase' database as reference.
I was wondering whether the reads_landscape files can help in assessing population, rather than species, divergence (based on TEs) and if so, should I be using a different species database as reference?
Also, the landscape graph was not produced for any of the samples I've analysed although the landscape_reads files were. Is that an indication of the process not finishing?

Thank you for your time

Getting reads from blast output

Hi,

I was wondering if there is any way to match the blast output to the reads in my .fq files. I looked at the sorted and unsorted "blast.out" files in the blast_out folder, but I can't find any read names in there. I'd like to create a library of reads that fall into the "single or low copy DNA" category on the piechart so that I can assemble these regions separately.

Any suggestions would be appreciated!

FileNotFoundError in repeatmasker_run

Hi,

In line 378, isnt Trinity fasta should be Trinity.fasta rather than Trinity.fasta.out?
with open(self.output_folder+"/Trinity.fasta.out", 'r') as trinity_handle

Thanks,
Rahul

Problem with repeatmasker modules

Dear dnaPipeTE designers -I really look forward to running this pipeline and identify TEs on my popgen-level dataset. It is just what I needed.

I managed to install it, but keep getting an error when trying the test data.

Command

module load Java/1.8.0_212  # Gettting java
module load Bowtie2/2.4.1-GCC-9.3.0 ## and bowtie2

conda activate DNApipeTE
conda install -c conda-forge perl-text-soundex # This installed the module needed.
cpan text::soundex # This installed the module needed..

PERL5LIB=/cluster/projects/nn9408k/cerca/conda/envs/DNApipeTE/bin/perl:$PERL5LIB

python3 ./dnaPipeTE.py -input ./test/test_dataset.fastq -output ./tmp -genome_size 2000000 -genome_coverage 0.5 -sample_number 2

Error:

Can't locate Text/Soundex.pm in @INC (@INC contains: /cluster/home/josece/local_bin/dnaPipeTE/bin/RepeatMasker /cluster/projects/nn9408k/cerca/conda/envs/DNApipeTE/bin/perl /node/lib/perl5 /cluster/lib/perl5/x86_64-linux-thread-multi /cluster/lib/perl5 /usr/local/lib64/perl5 /usr/local/share/perl5 /usr/lib64/perl5/vendor_perl /usr/share/perl5/vendor_perl /usr/lib64/perl5 /usr/share/perl5 .) at /cluster/home/josece/local_bin/dnaPipeTE/bin/RepeatMasker/Taxonomy.pm line 83.
BEGIN failed--compilation aborted at /cluster/home/josece/local_bin/dnaPipeTE/bin/RepeatMasker/Taxonomy.pm line 83.
Compilation failed in require at /cluster/home/josece/local_bin/dnaPipeTE/bin/RepeatMasker/RepeatMasker line 313.
BEGIN failed--compilation aborted at /cluster/home/josece/local_bin/dnaPipeTE/bin/RepeatMasker/RepeatMasker line 313.
Traceback (most recent call last):
  File "./dnaPipeTE.py", line 698, in <module>
    RepeatMasker(config['DEFAULT']['RepeatMasker'], args.RepeatMasker_library, args.RM_species, args.cpu, args.output_folder, args.RM_threshold)
  File "./dnaPipeTE.py", line 381, in __init__
    self.repeatmasker_run()
  File "./dnaPipeTE.py", line 400, in repeatmasker_run
    with open(self.output_folder+"/Trinity.fasta.out", 'r') as trinity_handle:
FileNotFoundError: [Errno 2] No such file or directory: './tmp/Trinity.fasta.out'

I see this is a common repeatMasker error, however, I tried everything on google and I can't get it working. Would you happen to know how to sovle this? I have no sudo rights.

About input file

Dear dnaPipeTE developer
I'm confused about the input option.
Now I have paired end file, but the dnaPipeTE is only hand single end only.
I dont know I just put the one end or reverse the reverse file and merge the forward file.
So would you mind give me some advise for me?
Thanks
yours Zhang

TE consensus sequences

Hello,

I want to ask a question. In the output_folder , which file is the result of TE consensus sequences?
thanks in advance

Thanks,
Lina Zhao

Blast parsing errors at higher coverages

Hi there

I've recently switched machines and thus have a fresh install of RM/blast and DnaPipeTE. I had a few successfull runs (except that it pretty much always says: join: contigsTrinityRM.sorted: No such file or directory)(cov 0.2/ 0.3X), but encounter some weird problem at the parsing of the blastoutput in runs with higher coverage (0.35/0.4X).

my command was:
python3 ~/progz/dnaPipeTE/dnaPipeTE.py -input $FILE -output $F/DNAPIPETE_$NEWF -cpu $CPUs -genome_size 450000000 -genome_coverage $COV -sample_number 3
mv landscape.pdf $F/DNAPIPETE_$NEWF/
mv Rplots.pdf $F/DNAPIPETE_$NEWF/

Parsing blast3 output...
rm: cannot remove '/scratch/scratchspace/QMUL_apocrita_temp_copy/007/mt_reduced/repeatcontent/DNAPIPETE_Srichteri_littleb_0.35/blast_out/int.reads_vs_annoted.blast.out': No such file or directory
#######################################################

Estimation of Repeat content from blast outputs

#######################################################
parsing blastout and adding RM annotations for each read...
join: contigsTrinityRM.sorted: No such file or directory
Done, results in: blast_out/blastout_final_fmtd_annoted
#########################################

OK, lets build some pretty graphs

#########################################
Drawing graphs...
Error in read.table(paste(folder, file1, sep = "/")) :
no lines available in input
edit: this comes from the graph.R becasue the inputfile is non existent
Execution halted

I cant really find the error. It seems parsing blastoutput 2 works but blastoutput3 is missing files
particularly the step creating int.reads_vs_annoted.blast.out seems to fail since this file is missing
sort -k1,1 -k12,12nr -k11,11n /scratch/repeatcontent/DNAPIPETE_Srichteri_bigB_0.40/blast_out/reads_vs_annoted.blast.out > /scratch/repeatcontent/DNAPIPETE_Srichteri_bigB_0.40/blast_out/int.reads_vs_annoted.blast.out

Any idea? Did others encounter the error? Funny thing is that if I run lower coverages from the same inputfile, then everything seems to work (still complaining about the join: contigsTrinityRM.sorted: No such file or directory tho).

just another hint for the next version:
I think the landscapes plot is generated in the current DIR but should be better generated in the Output DIR. I had to softlink the blastparser.py in the current DIR, otherwise it would not find it.

rm: unable to remove "blast_contigs_1_fmtd": File or directory does not exist

Dear @clemgoub,

I'm trying to setup my dnaPipeTE installation using the test dataset for the first analysis.

Trinity and RepeatMasker seems to work properly, however I still have a problem during the
estimation of repeat phase, after the third Blast run. It seems that some files were not created
so it is impossible to remove them. Since my test folder is named "prova1", I ran:

python3 ./dnaPipeTE.py -input test_dataset.fastq -output prova1/ -genome_size 10000000 -genome_coverage 0.1 -sample_number 1

results are different from those provided in the test directory and I have this error:

rm: unable to remove "prova1 // blast_contigs_1_fmtd": File or directory does not exist

This is a short version of the log file:

`Start time: Mon May 15 17:49:27 2017
sampling file found, skipping sampling...
Trinity files found, skipping assembly...
prova1/Annotation/Best_RM_annot_80-80
#######################################

REPEATMASKER to anotate contigs

#######################################

RepeatMasker version open-4.0.6
Search Engine: NCBI/RMBLAST [ 2.2.27+ ]
Master RepeatMasker Database: ./bin/RepeatMasker/Libraries/RepeatMaskerLib.embl ( Complete Database: 20150807 )

analyzing file prova1/Trinity.fasta

Some previous RepeatMasker output files were moved to the directory
prova1//Trinity.fasta.preMonMay151749272017.RMoutput
in order not to overwrite them.

Checking for E. coli insertion elements
identifying Simple Repeats in batch 1 of 1
identifying matches to root sequences in batch 1 of 1
identifying Simple Repeats in batch 1 of 1
processing output:
cycle 1
cycle 2
cycle 3
cycle 4
cycle 5
cycle 6
cycle 7
cycle 8
cycle 9
cycle 10
Generating output...
masking
done
24 line read, sorting...
sort done, filtering...
15 lines in one_RM_hit_per_Trinity_contigs
0 lines in Best_RM_annot_80
12 lines in Best_RM_annot_partial
Done
#########################################

Making contigs annotation from RM

#########################################
Done

Making blast sample...
sampling file found, skipping sampling...
total number of reads: 100125
maximum number of reads to sample: 12048
fastq : test_dataset.fastq
sampling 1 samples of max 12048 reads to reach coverage...
999984 bases sampled in 12048 reads
s_test_dataset.fastq_blast done.
#######################################################

Blast 1 : raw reads against all repeats contigs

#######################################################
Blast 1 files found, skipping Blast 1 ...
###################################################

Blast 2 : raw reads against annoted repeats

###################################################
Blast 2 files found, skipping Blast 2 ...
#####################################################

Blast 3 : raw reads against unannoted repeats

#####################################################
Blast 3 files found, skipping Blast 3 ...
#######################################################

Estimation of Repeat content from blast outputs

#######################################################
parsing blastout and adding RM annotations for each read...
awk: riga com.:1: attenzione: sequenza di escape \$' considerata come semplice $'
rm: impossibile rimuovere "prova1/blast_contigs_1_fmtd": File o directory non esistente
Done, results in: blast_out/blastout_final_fmtd_annoted
#########################################

OK, lets build some pretty graphs

#########################################
Drawing graphs...
null device
1
null device
1
null device
1
null device
1
Warning message:
Removed 3 rows containing missing values (geom_bar).
Warning message:
Removed 3 rows containing missing values (geom_bar).
Done
Removing Trinity runs files...
find: "prova1/Trinity_run*": File o directory non esistente
done
Finishin time: Mon May 15 17:49:46 2017
########################

see you soon !!!

########################`

In my test analysis LTR/Pao are absent from file landscape.pdf output whereas Counts.txt looks just like yours. Which is the problem? Any help will be greatly appreciated.

Thank you in advice,
Massimiliano.

clemgoub / dnapipete Goto Github PK

dnapipete's People

Contributors

Stargazers

Watchers

Forkers

dnapipete's Issues

Estimation of Repeat content from blast outputs

OK, lets build some pretty graphs

REPEATMASKER to anotate contigs

Making contigs annotation from RM

Blast 1 : raw reads against all repeats contigs

Blast 2 : raw reads against annoted repeats

Blast 3 : raw reads against unannoted repeats

Estimation of Repeat content from blast outputs

OK, lets build some pretty graphs

see you soon !!!

Recommend Projects

Recommend Topics

Recommend Org