Giter Site home page Giter Site logo

clemgoub / dnapipete Goto Github PK

View Code? Open in Web Editor NEW
47.0 5.0 10.0 389.66 MB

dnaPipeTE (for de-novo assembly & annotation Pipeline for Transposable Elements), is a pipeline designed to find, annotate and quantify Transposable Elements in small samples of NGS datasets. It is very useful to quantify the proportion of TEs in newly sequenced genomes since it does not require genome assembly and works on small datasets (< 1X).

Dockerfile 100.00%
transposable-elements annotations assembly trinity annotation-pipeline repeatmasker pipeline bioinformatics genomics

dnapipete's Introduction

logo

dnaPipeTE status status: support DOI DOI

dnaPipeTE (for de-novo assembly & annotation Pipeline for Transposable Elements), is a pipeline designed to find, classify and quantify Transposable Elements and other repeats in low coverage (< 1X) NGS datasets. It is very useful to quantify the proportion of TEs in newly sequenced genomes since it does not require genome assembly and works directly on raw short-reads.

  • 👪 dnaPipeTE was created in 2015 by Clément Goubert and Laurent Modolo at the LBBE, with the latter contributions of Romain Lannes (@rLannes), @pauram and T. Mason Linscott. Thanks a lot!

  • 📦 The container version has been made possible thanks to Stéphane Delmotte of the LBBE.

    • The current version of dnaPipeTE is v.1.4c "container" and is available through Docker/Singularity (see Installation). Changelogs can be found here.
    • From now on, only the container versions of dnaPipeTE will have support. Thank you for your understanding! Container versions are stored on the Docker Hub.
    • The last non-container version of dnaPipeTE 1.3.1 is available here.
  • 📄 You can read the original publication in GBE

  • 📊 A companion repository dnaPT_utils provides useful scripts for post-processing and to create customizable figures. It requires a UNIX environment with bash, R and cd-hit. It is not required for execution of dnaPipeTE.

    dnaPT_utils has been added to the latest distribution (v1.4c).

  • 🩺 If you encounter some issues with dnaPipeTE, you can request assistance here!

  • 🧑‍🏫 An introductory tutorial to dnaPipeTE is available on the TE-hub

  • 🧑‍🏫 An advanced tutorial is published in the book "Transposable Elements, Methods and Protocols (2022)": dnaPipeTE chapter

pipeline


Installation

System requirement

dnaPipeTE can now run on any system compatible with Docker or Singularity. It is recommended to have a minimum of 16Go or RAM, and multiple cpus will improve the execution speed.

Trinity (used for the repeats' assembly) can use a lot of RAM! Here are some examples of RAM usages:

  • 100,000 reads ~10 Go RAM (two Trinity iterations)
  • 3,000,000 reads ~40 Go RAM (two Trinity iterations)

Docker (root users)

Docker must be installed and running on the execution machine. For more details see https://docs.docker.com/get-docker/. Then, download the dnaPipeTE container:

sudo docker pull clemgoub/dnapipete:latest

Singularity/Apptainer (non-root users, HPC,...)

For users of High Performance Clusters (HPC) and other system with no root privileges, it is recommended to use Singularity (usualy provided with the base software; for more information see https://sylabs.io/guides/3.0/user-guide/installation.html).

To use dnaPipeTE with Singularity you need to create an image of the container on your machine.

mkdir ~/dnaPipeTE
cd ~/dnaPipeTE
singularity pull --name dnapipete.img docker://clemgoub/dnapipete:latest

This step takes ~20 minute to build the image, and is only required once.

Running dnaPipeTE

Create a project folder

mkdir ~/Project
cd Project

~/Project will be mounted into the /mnt directory of the Docker or Singularity container and will contain the inputs and outputs.

Input File

The input file must be a single-end FASTQ or FASTQ.GZ file of NGS reads. It can be either the R1 or R2 end of a paired-end library. dnaPipeTE performs the sampling automatically, so you can provide a large file (> 1X) as input.

IMPORTANT: We recommend to remove mitochondrial DNA and other non-nuclear DNA from your reads (symbionts, virus, contaminants). If mtDNA reads are left in the samples, the mitochondrial genome will be assembled and will appear as one of the most abundant repeat in the output for a size of ~10kb (it may also be wrongly classified as TE!).

For the following examples, we will consider a fictitious read file called reads_input.fastq

Interactive usage

Docker

# start the dnaPipeTE container
sudo docker run -it -v ~/Project:/mnt clemgoub/dnapipete:latest

Once in the container, run:

python3 dnaPipeTE.py -input /mnt/reads_input.fastq -output /mnt/output -RM_lib ../RepeatMasker/Libraries/RepeatMasker.lib -genome_size 170000000 -genome_coverage 0.1 -sample_number 2 -RM_t 0.2 -cpu 2

Singularity

singularity shell --bind ~Project:/mnt ~/dnaPipeTE/dnapipete.img

Once in the container, run:

cd /opt/dnaPipeTE # <<<--- This line is very important to run the program with singularity!
python3 dnaPipeTE.py -input /mnt/reads_input.fastq -output /mnt/output -RM_lib ../RepeatMasker/Libraries/RepeatMasker.lib -genome_size 170000000 -genome_coverage 0.1 -sample_number 2 -RM_t 0.2 -cpu 2

Batch file usage

We create a file dnaPT_cmd.sh that will contain the dnaPipeTE command:

Docker:

#! /bin/bash 
python3 dnaPipeTE.py -input /mnt/reads_input.fastq -output /mnt/output -RM_lib ../RepeatMasker/Libraries/RepeatMasker.lib -genome_size 170000000 -genome_coverage 0.1 -sample_number 2 -RM_t 0.2 -cpu 2 

and then

sudo docker run -v ~Project:/mnt clemgoub/dnapipete:latest ./mnt/dnaPT_comd.sh

Singularity

#! /bin/bash 
cd /opt/dnaPipeTE # <<<--- This line is very important to run the program with singularity!
python3 dnaPipeTE.py -input /mnt/reads_input.fastq -output /mnt/output -RM_lib ../RepeatMasker/Libraries/RepeatMasker.lib -genome_size 170000000 -genome_coverage 0.1 -sample_number 2 -RM_t 0.2 -cpu 2 

and then

singularity exec --bind ~Project:/mnt ~/dnaPipeTE/dnapipete.img /mnt/dnaPipeTE_cmd.sh

dnaPipeTE arguments

Argument Description
-input input fastq or fastq.gz files (single end only). It will be sampled
-output complete path with name for the outputs
-cpu maximum number of cpu to use
-sample_number number of trinity iterations
-genome_size size of the genome [use it with -genome_coverage; if used, do not use -sample_size] Ex. 175000000 for 175Mb
-genome_coverage coverage of the genome for each sample [use it with -genome_size; if used, do not use -sample_size] Ex: 0.1 for 0.1X coverage per sample
-sample_size number of reads to sample [use without -genome_size and -genome_coverage]
-RM_lib path to repeat library for RepeatMasker. By default use ../RepeatMasker/Libraries/RepeatMasker.lib. For a custom library, the header format must follow: >Repeat_name#CLASS/Subclass with CLASS in "DNA, LINE, LTR, SINE, MITE, Helitron, Simple Repeat, Satellite"
-RM_t Annotation threshold: minimal percentage of the query (dnaPipeTE contig) aligned on the repeat to keep the annotation from RepeatMasker. Ex: 0.2 for 20% of query in db
-keep_Trinity_output Keep Trinity output files at the end of the run. Default files are removed (large and numerous).
-contig_length minimum size of a repeat contig to be retained (default 200bp)

Continuing a crashed run: dnaPipeTE is able to skip some steps if a run crashes after a checkpoint. For example, if it crashes during the Trinity assembly, the sampling won't be performed again if you launch the run again in the same output folder. The checkpoints are 1-sampling of Trinity inputs; 2- Trinity assembly.

dnaPipeTE OUTPUTS

dnaPipeTE produces a lot of outputs, some of them are very interesting!

The output folder is divided into the following parts:

  • main folder (output name):

important files:

File Description
"Trinity.fasta" this file contains the dnaPipeTE contigs, this is the last assembly performed with Trinity
"reads_per_component_and_annotation" table with the count of reads and bp aligned per dnaPipeTE contigs (from blastn 1), as well as its best RepeatMasker annotation.
  • 1: counts (#reads)
  • 2: aligned bases
  • 3 dnaPipeTE contig name
  • 4 Repeat Masker hit length (bp)
  • 5 RepeatMakser annotation
  • 6 RM classification
  • 7 hit length / dnaPipeTE contig length

less important files you may like:

File Description
"Trinity.fasta.out" raw RepeatMasker output (not sorted) of Trinity.fasta on the repeat libraries.
"Counts.txt" count of bp of the sample aligned for each TE class (used for the pieChart)
"Reads_to_components_Rtable.txt" input file to compute the reads and bp per contig (one line per reads)
"Bases_per_component.pdf/png" graph with the number of base-pairs aligned on each dnaPipeTE contig (from blast 1), ordered by genome proportion of the dnaPipeTE contig. -- however, see dnaPT_utils improved graphs
"pieChart.pdf/png" graph with the relative proportion of the main repeat classes, informs about the estimated proportion of repeats in the genome (from blastn 2 and 3) -- however, see: dnaPT_utils for improved graphs
"reads_landscape" reads used for the landscape graph, including the blastn divergence from one reads to the contig on which it maps. To plot the landscape, see dnaPT_utils
  • "Annotation" folder:

important files:

File Description
"one_RM_hit_per_Trinity_contigs" sorted RepeatMasker output containing the best hit on the repeat library for each of the dnaPipeTE contigs (Trinity.fasta)
  • 1: dnaPipeTE contig name
  • 2: hit length on dnaPipeTE contig
  • 3: proportion of dnaPipeTE contig covered by hit
  • 4: hit name
  • 5: hit classification
  • 6: hit target length
  • 7: hit coordinates on target
  • 8: proportion of target covered by the hit

less important files you may like:

| "Best_RM_annot_80_80"| subset of the previous table, including contigs for which at least 80% of the sequence is mapping to at least 80% percent of the target sequence.| | "Best_RM_annot_partial"| same but for contigs for which at least 80% of the sequence is mapping to less than 80% percent of the target sequence| |"[repeat-class].fasta"| subsets of the Trinity.fasta file for each repeat type detected by RepeatMasker| |"unannotated.fasta"| subsets of the Trinity.fasta for contigs that didn't find any match...|

  • "blast_out" folder:

important files:

File Description
"sorted.reads_vs_Trinity.fasta.blast.out" best hit per reads from blastn 1
"sorted.reads_vs_annotated.blast.out" best hit per reads from blastn 2
"sorted.reads_vs_unannotated.blast.out" best hit per reads from blastn 3

less important files you may like:

File Description
"reads_vs_[anything]" raw blast out from previous files
Trinity_runX Those files contains the raw Trinity outputs and intermediates files produced during assembly steps. For futher detail see the Trinity documentation (http://trinityrnaseq.sourceforge.net/)

Changelog

Changelog v1.4c Oct.2022

  • Update RepeatMasker to v.4.1.3
  • Update R from 3.3.3 to 4.2.1
  • Fix issues #12, #55 and #73 (thanks to T. Mason Linscott)
  • Adds dnaPT_utils to the container

Changelog v1.3.1c March.2022

  • First container version
  • dnaPipeTE.py
    • The docker-specific config.ini has to be used.
    • blast2: the database (annotated dnaPipeTE contigs) is not merged with Repbase anymore for this blast, as Repbase in not freely accessible anymore. This was in case low-copy TE were missed but present in Repbase, they could be saved. However there is virtually no influence on the results.

Changelog v1.3.1 07.Dec.2017

  • Fixed missing class column for some Academ families causing errors with landscape graphs (thanks @rotifergirl for reporting!)

Changelog v1.3 01.Dec.2017

  • Updated Trinity with latest version (v2.5.1)
  • Updated RepeatMasker with latest version (version Open 4.0.7)
  • Compatible with latest Repbase (RepeatMasker compatible) libraries (20170127)
  • fix bug for the blast sample wich turned out to be a reclycling of the sample 1 instead of a new independant sample. Howerver, test showed this had no striking influence on the results (actual sampling variation between runs is more likely to create variation between outputs).
  • remove most of the files from the bin folder and replace it with the init.sh script so that user can make their own installation.
  • Landscape graph are now expressed relative to genome %
  • Clean git repository of larges files

You can download previous version on the github repository, clicking on "branch" menu and selecting the desired version


Changelog v1.2

  • Estimation of repeat content is now performed on the ratio of aligned bases (bp) on repeat contig over the total number of base sampled, instead of the number of reads mapping / total of read sampled; this produces a better estimate of the repeat content and reduces potential overestimations. In addition, it allows more accurate estimates if the size of reads used as input is variable. ![changes-1.1_1.2]
  • If different part of one same read match different repeats contigs (e.g. in case adjacent TEs or TE in TE), all bases are retained instead only the one of the best hit.
  • New graph "Bases per component" replaces "reads per component"; is very similar to reads per component graph but represent the total amount of bases aligned over the dnaPipeTE contigs.
  • Bug fix: in last version, repbase library was not merged to annotated dnaPipeTE contigs for repeat estimates, now it is.
  • New option: "-Trin_glue" to specify a minimum number of reads supporting the joining of kmer contigs during assembly (Chrysalis step in trinity)
  • New option: "-contig_length" to set a minimum size (in bp) to report a contig (default is 200 bp)

dnapipete's People

Contributors

clemgoub avatar l-modolo avatar pauram avatar rlannes avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

dnapipete's Issues

RepeatMasker library

Hi!

I am installing dnaPipeTE, following all the steps and including the giri username and subscription. However, I got these errors:

When runnin init.sh:

(...)

Saving to: ‘RepBaseRepeatMaskerEdition-20170127.tar.gz’

RepBaseRepeatMaskerEdition-20170127.tar.gz    100%[===============================================================================================>]   9,43K  --.-KB/s    in 0,03s   

2019-08-06 14:00:44 (296 KB/s) - ‘RepBaseRepeatMaskerEdition-20170127.tar.gz’ saved [9661/9661]


gzip: stdin: not in gzip format
tar: Child returned status 1
tar: Error is not recoverable: exiting now

##################################################################################################
installation of dependencies done, now run the ./configure script in the ./bin/RepeatMasker folder

And if I continue with the dependencies configuration and then try ./test_config.sh

./test_config.sh 
This is the test script for dnaPipeTE
                  ***                

We will test a few dependancies to be sure tha the pipeline run properly


Testing Java...
java version OK!


Testing RepeatMasker Libraries...
RepeatMasker.lib doesn't include the Repbase sequences! Follow instruction to install RepeatMasker libraries on https://github.com/clemgoub/dnaPipeTE

Do you know how I can solve it? or where the problem may be?

Thanks

Using sample_size option

Hi,
I'm trying to subsample some of the reads in my fastq file by using the sample_size option.
Is this option sampling reads randomly or by order of appearance in fastq file and if not random by default, is there a way to make it so?

Thanks

about the output file reads_per_component_and_annotation

Hi dnaPipeTE devs!

I would like to understand all the columns in the output file reads_per_component_and_annotation. From the manual, I understand that there are 5 columns: read counts, aligned bases, contig name, RM annotation and proportion of contig with RM hit. However, my reads_per_component_and_annotation file contains 6 columns.
For example :

168 24506 comp_TRINITY_DN4252_c0_g1_i2 2983 Gypsy-32_LMi-I LTR/Gypsy 0.7562856185048609
168 24940 comp_TRINITY_DN4286_c3_g4_i4
166 23137 comp_TRINITY_DN4278_c3_g2_i1 1457 Penelope-43_LMi LINE/Penelope 0.9993136582017845
165 13058 comp_TRINITY_DN2925_c0_g1_i1 218 RTE-53_LMi LINE/RTE-BovB 0.5504587155963303
165 22865 comp_TRINITY_DN4295_c12_g1_i1 339 CR1-4_LMi LINE/CR1 0.9970501474926253
165 24465 comp_TRINITY_DN4223_c5_g1_i14 509 Mariner-10_LMi DNA/TcMar-Tc1 0.7033398821218074
158 23363 comp_TRINITY_DN4009_c0_g3_i1 3032 Gypsy-53_LMi-I LTR/Gypsy 0.9993403693931399

I would like to know what is the number right after the contig name.

Best Wishes,

Abhijeet

FileNotFoundError in repeatmasker_run

Hi,

In line 378, isnt Trinity fasta should be Trinity.fasta rather than Trinity.fasta.out?
with open(self.output_folder+"/Trinity.fasta.out", 'r') as trinity_handle

Thanks,
Rahul

Trinity.fasta empty ~ java version

The program stops saying that Trinity.fasta (in the output folder) is empty (indeed 0 bytes). Way before this happens I get this message:

Use of uninitialized value $java_version in pattern match (m//) at /home/manager/dnaPipeTE/bin/trinityrnaseq-Trinity-v2.5.1/Trinity line 1023.


** Warning, Trinity cannot determine which version of Java is being used. Version 1.8 is required.

I did run fixjava with an OK. This is my current version:

manager@sb:~/dnaPipeTE$ java -version
openjdk version "1.8.0_141-BLFS"
OpenJDK Runtime Environment (build 1.8.0_141-BLFS-b15)
OpenJDK 64-Bit Server VM (build 25.141-b15, mixed mode)
manager@sb:~/dnaPipeTE$ java -version
openjdk version "1.8.0_141-BLFS"
OpenJDK Runtime Environment (build 1.8.0_141-BLFS-b15)
OpenJDK 64-Bit Server VM (build 25.141-b15, mixed mode)

Will it fix by manually add the version in /dnaPipeTE/bin/trinityrnaseq-Trinity-v2.5.1/Trinity line 1023?

Figuring out read_per_component_and_annotation/bar_graph and Count.txt

Hello,

I am trying to figure out the output of dnaPipeTE.
What does the bar plot obtained from the read_per_component_and_annotation represents. what does rep >0.1 applies ? Does it signify something with contig assembled.
Do we have a file with base pair mapped from which analysis of Count.txt is done. I am aware that read_per_component_and_annotation have information of read_mapped, bp_mapped. But these numbers are different from Count.txt file. So, I was wondering if there is some file which is used from which we get the Count.txt analysis done.
Lastly, can we get access to file which had all reads which where used for blastn search.

Question on input

Dear Clément,

happy new year. Thank you for all your help before in running dnaPipeTE. Today I spoke with a collaborator (Alex Suh), and I told him that dnaPipeTE had died since it pulled too much memory for too long in my cluster. He said that one should downscale the input data. I tried with 15Gb of input, which is 5x (3Gb genome). Alex recommended to ask you, but definitely downscale the data. I wonder if you have any recommendations on how low I should go?

Thank you for your time.
José Cerca

Getting reads from blast output

Hi,

I was wondering if there is any way to match the blast output to the reads in my .fq files. I looked at the sorted and unsorted "blast.out" files in the blast_out folder, but I can't find any read names in there. I'd like to create a library of reads that fall into the "single or low copy DNA" category on the piechart so that I can assemble these regions separately.

Any suggestions would be appreciated!

nearly no TEs annotated

Hello,

I run dnaPipeTE and it worked till now well. I run exactly the same read-set with exactly the same parameters

python3 ./dnaPipeTE.py -input "$READ_DIR"/all_reads.fastq.gz \
    -output "$OUTPUT" \
    -cpu 32 -genome_size "$GENOME_SIZE" -genome_coverage 0.5 -sample_number 3

and very similar config files without specified rm library

rm_species = All
repeatmasker_library = 

However, when I look at the results, I get almost no TEs annotated by 1.3.1 in comparison to 1.2 (pasted Counts.txt files):

> TEs_2 # version 1.2.0
               V1        V2
1             LTR  11122701
2            LINE   1486291
3            SINE    102149
4             DNA   3251843
5            MITE         0
6        Helitron    267206
7            rRNA   6914419
8  Low_Complexity     30440
9       Satellite    512504
10 Tandem_repeats         0
11  Simple_repeat    321150
12         others     37846
13             na  87606406
14         Others         0
15          Total 129595103
> TEs_1 # version 1.3.1
               V1        V2
1             LTR         0
2            LINE     14004
3            SINE         0
4             DNA         0
5            MITE         0
6        Helitron         0
7            rRNA         0
8  Low_Complexity   1911368
9       Satellite         0
10 Tandem_repeats         0
11  Simple_repeat  19292537
12         others         0
13             na  88947189
14         Others         0
15          Total 129595357

The only thing I could think of is a difference in rm database. I could find this in log of run 1.3.1: Master RepeatMasker Database: /.../RepeatMaskerLib.embl ( Complete Database: dc20170127 )m but I don't have a log from the 1.2 run anymore. I suppose that it used the analogical database of 1.2.

Where do you think could be a problem?

Best,
Kamil

P.S. We cited you already in our preprint, here we jsut add some data for a review :-)

RepeatMasker libraries not found

Hi!

I'm attempting to install your program, but when configuring RepeatMasker it doesn't seem to find the required libraries. I'm pointing to a path that should contain them, but I get this error message:

No repeat libraries found!  At a minimum the Dfam_consensus
is required to run.  Please download and install the latest 
Dfam_consensus.  It is highly recommended that you also install the
latest RepBase RepeatMasker Edition library obtainable from GIRI.
General instructions can be found here: http://www.repeatmasker.org

The folder does have a file called DfamConsensus.embl, but is this not what it is looking for?

Thanks so much for any help!

Caroline

Trinity iteration errors

The program gives this error:
awk: fatal: cannot open file 'output/Trinity_run0/chrysalis/readsToComponents.out.sort' for reading (No such file or directory)

This can be traced back to 'trinity_iteration' and 'select_reads', where in some cases 'iteration+1' is used and in other cases 'iteration': file naming is not consistent. This 'iteration'-error is easily fixed, however the underlying problem then still remains. 'awk' expect output from trinity, but in the first iteration this output does not exists.
In the code there is this call: 'self.trinity_iteration(0)' (line 298), however iteration '0' is not handled differently within that method. Maybe some code got lost somehow?

RepeatMasker checkpoint never triggers as RepeatMasker done

Hello,
I have run dnaPipeTE on university cluster which has access to RepeatMasker library. The analysis did get past the RepeatMasker step but and failed to produce graphs. I wanted to rerun the last part on my computer where I know the graphs do get produced but I only have there an old RepeatMasker library. The dnaPipeTE does successfully skip Trinity phase but always reruns RepeatMasker.

Is suspect the problem is the test_RepeatMasker checks for presence of file /Annotation/Best_RM_annot_80_80 which seems to be never produced. Shouldn't it be checking for /Annotation/Best_RM_annot_80 instead? Or did something wrong happen with my run and /Annotation/Best_RM_annot_80_80 should be there?

rm: unable to remove "blast_contigs_1_fmtd": File or directory does not exist

Dear @clemgoub,

I'm trying to setup my dnaPipeTE installation using the test dataset for the first analysis.

Trinity and RepeatMasker seems to work properly, however I still have a problem during the
estimation of repeat phase, after the third Blast run. It seems that some files were not created
so it is impossible to remove them. Since my test folder is named "prova1", I ran:

python3 ./dnaPipeTE.py -input test_dataset.fastq -output prova1/ -genome_size 10000000 -genome_coverage 0.1 -sample_number 1

results are different from those provided in the test directory and I have this error:

rm: unable to remove "prova1 // blast_contigs_1_fmtd": File or directory does not exist

This is a short version of the log file:

`Start time: Mon May 15 17:49:27 2017
sampling file found, skipping sampling...
Trinity files found, skipping assembly...
prova1/Annotation/Best_RM_annot_80-80
#######################################

REPEATMASKER to anotate contigs

#######################################

RepeatMasker version open-4.0.6
Search Engine: NCBI/RMBLAST [ 2.2.27+ ]
Master RepeatMasker Database: ./bin/RepeatMasker/Libraries/RepeatMaskerLib.embl ( Complete Database: 20150807 )

analyzing file prova1/Trinity.fasta

Some previous RepeatMasker output files were moved to the directory
prova1//Trinity.fasta.preMonMay151749272017.RMoutput
in order not to overwrite them.

Checking for E. coli insertion elements
identifying Simple Repeats in batch 1 of 1
identifying matches to root sequences in batch 1 of 1
identifying Simple Repeats in batch 1 of 1
processing output:
cycle 1
cycle 2
cycle 3
cycle 4
cycle 5
cycle 6
cycle 7
cycle 8
cycle 9
cycle 10
Generating output...
masking
done
24 line read, sorting...
sort done, filtering...
15 lines in one_RM_hit_per_Trinity_contigs
0 lines in Best_RM_annot_80
12 lines in Best_RM_annot_partial
Done
#########################################

Making contigs annotation from RM

#########################################
Done

Making blast sample...
sampling file found, skipping sampling...
total number of reads: 100125
maximum number of reads to sample: 12048
fastq : test_dataset.fastq
sampling 1 samples of max 12048 reads to reach coverage...
999984 bases sampled in 12048 reads
s_test_dataset.fastq_blast done.
#######################################################

Blast 1 : raw reads against all repeats contigs

#######################################################
Blast 1 files found, skipping Blast 1 ...
###################################################

Blast 2 : raw reads against annoted repeats

###################################################
Blast 2 files found, skipping Blast 2 ...
#####################################################

Blast 3 : raw reads against unannoted repeats

#####################################################
Blast 3 files found, skipping Blast 3 ...
#######################################################

Estimation of Repeat content from blast outputs

#######################################################
parsing blastout and adding RM annotations for each read...
awk: riga com.:1: attenzione: sequenza di escape \$' considerata come semplice $'
rm: impossibile rimuovere "prova1/blast_contigs_1_fmtd": File o directory non esistente
Done, results in: blast_out/blastout_final_fmtd_annoted
#########################################

OK, lets build some pretty graphs

#########################################
Drawing graphs...
null device
1
null device
1
null device
1
null device
1
Warning message:
Removed 3 rows containing missing values (geom_bar).
Warning message:
Removed 3 rows containing missing values (geom_bar).
Done
Removing Trinity runs files...
find: "prova1/Trinity_run*": File o directory non esistente
done
Finishin time: Mon May 15 17:49:46 2017
########################

see you soon !!!

########################`

In my test analysis LTR/Pao are absent from file landscape.pdf output whereas Counts.txt looks just like yours. Which is the problem? Any help will be greatly appreciated.

Thank you in advice,
Massimiliano.

Blast parsing errors at higher coverages

Hi there

I've recently switched machines and thus have a fresh install of RM/blast and DnaPipeTE. I had a few successfull runs (except that it pretty much always says: join: contigsTrinityRM.sorted: No such file or directory)(cov 0.2/ 0.3X), but encounter some weird problem at the parsing of the blastoutput in runs with higher coverage (0.35/0.4X).

my command was:
python3 ~/progz/dnaPipeTE/dnaPipeTE.py -input $FILE -output $F/DNAPIPETE_$NEWF -cpu $CPUs -genome_size 450000000 -genome_coverage $COV -sample_number 3
mv landscape.pdf $F/DNAPIPETE_$NEWF/
mv Rplots.pdf $F/DNAPIPETE_$NEWF/

Parsing blast3 output...
rm: cannot remove '/scratch/scratchspace/QMUL_apocrita_temp_copy/007/mt_reduced/repeatcontent/DNAPIPETE_Srichteri_littleb_0.35/blast_out/int.reads_vs_annoted.blast.out': No such file or directory
#######################################################

Estimation of Repeat content from blast outputs

#######################################################
parsing blastout and adding RM annotations for each read...
join: contigsTrinityRM.sorted: No such file or directory
Done, results in: blast_out/blastout_final_fmtd_annoted
#########################################

OK, lets build some pretty graphs

#########################################
Drawing graphs...
Error in read.table(paste(folder, file1, sep = "/")) :
no lines available in input
edit: this comes from the graph.R becasue the inputfile is non existent
Execution halted

I cant really find the error. It seems parsing blastoutput 2 works but blastoutput3 is missing files
particularly the step creating int.reads_vs_annoted.blast.out seems to fail since this file is missing
sort -k1,1 -k12,12nr -k11,11n /scratch/repeatcontent/DNAPIPETE_Srichteri_bigB_0.40/blast_out/reads_vs_annoted.blast.out > /scratch/repeatcontent/DNAPIPETE_Srichteri_bigB_0.40/blast_out/int.reads_vs_annoted.blast.out

Any idea? Did others encounter the error? Funny thing is that if I run lower coverages from the same inputfile, then everything seems to work (still complaining about the join: contigsTrinityRM.sorted: No such file or directory tho).

just another hint for the next version:
I think the landscapes plot is generated in the current DIR but should be better generated in the Output DIR. I had to softlink the blastparser.py in the current DIR, otherwise it would not find it.

RepeatMasker.lib

I am using dnaPipeTE to see what transposable elements are present in the genome of the insect species I am studying. I was having an issue with it, I will be really thankful if you can please help me through this.

I get the following error whenever I run ./test_config.sh.

"RepeatMasker.lib doesn't include the Repbase sequences! Follow instruction to install RepeatMasker libraries on https://github.com/clemgoub/dnaPipeTE"

If anyone say how can this be resolved??

Pacbio reads ?

Hi,

I am wondering if dnaPipeTE works with pacbio read ?

Problem with repeatmasker modules

Dear dnaPipeTE designers -I really look forward to running this pipeline and identify TEs on my popgen-level dataset. It is just what I needed.

I managed to install it, but keep getting an error when trying the test data.

Command

module load Java/1.8.0_212  # Gettting java
module load Bowtie2/2.4.1-GCC-9.3.0 ## and bowtie2

conda activate DNApipeTE
conda install -c conda-forge perl-text-soundex # This installed the module needed.
cpan text::soundex # This installed the module needed..

PERL5LIB=/cluster/projects/nn9408k/cerca/conda/envs/DNApipeTE/bin/perl:$PERL5LIB

python3 ./dnaPipeTE.py -input ./test/test_dataset.fastq -output ./tmp -genome_size 2000000 -genome_coverage 0.5 -sample_number 2

Error:

Can't locate Text/Soundex.pm in @INC (@INC contains: /cluster/home/josece/local_bin/dnaPipeTE/bin/RepeatMasker /cluster/projects/nn9408k/cerca/conda/envs/DNApipeTE/bin/perl /node/lib/perl5 /cluster/lib/perl5/x86_64-linux-thread-multi /cluster/lib/perl5 /usr/local/lib64/perl5 /usr/local/share/perl5 /usr/lib64/perl5/vendor_perl /usr/share/perl5/vendor_perl /usr/lib64/perl5 /usr/share/perl5 .) at /cluster/home/josece/local_bin/dnaPipeTE/bin/RepeatMasker/Taxonomy.pm line 83.
BEGIN failed--compilation aborted at /cluster/home/josece/local_bin/dnaPipeTE/bin/RepeatMasker/Taxonomy.pm line 83.
Compilation failed in require at /cluster/home/josece/local_bin/dnaPipeTE/bin/RepeatMasker/RepeatMasker line 313.
BEGIN failed--compilation aborted at /cluster/home/josece/local_bin/dnaPipeTE/bin/RepeatMasker/RepeatMasker line 313.
Traceback (most recent call last):
  File "./dnaPipeTE.py", line 698, in <module>
    RepeatMasker(config['DEFAULT']['RepeatMasker'], args.RepeatMasker_library, args.RM_species, args.cpu, args.output_folder, args.RM_threshold)
  File "./dnaPipeTE.py", line 381, in __init__
    self.repeatmasker_run()
  File "./dnaPipeTE.py", line 400, in repeatmasker_run
    with open(self.output_folder+"/Trinity.fasta.out", 'r') as trinity_handle:
FileNotFoundError: [Errno 2] No such file or directory: './tmp/Trinity.fasta.out'

I see this is a common repeatMasker error, however, I tried everything on google and I can't get it working. Would you happen to know how to sovle this? I have no sudo rights.

Why the single or low copy DNA is 100% in TEs_pipchart file?

During test run, it runned smoothly for about 5 min, and produced files of Base_per_components.pdf and landscape.pdf, but the TEs_piechart.pdf file is a wholly gray figure and shows only "single or low copy DNA".

I run ./test_config.sh, it showed "java version < 1.8". But I actually installed java 10.0.1, perl 3.5, R 3.4.2, but GNU 2.12 in my environment.

$python
 Python 3.5.1 (default, Jun 24 2016, 15:59:19) 
[GCC 4.4.7 20120313 (Red Hat 4.4.7-17)] on linux

$java -version
openjdk version "10.0.1" 2018-04-17
OpenJDK Runtime Environment (build 10.0.1+10)
OpenJDK 64-Bit Server VM (build 10.0.1+10, mixed mode)

The error files are as attached.
TEs_piechart.pdf
dnaPipeTE.52650.err.txt
dnaPipeTE.52650.out.txt

About input file

Dear dnaPipeTE developer
I'm confused about the input option.
Now I have paired end file, but the dnaPipeTE is only hand single end only.
I dont know I just put the one end or reverse the reverse file and merge the forward file.
So would you mind give me some advise for me?
Thanks
yours Zhang

Best_RM_annot_80

Hi,
I'm trying to work with the Best_RM_annot_80 files.
My problem is I can't really make sense of some of the columns. I also couldn't find any hint of what the headers for that file might be.
Is there a good source for that? I've looked at the RM website as well as dnaPipeTE READ.ME file and several articles.

Thanks in advance

Help with paired reads

Hello,
I have low to high coverage paired reads of a lot of different species that I work with.
Is there a way to use the paired end sequencing data with this software?
Could I get away with using just the forward reads?
Please let me know what you think.
Best,
Basanta

Annotation database with dnaPipeTE?

Hi!!

Could a dnaPipeTE assembly+annotation be used as database to annotate the assemble of repetitive sequences generated by other softwares like RepeatExplorer?

Thanks!

repbase

it seems giri asks for non-free subscription to obtain the RM libraries so my username and password are only getting some xml file from their repository. any ideas how to solve this? would a flat fasta be enough?

thanks!

issue in the last part of analysis

Greetings, I write because I'm having problems with the "Counts" part of the analysis: the " Counts.txt" file contain only the count of totals TE and the "na" ones; the others fields have 0 even if "reads_per_component_and_annotation" file and all others files show clearly the presence of different TE and several TE families. I reckon the problem is about RAM or space though no warning is showing.
I obtained the same results twice.
So, I was wondering if there is a way to recover only that step or if I can estimate by myself the TE proportion I miss.

thanks in advance

Not all contigs in reads_per_component_and_annotation

Hi,
I want to know the count of reads in every contig. There are 19 contigs in Trinity.fasta but only 17 in reads_per_component_and_annotation. One in the rest 2 contigs even has high proportion of it covered by a Repeat Masker annotation. I'm wondering whether the 2 contigs disapeared because of low reads count and is there any filtering threshold was used? Thanks.

Chenjiaqi

Error getting after Run

awk: fatal: cannot open file /home/gnomeadmin/te_analysis/dnaPipeTE/dnaPipeTE//Trinity_run0/chrysalis/readsToComponents.out.sort' for reading (No such file or directory) sed: can't read /home/gnomeadmin/te_analysis/dnaPipeTE/dnaPipeTE//Trinity_run1/Trinity.fasta: No such file or directory awk: fatal: cannot open file /home/gnomeadmin/te_analysis/dnaPipeTE/dnaPipeTE//Trinity_run1/Trinity.fasta' for reading (No such file or directory)
Traceback (most recent call last):
File "./dnaPipeTE.py", line 698, in
RepeatMasker(config['DEFAULT']['RepeatMasker'], args.RepeatMasker_library, args.RM_species, args.cpu, args.output_folder, args.RM_threshold)
File "./dnaPipeTE.py", line 381, in init
self.repeatmasker_run()
File "./dnaPipeTE.py", line 400, in repeatmasker_run
with open(self.output_folder+"/Trinity.fasta.out", 'r') as trinity_handle:
FileNotFoundError: [Errno 2] No such file or directory: '/home/gnomeadmin/te_analysis/dnaPipeTE/dnaPipeTE//Trinity.fasta.out'

parameter -cpu vs real number of used cpu

Hello,

I am trying to run dnaPipeTE on university computing cluster. To run a job there I need to specify the number of cpu the job will use. If the job exceeds the specified cpu at any time-point it gets killed.
I have found that if I use dnaPipeTE parameter -cpu exactly the same as number as specified cpus for job then the job gets killed at Trinity stage. I tried to add some buffer cpus for the job with several trials and killed jobs. I am currently at the iteration when if I use dnaPipeTE -cpu 6 and allocate 10 cpus for the job then the job gets killed at RepeatMasker stage, because apparently RepeatMasker tries to use more than 10 cpus.
Any suggestions on how to estimate real number of used cpus?

Thanks,
Markéta

Figuring out how are Counts.txt calculated

Hello,

I am trying to understand dnaPipeTE outputs. Specifically I am trying to figure out how are the numbers in Counts.txt calculated. When I look into reads_per_component_and_annotation file and simply the numbers in the second (base pairs) column for the relevantly annotated rows I cant seem to arrive at the same number as dnaPipeTE.
For example in Counts.txt i find
Simple_repeat 34570800
However if I add all rows in assigned as Simple_repeat in reads_per_component_and_annotation I get:
Simple_repeat 10408258

Similarly in Counts.txt i have:
LTR 376818

But when looking into reads_per_component_and_annotation and adding all rows with "LTR" anotation I get:
LTR 149042

Is there some normalization involved?

The reason why I am asking this is because I will probably need to manually annotate some un-annotated and mis-annotated contigs, so i would like to know what are the next steps in the pipeline once I am done with reannotating.

Thanks,
Markéta

Repbase Libraries

I've successfully installed dnaPipeTE on a cluster and my home computer, but I am still getting this message when I run test_config.sh:

Testing RepeatMasker Libraries...
RepeatMasker.lib doesn't include the Repbase sequences! Follow instruction to install RepeatMasker libraries on https://github.com/clemgoub/dnaPipeTE

This is what the library directory looks like on the cluster:

Libraries]$ ls -l
total 1599056
-rw-r--r-- 1 212876 Jan 28 2017 DfamConsensus.embl
-rw-r--r-- 1 1510837557 Sep 24 2015 Dfam.hmm
-rw-r--r-- 1 214 Jan 28 2017 README.meta
-rw-r--r-- 1 22475384 Aug 7 2015 RepeatAnnotationData.pm
-rw-r--r-- 1 129371 Sep 5 13:04 RepeatMasker.lib
-rw-r--r-- 1 209744 Sep 5 13:04 RepeatMaskerLib.embl
-rw-r--r-- 1 43039 Sep 5 13:09 RepeatMasker.lib.nhr
-rw-r--r-- 1 5272 Sep 5 13:09 RepeatMasker.lib.nin
-rw-r--r-- 1 28191 Sep 5 13:09 RepeatMasker.lib.nsq
-rw-r--r-- 1 0325957 Jan 31 2014 RepeatPeps.lib
-rw-r--r-- 1 1516945 Sep 5 13:09 RepeatPeps.lib.phr
-rw-r--r-- 1 84520 Sep 5 13:09 RepeatPeps.lib.pin
-rw-r--r-- 1 9463396 Sep 5 13:09 RepeatPeps.lib.psq
-rw-r--r-- 1 4401 May 28 2009 RepeatPeps.readme
-rw-r--r-- 1 17204287 Jan 28 2017 RMRBMeta.embl
-rw-r--r-- 1 64450715 Aug 29 2016 taxonomy.dat

I have a working GIRI login. Any advice?

missing 'reads_per_component_and_annotation' in output folder

Hello!

I have run dnaPipeTE on a few different species, most of them around 0.1x coverage. My outputs include everything that should be there, except for the reads_per_component_and_annotation file, and the graphs. The graphs are to be expected as I am running from a supercomputer cluster without R. Do I need R for the reads_per_component_and_annotation file to generate as well?
Is it safe to just use the Counts.txt file to determine the percentage of a TE class in the genome? Is there a way to determine superfamily annotations without this file?

Hopefully this makes sense, I am very new to this. Thank you!

Trinity requires access to Java version 1.6 or 1.7

Hi !

I got a error while trying to test dnaPipeTE with the following command :python3 ./dnaPipeTE.py -input test_dataset.fastq -output /home/loutre/dnaPipeTE/try -genome_size 10000000 -genome_coverage 0.1 -sample_number 1
Here the error message I get :

Start time: Fri Jan 27 10:22:45 2017
sampling file found, skipping sampling...
###################################
### TRINITY to assemble repeats ###
###################################

***** TRINITY iteration 1 *****

Selecting reads for Trinity iteration number 1...
awk: fatal : impossible d'ouvrir le fichier « /home/loutre/dnaPipeTE/try/Trinity_run0/chrysalis/readsToComponents.out.sort » en lecture (Aucun fichier ou dossier de ce type)
Done

Current settings:
time(seconds)        unlimited
file(blocks)         unlimited
data(kbytes)         unlimited
stack(kbytes)        8192
coredump(blocks)     0
memory(kbytes)       unlimited
locked memory(kbytes) 64
process              515438
nofiles              1024
vmemory(kbytes)      unlimited
locks                unlimited


Error, Trinity requires access to Java version 1.6 or 1.7.  Currently installed version is: java version "1.8.0_121"
Java(TM) SE Runtime Environment (build 1.8.0_121-b13)
Java HotSpot(TM) 64-Bit Server VM (build 25.121-b13, mixed mode)
Trinity iteration 1 Done'
renaming Trinity output...
awk: fatal : impossible d'ouvrir le fichier « /home/loutre/dnaPipeTE/try/Trinity_run1/Trinity.fasta » en lecture (Aucun fichier ou dossier de ce type)
done
/home/loutre/dnaPipeTE/try/Annotation/one_RM_hit_per_Trinity_contigs
/home/loutre/dnaPipeTE/try/Annotation/Best_RM_annot_80-80
/home/loutre/dnaPipeTE/try/Annotation/Best_RM_annot_partial
#######################################
### REPEATMASKER to anotate contigs ###
#######################################

RepeatMasker version open-4.0.6
The RepeatMasker installation directory ($RepeatMaskerConfig::REPEATMASKER_DIR) is incorrectly set in the RepeatMaskerConfig.pm file.  Please open the RepeatMaskerConfig.pm file  and edit the $RepeatMaskerConfig::REPEATMASKER_DIR line.
Traceback (most recent call last):
  File "./dnaPipeTE.py", line 700, in <module>
    RepeatMasker(config['DEFAULT']['RepeatMasker'], args.RepeatMasker_library, args.RM_species, args.cpu, args.output_folder, args.RM_threshold)
  File "./dnaPipeTE.py", line 359, in __init__
    self.repeatmasker_run()
  File "./dnaPipeTE.py", line 378, in repeatmasker_run
    with open(self.output_folder+"/Trinity.fasta.out", 'r') as trinity_handle:
FileNotFoundError: [Errno 2] No such file or directory: '/home/loutre/dnaPipeTE/try/Trinity.fasta.out'
Loutre:~/dnaPipeTE$ pwd 
/home/loutre/dnaPipeTE

I think this is a problem with java version. I already Trinity on my computer, and it worked fine with java 1.8. My trinity version is 1.3, yours is installed trinityrnaseq_r2013_08_14.

Do you think it will be possible to update trinity version ? Or should I install a older java version ?

Issues installing on a cluster/HPC

Hi, I use dnaPipeTE frequently and it's great. I would like to get it working on my universities cluster but require some changes which I cannot work out. Specifically, my cluster has repeatmasker and trinity already installed in a module loading system, so I cannot install them again, but can use the preinstalled copies. How do I change dnaPipeTE to use Trinity and RepeatMasker that are already in the environment/path (as in I can use Trinity by just typing Trinity ) ?

Any help would be great.

Thanks

Installation issue in Linux Redhat 6.2

I have been trying to install dnaPipeTE on Linux server (version 6.2, Redhat). After adding the login and password in the file (init.sh) I tried to run it as suggested in the installation, I get the following message.

[nishma@login1 dnaPipeTE]$ ./init.sh
GIRINST_USERNAME=: Command not found.
GIRINST_PASSWORD=
: Command not found.
GIRINST_USERNAME: Undefined variable.

The same username and password had worked on a Linux computer, the RAM is small so I could not run the analyses on the computer.

How to set -genome_size parameter?

Hello,

For the -genome_size parameter, should we use haploid genome size (e.g. human = 3 Gb) or total DNA content (for human this would be 2N = 6 Gb)?

I am working with asexual organisms with fully collapsed assembly span of ~150 Mb, and am unsure whether to use 150 Mb or 300 Mb for this parameter. Or perhaps it is not so important?

Thanks for your help,
reubwn

Landscape error

Hi clement,
I have this issue when i look at the landscape graph. I tried different coverages for a ~2GB genome, and as the number of "families" is high, the landscape graph is not well constructed. Im sending you three pictures showing the problem. I guess It's due to the height parameter in the R script, but as Im not used to work with R-language I'm unable to solve it efficiently.
Also, I wanted to know if there is some way re-run the landscapes.R script. I guess the input file is the reads_landscape and factors_and_colors... but are there others?
landscape02X.pdf
landscape008X.pdf
landscape025X.pdf

modify dnaPipeTE.py --JM to --max_memory to support new Trinity versions

Hi

The changes from v1 are great!
I just managed to get it running on a new system and was trying trinity 2.2 (latest).
the dnapipete.py needs to be amended: --max_memory instead of --JM for making it compatible

There seems to be another issue at the Butterfly step

Butterfly assemblies are written to /dnaPipeTE/testout/Trinity_run1/Trinity.fasta

Trinity iteration 1 Done'
Traceback (most recent call last):
File "./dnaPipeTE.py", line 699, in
Trinity(config['DEFAULT']['Trinity'], config['DEFAULT']['Trinity_memory'], args.cpu, config['DEFAULT']['Trinity_glue'], args.output_folder, sample_files, args.sample_number, args.contig_length)
File "./dnaPipeTE.py", line 301, in init
self.new_version_correction()
File "./dnaPipeTE.py", line 329, in new_version_correction
year = re.search('\d{4}', str(out)).group(0)
AttributeError: 'NoneType' object has no attribute 'group'

any idea how to fix this? my python knowledge is rather limited.
cheers
eckart

Re-running analysis

Hi again Clement!
I hope you are OK.
I am writing to ask you about something. I managed to classify some Trinity contigs that were not classified by dnaPipeTE, and I was wondering if there was a way to re-run some of the analysis carried on by the program without running all the pipeline again. Specifically, I would like to re-calculate the Counts.txt file and to produce a new repeatlandscape.
I believe it is posible, because I have been reading your code and If I didn't understand incorrectly, the main statistics about all contigs are already calculated, such as % of divergence of each read to its contig and number of mapping reads and bases. I guess I could do it by adding the reads mapping to this new classified contig to the reads_landscape from sorted.reads_vs_unannoted.blast.out and re-running the Rscript (you already explained me how to do so, hehe).
In the case of the Counts file, I think I could do two things, or I calculate the new counts by hand, adding the pb of this contig to the acording class and substracting from the "Unclassified", or I could try to re-run the funtion "count" that is inside the main python script.
If I follow the second alternative I would have to move the reads matching the contig from sorted.reads_vs_unannoted.blast.out to sorted.reads_vs_annoted.blast.out and also include the new classification inside one_RM_hit_per_Trinity_contigs or there is a simpler way?
I am sorry if this is a little confusing, but I thought It may concern you because some of this things could help If you still want to develop new checkpoints in the pipeline.

Thanks in advanced!!
Mylena

ggplot2

Almost at the end of the run I get this message:

#########################################
### OK, lets build some pretty graphs ###
#########################################
Drawing graphs...
null device 
          1 
null device 
          1 
null device 
          1 
null device 
          1 
Error in library(ggplot2) : there is no package called ‘ggplot2’
Execution halted
Done

Am I missing graphics as a result?

Thanks - Claudio

Feature Request: RepeatMasker compatible fasta headers

Hi Clement,

Just a small request based on the work I'm doing. I'm using the dnaPipeTE contigs as a custom library of repeats for my organism for repeatmasker, and I've had to go through and edit the fasta headers in the Annotation file to include the classification in a format compatible with repeatmasker.

Currently an example of a fasta header here is:

>DNA_comp_TRINITY_DN16970_c0_g1_i1

and for repeatmasker compatibility it would look like

>DNA_comp_TRINITY_DN16970_c0_g1_i1#DNA

For deeper level classification it might read

>DNA_comp_TRINITY_DN16970_c0_g1_i1#DNA/Helitron

for example.

As it stands, I have gotten around this with a sed command, but other users may also appreciate this feature, be it as an option or as a default.

Thanks again for such a great tool and for helping out with any previous issues too!

TE consensus sequences

Hello,

I want to ask a question. In the output_folder , which file is the result of TE consensus sequences?
thanks in advance

Thanks,
Lina Zhao

Error building graphs

Hi, I'm trying to test the installation. All the programs are working.

However, I have some problems to make functional dnaPipeTE.py. I run the test but I got an error. I looked at the code and tried to understand but it's getting hard.

I let here the last lines of the output:

#######################################################
### Estimation of Repeat content from blast outputs ###
#######################################################
parsing blastout and adding RM annotations for each read...
Done, results in: blast_out/blastout_final_fmtd_annoted
#########################################
### OK, lets build some pretty graphs ###
#########################################
Drawing graphs...
null device 
          1 
null device 
          1 
null device 
          1 
null device 
          1 
Loading required package: methods
Error in read.table(file1) : no lines available in input
Execution halted
Done
Removing Trinity runs files...
done
Finishin time: Fri Feb 21 16:18:18 2020
########################
#   see you soon !!!   #
########################

At first, I had thought it was an R libraries problems, but I then realized some files needed in the plotting are empty (sorted_families , reads_landscape and factors_and_colors).

Any idea what's going on Clement?

Thanks,

Nicolás

Installation issue: blastx & makeblastdb not found

Hi Clement,

I'm trying to install dnaPipeTE on my directory on a shared cluster (it cannot be installed like the other tools because it has to be run in the installation folder)

the installation runs fine until the blast step; I put the path to the locally installed blast
/home/datawork-lep-s/deepadapt/dnaPipeTE/bin/ncbi-blast-2.2.28+/bin

but I get the follwing errors:
/home/datawork-lep-s/deepadapt/dnaPipeTE/bin/ncbi-blast-2.2.28+/bin/makeblastdb does not exist
/home/datawork-lep-s/deepadapt/dnaPipeTE/bin/ncbi-blast-2.2.28+/bin/blastx does not exist

indeed blastx and makeblastdb are not in this directory, nor in the ncbi-rmblastn-2.2.28/bin directory

I have a local installation of blast if that helps?
Could you help me solve this issue?

Thanks in advance, and I'm looking forward to using this tool!
Alexandra

Reading Trinity output files

Hi,
I'm having some trouble understanding the length of repeats given in the Trinity.fasta.out files.
Under repeat 'begin' and 'end' (which I'm assuming will give the length if deducted from one another) some values appear in brackets () and sometimes does not make sense (i.e. begin larger than end).
Could you please clarify?
Cheers

gz compression of inputfiles

Hi again.

I seem to have an issue with the handling of compressed fastq input and get this error:
Start time: Fri Sep 9 16:26:32 2016
gz compression detected for /reads/xy.fastq.gz
counting reads number...Traceback (most recent call last):
File "/progz/dnaPipeTE/dnaPipeTE.py", line 697, in
Sampler = FastqSamplerToFasta(args.input_file, args.sample_size, args.genome_size, args.genome_coverage, args.sample_number, args.output_folder, False)
File "/progz/dnaPipeTE/dnaPipeTE.py", line 156, in init
self.get_sampled_id(self.fastq_R1)
File "/progz/dnaPipeTE/dnaPipeTE.py", line 188, in get_sampled_id
with gzip.open(file_name+".gz", 'rt') as file1 :
File "/usr/lib64/python3.2/gzip.py", line 46, in open
return GzipFile(filename, mode, compresslevel)
File "/usr/lib64/python3.2/gzip.py", line 156, in init
raise IOError("Mode " + mode + " not supported")
IOError: Mode rt not supported

the input was xy.fastq.gz file which was generate by "gzip xy.fastq".
any idea if its a python problem or dnapiteTE problem?

trinity install as part of dnapipeTE

The code is ./init.sh

I'm getting the following error during the trinity install:

g++ -W -Wall -Wno-unused -Wno-deprecated -ansi -pedantic -Wno-long-long -fno-nonansi-builtins -Wctor-dtor-privacy -Wsign-promo -Woverloaded-virtual -Wendif-labels -O3 -ggdb3 -DMAKE_DATE='"Wed 17 Mar 2021 08:40:09 PM UTC "' -DMAKE_OS_RELEASE='"5.3.0-64-generic"' -DMAKE_RELEASE='"3.0"' -DNEW_MAKEFILE -imacros system/BigFileDefines.h -pthread -ftemplate-depth-30 -fno-strict-aliasing -mieee-fp -fopenmp -c ./aligns/KmerAlignCore.cc -o obj/aligns/KmerAlignCore.o
./aligns/KmerAlignCore.cc:6:10: fatal error: aligns/KmerAlignCore.h: No such file or directory
6 | #include "aligns/KmerAlignCore.h"
| ^~~~~~~~~~~~~~~~~~~~~~~~
compilation terminated.
make[1]: *** [Makefile:474: aligns/KmerAlignCore.o] Error 1

From what I can tell, and that is not a lot, this is perhaps related to newer versions of gcc being incompatible with something about this. I can get Trinity installed just fine in a conda environment, but can't figure out how to get dnaPipeTE to look for that instead of its own specific install. If you know either a) how to fix this error or b) how to make dnaPipeTE use a system install of Trinity (if it even can, I have no idea how dependent it is on that particular version of Trinity) I would appreciate it.

Calculating TEs divergence

Hi,
I have used dnaPipeTE on a large collection of wild barley I have, comprised of several populations of the same species. I've used the cultivated barley 'repbase' database as reference.
I was wondering whether the reads_landscape files can help in assessing population, rather than species, divergence (based on TEs) and if so, should I be using a different species database as reference?
Also, the landscape graph was not produced for any of the samples I've analysed although the landscape_reads files were. Is that an indication of the process not finishing?

Thank you for your time

FileNotFoundError: [Errno 2] No such file or directory: '/dnapipete_3/Trinity.fasta.out'

Hello!

I am trying to run dnaPipeTE with publicly available WGS data.

I will greatly appreciate it if you can help me with the following error message:

parseTagData: ID field not to EMBL spec "SNAP-OL2 repeatmasker; DNA; ???; BP.
" from DE RepbaseID: SNAP-OL2XX

at /home/Softwares/dnaPipeTE/bin/RepeatMasker/RepeatMasker line 7611.
Traceback (most recent call last):
File "./dnaPipeTE.py", line 698, in
RepeatMasker(config['DEFAULT']['RepeatMasker'], args.RepeatMasker_library, args.RM_species, args.cpu, args.output_folder, args.RM_threshold)
File "./dnaPipeTE.py", line 381, in init
self.repeatmasker_run()
File "./dnaPipeTE.py", line 400, in repeatmasker_run
with open(self.output_folder+"/Trinity.fasta.out", 'r') as trinity_handle:
FileNotFoundError: [Errno 2] No such file or directory: '/mnt/data/Data/dnapipete_3/Trinity.fasta.out'

Landscapes.R plot issue

Thanks for the new update! It runs very smoothly, I just found a problem with the landscapes.R plotting pipeline. The "factors_and_colors" file does not always have the correct amount of columns in each entry, so R does not open it. For example, the entry "DNA/Academ-1" does not have a class associated with it (in my file), so the second column is left open here. I made the plot manually after fixing the file, but thought you may like to be aware of it.

influence of the input parameters on estimated TE loads

Hello again,

we (@reubwn, @jensbast and me) are wondering about the influence of the input parameters (genome size and sampling coverage) for the estimate of the total TE load.

The reason why we started to dig in it was that dnaPipeTE find way fewer TEs than assembly-based approaches in bdelloid rotifers (these reads: ERR2135445; genomes are published here). They are difficult genomes, degenerate tetraploids and therefore it was not that clear to us, what genome size we should go for. We tested three values of input genome sizes and coverages and basically more nt we have samples, higher fraction of TEs we have annotated. You can read details in this thread.

We hoped that it's just rotifer that is weird, but we tried the same with a genome that far more sane, a parasitoid wasp (SRR7028347), but the pattern was kind of the same. More sampled nucleotides higher TE loads detected (details latter in the same thread).

This is a bit worrying to us, as the TE load should not be dependent on the depth of sampling (at least we would hope not). We are kind of running out of inspiration how to explain the pattern. Could you help us make sense out of it?

Best,
Kamil

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.