Greetings, I write because I'm having problems with the "Counts" part of the analysis:

issue in the last part of analysis about dnapipete HOT 5 CLOSED

clemgoub commented on August 15, 2024

issue in the last part of analysis

from dnapipete.

Comments (5)

clemgoub commented on August 15, 2024

Hi!

This is odd. Are you using a custom TE library? I think what you refer as the total TE content is actually the total nb if bp un the sample. The rest being NA, this is why I wouldn't be surprised of this result if your library is a "in house" and does not match the RepeatMasker format (the TE classes are not recognized).
If it's not the case, can you send me your log/error files as well as your command line?

Thanks,

Clément

from dnapipete.

DNAcastigator commented on August 15, 2024

python3 ./dnaPipeTE.py -input /home/angeloruggieri/Locusta_migratoriaSRR764581_trimmed.fastq.gz -output /home/angeloruggieri/Locusta_migratoriaSRR764581_trimmed.fastq.gz_output_0.5x -cpu 15 -genome_size 5868000000 -genome_coverage 0.5

where can I find the log/error file ?
I found the problem for the current run: the space on device run out during the command "
sort -k1,1 -k12,12nr -k11,11n " and so the file "sorted.reads_vs_annoted.blast.out" is empty. All the files in "Annotation" are full . I'm pretty sure I can't solve this memory space problem so I can't run it again hoping it'll work. Is there a way to know if all the other files are properly written ? Can I estimate somehow the infomations I'm missing ?

from dnapipete.

clemgoub commented on August 15, 2024

I see,

Sorry, it's a major issue with large genomes with a lot of repeats... Sorting these files is endless with the current version.

For your case, I suggest you use 0.1X coverage. Even though it would be informative to go to 0.5X, I am pretty sure that you are going to reach saturation. For example, I was using such level of coverage with the Asian Tiger Mosquito, which as a ~1Gbp genome and 50% repeats including 30% of TE.

You can try several run starting from 0.01X to 0.1X to and compare the %total TE discover to see if you reach a plateau. If you don't, try to go further than 0.1X. If dnaPipeTE keep crashing due to a memory problem, please let me know. I am currently working on an update, I will take into account your issue and try to make the sorting possible when these cases occur.

Best,

Clément

from dnapipete.

clemgoub commented on August 15, 2024

And to reply to your question, indeed the way to know the quantity of each repeat would be technically to sort the file reads_vs_annoted.blast.out, sorting each query (reads) (1) by name, then (2) by blast score and finally by (3) evalue in case of equality. The problem here is the with the command I have in the script, it runs forever. Then the rest of the scripts uses sorted.reads_vs_annoted.blast.out as input for the rest of the analyses. In particular, it not only takes the best hit per read, but if the read overlap with two TE (for example TE into TE) dnaPipeTE has a script that will save this info. So a regular "best hit" sort wont produce the same output.

from dnapipete.

clemgoub commented on August 15, 2024

Closing this issue since it concerns the non-container version. Please DM if further support is needed. Thank you!

from dnapipete.

issue in the last part of analysis about dnapipete HOT 5 CLOSED

Comments (5)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent