Giter Site home page Giter Site logo

Comments (5)

clemgoub avatar clemgoub commented on August 15, 2024

Hi!

This is odd. Are you using a custom TE library? I think what you refer as the total TE content is actually the total nb if bp un the sample. The rest being NA, this is why I wouldn't be surprised of this result if your library is a "in house" and does not match the RepeatMasker format (the TE classes are not recognized).
If it's not the case, can you send me your log/error files as well as your command line?

Thanks,

Clément

from dnapipete.

DNAcastigator avatar DNAcastigator commented on August 15, 2024

python3 ./dnaPipeTE.py -input /home/angeloruggieri/Locusta_migratoriaSRR764581_trimmed.fastq.gz -output /home/angeloruggieri/Locusta_migratoriaSRR764581_trimmed.fastq.gz_output_0.5x -cpu 15 -genome_size 5868000000 -genome_coverage 0.5

where can I find the log/error file ?
I found the problem for the current run: the space on device run out during the command "
sort -k1,1 -k12,12nr -k11,11n " and so the file "sorted.reads_vs_annoted.blast.out" is empty. All the files in "Annotation" are full . I'm pretty sure I can't solve this memory space problem so I can't run it again hoping it'll work. Is there a way to know if all the other files are properly written ? Can I estimate somehow the infomations I'm missing ?

from dnapipete.

clemgoub avatar clemgoub commented on August 15, 2024

I see,

Sorry, it's a major issue with large genomes with a lot of repeats... Sorting these files is endless with the current version.

For your case, I suggest you use 0.1X coverage. Even though it would be informative to go to 0.5X, I am pretty sure that you are going to reach saturation. For example, I was using such level of coverage with the Asian Tiger Mosquito, which as a ~1Gbp genome and 50% repeats including 30% of TE.

You can try several run starting from 0.01X to 0.1X to and compare the %total TE discover to see if you reach a plateau. If you don't, try to go further than 0.1X. If dnaPipeTE keep crashing due to a memory problem, please let me know. I am currently working on an update, I will take into account your issue and try to make the sorting possible when these cases occur.

Best,

Clément

from dnapipete.

clemgoub avatar clemgoub commented on August 15, 2024

And to reply to your question, indeed the way to know the quantity of each repeat would be technically to sort the file reads_vs_annoted.blast.out, sorting each query (reads) (1) by name, then (2) by blast score and finally by (3) evalue in case of equality. The problem here is the with the command I have in the script, it runs forever. Then the rest of the scripts uses sorted.reads_vs_annoted.blast.out as input for the rest of the analyses. In particular, it not only takes the best hit per read, but if the read overlap with two TE (for example TE into TE) dnaPipeTE has a script that will save this info. So a regular "best hit" sort wont produce the same output.

from dnapipete.

clemgoub avatar clemgoub commented on August 15, 2024

Closing this issue since it concerns the non-container version. Please DM if further support is needed. Thank you!

from dnapipete.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.