Giter Site home page Giter Site logo

Comments (5)

SionBayliss avatar SionBayliss commented on August 12, 2024

Hi Alex,

That looks like an issue with missing sequences which can sometimes happen when erroneous characters/headers make it through the gene filtering step. You will notice some files didn't pass QC so something might be up with the other files. Did you annotate them with prokka?

I am happy to have a look at a subset of files if we can't find a solution.

All the best,
Sion

from pirate.

alexweisberg avatar alexweisberg commented on August 12, 2024

Hi Sion,

Thanks for looking into it. Roughly half of the files were annotated by prokka (we sequenced them), and the other half were converted from NCBI gbk files.

I tried a few small subsets run with a few of our prokka-annotated genomes and a few NCBI genomes, and they completed successfully. So there may be a small subset of genomes that are having some kind of specific issue.

I found one of the locus tags that was missing from the expanded pangenome mcl file (A4_00008 from input file A4.gff). Here is what I found when I searched for it in the input file and the modified gff file:
pirate_duplicate_id
The locus tag is somehow included in the next gene region in the modified_gff file version. When I include this gff file in a small run of only 5 genomes, it runs to completion correctly though, and the modified_gff file has this weird ID too.

I think there may be an issue due to the large size of the dataset and parallel threads on our cluster. I will try removing genomes until it runs to completion.

Best,
Alex

from pirate.

SionBayliss avatar SionBayliss commented on August 12, 2024

Hi Alex,

I don't expect it is a problem with the cluster (but I might be wrong). I expect that this is an issue with a subset of files from the NCBI that have really weird/erroneous annotation. This can happen sometimes as they have not been annotated consistently or using the same pipelines. You might want to reannotate the NCBI files using prokka and see if PIRATE completes.

those new fields in the GFF are created by PIRATE. PIRATE tries to standardise locus tags in order to avoid issues with annotation. One of the early scripts in the pipeline renames all locus_tag/ID to the "name of the genome"_"ascending number of the CDS in the file" (e.g. the first CDS is called genomename_0001). The old locus tag/ID is moved to the prev_ID/prev_locustag field in the modified GFF file present in the PIRATE folder. By default it only considers CDS features. This isn't a terribly elegant way to fix the issue but I was encountering many issues similar to yours with inconsistent annotation impacting on the outputs.

I hope that helps.

All the best,
Sion

from pirate.

alexweisberg avatar alexweisberg commented on August 12, 2024

Hi,

I re-annotated my genomes with prokka, and I was able to run an analysis of >1000 genomes successfully with 32 CPU cores in 16 hours. Thanks for helping me get it set up!

I noticed in the manual that the section on converting the output to a binary format ("Convert to binary presence-absence or count") likely has a typo. The command referring to generating a paralog presence/absence table should probably refer to the "paralogs_to_Rtab.pl" script rather than the "PIRATE_to_Rtab.pl" script. Currently both commands are identical.

Thanks,
Alex

from pirate.

SionBayliss avatar SionBayliss commented on August 12, 2024

Hi Alex,

I am glad I could help.

All the best,
Sion

from pirate.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.