Hi, I've set up an initial run with ~800 genomes, using the diamond mode to speed thin

Error after MCL clustering step about pirate HOT 5 CLOSED

alexweisberg commented on August 12, 2024

Error after MCL clustering step

from pirate.

Comments (5)

SionBayliss commented on August 12, 2024

Hi Alex,

That looks like an issue with missing sequences which can sometimes happen when erroneous characters/headers make it through the gene filtering step. You will notice some files didn't pass QC so something might be up with the other files. Did you annotate them with prokka?

I am happy to have a look at a subset of files if we can't find a solution.

All the best,
Sion

from pirate.

alexweisberg commented on August 12, 2024

Hi Sion,

Thanks for looking into it. Roughly half of the files were annotated by prokka (we sequenced them), and the other half were converted from NCBI gbk files.

I tried a few small subsets run with a few of our prokka-annotated genomes and a few NCBI genomes, and they completed successfully. So there may be a small subset of genomes that are having some kind of specific issue.

I found one of the locus tags that was missing from the expanded pangenome mcl file (A4_00008 from input file A4.gff). Here is what I found when I searched for it in the input file and the modified gff file:

The locus tag is somehow included in the next gene region in the modified_gff file version. When I include this gff file in a small run of only 5 genomes, it runs to completion correctly though, and the modified_gff file has this weird ID too.

I think there may be an issue due to the large size of the dataset and parallel threads on our cluster. I will try removing genomes until it runs to completion.

Best,
Alex

from pirate.

SionBayliss commented on August 12, 2024

Hi Alex,

I don't expect it is a problem with the cluster (but I might be wrong). I expect that this is an issue with a subset of files from the NCBI that have really weird/erroneous annotation. This can happen sometimes as they have not been annotated consistently or using the same pipelines. You might want to reannotate the NCBI files using prokka and see if PIRATE completes.

those new fields in the GFF are created by PIRATE. PIRATE tries to standardise locus tags in order to avoid issues with annotation. One of the early scripts in the pipeline renames all locus_tag/ID to the "name of the genome"_"ascending number of the CDS in the file" (e.g. the first CDS is called genomename_0001). The old locus tag/ID is moved to the prev_ID/prev_locustag field in the modified GFF file present in the PIRATE folder. By default it only considers CDS features. This isn't a terribly elegant way to fix the issue but I was encountering many issues similar to yours with inconsistent annotation impacting on the outputs.

I hope that helps.

All the best,
Sion

from pirate.

alexweisberg commented on August 12, 2024

Hi,

I re-annotated my genomes with prokka, and I was able to run an analysis of >1000 genomes successfully with 32 CPU cores in 16 hours. Thanks for helping me get it set up!

I noticed in the manual that the section on converting the output to a binary format ("Convert to binary presence-absence or count") likely has a typo. The command referring to generating a paralog presence/absence table should probably refer to the "paralogs_to_Rtab.pl" script rather than the "PIRATE_to_Rtab.pl" script. Currently both commands are identical.

Thanks,
Alex

from pirate.

SionBayliss commented on August 12, 2024

Hi Alex,

I am glad I could help.

All the best,
Sion

from pirate.

Error after MCL clustering step about pirate HOT 5 CLOSED

Comments (5)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent