Hi, After finishing running PIRATE, there is an output file called c

questions on output files about pirate HOT 4 CLOSED

sionbayliss commented on August 12, 2024

questions on output files

from pirate.

Comments (4)

SionBayliss commented on August 12, 2024

The PIRATE.gene_families.tsv file has more information on the particular genes included in the alignment. Genes are considered to be core when they are present in >95% of all isolates. See the README for more details on the output files.

For an alignment of aa sequence you can either concatenate the aligned amino acid sequence present in the feature_sequences/ directory (*.aa.fasta) or directly translate the core alignment file. This file retains the triplicate (codon) nature of an amino acid alignment so it should be suitable for translation by other software without any preprocessing.

from pirate.

limin321 commented on August 12, 2024

The PIRATE.gene_families.tsv file has more information on the particular genes included in the alignment. Genes are considered to be core when they are present in >95% of all isolates. See the README for more details on the output files.

For an alignment of aa sequence you can either concatenate the aligned amino acid sequence present in the feature_sequences/ directory (*.aa.fasta) or directly translate the core alignment file. This file retains the triplicate (codon) nature of an amino acid alignment so it should be suitable for translation by other software without any preprocessing.

S

Hi S,

That's very helpful suggestion. I think I could get all core gene name based on PIRATE.gene_families.tsv and core_alignment.fasta. Feels like there is no direct way to filter genes that are present in >95% of all isolates from the table. Do you have any suggestion how you will do that? The only way I can think of right now is to use core gene alignment to blast against the whole genome sequences to get which genes are core ones.

I would like to extract core genes aa sequences and concatenate them.
However, when we open PIRATE.gene_families.tsv in excel, the first two columns are allele_name (ex: g031830_000006), and gene_family (g031830). However, in feature_sequences folder, *.aa.fasta file are named like g031830.aa.fasta.

Also, there is g000001.aa.fasta file in feature_sequences folder, but I don't see g000001 in PIRATE.gene_families.tsv anywhere.

In the picture, you will see the smallest number of gene_family name starting with g000091, however, there are plenty of *.aa.fasta files starting with g000001.aa.fasta; g000001.aa.fasta; etc What does this mean that gene_family names don't match in the table and in feature_sequences folder?

That is why I am confused after I filtered all core genes in PIRATE.gene_families.tsv, which name should I use to match *.aa.fasta files accordingly.

I am so sorry for so much questions.

Best,
LC

from pirate.

SionBayliss commented on August 12, 2024

Hi LC,

You just convert the 'number of genomes' column to a percentage using the number of input samples.

The gene family column should start with g00001. Are you sure you haven't modified it in some way? The presence of alignments with g00001 in the feature sequences directory indicates that there should be an entry in the PIRATE.gene_families.tsv file (the script that creates the alignments uses the gene_families file for an input). Alternatively, have you run PIRATE multiple times to the same output folder with different settings?

I recommend that you read the README thoroughly. The answers to most of your questions are there.

from pirate.

limin321 commented on August 12, 2024

Hi LC,

You just convert the 'number of genomes' column to a percentage using the number of input samples.

The gene family column should start with g00001. Are you sure you haven't modified it in some way? The presence of alignments with g00001 in the feature sequences directory indicates that there should be an entry in the PIRATE.gene_families.tsv file (the script that creates the alignments uses the gene_families file for an input). Alternatively, have you run PIRATE multiple times to the same output folder with different settings?

I recommend that you read the README thoroughly. The answers to most of your questions are there.

S

Hi S,

Thank you so much for the suggestion. I double checked. My excel run into some issues, making it unable to display all data. So sorry for the inconveniece.

Best,

from pirate.

questions on output files about pirate HOT 4 CLOSED

Comments (4)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent