Giter Site home page Giter Site logo

questions on output files about pirate HOT 4 CLOSED

sionbayliss avatar sionbayliss commented on August 12, 2024
questions on output files

from pirate.

Comments (4)

SionBayliss avatar SionBayliss commented on August 12, 2024

The PIRATE.gene_families.tsv file has more information on the particular genes included in the alignment. Genes are considered to be core when they are present in >95% of all isolates. See the README for more details on the output files.

For an alignment of aa sequence you can either concatenate the aligned amino acid sequence present in the feature_sequences/ directory (*.aa.fasta) or directly translate the core alignment file. This file retains the triplicate (codon) nature of an amino acid alignment so it should be suitable for translation by other software without any preprocessing.

S

from pirate.

limin321 avatar limin321 commented on August 12, 2024

The PIRATE.gene_families.tsv file has more information on the particular genes included in the alignment. Genes are considered to be core when they are present in >95% of all isolates. See the README for more details on the output files.

For an alignment of aa sequence you can either concatenate the aligned amino acid sequence present in the feature_sequences/ directory (*.aa.fasta) or directly translate the core alignment file. This file retains the triplicate (codon) nature of an amino acid alignment so it should be suitable for translation by other software without any preprocessing.

S

Hi S,

That's very helpful suggestion. I think I could get all core gene name based on PIRATE.gene_families.tsv and core_alignment.fasta. Feels like there is no direct way to filter genes that are present in >95% of all isolates from the table. Do you have any suggestion how you will do that? The only way I can think of right now is to use core gene alignment to blast against the whole genome sequences to get which genes are core ones.

I would like to extract core genes aa sequences and concatenate them.
However, when we open PIRATE.gene_families.tsv in excel, the first two columns are allele_name (ex: g031830_000006), and gene_family (g031830). However, in feature_sequences folder, *.aa.fasta file are named like g031830.aa.fasta.

Also, there is g000001.aa.fasta file in feature_sequences folder, but I don't see g000001 in PIRATE.gene_families.tsv anywhere.

Screen Shot 2020-07-27 at 2 24 38 PM

In the picture, you will see the smallest number of gene_family name starting with g000091, however, there are plenty of *.aa.fasta files starting with g000001.aa.fasta; g000001.aa.fasta; etc What does this mean that gene_family names don't match in the table and in feature_sequences folder?

That is why I am confused after I filtered all core genes in PIRATE.gene_families.tsv, which name should I use to match *.aa.fasta files accordingly.

I am so sorry for so much questions.

Best,
LC

from pirate.

SionBayliss avatar SionBayliss commented on August 12, 2024

Hi LC,

You just convert the 'number of genomes' column to a percentage using the number of input samples.

The gene family column should start with g00001. Are you sure you haven't modified it in some way? The presence of alignments with g00001 in the feature sequences directory indicates that there should be an entry in the PIRATE.gene_families.tsv file (the script that creates the alignments uses the gene_families file for an input). Alternatively, have you run PIRATE multiple times to the same output folder with different settings?

I recommend that you read the README thoroughly. The answers to most of your questions are there.

S

from pirate.

limin321 avatar limin321 commented on August 12, 2024

Hi LC,

You just convert the 'number of genomes' column to a percentage using the number of input samples.

The gene family column should start with g00001. Are you sure you haven't modified it in some way? The presence of alignments with g00001 in the feature sequences directory indicates that there should be an entry in the PIRATE.gene_families.tsv file (the script that creates the alignments uses the gene_families file for an input). Alternatively, have you run PIRATE multiple times to the same output folder with different settings?

I recommend that you read the README thoroughly. The answers to most of your questions are there.

S

Hi S,

Thank you so much for the suggestion. I double checked. My excel run into some issues, making it unable to display all data. So sorry for the inconveniece.

Best,

from pirate.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.