Comments (4)
The PIRATE.gene_families.tsv file has more information on the particular genes included in the alignment. Genes are considered to be core when they are present in >95% of all isolates. See the README for more details on the output files.
For an alignment of aa sequence you can either concatenate the aligned amino acid sequence present in the feature_sequences/ directory (*.aa.fasta) or directly translate the core alignment file. This file retains the triplicate (codon) nature of an amino acid alignment so it should be suitable for translation by other software without any preprocessing.
S
from pirate.
The PIRATE.gene_families.tsv file has more information on the particular genes included in the alignment. Genes are considered to be core when they are present in >95% of all isolates. See the README for more details on the output files.
For an alignment of aa sequence you can either concatenate the aligned amino acid sequence present in the feature_sequences/ directory (*.aa.fasta) or directly translate the core alignment file. This file retains the triplicate (codon) nature of an amino acid alignment so it should be suitable for translation by other software without any preprocessing.
S
Hi S,
That's very helpful suggestion. I think I could get all core gene name based on PIRATE.gene_families.tsv and core_alignment.fasta. Feels like there is no direct way to filter genes that are present in >95% of all isolates from the table. Do you have any suggestion how you will do that? The only way I can think of right now is to use core gene alignment to blast against the whole genome sequences to get which genes are core ones.
I would like to extract core genes aa sequences and concatenate them.
However, when we open PIRATE.gene_families.tsv in excel, the first two columns are allele_name (ex: g031830_000006), and gene_family (g031830). However, in feature_sequences folder, *.aa.fasta file are named like g031830.aa.fasta.
Also, there is g000001.aa.fasta file in feature_sequences folder, but I don't see g000001 in PIRATE.gene_families.tsv anywhere.
In the picture, you will see the smallest number of gene_family name starting with g000091, however, there are plenty of *.aa.fasta files starting with g000001.aa.fasta; g000001.aa.fasta; etc What does this mean that gene_family names don't match in the table and in feature_sequences folder?That is why I am confused after I filtered all core genes in PIRATE.gene_families.tsv, which name should I use to match *.aa.fasta files accordingly.
I am so sorry for so much questions.
Best,
LC
from pirate.
Hi LC,
You just convert the 'number of genomes' column to a percentage using the number of input samples.
The gene family column should start with g00001. Are you sure you haven't modified it in some way? The presence of alignments with g00001 in the feature sequences directory indicates that there should be an entry in the PIRATE.gene_families.tsv file (the script that creates the alignments uses the gene_families file for an input). Alternatively, have you run PIRATE multiple times to the same output folder with different settings?
I recommend that you read the README thoroughly. The answers to most of your questions are there.
S
from pirate.
Hi LC,
You just convert the 'number of genomes' column to a percentage using the number of input samples.
The gene family column should start with g00001. Are you sure you haven't modified it in some way? The presence of alignments with g00001 in the feature sequences directory indicates that there should be an entry in the PIRATE.gene_families.tsv file (the script that creates the alignments uses the gene_families file for an input). Alternatively, have you run PIRATE multiple times to the same output folder with different settings?
I recommend that you read the README thoroughly. The answers to most of your questions are there.
S
Hi S,
Thank you so much for the suggestion. I double checked. My excel run into some issues, making it unable to display all data. So sorry for the inconveniece.
Best,
from pirate.
Related Issues (20)
- mafft alignment number of threads HOT 1
- gaps in core_alignment.fasta HOT 1
- Output Files HOT 3
- Confused with terminology/output HOT 1
- extract_feature_sequences.pl failed HOT 2
- error observed during "aligning all feature sequences" HOT 2
- Missing genome in output HOT 12
- Output gene sequences to run gene alignment separately HOT 4
- PIRATE_plots.pdf created by plot_summary.R HOT 1
- Error after MCL clustering step HOT 5
- How do you tell which gene families are single-copy or multi-copy? HOT 2
- Feature request: Option to include original IDs and annotations in fasta headers for align_features_sequences script HOT 2
- Average_dose =1 is appropriate to determine whether a gene family is a single copy? HOT 1
- - ERROR: link_clusters.pl failed. HOT 1
- Undefined subroutine &main::translate called HOT 2
- Error when running PIRATE MCL process
- For some single loci, a gene family but for others not. HOT 1
- problem in installation HOT 9
- Bump version in new release HOT 4
- Missing output files and coregenom files HOT 3
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from pirate.