Comments (11)
Okay! It was the headers. RefSeq headers for a couple of my species looked like this:
lcl|NC_004818.2_prot_XP_311158.5_1 [gene=KIBRLG] [locus_tag=AgaP_AGAP000002] [db_xref=VectorBase:AGAP000002-PA,GeneID:1272272] [protein=AGAP000002-PA] [protein_id=XP_311158.5] [location=complement(join(582..865,950..3120,3211..3370,3459..3760,15747..15871))] [gbkey=CDS]
I found an alternative repository with different headers, which looked like this:
AGAP000002-PA | transcript=AGAP000002-RA | gene=AGAP000002 | organism=Anopheles_gambiae_PEST | gene_product=unspecified product | transcript_product=unspecified product | location=AgamP4_X:582-15871(-) | protein_length=1013 | sequence_SO=chromosome | SO=protein_coding_gene | is_pseudo=false
After this, masce aligned the orthologs without issue
from compare_genomes.
I can take a look! I almost wonder if a "translation table" would be a better optional approach for datasets like mine, where the headers are kind of a mess, haha.
Regardless I was able to get things to run quite well, and I'm excited to try it on a larger dataset!
Question for you - what kind of filtering do you do of inputs before you run the pipeline? I realized the gene sets I downloaded did not do any isoform curation which resulted in some hilarious levels of purported gene duplication (the gene sets from our genomes were pre-curated for isoforms, I just didn't think to do that to this new set when I was testing header issues).
from compare_genomes.
Yes, it's failing to align with MACSE; however I have had no problems aligning with 32 cores and ~160GB of RAM - possibly some of your sequences are very large. Your options are:
- checking that your system configuration matches
process.config
, i.e. the number of CPUs and RAM matches your system's, - increasing the RAM, if possible,
- running MACSE like so: ranwez/MACSE_V2_PIPELINES#7, or
- replacing MACSE with another aligner.
from compare_genomes.
Great suggestions - I did what you said, and notably I updated the MACSE wrapper to have more memory. That got rid of the heap space error, but now I have this in my error log:
42422 sequence(s) with genetic code The_Standard_Code Build draft alignment using greedy strategy based on UPGMA tree Start building smaller alignments 4242 fasta file:OG0006748.aligned.unsorted.cds.tmp not found java.io.FileNotFoundException: OG0006748.aligned.unsorted.cds.tmp (No such file or directory) at java.base/java.io.FileInputStream.open0(Native Method) at java.base/java.io.FileInputStream.open(FileInputStream.java:216) at java.base/java.io.FileInputStream.<init>(FileInputStream.java:157) at java.base/java.io.FileInputStream.<init>(FileInputStream.java:111) at java.base/java.io.FileReader.<init>(FileReader.java:60) at bioObject.CodingDnaSeq.readFasta(CodingDnaSeq.java:562) at utils.MacseMain.main(MacseMain.java:590) rm: cannot remove 'OG0006748*.tmp': No such file or directory rm: cannot remove 'OG0006748.AA.prot': No such file or directory
Is that still likely related to running out of memory?
from compare_genomes.
I think it has something to do with the pipeline reruns. The module single_gene_orthogroups_tree.nf
cleans up the extracted sequences for alignment and other temporary files. Can you try rerunning the entire single_gene_orthogroups_tree.nf
module again?
from compare_genomes.
So I tried various things and kept running into the same error... I think the problem is my input file formats. The files from RefSeq have pipes which I think may be causing problems! I am going to see about modifying those input files and trying again.
from compare_genomes.
The pipes should not be a problem as I am accounting for those; but I may be completely wrong and some steps are failing because of them. Have you also tried completely restarting the pipeline from scratch (good 'ol turn it off and on again)?
from compare_genomes.
I did try to restart the pipeline. I noticed A LOT of the files that are pulled out by single_gene_orthogroups_tree.nf are 80 mb with 40k sequences in them! We've run some of these genomes through orthofinder before and never had such large gene families found...
I can try manipulating headers tomorrow and seeing...
from compare_genomes.
I wonder if those are because of my sequence extractor scripts or OrthoFinder's orthogroup detection.
Thanks for your tenacity on this. Please let me know if it still fails again. Also, if you can share your config files so that I can recreate the errors and we'll debug further.
from compare_genomes.
Nice! Then I think it's probably an issue with my sequence extractor (Called here this julia script here). I'll look at them when I have time, but feel free to submit pull requests if you have time to fix them.
from compare_genomes.
Awesome! We have not had problems with isoforms messing with our gene duplication estimates. BUSCO analyses seem to be within expected ranges for the plant species we've had to deal with so far. And I'm not too intimately knowledgeable about the annotation pipeline which may or may not deal with isoforms well enough.
from compare_genomes.
Related Issues (17)
- Make use of Nextflow's IO more effectively
- Improve documentation and debug Venn diagram plotting HOT 3
- Rerunning from orthogroups appends species names to protein sequence names which conflicts with downstream module expectations of HOT 1
- Process `PLOT` terminated with an error exit status (1) HOT 39
- Allow for PANTHER databases to be linked instead of downloaded
- HyPhy integration for more efficient detection of signatures of selection on gene sequences
- Performance assessment on polyploid genomes
- Error executing process > 'PLOT' HOT 32
- Error on test dataset HOT 4
- Suitable for no model organisms? HOT 3
- cannot open file 'ORTHOGROUPS_SINGLE_GENE.NT.timetree.nex': No such file or directory HOT 17
- assess_specific_genes is taking days HOT 6
- Can't generate venn diagram with R script HOT 5
- RE: 4DTv and WGD HOT 5
- cannot open file 'CAFE_results/Gamma_clade_results.txt': No such file or directory HOT 2
- assess_specific_genes.nf fails to open ORTHOGROUPS_SINGLE_GENE.NT.treefile HOT 2
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from compare_genomes.