Giter Site home page Giter Site logo

inab / trimal Goto Github PK

View Code? Open in Web Editor NEW
150.0 12.0 39.0 72.16 MB

A tool for automated alignment trimming in large-scale phylogenetic analyses. Development version: 2.0

Home Page: http://trimal.cgenomics.org

License: GNU General Public License v3.0

Parrot 0.36% C++ 72.62% C 1.23% Clarion 9.80% Python 5.32% Roff 0.04% Makefile 0.26% Shell 10.37%
multiple-sequence-alignment trimming bioinformatics-tool

trimal's People

Contributors

markjens avatar nicodr97 avatar scapella avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

trimal's Issues

Implement support for different genetic codes

Different genetic codes: Universal, mammalian mt, yeast mt, mold mt, invertebrate mt, cilate nuclear, echinoderm mt, euplotid mt, alternative yeast nuclear, ascidian mt.

When it comes to labeling stop codons, you need just a nuclear and mitochondrial code parameter.

Complementary functionality doesn't work as expected when using consistency-based methods

I have detected an unexpected behaviour when trimming an alignment based on consistency methods, which compute their scores based on the level of agreement across a set of input alignment.

trimAl doesn't return those columns which are removed when applying -ct parameter. I have detected it using -ct alone or with other methods e.g. -gt

@Vicfero could you please verify to what extend this affect to the new version?

trimAl doesn't recognise alignment

Hi,

I have tried to find an answer by searching google but couldn't find anything.
I aligned my Data with mafft and now wanted to trim with TrimAl. The first two sequences worked but then I received following error message:

ERROR: The sequences in the input alignment should be aligned in order to use trimming method.

I will attach the file in question. CoaE.mafft.zip

The command used was:

trimal -in CoaE.mafft -out CoaE.triaml.fasta -fasta - automated1

Thank you for helping.

PS I really do hope this isn't a stupid question.

Provide an updated windows version

Provide an updated Windows-compatible version incorporating the latest trimAl features and fixed bugs.
Ideally we should have a (automated) mechanism for providing this specific compilation

Problem using -compareset, -ct, -gt together.

When using -compareset, -ct, and -gt together (perhaps this is not allowed?) I get an output alignment with some extra garbage to the right of the last legitimate column:
(sorry I don't know how to get fixed-width font here) .

For example:

tomfy@t410:~/trimAl_1.4/dataset$ trimal -compareset fileset1 -ct 0.1 -gt 0.1
6 46
Sp8 ---GKVIV-YGIVLGTKSDQFSVVWLFPWNGLQIHMMGII
Sp17 FAYTDLLL-IGFLLKTV-ATFGDTWFQLWQGLDLNKMPVF
Sp10 ----AVL--FVIMLGTI-TKFSSEWFFAWLGLEINMMVII
Sp26 AAAAALLTYLGLFLGTDYENFAAAAANAWLGLEINMMAQI
Sp33 ----TILNIAGLHMETD-INFSLAWFQAWGGLEINKQAIL
Sp6 ---AAILT-LGIYLFTLCAVISVSWYLAWLGLEINMMAIINKMPVF

tomfy@t410:~/trimAl_1.4/dataset$ trimal -compareset fileset1 -ct 0.2 -gt 0.5
6 38
Sp8 GIVLGTKSFSVVWLFPWNGLQIHMMGIIQAIL
Sp17 GFLLKTV-FGDTWFQLWQGLDLNKMPVFMAQI
Sp10 VIMLGTI-FSSEWFFAWLGLEINMMVIIMVII
Sp26 GLFLGTDYFAAAAANAWLGLEINMMAQIMPVF
Sp33 GLHMETD-FSLAWFQAWGGLEINKQAILMGII
Sp6 GIYLFTLCISVSWYLAWLGLEINMMAII

Improve documentation and manual

The documentation needs to be intensively extended to incorporate the latest improvements.
It is also lacking enough clarity about how to use specific functions.

Please tag a "Release" so it can be packaged

Ideally, to put trimal in Brew / Linuxbrew we need a tagged release.

Would it be possible to make one using the "Releases" tab?

Even it if is 1.4b that's fine - we just need a .tar.gz to download.

Improve columns mapping

Improve "-colnumbering" parameter by providing more information about which columns from the old alignment corresponds to the new one.

trimAl capital letters

Check whether it is already fixed to consider lower and upper case letter the same symbol.

statal prints nothing - no warning to user

Running statal -in file.aln runs for a while and prints nothing.

I assume that one of the -sc* options is needed to get an output.

Can you flag an error if no output option is provided?

Improving warning when working with backtranslation

WARNING: Cutting sequence "Phy006C668_LYNLY" at first appearance of stop codon "TAA" (residue "") at position 1045 (length: 1047) <<<

This warning is not necessary since the protein sequence end has been reached.

readAl/trimAl fails to output to nexus, mega and nbrf if sequences are classified 'Deg'

If the input alignment (DNA/RNA) type is classified as DNADeg or RNADeg, both programs fail to output correctly:

NEXUS OUTPUT:

#NEXUS
BEGIN DATA;
DIMENSIONS NTAX=12 NCHAR=637;
;
...

NEXUS DESIRED OUTPUT

#NEXUS
BEGIN DATA;
DIMENSIONS NTAX=12 NCHAR=637;
FORMAT DATATYPE=DNA INTERLEAVE=yes GAP=-;

MEGA OUTPUT:

#MEGA
!Title dataset/huge/example4.sequential.phy.nexus;
NSeqs=12 Nsites=637 indel=- CodeTable=Standard;

MEGA DESIRED OUTPUT

#MEGA
!Title dataset/huge/example4.sequential.phy.nexus;
!Format DataType=DNA NSeqs=12 Nsites=637 indel=- CodeTable=Standard;

NBRF OUTPUT

>;VRA17
VRA17 637 bases

NBRF DESIRED OUTPUT

>DL;VRA17
VRA17 637 bases

The issue seems to be located on these lines:

  /* Compute output file datatype */
  getTypeAlignment();
  if (dataType == DNAType)
    alg_datatype = "DL";
  else if (dataType == RNAType)
    alg_datatype = "RL";
  else if (dataType == AAType)
    alg_datatype = "P1";

Degraded.zip

'compareset' error when using trimAL 1.2 on Windows

Dear developer,

I experienced an error while using trimAL. When I using "trimal -compareset Api0000040.cmp" command. I got an error: "Alignment not loaded: "" Check the file's content.". I also tried to use the absolute path of the alignment, but I still got that error. Could you please tell me how to resolve this? I am using trimAl 1.2 on Windows (downloaded from http://trimal.cgenomics.org/downloads).

By the way, could you please compile a new version of trimAL for windows? I can currently only use version 1.2 on Windows.

Best wishes,

Dong Zhang

readAl / trimAl creates an empty file when trying to output a non-aligned format/alignment to an aligned-format.

Both programs output an empty file when given a non-aligned file / alignment (ex: dataset/example.007.AA.only_seqs) and asked to save in a alignment-format (clustal, phylip, etc)

The program warns about this problem, but the check is done in functions 'alignmentXToFile' when the file has been created.
The problem seems to be on function "alignment::saveAlignment(char *destFile)" on "alignment.cpp", line 607-673:

bool alignment::saveAlignment(char *destFile) {

  ofstream file;

  if(sequences == NULL)
    return false;

  if((residNumber == 0)  || (sequenNumber == 0)) {
    cerr << endl << "WARNING: Output alignment has not been generated. "
      << "It is empty." << endl << endl;
    return true;
  }

  /* File open and correct open check */
  file.open(destFile);
  if(!file) return false;

  /* Depending on the output format, we call to the appropiate function */
  switch(oformat) {
    case 1:
      alignmentClustalToFile(file);
      break;
    case 3:
      alignmentNBRF_PirToFile(file);
      break;
    case 8:
      alignmentFastaToFile(file);
      break;
    case 11:
      alignmentPhylip3_2ToFile(file);
      break;
    case 12:
      alignmentPhylipToFile(file);
      break;
    case 13:
      alignmentPhylip_PamlToFile(file);
      break;
    case 17:
      alignmentNexusToFile(file);
      break;
    case 21: case 22:
      alignmentMegaToFile(file);
      break;
    case 99:
      getSequences(file);
      break;
    case 100:
      alignmentColourHTML(file);
      break;
    default:
      return false;
  }

  /* Close the output file */
  file.close();

  /* All is OK, return true */
  return true;
}

The warning of this problem:

ERROR: Sequences are not aligned. Format (X) not compatible with unaligned sequences.

Is done after opening the ofstream, thus, making an empty file.

Removing sequences with manual overlap

The option to remove sequences using -seqoverlap and -resoverlap produces results I find unexpected. From what I can tell, it seems that when -resoverlap compares if a residue is the same in the other sequences it does not consider the base identity (for a DNA alignment), only whether there is a gap character or any DNA base. Therefore, for a gapfree alignment, if I change all bases in a sequence to e.g. "T" the sequence will not be removed from the alignment, even with strict settings, (e.g. -resoverlap 0.9 seqoverlap 95).

Is this how it is supposed to work and if so I'm curious why it works like this and not as I might have expected it to? Thanks

Should gaps be included in the conservation score computation?

Dear developers,
Can it be better that the blosum matrix in this package (matrix.BLOSUM62) contains one more state for gaps (ie. '-'). When I tried to remove fully conserved columns, the result was expected to exclude monotonic columns, such as:

A
A
A
A
A

However, a column like

A
-
A
A
A

also had a 1.0 conservation score, computed by -scc function, although it was not fully conserved. It seems not so straightforward to specifically detect the first case with -gt and -st. Can there be any suggestion for people trying to remove monotonic columns (ie. containing single residue type and without gaps)?

TrimAl compareset error

Dear Salvador Capella-Gutiérrez and Toni Gabaldón,

I was testing trimAl with the -compareset option (employing a few nucleotide alignments with the same sequences across alignments in same order), but i got a few errors (please see in attachment). Am I missing something?
TrimAl was compiled from the latest version (v1.4.rev22 build[2015-05-21]) available in https://github.com/scapella/trimal.

Are these errors related to alignment differences? If not, is it possible that a future version can fix these?

Thanks for your attention.

Best regards,
Emanuel Maldonado.
CIIMAR, University of Porto.

trimal_compareset_error.txt
msa.zip

Error message on some alignments

Hello!
On some alignment files I get multiple errors, along the line of:
'Error: the symbol 'R' accesing the matrix is not defined in this object' (the symbol is changing).

The thing is, that my sequences are DNA, but as far as I've seen in the code, the default similarity matrix is for proteins. I have tested about 10 MSAs (which are from very similar sources), and some work, some produce the error. On some files statal -ssc gives those errors, but trimal works as expected.

Do you have some hints how could this be resolved?

Thanks!

trimal.exe has stopped working message

Hi there,

I have been using trimal v.1.2 on a local computer using command prompt. When I tried to run a command line to trim my alignment this message "trimal.exe has stopped working message" popped up and the program stopped working. Can you please tell me what I have done wrong?
The command line was trimal -in -out -automated1

My guess was that the alignment file was too big. I'm trying to find if there's any size limitation for trimal but couldn't find any information on the main website. Could you please clarify this issue?

Thank you in advance

Any plans for multi-threading support in trimAl ?

Are there any plans to support multi-threading?

I understand it is not trivial to implement, but OpenMP pragmas could make it easy to parallelize parts of the code that loop over columns because they are independent operations?

I work on DNA alignments with 5,000,000 columns and 100s of rows, and most operations are surprisingly slow.

Problem with "n"

When the alignment contain "n" and not "N", trimal gives an error:
Error: the symbol 'n' accesing the matrix is not defined in this object

statal - basic report with #seq

Is there a way for statal to give me the basic information of

  • name of the alignment (if the format supports it)
  • number of entries
  • length of the the alignment
  • alphabet eg. AGTC AGTCN-

Perhaps this could be the default when no -sg* option is provided?

Accept stdin as input

A feature request: would be very convenient to have an option to pipe into trimal:

my_pipe | trimal -in /dev/stdin

or

my_pipe | trimal -in -

trimal -selectcols doesn't work

ERROR: Parameter "-selectcols" not valid.

I found somewhere pdf manual, where "select" is used instead - still the same error

statal - alphabet distribution report

I deal a lot with core genome SNP alignments (DNA) across 100s of bacterial samples. A useful report would be like this:

ID      #A  #G  #T  #C  #N  #-
aln1   12  31   11  31  0   8
aln2   11  44   12  32  2   5
aln3   10   33  12  32  10  2

trimal changes fasta headers when contain ':'

Hi

I use trimAl v1.4.rev9 build[2012-08-09] and have problems with some sequence names.
Some names look like 'TCOGS2:TC012457-PA'.
(They are "official" names from the Tribolium castaneum genome)

If I run trimAl on the dataset containing this sequence
trimal -in input.fas -noallgaps -keepseqs
'TCOGS2:TC012457-PA' is truncated to 'TCOGS2'

Could you prevent trimAl to truncate sequence names in case of ':' ?

Thanks
Regards

Compilation problem (mac os 10.13, Kernel Version: Darwin 17.0.0)

Hi,

I tried to compile both v1.2 & v1.4. Below are the error messages.
########
g++ -Wall -c compareFiles.cpp
g++: warning: couldn’t understand kern.osversion ‘17.0.0
In file included from /usr/include/Availability.h:194:0,
from /usr/include/stdlib.h:61,
from compareFiles.h:30,
from compareFiles.cpp:27:
/usr/include/AvailabilityInternal.h:25584:74: error: missing binary operator before token "("
#if defined(__has_feature) && defined(__has_attribute) && __has_attribute(availability)
^
In file included from /usr/include/stdlib.h:61:0,
from compareFiles.h:30,
from compareFiles.cpp:27:
/usr/include/Availability.h:387:74: error: missing binary operator before token "("
#if defined(__has_feature) && defined(__has_attribute) && __has_attribute(availability)
^
make: *** [compareFiles.o] Error 1

Compilation time and memory

Hi,

While compilation, I had t stop because the make command was using all my system memory (12 GB). Is that normal? Below is where I had to stop the compilation process.

I'm running MACOSX Lion.

Thank you,
Bernardo


dhcp-172-17-27-227:source bernardo$ make
g++ -Wall -O2 -c alignment.cpp rwAlignment.cpp autAlignment.cpp
g++ -Wall -O2 -c statisticsGaps.cpp
g++ -Wall -O2 -c utils.cpp
g++ -Wall -O2 -c similarityMatrix.cpp
g++ -Wall -O2 -c statisticsConservation.cpp
g++ -Wall -O2 -c sequencesMatrix.cpp
g++ -Wall -O2 -c compareFiles.cpp
g++ -Wall -O2 -o readal readAl.cpp -lm alignment.o statisticsGaps.o utils.o similarityMatrix.o statisticsConservation.o sequencesMatrix.o compareFiles.o
g++ -Wall -O2 -o trimal main.cpp -lm alignment.o statisticsGaps.o utils.o similarityMatrix.o statisticsConservation.o sequencesMatrix.o compareFiles.o

^Cmake: *** [trimal] Interrupt: 2

Option to remove all ambiguous columns

Hi,
It would be nice to have a flag to remove all ambiguous columns, beside ACTG for DNA, for example, something similar to -gt which we can fine tune how much ambiguities are allowed!

Thanks,
Mohammad

Does trimal have a codon alignment option ?

Hello, I have a codon alignment file and I wondered if Trimal could have an option to select blocks made to contain only complete codons?

Thank you for your answer.

Best regards

Can't install trimal on Mac

Hi there,

I just downloaded the trimal v.1.2 for Mac and try to manually install by following the instruction on readme file but it doesn't work. Also, I couldn't find a bin directory in the downloaded folder. Could you please help me with this issue?

Thanks

duplicate sequence IDs

Dear Salvador,
I get a segmentation fold in trimal, when running the following command:

trimal -in COG0185.0.faa -out COG0185.0.fna -backtrans inMSA0.fna -ignorestopcodon -gt 0.1 -cons 60

Without the -backtrans option, the program runs fine. I was hoping you could help me.

All the best,
Falk Hildebrand

Answer from Salvador:
Dear Falk,

Thanks for using trimAl and contacting me regarding this unexpected behaviour.

I played a bit with your input file and realized that you have some repeated IDs for the nucleotide files ...
2 >394503_COG0185
2 >411474_COG0185
2 >445973_COG0185
2 >699246_COG0185
2 >718252_COG0185
... and for the protein files:
2 >394503_COG0185

I just kept the first appearance of such sequences in the attached files, and everything worked as expected.

Throw error when only input output are given

When only input and output are given to trimal (as new users may expect a default trim mode).
trimal output the input.
It seems to me that it should instead complain that no option has been given.

This may save time for new users of trimal.

BLOSUM45 error

Hi there,
I just ran into an error that I believe is caused by the BLOSUM45 matrix (it seems that the wild character is needed but not taken by trimal) . Both commands below produce a Segmentation error

trimal -in PTHR26451_2456.ali -out PTHR26451_2456.phy -gt 0.9 -cons 60 -st 0.3 -matrix BLOSUM45b
trimal -in PTHR26451_2456.ali -out PTHR26451_2456.phy -gt 0.9 -cons 60

Please find attached the files that cause the error for your consideration.
Many thanks,
DE

Data.zip

Improving alignment stats

When providing stats, for instance, similarity values or gapp values. It would be good to have the average value for the whole alignment

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.