inab / trimal Goto Github PK

A tool for automated alignment trimming in large-scale phylogenetic analyses. Development version: 2.0

Home Page: https://trimal.readthedocs.io/

License: GNU General Public License v3.0

Parrot 0.36% C++ 72.67% C 1.25% Clarion 9.77% Python 5.30% Roff 0.04% Makefile 0.26% Shell 10.34%

multiple-sequence-alignment trimming bioinformatics-tool

trimal's Introduction

Basic Installation
==================

The simplest way to compile this package is:

  1. 'cd' to the directory containing the package's source code ('source').

  2. Type 'make' to compile the package.

  3. Optionally, run trimAl/readAl with the examples into the 'dataset' 
     directory to check the correct installation.

   By default, 'make' compiles the source code of trimAl and readAl in the
current directory. After that, you can either add to PATH the current
directory or move these files to '/usr/local/bin' or to '/usr/bin' using
root privileges.

trimal's People

Contributors

Stargazers

Watchers

Forkers

jrherr alionka daweih marco-mariotti egenomics fw1121 sleeperko hainguyenres charlesxian beig2048 vicfero tobiasroeding dayedepps avicens chengyunazhi obenno pythseq wangpanqiao luhuimeng wangdi2014 liupfskygre zhaokai2014 altingia zhk8111 danieledler wook2014 sarkar23mlt sai-borg cuidachao rajaldebnath bioinformatics-tools-collection paigemaroni ruixiangliu durantttt fjkfs markjens mmnashrullah tgobbin sicheng-shu jbitencourt

trimal's Issues

Redundant asignation on Sequences Matrix = operator

https://github.com/scapella/trimal/blob/b240b070f0c9651761361bb3d55bd0e3c328558a/source/sequencesMatrix.cpp#L92

should be

matrix[i][j] = matrix[j][i];

readAl / trimAl creates an empty file when trying to output a non-aligned format/alignment to an aligned-format.

Both programs output an empty file when given a non-aligned file / alignment (ex: dataset/example.007.AA.only_seqs) and asked to save in a alignment-format (clustal, phylip, etc)

The program warns about this problem, but the check is done in functions 'alignmentXToFile' when the file has been created.
The problem seems to be on function "alignment::saveAlignment(char *destFile)" on "alignment.cpp", line 607-673:

bool alignment::saveAlignment(char *destFile) {

  ofstream file;

  if(sequences == NULL)
    return false;

  if((residNumber == 0)  || (sequenNumber == 0)) {
    cerr << endl << "WARNING: Output alignment has not been generated. "
      << "It is empty." << endl << endl;
    return true;
  }

  /* File open and correct open check */
  file.open(destFile);
  if(!file) return false;

  /* Depending on the output format, we call to the appropiate function */
  switch(oformat) {
    case 1:
      alignmentClustalToFile(file);
      break;
    case 3:
      alignmentNBRF_PirToFile(file);
      break;
    case 8:
      alignmentFastaToFile(file);
      break;
    case 11:
      alignmentPhylip3_2ToFile(file);
      break;
    case 12:
      alignmentPhylipToFile(file);
      break;
    case 13:
      alignmentPhylip_PamlToFile(file);
      break;
    case 17:
      alignmentNexusToFile(file);
      break;
    case 21: case 22:
      alignmentMegaToFile(file);
      break;
    case 99:
      getSequences(file);
      break;
    case 100:
      alignmentColourHTML(file);
      break;
    default:
      return false;
  }

  /* Close the output file */
  file.close();

  /* All is OK, return true */
  return true;
}

The warning of this problem:

ERROR: Sequences are not aligned. Format (X) not compatible with unaligned sequences.

Is done after opening the ofstream, thus, making an empty file.

trimal -selectcols doesn't work

ERROR: Parameter "-selectcols" not valid.

I found somewhere pdf manual, where "select" is used instead - still the same error

Please tag a "Release" so it can be packaged

Ideally, to put trimal in Brew / Linuxbrew we need a tagged release.

Would it be possible to make one using the "Releases" tab?

Even it if is 1.4b that's fine - we just need a .tar.gz to download.

start (*) symbol is not recognized.

My alg contains * representing stop codons. Trimal seems to delete them, so it raises the that file contains unaligned seqs.

Improve documentation and manual

The documentation needs to be intensively extended to incorporate the latest improvements.
It is also lacking enough clarity about how to use specific functions.

Any plans for multi-threading support in trimAl ?

Are there any plans to support multi-threading?

I understand it is not trivial to implement, but OpenMP pragmas could make it easy to parallelize parts of the code that loop over columns because they are independent operations?

I work on DNA alignments with 5,000,000 columns and 100s of rows, and most operations are surprisingly slow.

Improving alignment stats

When providing stats, for instance, similarity values or gapp values. It would be good to have the average value for the whole alignment

Tag a new release?

Lots of commits since 1.4.1 ?

statal - basic report with #seq

Is there a way for statal to give me the basic information of

name of the alignment (if the format supports it)
number of entries
length of the the alignment
alphabet eg. AGTC AGTCN-

Perhaps this could be the default when no -sg* option is provided?

ERROR: Alignment not loaded:

Dear,
When I trim the output CLUSTAL file by muscle(3.8) with trimAl (v1.4.rev22), all the Symbol become the filrst one. could you give me some suggestion?
code
trimal -in region_muscle.clw -out region_muscle_trimal_auto -automated1

my email:
region_muscle.txt

[email protected]

Add an option for avoiding dumping any output alignment

duplicate sequence IDs

Dear Salvador,
I get a segmentation fold in trimal, when running the following command:

trimal -in COG0185.0.faa -out COG0185.0.fna -backtrans inMSA0.fna -ignorestopcodon -gt 0.1 -cons 60

Without the -backtrans option, the program runs fine. I was hoping you could help me.

All the best,
Falk Hildebrand

Answer from Salvador:
Dear Falk,

Thanks for using trimAl and contacting me regarding this unexpected behaviour.

I played a bit with your input file and realized that you have some repeated IDs for the nucleotide files ...
2 >394503_COG0185
2 >411474_COG0185
2 >445973_COG0185
2 >699246_COG0185
2 >718252_COG0185
... and for the protein files:
2 >394503_COG0185

I just kept the first appearance of such sequences in the attached files, and everything worked as expected.

Problem using -compareset, -ct, -gt together.

When using -compareset, -ct, and -gt together (perhaps this is not allowed?) I get an output alignment with some extra garbage to the right of the last legitimate column:
(sorry I don't know how to get fixed-width font here) .

For example:

tomfy@t410:~/trimAl_1.4/dataset$ trimal -compareset fileset1 -ct 0.1 -gt 0.1
6 46
Sp8 ---GKVIV-YGIVLGTKSDQFSVVWLFPWNGLQIHMMGII
Sp17 FAYTDLLL-IGFLLKTV-ATFGDTWFQLWQGLDLNKMPVF
Sp10 ----AVL--FVIMLGTI-TKFSSEWFFAWLGLEINMMVII
Sp26 AAAAALLTYLGLFLGTDYENFAAAAANAWLGLEINMMAQI
Sp33 ----TILNIAGLHMETD-INFSLAWFQAWGGLEINKQAIL
Sp6 ---AAILT-LGIYLFTLCAVISVSWYLAWLGLEINMMAIINKMPVF

tomfy@t410:~/trimAl_1.4/dataset$ trimal -compareset fileset1 -ct 0.2 -gt 0.5
6 38
Sp8 GIVLGTKSFSVVWLFPWNGLQIHMMGIIQAIL
Sp17 GFLLKTV-FGDTWFQLWQGLDLNKMPVFMAQI
Sp10 VIMLGTI-FSSEWFFAWLGLEINMMVIIMVII
Sp26 GLFLGTDYFAAAAANAWLGLEINMMAQIMPVF
Sp33 GLHMETD-FSLAWFQAWGGLEINKQAILMGII
Sp6 GIYLFTLCISVSWYLAWLGLEINMMAII

Does trimal have a codon alignment option ?

Hello, I have a codon alignment file and I wondered if Trimal could have an option to select blocks made to contain only complete codons?

Thank you for your answer.

Best regards

readal phylilp-m10

says this:

ERROR: Parameter "-phylip-m10" not valid.

removing alignment length to taxon name?

Is there an output option where the alignment length is not added to taxon names? Thanks.

trimal.exe has stopped working message

Hi there,

I have been using trimal v.1.2 on a local computer using command prompt. When I tried to run a command line to trim my alignment this message "trimal.exe has stopped working message" popped up and the program stopped working. Can you please tell me what I have done wrong?
The command line was trimal -in -out -automated1

My guess was that the alignment file was too big. I'm trying to find if there's any size limitation for trimal but couldn't find any information on the main website. Could you please clarify this issue?

Thank you in advance

Complementary functionality doesn't work as expected when using consistency-based methods

I have detected an unexpected behaviour when trimming an alignment based on consistency methods, which compute their scores based on the level of agreement across a set of input alignment.

trimAl doesn't return those columns which are removed when applying -ct parameter. I have detected it using -ct alone or with other methods e.g. -gt

@Vicfero could you please verify to what extend this affect to the new version?

TrimAl compareset error

Dear Salvador Capella-Gutiérrez and Toni Gabaldón,

I was testing trimAl with the -compareset option (employing a few nucleotide alignments with the same sequences across alignments in same order), but i got a few errors (please see in attachment). Am I missing something?
TrimAl was compiled from the latest version (v1.4.rev22 build[2015-05-21]) available in https://github.com/scapella/trimal.

Are these errors related to alignment differences? If not, is it possible that a future version can fix these?

Thanks for your attention.

Best regards,
Emanuel Maldonado.
CIIMAR, University of Porto.

trimal_compareset_error.txt
msa.zip

Option to remove all ambiguous columns

Hi,
It would be nice to have a flag to remove all ambiguous columns, beside ACTG for DNA, for example, something similar to -gt which we can fine tune how much ambiguities are allowed!

Thanks,
Mohammad

Provide an updated windows version

Provide an updated Windows-compatible version incorporating the latest trimAl features and fixed bugs.
Ideally we should have a (automated) mechanism for providing this specific compilation

'compareset' error when using trimAL 1.2 on Windows

Dear developer,

I experienced an error while using trimAL. When I using "trimal -compareset Api0000040.cmp" command. I got an error: "Alignment not loaded: "" Check the file's content.". I also tried to use the absolute path of the alignment, but I still got that error. Could you please tell me how to resolve this? I am using trimAl 1.2 on Windows (downloaded from http://trimal.cgenomics.org/downloads).

By the way, could you please compile a new version of trimAL for windows? I can currently only use version 1.2 on Windows.

Best wishes,

Dong Zhang

Throw error when only input output are given

When only input and output are given to trimal (as new users may expect a default trim mode).
trimal output the input.
It seems to me that it should instead complain that no option has been given.

This may save time for new users of trimal.

Implement Stockholm format support

Error message on some alignments

Hello!
On some alignment files I get multiple errors, along the line of:
'Error: the symbol 'R' accesing the matrix is not defined in this object' (the symbol is changing).

The thing is, that my sequences are DNA, but as far as I've seen in the code, the default similarity matrix is for proteins. I have tested about 10 MSAs (which are from very similar sources), and some work, some produce the error. On some files statal -ssc gives those errors, but trimal works as expected.

Do you have some hints how could this be resolved?

Thanks!

Removing sequences with a % gaps equal or greater than a threshold

Implement ATX alignments support

I'm not sure whether is worthy to implement this very specific format in trimAl (or any associated program)

http://genome.ucsc.edu/goldenPath/help/axt.html

degenerete codes in alignments

Create an option that allow the degenerate codes in the alignments.

Can't install trimal on Mac

Hi there,

I just downloaded the trimal v.1.2 for Mac and try to manually install by following the instruction on readme file but it doesn't work. Also, I couldn't find a bin directory in the downloaded folder. Could you please help me with this issue?

Thanks

trimAl doesn't recognise alignment

Hi,

I have tried to find an answer by searching google but couldn't find anything.
I aligned my Data with mafft and now wanted to trim with TrimAl. The first two sequences worked but then I received following error message:

ERROR: The sequences in the input alignment should be aligned in order to use trimming method.

I will attach the file in question. CoaE.mafft.zip

The command used was:

trimal -in CoaE.mafft -out CoaE.triaml.fasta -fasta - automated1

Thank you for helping.

PS I really do hope this isn't a stupid question.

Compilation time and memory

Hi,

While compilation, I had t stop because the make command was using all my system memory (12 GB). Is that normal? Below is where I had to stop the compilation process.

I'm running MACOSX Lion.

Thank you,
Bernardo

dhcp-172-17-27-227:source bernardo$ make
g++ -Wall -O2 -c alignment.cpp rwAlignment.cpp autAlignment.cpp
g++ -Wall -O2 -c statisticsGaps.cpp
g++ -Wall -O2 -c utils.cpp
g++ -Wall -O2 -c similarityMatrix.cpp
g++ -Wall -O2 -c statisticsConservation.cpp
g++ -Wall -O2 -c sequencesMatrix.cpp
g++ -Wall -O2 -c compareFiles.cpp
g++ -Wall -O2 -o readal readAl.cpp -lm alignment.o statisticsGaps.o utils.o similarityMatrix.o statisticsConservation.o sequencesMatrix.o compareFiles.o
g++ -Wall -O2 -o trimal main.cpp -lm alignment.o statisticsGaps.o utils.o similarityMatrix.o statisticsConservation.o sequencesMatrix.o compareFiles.o

^Cmake: *** [trimal] Interrupt: 2

Should gaps be included in the conservation score computation?

Dear developers,
Can it be better that the blosum matrix in this package (matrix.BLOSUM62) contains one more state for gaps (ie. '-'). When I tried to remove fully conserved columns, the result was expected to exclude monotonic columns, such as:

A
A
A
A
A

However, a column like

A
-
A
A
A

also had a 1.0 conservation score, computed by -scc function, although it was not fully conserved. It seems not so straightforward to specifically detect the first case with -gt and -st. Can there be any suggestion for people trying to remove monotonic columns (ie. containing single residue type and without gaps)?

trimAl capital letters

Check whether it is already fixed to consider lower and upper case letter the same symbol.

Improving warning when working with backtranslation

WARNING: Cutting sequence "Phy006C668_LYNLY" at first appearance of stop codon "TAA" (residue "") at position 1045 (length: 1047) <<<

This warning is not necessary since the protein sequence end has been reached.

Problem with "n"

When the alignment contain "n" and not "N", trimal gives an error:
Error: the symbol 'n' accesing the matrix is not defined in this object

Improve columns mapping

Improve "-colnumbering" parameter by providing more information about which columns from the old alignment corresponds to the new one.

Accept stdin as input

A feature request: would be very convenient to have an option to pipe into trimal:

my_pipe | trimal -in /dev/stdin

my_pipe | trimal -in -

statal prints nothing - no warning to user

Running statal -in file.aln runs for a while and prints nothing.

I assume that one of the -sc* options is needed to get an output.

Can you flag an error if no output option is provided?

statal - alphabet distribution report

I deal a lot with core genome SNP alignments (DNA) across 100s of bacterial samples. A useful report would be like this:

ID      #A  #G  #T  #C  #N  #-
aln1   12  31   11  31  0   8
aln2   11  44   12  32  2   5
aln3   10   33  12  32  10  2

Compareset ends without reporting when one alignment couldn't be loaded

trimAl ends with a Segmentation fault when using -compareset and any of the alignments specified can't be loaded.

Removing sequences with manual overlap

The option to remove sequences using -seqoverlap and -resoverlap produces results I find unexpected. From what I can tell, it seems that when -resoverlap compares if a residue is the same in the other sequences it does not consider the base identity (for a DNA alignment), only whether there is a gap character or any DNA base. Therefore, for a gapfree alignment, if I change all bases in a sequence to e.g. "T" the sequence will not be removed from the alignment, even with strict settings, (e.g. -resoverlap 0.9 seqoverlap 95).

Is this how it is supposed to work and if so I'm curious why it works like this and not as I might have expected it to? Thanks

FYI - I have packaged trimAl 1.2

FYI

https://github.com/tseemann/homebrew-bioinformatics-linux/blob/master/trimal.rb

This will eventually be migrated to full homebrew-science if it can compile on OS X cleanly.

https://github.com/Homebrew/homebrew-science

Remove "{" and "}" symbols

Just to simplify trimAl command line, these symbols '{' and '}' should be removed

Compareset may crash when more alignments than sequences on the alignments are provided

This is due to this line: compareFiles.cpp:56

Currently: numResiduesAlig = new int[numSeqs];

Should be: numResiduesAlig = new int[numAlignments];

Compilation problem (mac os 10.13, Kernel Version: Darwin 17.0.0)

Hi,

I tried to compile both v1.2 & v1.4. Below are the error messages.
########
g++ -Wall -c compareFiles.cpp
g++: warning: couldn’t understand kern.osversion ‘17.0.0
In file included from /usr/include/Availability.h:194:0,
from /usr/include/stdlib.h:61,
from compareFiles.h:30,
from compareFiles.cpp:27:
/usr/include/AvailabilityInternal.h:25584:74: error: missing binary operator before token "("
#if defined(__has_feature) && defined(__has_attribute) && __has_attribute(availability)
^
In file included from /usr/include/stdlib.h:61:0,
from compareFiles.h:30,
from compareFiles.cpp:27:
/usr/include/Availability.h:387:74: error: missing binary operator before token "("
#if defined(__has_feature) && defined(__has_attribute) && __has_attribute(availability)
^
make: *** [compareFiles.o] Error 1

trimal changes fasta headers when contain ':'

I use trimAl v1.4.rev9 build[2012-08-09] and have problems with some sequence names.
Some names look like 'TCOGS2:TC012457-PA'.
(They are "official" names from the Tribolium castaneum genome)

If I run trimAl on the dataset containing this sequence
trimal -in input.fas -noallgaps -keepseqs
'TCOGS2:TC012457-PA' is truncated to 'TCOGS2'

Could you prevent trimAl to truncate sequence names in case of ':' ?

Thanks
Regards

readAl/trimAl fails to output to nexus, mega and nbrf if sequences are classified 'Deg'

If the input alignment (DNA/RNA) type is classified as DNADeg or RNADeg, both programs fail to output correctly:

NEXUS OUTPUT:

#NEXUS
BEGIN DATA;
DIMENSIONS NTAX=12 NCHAR=637;
;
...

NEXUS DESIRED OUTPUT

#NEXUS
BEGIN DATA;
DIMENSIONS NTAX=12 NCHAR=637;
FORMAT DATATYPE=DNA INTERLEAVE=yes GAP=-;

MEGA OUTPUT:

#MEGA
!Title dataset/huge/example4.sequential.phy.nexus;
NSeqs=12 Nsites=637 indel=- CodeTable=Standard;

MEGA DESIRED OUTPUT

#MEGA
!Title dataset/huge/example4.sequential.phy.nexus;
!Format DataType=DNA NSeqs=12 Nsites=637 indel=- CodeTable=Standard;

NBRF OUTPUT

>;VRA17
VRA17 637 bases

NBRF DESIRED OUTPUT

>DL;VRA17
VRA17 637 bases

The issue seems to be located on these lines:

  /* Compute output file datatype */
  getTypeAlignment();
  if (dataType == DNAType)
    alg_datatype = "DL";
  else if (dataType == RNAType)
    alg_datatype = "RL";
  else if (dataType == AAType)
    alg_datatype = "P1";

Degraded.zip

BLOSUM45 error

Hi there,
I just ran into an error that I believe is caused by the BLOSUM45 matrix (it seems that the wild character is needed but not taken by trimal) . Both commands below produce a Segmentation error

trimal -in PTHR26451_2456.ali -out PTHR26451_2456.phy -gt 0.9 -cons 60 -st 0.3 -matrix BLOSUM45b
trimal -in PTHR26451_2456.ali -out PTHR26451_2456.phy -gt 0.9 -cons 60

Please find attached the files that cause the error for your consideration.
Many thanks,
DE

Data.zip

Implement support for different genetic codes

Different genetic codes: Universal, mammalian mt, yeast mt, mold mt, invertebrate mt, cilate nuclear, echinoderm mt, euplotid mt, alternative yeast nuclear, ascidian mt.

When it comes to labeling stop codons, you need just a nuclear and mitochondrial code parameter.