agordon / fastx_toolkit Goto Github PK
View Code? Open in Web Editor NEWFASTA/FASTQ pre-processing programs
License: Other
FASTA/FASTQ pre-processing programs
License: Other
FASTX-Toolkit ============= ******************************************************************* * * * FASTX TOOLKIT is unmaintained software. * * No new features have been added since 2010. * * * * There are many better alternatives for low-level FASTQ/FASTA * * manipulation. Use at your own risk. * * * ******************************************************************* Short Summary =============== The FASTX-Toolkit is a collection of command line tools for Short-Reads FASTA/FASTQ files preprocessing. More Details ============== Next-Generation sequencing machines usually produce FASTA or FASTQ files, containing multiple short-reads sequences (possibly with quality information). The main processing of such FASTA/FASTQ files is mapping (aka aligning) the sequences to reference genomes or other databases using specialized programs. Example of such mapping programs are: Blat (http://www.kentinformatics.com/index.asp), SHRiMP (http://compbio.cs.toronto.edu/shrimp), LastZ (http://www.bx.psu.edu/miller_lab), MAQ (http://maq.sourceforge.net/) And many many others. However, It is sometimes more productive to preprocess the FASTA/FASTQ files before mapping the sequences to the genome - manipulating the sequences to produce better mapping results. The FASTX-Toolkit tools perform some of these preprocessing tasks. Available Tools =============== FASTQ-to-FASTA - Converts a FASTQ file to a FASTA file.. FASTQ-Statistics - scans a FASTQ file, and produces some statistics about the quality and the sequences in the file. FASTQ-Quality-BoxPlot, and FASTQ-Nucleotides-Distribution - Generates charts based on the statistics generated by FASTQ-Statistics. These charts can be used to quickly see the quality of the sequenced library. FASTQ-Quality-Converter - Converts from ASCII to numeric quality scores. FASTQ-Quality-Filter - removes low-quality sequences from FASTQ files. FASTX-Artifacts-Filter - removes some sequencing artifacts from FASTA/Q files. FASTX-Barcode-Splitter - A common practice is to sequence multiple biological samples in the same library (marking each sample using a dedicated barcode). The resulting FASTA/Q file contains intermixed sequences from those samples. This tool separates FASTA/Q files into several individual files, based on the barcodes. FASTX-Clipper - Adapters (aka Linkers) are added to the library (before sequencing), and should be removed from the resulting FASTA/Q file. This tool removes (clips) adapters. FASTA-Clipping-Histogram - After clipping a FASTA file, this tool generates a chart showing the length of the clipped sequences. FASTX-Reverse-Complement - Produces a reverse-complement of FASTA/Q file. If a FASTQ file is given, the quality scores are also reversed. FASTX-Trimmer - Extract sub-seqeunces from FASTA/Q file. Two examples are: Removing barcodes from the 5'-end of all sequences in a FASTQ file; Cutting 7 nucleotides from the 3'-end of all sequences in a FASTA file. Galaxy ====== Galaxy (https://usegalaxy.org) is web-based framework for computational biology. While the programs in the FASTX-Toolkit are command-line based, the package include the necessary files to integrate the tools into a Galaxy server, Allowing users to execute this tools from their web-browser. If you run your own local mirror of a Galaxy server, you can integrate the FASTX-Toolkit into your Galaxy server. Software Requirements ===================== 1. GCC is required to compile most tools. 2. FASTA-Clipping-Histogram tool requires Perl, the "PerlIO::gzip", "GD::Graph::bars" modules. Installing the perl modules can be accomplised by running: $ sudo cpan 'PerlIO::gzip' $ sudo cpan 'GD::Graph::bars' 3. FASTX-Barcode-Splitter requires the GNU Sed program. 4. FASTQ-Quality-Boxplot and FASTQ-Nucleotides-Distribution requires the 'gnuplot' program. Installation ============ When downloading the git repository from github, use the following: $ git clone https://github.com/agordon/fastx_toolkit $ cd fastx_toolkit $ ./reconf $ ./configure $ make When downloading a released version archive: $ wget https://github.com/agordon/fastx_toolkit/releases/download/0.0.14/fastx_toolkit-0.0.14.tar.bz2 $ tar -xjvf fastx_toolkit-0.0.14.tar.bz2 $ cd fastx_toolkit-0.0.14 $ ./configure $ make The available releases are here: https://github.com/agordon/fastx_toolkit/releases To install the tools, run (as root): $ sudo make install This will install the tools into /usr/local/bin. To install the tools to a different location, change the 'configure' step to: $ ./configure --prefix=/DESTINATION/DIRECTORY The libgtextutils package is required to build fastx-toolkit, see https://github.com/agordon/libgtextutils/ . Command Line Usage ================== Most tools support "-h" argument to show a short help screen. Better documentation is not available at this moment. Some more details and examples are available in the <help> section of the XML tool files (in the 'galaxy' subdirectory). Galaxy Installation =================== Galaxy Installation should be done manually, and requires technical understading of the Galaxy framework. 1. build and install the command line tools (as described above). 2. Make backup of your galaxy installation (better safe than sorry). 3. Run the 'install_galaxy_files.sh' script, and specify the galaxy root directory. This script copies the files from the 'galaxy' sub-directory into your galaxy mirror directory. 4. Manually add the content of ./galaxy/fastx_toolkit_conf.xml file, into your Galaxy's tool_conf.xml 5. Edit [YOUR-GALAXY]/tool-data/fastx_clipper_sequences.txt file, And add your custom adapters/linkers. 6. Modify the "fastx_barcode_splitter_galaxy_wrapper.sh" as explained Below (see section "Special configuration for Barcode-Splitter"). 7. Restart Galaxy. Always make backup of your galaxy server files before trying to install the FASTX-Toolkit. Galaxy Testing ============== The following tools support Galaxy's functional testing: (Run from Galaxy's main directory) $ sh run_functional_tests.sh -id cshl_fastq_qual_conv $ sh run_functional_tests.sh -id cshl_fastq_to_fasta $ sh run_functional_tests.sh -id cshl_fastq_qual_stat $ sh run_functional_tests.sh -id cshl_fastx_trimmer $ sh run_functional_tests.sh -id cshl_fastx_reverse_complement $ sh run_functional_tests.sh -id cshl_fastx_artifacts_filter $ sh run_functional_tests.sh -id cshl_fasta_collapser $ sh run_functional_tests.sh -id cshl_fastx_clipper Special configuration for Barcode-Splitter ========================================== When running the barcode-splitter tool from the command line you specify a prefix direcotry - the output files will be written to that directory (similar to GNU's split program usage). Running the barcode-splittter inside galaxy requires a special hack beacuse (I don't know how to|Galaxy can't) create a variable number of output datasets. The number of required output files is determined by the tool only AFTER reading the barcodes description file. The Galaxy-version of Barcode-Splitter works like this: 1. A FASTA/FASTQ file, and a Barcode description file are fed to the tool. 2. The tool produces a single output dataset (inside galaxy). This output is an HTML file, containing links to the split FASTA files. 3. Users can use the links to get the split FASTA files. (Since Galaxy's 'upload data' tool accepts URLs, this is not a real problem). 4. As the galaxy administrator, you'll have to edit 'fastx_barcode_splitter_galaxy_wrapper.sh' script and change BASEPATH and PUBLICURL to point to a publicly accesibly path on your server. Example: fastx_barcode_splitter_galaxy_wrapper.sh contains: BASEPATH="/media/sdb1/galaxy/barcode_splits/" PUBLICURL="http://tango.cshl.edu/barcode_splits/" When a user runs the barcode splitter tool, the FASTA files will be generated in "/media/sdb1/galaxy/barcode_splits/". The URL "http://tango.cshl.edu/barcode_splits" is set (in an apache server) to serve files from "/media/sdb1/galaxy/barcode_splits/", with the following configuration: Alias /barcode_splits "/media/sdb1/galaxy/barcode_splits/" <Directory "/media/sdb1/galaxy/barcode_splits/"> AllowOverride None Order allow,deny Allow from all </Directory> Licenses ======== FASTX-Toolkit is distributed under the Affero GPL version 3 or later (AGPLv3), EXCEPT All files under the 'galaxy' sub-directory are distributed under the same license as Galaxy itself (which is an MIT-style license). While IANAL, these licenses basically mean that: 1. You're free to use FASTX-toolkit, 2. You're free to integrate FASTX-toolkit in your Galaxy mirror server (or any other server). 3. You're free to modify the files under 'galaxy', without making your modifications public. 4. If you modify the FASTX-toolkit tools, and make those modifications publicly available (either as downloadable tools, part of another product), or as a web-based server - you must make the modified source code freely available (free as in speech). See the COPYING file for the full Affero GPL. See the GALAXY-LICENSE file for galaxy's license. Please remember: THERE IS NO WARRANTY FOR THE PROGRAM, TO THE EXTENT PERMITTED BY APPLICABLE LAW. EXCEPT WHEN OTHERWISE STATED IN WRITING THE COPYRIGHT HOLDERS AND/OR OTHER PARTIES PROVIDE THE PROGRAM "AS IS" WITHOUT WARRANTY OF ANY KIND, EITHER EXPRESSED OR IMPLIED, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE. THE ENTIRE RISK AS TO THE QUALITY AND PERFORMANCE OF THE PROGRAM IS WITH YOU. SHOULD THE PROGRAM PROVE DEFECTIVE, YOU ASSUME THE COST OF ALL NECESSARY SERVICING, REPAIR OR CORRECTION. ============= Please send all comments, suggestions, bug reports (or better yet - bug fixes) to [email protected] .
Hi to all!
I am trying to replicate some results from an article and here the authors claim that they use fastx_quality_filter + fast_trimmer with paired end reads. I am not sure how to do this. I mean, it seems that there is not an option to use this programs in two paired reads at the same time. In case that I use it for every read independently, I supose that I have to sincronize again the reads. Is this correct?
Thank you,
Vera
It took me a while to figure out that the installation sequence is
cd fastx_toolkit/
./reconf
./configure
make
make install
I think the second step needs documentation.
Thanks! Volker
Would be nice to have the toolkit automatically recognize gzip compressed files as input instead of having to uncompress and pipe to fastx toolkit.
This is different from version 0.0.13 which use Q64 as default. Am I right?
If it is, it will be better if this change can be made more clear to user.
The link to download release 0.0.14 is broken:
https://github.com/agordon/fastx_toolkit/releases/download/0.0.14/fastx_toolkit-0.0.14.tar.bz2
Could you please provide binaries for Linux x86_64?
I cloned your repo, and successfully created a configure
file, but it won't complete successfully quite yet. Could you provide any tips? Thanks!
git clone https://github.com/agordon/fastx_toolkit.git
cd fastx_toolkit/
libtoolize --force
aclocal
autoheader
automake --force-missing --add-missing
vim configure.ac
automake --force-missing --add-missing
autoconf
autoreconf
automake --force-missing --add-missing
./configure --prefix=/home/unix/slowikow/.local/
./configure: line 14512: syntax error near unexpected token `GTEXTUTILS,gtextutils'
./configure: line 14512: `PKG_CHECK_MODULES(GTEXTUTILS,gtextutils)'
In file included from seqalign_test.cpp:5:0:
../libfastx/sequence_alignment.h:146:32: error: ‘ssize_t’ does not name a type
score_type safe_score ( const ssize_t query_index, const ssize_t target_index)
^
../libfastx/sequence_alignment.h:146:59: error: ‘ssize_t’ does not name a type
score_type safe_score ( const ssize_t query_index, const ssize_t target_index)
^
../libfastx/sequence_alignment.h:244:37: error: ‘ssize_t’ has not been declared
void find_alignment_starting_point(ssize_t &new_query_index, ssize_t &new_targ
^
../libfastx/sequence_alignment.h:244:63: error: ‘ssize_t’ has not been declared
void find_alignment_starting_point(ssize_t &new_query_index, ssize_t &new_targ
^
Makefile:256: recipe for target 'seqalign_test.o' failed
make[3]: *** [seqalign_test.o] Error 1
make[3]: Leaving directory '/media/wkstn/Data/Course/Project/fastx_toolkit-0.0.12/src/seqalign_test'
Makefile:252: recipe for target 'all-recursive' failed
make[2]: *** [all-recursive] Error 1
make[2]: Leaving directory '/media/wkstn/Data/Course/Project/fastx_toolkit-0.0.12/src'
Makefile:279: recipe for target 'all-recursive' failed
make[1]: *** [all-recursive] Error 1
make[1]: Leaving directory '/media/wkstn/Data/Course/Project/fastx_toolkit-0.0.12'
Makefile:209: recipe for target 'all' failed
make: *** [all] Error 2
Excuse me, I have a question that I would like answered. I built this software under the arm architecture, but when I tested it, I found that its output file was inconsistent with the given standard output, but it was consistent with the result under the x86 architecture. Is this a successful build?
Build fails under clang 6 due to #pragma pack change during compilation.
We could either build everything with -fpack-struct=1 or restore default packing size after the struct def as shown in the patch below.
--- src/libfastx/fastx.h.orig 2018-05-16 14:50:08 UTC
+++ src/libfastx/fastx.h
@@ -58,7 +58,7 @@ typedef enum {
OUTPUT_SAME_AS_INPUT=3
} OUTPUT_FILE_TYPE;
-#pragma pack(1)
+#pragma pack(push,1)
typedef struct
{
/* Record data - common for FASTA/FASTQ */
@@ -115,6 +115,7 @@ typedef struct
FILE* input;
FILE* output;
} FASTX ;
+#pragma pack(pop)
void fastx_init_reader(FASTX *pFASTX, const char* filename,
There is no "configure" in https://github.com/agordon/fastx_toolkit/archive/0.0.14.tar.gz, just "configure.ac".
Installation
============
To compile to tools, run:
$ ./configure
$ make
I ran "autoreconf -fvi" to make it and the Makefile, but the README is incorrect.
Hi, when I run the command
fastx_quality_stats -i input.fastq -o output.stats
, where input.fastq
consists of
@0 <unknown description>
A
+
]
@1 <unknown description>
A
+
]
, then output.stats
consists of
column count min max sum mean Q1 med Q3 IQR lW rW A_Count C_Count G_Count T_Count N_Count Max_count
1 2 60 60 120 60.00 60 50 50 -10 75 35 2 0 0 0
0 2
. Note the med
column has a value of 50, whereas the mean
column has a value of 60, and the two quality scores in input.fastq
are identical (]
).
I believe this is a bug?
PS fastx_quality_stats -h
prints:
usage: fastx_quality_stats [-h] [-N] [-i INFILE] [-o OUTFILE]
Part of FASTX Toolkit 0.0.14 by A. Gordon ([email protected])
[-h] = This helpful help screen.
[-i INFILE] = FASTQ input file. default is STDIN.
[-o OUTFILE] = TEXT output file. default is STDOUT.
[-N] = New output format (with more information per nucleotide/cycle).
The *OLD* output TEXT file will have the following fields (one row per column):
column = column number (1 to 36 for a 36-cycles read solexa file)
count = number of bases found in this column.
min = Lowest quality score value found in this column.
max = Highest quality score value found in this column.
sum = Sum of quality score values for this column.
mean = Mean quality score value for this column.
Q1 = 1st quartile quality score.
med = Median quality score.
Q3 = 3rd quartile quality score.
IQR = Inter-Quartile range (Q3-Q1).
lW = 'Left-Whisker' value (for boxplotting).
rW = 'Right-Whisker' value (for boxplotting).
A_Count = Count of 'A' nucleotides found in this column.
C_Count = Count of 'C' nucleotides found in this column.
G_Count = Count of 'G' nucleotides found in this column.
T_Count = Count of 'T' nucleotides found in this column.
N_Count = Count of 'N' nucleotides found in this column.
max-count = max. number of bases (in all cycles)
The *NEW* output format:
cycle (previously called 'column') = cycle number
max-count
For each nucleotide in the cycle (ALL/A/C/G/T/N):
count = number of bases found in this column.
min = Lowest quality score value found in this column.
max = Highest quality score value found in this column.
sum = Sum of quality score values for this column.
mean = Mean quality score value for this column.
Q1 = 1st quartile quality score.
med = Median quality score.
Q3 = 3rd quartile quality score.
IQR = Inter-Quartile range (Q3-Q1).
lW = 'Left-Whisker' value (for boxplotting).
rW = 'Right-Whisker' value (for boxplotting).
Hi Gordon and all,
In Computer Security, Privacy, and DNA Sequencing: Compromising Computers with Synthesized DNA, Privacy Leaks, and More (2017), Ney, Koscher, Organick, Ceze & Kohno, University of Washington, report a buffer overflow in FASTX-Toolkit, caused by the difference between MAX_SEQ_LINE_LENGTH
(25000) and MAX_SEQUENCE_LENGTH
(2000). Would it suffice to set MAX_SEQ_LINE_LENGTH
to 2000 to solve the problem?
Excuse me , does fastx_toolkit provide official test cases?
Hi,I am very confused about how to use fastx_barcode_splitter.pl to deal with paired ends fastq.
For I have R1 and R1 two ends sequence fastq files and fastx_barcode_splitter.pl is seemed to deal with single end sequence fastq files.
what should I do?
Thanks !
Hi, fastx_toolkit and llibgtextutils do no longer compile when using a recent compiler like GCC 7:
make[3]: Entering directory '/tmp/SBo/fastx_toolkit-0.0.14/src/fasta_formatter'
g++ -DHAVE_CONFIG_H -I. -I../.. -I/usr/local/include/gtextutils -I../../src/libfastx -O2 -fPIC -Werror=implicit-fallthrough -Wall -Wextra -Wformat-nonliteral -Wformat-security -Wswitch-default -Wswitch-enum -Wunused-parameter -Wfloat-equal -Werror -DDEBUG -g -O1 -MT fasta_formatter.o -MD -MP -MF .deps/fasta_formatter.Tpo -c -o fasta_formatter.o fasta_formatter.cpp
fasta_formatter.cpp: In function ‘void parse_command_line(int, char**)’:
fasta_formatter.cpp:105:9: error: this statement may fall through [-Werror=implicit-fallthrough=]
usage();
~~~~~^~
fasta_formatter.cpp:107:3: note: here
case 'i':
^~~~
cc1plus: all warnings being treated as errors
make[3]: *** [Makefile:425: fasta_formatter.o] Error 1
make[3]: Leaving directory '/tmp/SBo/fastx_toolkit-0.0.14/src/fasta_formatter'
The installation of fastx_toolkit 0.0.14 installs m4 macros into $prefix/share/aclocal. It seems wrong to install these files when they are not actually needed at runtime.
All tools parse (and many use) the -Q
option to specify the quality offset, with default 33. However this is not documented on the website or when using the -h
option of the interested tools.
Shouldn't the actual configure file be included in the repo? It is currently in your .gitignore, so when I clone the repo there is no ./configure to do the build. If I download the .tar.bz2 directly from your site it includes the configure file.
Your repo does include that configure.ac file, but I don't remember that be used for the initial build.
Thanks!
Line 105: usage();
should be followed by "exit();"
Otherwise, compilation fails on Ubuntu 18.04 with message:
fasta_formatter.cpp:105:9: error: this statement may fall through [-Werror=implicit-fallthrough=]
usage();
~~~~~^~
fasta_formatter.cpp:107:3: note: here
case 'i':
^~~~
cc1plus: all warnings being treated as errors
Hi Gordon,
similarly to libgtextutils, FASTX-Toolkit fails to build when hardening flags are enabled.
make[5]: Entering directory `/home/charles/debian/debian-med/fastx-toolkit/src/libfastx'
gcc -DHAVE_CONFIG_H -I. -I../.. -D_FORTIFY_SOURCE=2 -g -O2 -fstack-protector --param=ssp-buffer-size=4 -Wformat -Werror=format-security -Wall -Wextra -Wformat-nonliteral -Wformat-security -Wswitch-default -Wswitch-enum -Wunused-parameter -Wfloat-equal -Werror -DDEBUG -g -O1 -MT chomp.o -MD -MP -MF .deps/chomp.Tpo -c -o chomp.o chomp.c
mv -f .deps/chomp.Tpo .deps/chomp.Po
gcc -DHAVE_CONFIG_H -I. -I../.. -D_FORTIFY_SOURCE=2 -g -O2 -fstack-protector --param=ssp-buffer-size=4 -Wformat -Werror=format-security -Wall -Wextra -Wformat-nonliteral -Wformat-security -Wswitch-default -Wswitch-enum -Wunused-parameter -Wfloat-equal -Werror -DDEBUG -g -O1 -MT fastx.o -MD -MP -MF .deps/fastx.Tpo -c -o fastx.o fastx.c
In file included from /usr/include/stdio.h:937:0,
from fastx.c:18:
In function 'fgets',
inlined from 'fastx_read_next_record' at fastx.c:324:11:
/usr/include/x86_64-linux-gnu/bits/stdio2.h:261:2: error: call to '__fgets_chk_warn' declared with attribute warning: fgets called with bigger size than length of destination buffer [-Werror]
return __fgets_chk_warn (__s, __bos (__s), __n, __stream);
^
In function 'fgets',
inlined from 'fastx_read_next_record' at fastx.c:370:12:
/usr/include/x86_64-linux-gnu/bits/stdio2.h:261:2: error: call to '__fgets_chk_warn' declared with attribute warning: fgets called with bigger size than length of destination buffer [-Werror]
return __fgets_chk_warn (__s, __bos (__s), __n, __stream);
^
cc1: all warnings being treated as errors
make[5]: *** [fastx.o] Error 1
make[5]: Leaving directory `/home/charles/debian/debian-med/fastx-toolkit/src/libfastx'
make[4]: *** [all-recursive] Error 1
make[4]: Leaving directory `/home/charles/debian/debian-med/fastx-toolkit/src'
make[3]: *** [all-recursive] Error 1
make[3]: Leaving directory `/home/charles/debian/debian-med/fastx-toolkit'
make[2]: *** [all] Error 2
make[2]: Leaving directory `/home/charles/debian/debian-med/fastx-toolkit'
dh_auto_build: make -j1 returned exit code 2
make[1]: *** [override_dh_auto_build] Error 2
make[1]: Leaving directory `/home/charles/debian/debian-med/fastx-toolkit'
make: *** [build] Error 2
dpkg-buildpackage: error: debian/rules build gave error exit status 2
Cheers,
Charles Plessy
Debian Med packaging team,
http://www.debian.org/devel/debian-med
Tsurumi, Kanagawa, Japan
I'm fairly new to suggesting fixes for code, but I think I've got a useful fix for an error that comes up when compiling the toolkit using clang 17. I haven't had this issue when compiling with gcc 11.4.0 on ubuntu, but my colleague had the issue with clang 17 on macOS.
There is an unused variable in the fastx_artifacts_filter code that throws an error, specifically "n_count". I used grep to make sure that variable is not used in any other code (it's not) and just deleted it using a sed script. Here is the script that fixes the issue, run this from the fastx_toolkit-0.0.14 directory:
$ sed -i '88,90d;58d' src/fastx_artifacts_filter/fastx_artifacts_filter.c
I couldn't find any instruction on how to cite fastx toolkit in a paper. What reference should we use to give credit to the authors?
-Gael
I am trying to translate ONT Minion reads from fastq to fasta but get the following error
fastq_to_fasta: Error: invalid quality score data on line 19124 (quality_tok = "+"
Any suggestions?
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.