transcript / samsa2 Goto Github PK

SAMSA pipeline, version 2.0. An open-source metatranscriptomics pipeline for analyzing microbiome data, built around DIAMOND and customizable reference databases.

License: GNU General Public License v3.0

R 38.33% Python 30.81% Shell 30.86%

samsa2's Introduction

SAMSA2 - A complete metatranscriptome analysis pipeline

Version 2.2.0 - Yesod - Modifications added with help from [email protected]:

Multithreading added for PEAR, Trimmomatic, SortMeRNA
The script automatically creates a checkpoint file. Once a step is finished, it writes the name of that specific step in checkpoint and that step is skipped on a rerun of the master_script. This is done to avoid re-running CPU-intensive steps if unnecessary.
A new version of the master script now exists, called "master_script_preserving_unmerged.sh". In this script, in the merging step, unmerged reads are concatenated and added to a single file. The forward read and the reverse (complement) read are concatenated with a string of 20 Ns in the middle: This is done through a new R script entitled: combining_umerged.R
Extra care is taken to remove unnecessary files once a step is performed to keep disk usage at a minimum.
Each step contains an exit statement to be printed if the master script dies due to an unforseen error.
Trimmomatic removes adapter contamination according to a specific fasta file.
All options, read & program location are to be specified in the first section of the script.
The script is formated to be run on a HPC using a SLURM job scheduler, but this can be easily changed / removed.
The flag --num_alignments 0 in the ribosomal sortmrna step has been removed. This caused problems and slowed things down a lot. Plus, we don't care about the rRNA alignments - whether a sequence aligns to 1 or 1,000 rRNA, it's out anyways...

Version 2 of the SAMSA pipeline - faster! Lighter! More options! Less waiting!

How to cite:

If you use SAMSA2, please cite the following paper:

Westreich, S.T., Treiber, M.L., Mills, D.A. et al. SAMSA2: a standalone metatranscriptome analysis pipeline. BMC Bioinformatics 19, 175 (2018) doi:10.1186/s12859-018-2189-z

New in version 2:

DIAMOND integration, allowing for SAMSA2 to be run without ever needing an MG-RAST account.
Option to annotate against custom databases.
Better, more polished R scripts that can be executed from the command line.
PCA plots and other graphical outputs.
Filtering of ribosomes for even more speed.
And more!

Dependencies

SAMSA2 requires Python2 for aggregation scripts. Currently, this pipeline works mostly with Python3, although there may be some errors not yet caught.

The following programs can be downloaded OR can be installed from the binaries provided in the programs/ folder.

DIAMOND, version 0.8.3: https://github.com/bbuchfink/diamond
Trimmomatic, a flexible read cleaner: http://www.usadellab.org/cms/?page=trimmomatic
PEAR, if using paired-end data (recommended): https://sco.h-its.org/exelixis/web/software/pear/
SortMeRNA: http://bioinfo.lifl.fr/RNA/sortmerna/

Quick start

Download SAMSA2:
git clone https://github.com/transcript/samsa2.git
Either install the dependencies from the links above, or use the setup_and_test/package_installation.bash script provided with SAMSA2 for installing from the included binaries.
Access the full databases by downloading them using the full_database_download.bash script, located in the setup_and_test folder. This downloads the full RefSeq bacteria database, for organism and specific functional results, and the SEED Subsystems database, which is used for hierarchical functional ontology.
Make changes to the master_script.bash, which performs the first 3 of 4 steps in the SAMSA2 pipeline (preprocessing, annotation, aggregation)
If not using master_script, use DIAMOND to annotate your reads against a database of your choosing (note that database must be local and DIAMOND-indexed). See "example_DIAMOND_annotation_script.bash" for more details.
If not using master_script, use "DIAMOND_analysis_counter.py" to create a ranked abundance summary of the DIAMOND results from each metatransciptome file.
Import these abundance summaries into R and use "run_DESeq_stats.R" to determine the most significantly differing features between either individual metatranscriptomes, or control vs. experimental groups.

SAMSA: Simple Analysis of Metatranscriptomes by Sequence Annotation

Metatranscriptome, RNA-seq data from multiple members of a microbial community, offers incredibly powerful insights into the workings of a complex ecosystem. RNA sequences are able to not only identify the individual members of a community down to the strain level, but can also provide information on the activity of these microbes at the time of sample collection - something that cannot be determined through other meta- (metagenome, 16S rRNA sequencing) method.

However, working with metatranscriptome data often proves challenging, given its high complexity and large size. SAMSA is one of the first bioinformatics pipelines designed with metatranscriptome data specifically in mind. It accepts raw sequence data in FASTQ form as its input, and performs four phases:

Preprocessing: If the sequencing was paired-end, PEAR is used to merge mate pairs. Trimmomatic is used for the removal of adaptor contamination and low-quality reads. SortMeRNA removes ribosomal sequences, as these don't contribute to the mRNA functional profile of the metatranscriptome.

Annotation: Annotation is completed using DIAMOND, an accelerated BLAST-like sequence aligner. (Why DIAMOND? At a standard rate of 10 annotations per second, a standard BLAST approach would take several months to finish - just for a single file!)

Note: DIAMOND is set by default to run in fast mode, where it prioritizes speed. If you are especially concerned about accuracy, you can change DIAMOND to sensitive mode by editing lines 162 and 199 in the master_script, replacing "--fast" in the DIAMOND command with "--sensitive". This may significantly increase memory requirements.

Aggregation: DIAMOND returns results on a per-read basis, a bit like a ticker tape or a line item receipt. In the aggregation step, Python scripts condense these line-by-line results to create summary tables.

Analysis: R scripts use DESeq to compute most significantly different features between control vs. experimental samples. These R scripts generate a tabular output with assigned p-values and log2FoldChange scores for each feature. These 'features' can be either organisms or specific functions. R can also create graphs showing visual representation of the metatranscriptome(s).

Individual programs in SAMSA2 and their functions

For more information, please consult the manual, which goes into more detail on each step in the SAMSA pipeline.

Preprocessing: The following program steps can either be run through master_script.bash or individually:

PEAR - merges two paired-end FASTQ sequencing files together to make a single file of extended fragments.
Trimmomatic - removes low-quality sequences and/or adaptor contamination.
SortMeRNA - removes ribosomal reads, as these don't contribute to the mRNA profile of the metatranscriptome.

Annotation: This step can be performed through master_script.bash, or individually.

DIAMOND_example_script.bash - a guide to structuring commmands for DIAMOND to perform annotations or for building dictionaries, including customizable paths to system locations and reference dictionaries.

Aggregation: This step can be performed through master_script.bash, or individually.

DIAMOND_analysis_counter.py - creates a summary file from DIAMOND results, condensing identical matches and providing counts. Can be enabled to return either organism or functional results.
DIAMOND_subsystems_analysis_counter.py - handles Subsystems hierarchical annotations, keeping all hierarchy levels in the output.
DIAMOND_specific_organism_counter.py - allows for subsetting of results to obtain only results matching a specific genus, species, or function.

Analysis:
Note: these programs are located in the "R_scripts" folder. They all require two sets of DIAMOND summarized results from the aggregation step; an experimental set and a control set.

diversity_stats.R - this program computes Shannon and Simpson diversity scores.
make_DESeq_PCA.R - this program creates a PCA plot of the supplied results (either organism or functional category).
make_DESeq_heatmap.R - generates a heatmap contrasting control vs. experimental sets.
make_combined_graphs.R - generates two stacked bar graphs: a relative version (y axis stretched to 100%) and an absolute version (y axis is raw counts).
run_DESeq_stats.R - computes p-values and most significant differences between summarized results sets.
volcano_plot.R - creates a volcano plot of the DESeq stats from run_DESeq_stats.R output

Need assistance?

Step 1: check the documentation! The documentation includes in-depth explanations of each step, including sample commands. Be sure to check there if you're having an issue on one particular step.

If you're unsure if your files are being processed properly, take a look at the sample files. These correspond to each step in the pipeline. If a quick look (from the command line, "less $file") reveals a dissimilar setup to these example files, there may be an issue with the most recent program used in the pipeline.

Check out the Google Group for SAMSA! https://groups.google.com/forum/#!forum/samsa-bioinformatics-group.

If you find an error with one of these programs, or simply want to ask me questions, you can contact me at [email protected].

Citations of other tools used:

Westreich, S.T., Korf, I., Mills, D.A., Lemay, D.G. (2016) SAMSA: A comprehensive metatranscriptome analysis pipeline. BMC Bioinformatics.

Bolger, A. M., Lohse, M., & Usadel, B. (2014). Trimmomatic: A flexible trimmer for Illumina Sequence Data. Bioinformatics, btu170.

Kopylova E., Noé L. and Touzet H., "SortMeRNA: Fast and accurate filtering of ribosomal RNAs in metatranscriptomic data", Bioinformatics (2012), doi: 10.1093/bioinformatics/bts611.

Zhang, J., Kobert, K., Flouri, T., Stamatakis, A. (2014). PEAR: a fast and accurate Illumina paired-end Paired-End reAd mergeR. Bioinformatics.

=======

samsa2's People

Contributors

Stargazers

Watchers

samsa2's Issues

Where is the File 2 (R2) or reverse file in the script?

I’ve been looking at the master_script.sh of SAMSA2 and I’ve noticed that the line to identify the reverse file or the R2 file is not really clear. Instead it shows an echo for the file 1 (R1) and ‘awk’ command that will out-put a file with a printed name of R2.

Considering that two independent files exist, for reverse and forward sequences and SAMSA2 can call out one of the files, why does it not call file 2 (R2)?

STEP 1: MERGING OF PAIRED-END FILES USING PEAR
Note: paired-end files are usually named using R1 and R2 in the name.
Example: control_1.R1.fastq
control_1.R2.fastq

Note: if using single-end sequencing, skip this step (comment out).
Note: if performing R analysis (step 6), be sure to name files with
the appropriate prefix ("control_$file" and "experimental_$file")!

cd $starting_files_location
for file in $starting_files_location/*.gz
do
gunzip $file
done

for file in $starting_files_location/_R1
do
file1=$file
file2=echo $file1 | awk -F "R1" '{print $1 "R2" $2}'
out_path=echo $file | awk -F "_R1" '{print $1 ".merged"}'
out_name=echo ${out_path##*/}

$pear_location/pear -f $file1 -r $file2 -o $out_name

done

mkdir $starting_location/step_1_output/
mv $starting_files_location/merged $starting_location/step_1_output/
echo -e "\nPaired-end merging step completed.\n"

####################################################################

I would like to know what exactly is the command doing for file 2 (R2) ?

Thank you for your time,
Camille

make_DESeq_PCA.R

In make_DESeq_PCA.R, line 132 should read:
pdf(file = paste(save_filename,".pdf",sep = ""), width=10, height=7)

Currently, it reads:
pdf(file = save_filename, sep = "", width=10, height=7)

Combining the concatenated (unassembled) forward and reverse reads with the assembled reads

Hi Sam,

I've done a paired-end sequencing and managed to merge ~ 70% of the forward and reverse sequences using pear. I have then used the combining_unmerged.R on the unassembled forward and reverse sequences to concatenate them. I was wondering if you combine the results of concatenated forward and reverse reads at any step with the rest of the assembled reads, and if yes in which script I can find that information. Thanks in advance!

Mona

How much difference is expected between master_script and preserving_unmerged.sh

Dear Sam,

How much difference is expected between master_script_preserving_unmerged.sh and master_script.sh to a same dataset?

Thanks.

Shicheng

Bad operand protections - DIAMOND_specific_organism_retriever.py can be overwritten when calling it

Recently I've been helping some researchers with their Samsa2 troubles. I've noticed a few students using operands inappropriately leading to some headaches. I know it's hard to account for this stuff but it might be nice to actively help prevent them!

In the file DIAMOND_specific_organism_retriever.py you might consider adding the following check after the -O, sys.argv parsing. I have seen a few students make this mistake.

if target_org_outfile == os.path.basename(__file__): sys.exit("Cannot write to the file we are specifying: target_org_outfile: {}".format(target_org_outfile))

error with DIAMOND_subsystems_analysis_counter.py

Hi,

I get an error with DIAMOND_subsystems_analysis_counter.py when i run the test_of_master_script_tiny.bash

Traceback (most recent call last):
File "path_to_folder/samsa2/python_scripts/DIAMOND_subsystems_analysis_counter.py", line 105, in
db = open (db_name, "r", encoding='utf-8', errors='ignore')
TypeError: file() takes at most 3 arguments (4 given)

I have python 2.7.X installed

Thanks

Trim and merge OR merge and trim?

Hi Sam,

I'd like to know the reason that you trim assembled reads (and so you treat them as single-end sequences), instead of first trimming them as paired-end reads and then merge them.

I'd also like to thank you for the nice documentation.

Mona

problems running "test_of_master_script_tiny.bash"

Hi,
We had problems running "test_of_master_script_tiny.bash" under Ubuntu-MATE 18.04 because of the way BASH_SOURCE is used to get the directory where the script is running from and where the scripts are in the ../bash_scripts below it - The BASH_SOURCE strategy only works if an absolute path is used to invoke the test e.g.:

bash /home/username/samsa2/setup_and_test/test_of_master_script_tiny.bash

There are also a couple of bugs in "bash_scripts/lib/common.h":
#1 $SAMSA prefix is missing for the awk source file
#2 Command not found error because $@ is quoted inappropriately

diff --git a/bash_scripts/lib/common.sh b/bash_scripts/lib/common.sh
index b8cf7ed..141c0ca 100644
--- a/bash_scripts/lib/common.sh
+++ b/bash_scripts/lib/common.sh
@@ -42,7 +42,7 @@ export LOGFILE
 
 # get_config <varname>
 get_config() {
-  PROG="bash_scripts/lib/parse_config.awk"
+  PROG="$SAMSA/bash_scripts/lib/parse_config.awk"
   VAR_NAME="$1"
   DEF_VAL="undef"
   for arg in "$@"; do
@@ -97,7 +97,7 @@ fatal() {
 
 # log <cmd...>
 log() {
-  "$@" 2>&1 | tee -a "$LOGFILE"
+  $@ 2>&1 | tee -a "$LOGFILE"
   return ${PIPESTATUS[0]}
 }

Parsing of genus/species names

Hello,

I've been using the SAMSA2 pipeline and it works great for my application.

One thing I've noticed is that the genus/species names reported for Step 5 outputs are parsed using the final two space-separated names in the taxonomy. Most of the time this works well enough (e.g., the output is something like Bacillus subtilis, a proper genus and species pair).

But I seem to have quite a few cases where the output is something like "sp. Root239" or "sp. NRRL", the latter of which is particularly uninformative because NRRL is a type collection and so could really be pointing to anything.

I'm wondering if there is a way to modify the output of the script so that the user can get the full taxonomy? I see that the DIAMOND_general_RefSeq_analysis_counter.py python script deals with this function (around line 132 if I'm reading this correctly?). Maybe even having an option to add a column for taxid in the output here would be useful.

Thanks for any input you have on this!
Matt

How does the Refseq database supplied with samsa2 script differ to the one available online

I have run the pipeline with two refseq databases, one that is downloaded with the samsa2 scripts:
"https://bioshare.bioinformatics.ucdavis.edu/bioshare/download/2c8s521xj9907hn/RefSeq_bac.fa

and the other I downloaded from:
ftp://ftp.ncbi.nlm.nih.gov/refseq/release/bacteria/

I get different results when using the two databases and I noticed the one from ncbi is twice the size, so what is the difference between the two?

SAMSA2 wrapper for Galaxy

Hi,
Do you have a wrapper for samsa2 to install it in Galaxy?
Thanks
Mushal

'Bacteria' in the result of combined_graphs_unified.R

Hi Sam,

Applying combined_graphs_unified.R and SAMSA2 pipeline, I obtained a figure as the blow. My question is how to understand the 'Bacteria' in the Genus list since I think all others also belong to bacteria, right? Hope to receive some explanation to my confusion. Thanks.

Shicheng

trimming for adaptors

In the trimmomatic step of the pipeline (e.g.: Line 108 of https://github.com/transcript/samsa2/blob/master/bash_scripts/master_script.sh), there is no cleaning of adaptor contamination. I think this should be included by default as suggested by trimmomatic. http://www.usadellab.org/cms/?page=trimmomatic

Is there any reason not to include it? In addition, what is the reasoning for doing the trimming after the merging? Intuitively, I feel that it would make more sense to trim, then merge (i.e. you don't want to merge sequence based on adaptor contamination) ?

thanks,

minor bug in master_script_for_sample_files.bash

Hi,

There's a minor bug in this bash script (samsa2/sample_files_paired-end/master_script_for_sample_files.bash).

I think Line 123 should read "if [ -f $starting_files_location/step_2_output/raw_counts.txt ]" not, "if [ -f $starting_files_location/step_2_output/raw_counts.txt]" (missing space after .txt)

thanks,

sebastien

Checkpoint TRIMMO

Hi,

the first step of TRIMMING is repeated, probably because the checkpoint is not written.

the first checkpoint trimming is created at line 125 master_script.sh
printf "TRIMMO\n" >>pipeline/checkpoints

All other checkpoints are written into
printf "MERGING\n" >>$INPUT_DIR/checkpoints

Can you please explain shortly what the checkpoint is doing with the debug function ?

checked <cmd...>

checked() {
  if [[ -n "$DRY_RUN" ]]; then
    debug "DRY RUN: $@"
    logstr "$@"
  else
    debug "Running $@"
    log "$@"
    status=$?
    if [[ $status -ne 0 ]]; then
      echo "'$@' exited with non-zero status $status" >&2
      exit $status
    fi
    debug "Finished running $@"
  fi
}

Truncated names in RefSeq organism results

Hi Sam,

Just a small bug - I've noticed that for some of the organism names the beginning of the name is removed from the organism results.

You can see this in some of your sample files, for example: https://github.com/transcript/samsa2/blob/master/sample_files_paired-end/6_RefSeq_org_results/control_1_TINY.RefSeq_annot_organism.tsv

Line 13 should be 'Prevotella sp. HMSC073D09', line 17 should be 'Bacteroidales bacterium KA00344' and so on. I fortunately haven't run into too many of these in my own results so I can just grep the truncated part to get the full name manually from the database header.

This happens with all of the organism names containing "sp." but also with others, so I can't quite work out why it's grabbing from the middle of the [organism name] from the RefSeq header (is it expecting only two words?).

Cheers,
Rachael

make_DESeq_heatmap.R

It's not entirely clear how the clustering is done in this script to generate the heatmap, but the same (sample) clustering seems to applied to both the x and y axis. Therefore, the information in the heatmap is actually only one dimensional and redundant.

Instead, in gene expression studies, data is usually clustered by samples on one axis, while on the other, genes are clustered. Also, to avoid having thousands of genes (or here: organisms, functions or subsystems) on heatmap, usually only the expression of differentially expressed genes are used for clustering (see for example pheatmap(test) in ?pheatmap). Updates to script should reflect that I think. Thanks.

diversity_stats.R

Currently, L109-114 only make sense if you have 12 experimental and 12 control samples. It would make more sense to have a more general parsing such as:
divShannon_exp <- mean(diversity(flipped_complete_table[grep("exp",rownames(flipped_complete_table)),], index = "shannon"))

You just need to make sure your sample names contain "exp" and "control" in their names (note that I also had to modify the "GET FILE NAME" section, because currently it does not parse the information properly).

Finally, if you want this script to work with the Subsystems_results directory, you should modify Lines 66, 70,84,88 to add a "fill = T" argument (eg:

control_table <- read.table(file = x, header = F, quote = "", sep = "\t",fill = T)

This would take care of rows that do not have a hierarchy ("NO HIERARCHY"). This is probably true in other scripts too.

Including locusTags alongwith gene names

Hi,
In the DIAMOND_analysis_counter.py the gene products are extracted and a *function.tsv file is outputted. However, due to the large inconsistencies in naming, sometimes the gene names are truncated or missed as well as all hypothetical proteins clubbed as one. Is it possible that an option for getting counts for each locus tag can be introduced? This will also likely give an idea of which locus tag and co-localized genes are actively used for a given genome and also make downstream linking of outputs to custom databases using locus tags more flexible.
The output tsv for instance can be formatted to give the following fields:

|-----------------------------------------------------------------------|
| RelativeAbundance | RawCount | LocusTag | GeneName/Product            | 
|-----------------------------------------------------------------------|
| 42.1377616129     | 2877037  | XX_0201  | Dehydrogenase               |
|-----------------------------------------------------------------------|

The locus tag, for instance, can help in pathway enrichment analysis by linking to KEGG orthologs.

Best wishes,
Sudarshan
Disclaimer: Not a bioinformatician and pardon me if this is a trivial request.

DIAMOND_analysis_counter.py issue

To whom it may concern,

I have created myfile.daa , then I obtain a BLAST m 8 data table (through diamond view).
Then, I'd like to aggregate results considering the organism through:
$python2 DIAMOND_analysis_counter.py –I final_output –D uniprot_sprot.dmnd –O
I have tried, but then as output, I receive this:
WARNING: need to specify either organism results (with -O flag in command) or functional results (with -F flag in command)

What am I missing?

Thanks,
Marco

SAMS2 bash script for 16S rRNA pipeline

Dear Sam,

I think it will be great if you can provide a bash script for 16S rRNA pipeline. what we need to do is replace the 16S rRNA from SortMeRNA to the following steps, rather than reads of non-16S rRNA.

Thanks.

Shicheng

snakemake implementation of samsa2 pipeline

Hi,

i started working on a snakemake implementation of the samsa2 pipeline that can

use single ended and paired end sample
progressively add more samples to the same experiment without re-running unnecessary
computation
run tasks as cluster jobs in parallel

would you guys be interested in incorporating that into samsa2? It's not quite done yet but getting close.

Regards,
Wolfgang

Slow run time and issue with running run_DESeq_stats.R

Hi, I'm a new user of samsa2 and just so as to test, I used just two metatranscriptome libraries (approx. 16 Mio x 2 paired-end reads in each of the two libraries) but it it took 2 full days to reach until SUBSYS_AGGREG checkpoint on my High-performance PC (245G available total memory; 16TB available space and no other major computational jobs running at the time of running samsa2 pipeline). I wonder how I might be able to scale up the analysis speed when I'll need to input a much higher number of metatranscriptomic libraries? I think the DIAMOND step is taking the most time (understandably) but I'm curious to know if this rather slow speed of analysis is just encountered by only me and not you or the other users?

Also, I get the below error while run_DESeq_stats.R is called from the master_script_preserving_unmerged.sh.

[1] "USAGE: $ run_DESeq_stats.R -I working_directory/ -O save.filename"
Working directory is /data/samsa2/step_5_output/RefSeq_results/org_results
Error in colnames<-(*tmp*, value = "SRR6493775") :
attempt to set 'colnames' on an object with less than two dimensions
Calls: colnames<- -> colnames<-
Execution halted
'Rscript /data/samsa2/R_scripts/run_DESeq_stats.R -I /data/samsa2/step_5_output/RefSeq_results/org_results -O RefSeq_org_DESeq_results.tab -R /data/samsa2/step_2_output/raw_counts.txt' exited with non-zero status 1

This is how the first few lines of one of the files (control_SRR6493783.merged.RefSeq_annot_organism.tsv) at /data/samsa2/step_5_output/RefSeq_results/org_results look like:
14.8100107084 1206281 Bacteroides
14.0013343093 1140414 Bacteroides uniformis
6.71950210668 547306 Ruminococcus bromii
5.89554032774 480194 Bacteroides sp.
3.24016904525 263913 Faecalibacterium prausnitzii
3.06935339037 250000 Ruminococcus sp.
2.79296425627 227488 Bacteroides xylanisolvens
2.71861223975 221432 Eubacterium rectale
2.4190924585 197036 unclassified Bacteroides
2.41829442662 196971 Ruminococcus

This is the content of /data/samsa2/step_2_output/raw_counts.txt
SRR6493775.cleaned.forward 13993245
SRR6493783.cleaned.forward 16517196

Any help here would be much appreciated!

Normalize based on gene length?

From Pedro Torres, suggestion to normalize counts based on gene length, as longer genes would have more reads.

Probably is feasible by mapping the database and getting a geneLength number for each gene. This could be plugged in later, likely at the R step, and used to normalize based on gene lengths.

Assigning NAs as 1 might bias alpha diversity estimates?

Hi Sam,

I am looking at your script for estimating alpha diversity indices (e.g. 'diversity_stats.R') and at the merge step where 'complete_table' is generated. Following this the NAs are assigned as 1. Given how sensitive alpha diversities are to singletons, I believe this could bias the diversity estimates generated by the script. Is there a reason why the NAs were not assigned as 0? Or am I wrong in my assumptions?

Thanks,
Paul

How DIAMOND_analysis_counter.py works?

Dear Sam,

Can you provide more details about how DIAMOND_analysis_counter.py and RefSeq_reduce_to_genus.py work to aggregate taxa? I am not python expert. What I find are:

1, xxx.merged.RefSeq_annot_organism.tsv in step_5_output has both species and genus terms
2, Not sure whether terms suffix with sp. belongs to species, others belong to genus?
3, RefSeq_reduce_to_genus.py -I xxx.merged.RefSeq_annot_organism.tsv will generate reduced form, but some genus will be lost. for example, Bacteria is lost in the attached example.

Thanks. Shicheng

Shicheng

SEED Subsystems: level 4 category does not always share the same higher level categories

Hi Sam,

I've been exploring the SEED subsystems results and I've noticed that in the Subsystems_pie_charts.R script, the first file is read in as data_table, then all subsequent files are put into temp_table where only the counts and the level 4 name is kept before merging with data_table (lines 67-68):

temp_table <- temp_table[,c(2,3)]
data_table <- merge(temp_table, data_table, by = "Level4")

It would be no problem to use the level 3, 2 and 1 hierarchy names from the first file if each level 4 category always grouped into the same level 3, 2 and 1 hierarchies - however strangely, this isn't the case.

As an example, when I look in my results for the level 4 name DNA topoisomerase I (EC 5.99.1.2), some samples have a line like this in the .hierarchy.reduced file:

0.159525979945 28.0 DNA topoisomerase I (EC 5.99.1.2) Conserved_gene_cluster_associated_with_Met-tRNA_formyltransferase Clustering-based subsystems

and others have a line like this:

0.132864735128 31.0 DNA topoisomerase I (EC 5.99.1.2) DNA_topoisomerases,_Type_I,_ATP-independent DNA Metabolism DNA replication

Indeed when I look for DNA topoisomerase I (EC 5.99.1.2) in the subsys_db.fa database provided with SAMSA2, I can see that there's several entries and they do differ in their level 3, 2 and 1 hierarchy names despite an identical level 4 name.

I am uncertain how large an effect this may have on the downstream analysis (when comparing between two of my samples, 24 of the 655 level 4 terms in common between them had variable hierarchies). At the moment I can't think of a nice way to deal with this but wanted to bring it to your attention.

(Also, should line 68 include all = TRUE? Otherwise we are keeping only the level 4 functions that are present in every single sample)

Cheers,
Rachael

Read duplication

Hi all,
do you have experience with duplicated reads ? The annotation step with DIAMOND is extremely slow because of the high number of duplicated reads. Do you have ideas to deduplicate the dataset before annotation BLAST ?

Best, Michael

Showing error in diamond subsystem analysis

After running the pipeline in diamond subsystem analysis step i am getting some error.
Kindly help me to solve the error. The error message is listed below.

Traceback (most recent call last):
File "python_scripts/DIAMOND_subsystems_analysis_counter.py", line 119, in
for line in db:
File "/home/amit/anaconda3/lib/python3.7/codecs.py", line 322, in decode
(result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x96 in position 3173: invalid start byte

Thank you.
Kind regards,
Amit

meaning behind "checked" in test script demo

Hi,
I'm trying to run the test program "test_of_master_script_tiny.bash" and I've had to edit a bunch already (fill in all the bash vars, remove weird quotes around things), but this one takes the cake.

What's the meaning behind "checked"? It's sprinkled throughout the code like a command, but it's not actually a program call. What is it supposed to have been?

run_DESeq_stats error

The script is complaining about how control_table isn't found. in line 84.
Is that batch of code (lines 84-86) supposed to be within the for-loop?

Step 5

Hi, thanks very much for this pipeline. However, we're having a problem at step 5. We seem to be able to create the annotated refseq file and the diamond binary file at the end of step 4, but step 5 seems to produce empty tsv files. Do you have any idea what's going wrong? We can send you a test file if necessary?
Many thanks in advance
Rachael

From @utpalh: code problems

@utpalh, I'm making this a new issue so it stays open. #9

I came upon the following error while processing for Level 2 subsystem and level 4 subsystem

[1] "USAGE: $ run_DESeq_stats.R -I working_directory/ -O save.filename -L level (1,2,3,4)"
Working directory is /home/utpal/Documents/samsa2/step_5_output/Subsystems_results/
Saving results as Subsystems_level-2_DESeq_results.tab
Calculating DESeq results for hierarchy level 2
Error in setnames(x, value) :
Can't assign 4 names to a 3 column data.table
Calls: colnames<- ... colnames<- -> names<- -> names<-.data.table -> setnames
Execution halted
[1] "USAGE: $ run_DESeq_stats.R -I working_directory/ -O save.filename -L level (1,2,3,4)"
Working directory is /home/utpal/Documents/samsa2/step_5_output/Subsystems_results/
Saving results as Subsystems_level-1_DESeq_results.tab
Calculating DESeq results for hierarchy level 4
Error in row.names<-.data.frame(tmp, value = value) :
duplicate 'row.names' are not allowed
Calls: rownames<- ... rownames<- -> row.names<- -> row.names<-.data.frame
In addition: Warning message:
non-unique value when setting 'row.names': ‘NO HIERARCHY’
Execution halted

Help with scripts

I'm new to running bash scripts so I was hoping to get some advice...

I downloaded samsa2 and have tried following the guidelines for installation using the scripts in the setup_and_test folder.

I have used gmod +x to give executable permission to the scripts and I try to run them using:

bash pathway_to_script.sh

When doing this I get the following error:

package_installation.bash: line 32: package_installation.bash/../bash_scripts/lib/common.sh: Not a directory

When looking at the script I see this line:

Set pathway for SAMSA to location of samsa2 GitHub download:

IGNORE_DEPS=1 source "${BASH_SOURCE%/*}/../bash_scripts/lib/common.sh"

What is the correct way to set this pathway? Do I need to delete the whole line as it is and add in the path to the samsa2 download or is there another way to do it?

Thanks for your help!

Estimate microbial diversity with rRNA

Dear Sam,

I remember, in metagenome research, we can use 16S rRNA to estimate metagenome diversity. I am thinking why it is not showed in the samsa2 pipeline since in the default mode, samsa2 remove all rRNAs.

Thanks.

Shicheng

The flag --num_alignments 0 in the ribosomal sortmrna step has been removed. This caused problems and slowed things down a lot. Plus, we don't care about the rRNA alignments - whether a sequence aligns to 1 or 1,000 rRNA, it's out anyways...

Docker/Singularity image?

First of all, thanks for this tool/pipeline: I've just started tinkering with it but it seems well designed. I was wondering if you are considering to build a Docker/Singularity image for making SAMSA2 installation even easier: I know it is a major effort but I'm convinced it could be very useful in spreading the software adoption (e.g., HPC environments). Thanks in advance.

Code problems

Greetings!

I was trying to use SAMSA2, and running the master_script.bash, it fails at line 89 because of what i think is a typo: the "##*/" in "out_name=echo ${out_path##*/}". I tried editing that out and further problems arose because of file names (changed .merged to merged. and *.cleaned to *_cleaned). This edits allowed me to get to the python scripts, however, after identifying the genes, it fails to find their identifiers in the databases for retrieving functional and taxonomic information.

I downloaded SAMSA2 from github, and installed everything with the supplied scripts from the tool. Am I using an older non-functional version of the tool?

Thank you for your attention

Errors in master_script_preserving_unmerged.sh and combining_unmerged.R

Hi, I am getting errors when trying to run the master script for including unmerged reads ("master_script_preserving_unmerged.sh"). The errors seem to stem from lines 159-161 of the script.

$R_programs/combining_umerged.R $STEP_1

printf "MERGING\n" >>pipeline/checkpoints

This gives the errors:

line 178: /combining_umerged.R: No such file or directory
insect_script.sh: line 180: pipeline/checkpoints: No such file or directory

I tried replacing the unfound variables $R_programs with $R_DIR and pipeline/checkpoints with $INPUT_DIR/checkpoints.

However then I get an error related to the R script "combining_unmerged.R"

/cvmfs/soft.computecanada.ca/easybuild/software/2017/avx2/Compiler/intel2016.4/r/3.5.0/bin/Rscript: bad interpreter: No such file or directory

Any help with this would be appreciated. Thanks!

Running the included R scripts to generate figures

Sorry I'm still new to this field.

Can anyone give me some advice on how to run the provided r scripts to generate the different figures?l I have completed the full master script so have run the initial run_DESeq_stats.R but I'd like to generate some figures as well. I've been trying to work it out but I'm not sure what to use as input to the script (control_table, experimental_table etc).

Also I would like to look at which bacteria are the most active within the experimental samples relative to the control, as well as what pathways may be up/downregulated within particular genus of interest. Is that possible to do?

For example, based on the RefSeq org results Streptococcus spp seem to be significantly more active in the experimental samples, can I also look at what functions/pathways are active in the Streptococcus spp alone rather than the whole dataset?

Fail to install R 3.0 Ubuntu 16.04LTS

Hello

Before running the package_installation.bash, I tried to install R 3.0.2 on the Ubutu16.04LTS from the source code, but failed. Here is link of installing R3.0.2 https://stackoverflow.com/questions/39870825/installing-r-3-0-2-in-ubuntu-16-04
Our Linux was installed in VirtualBox on Windows10

I wanted to ask what Linux OS and R version did you use?

We also used the following command to install R3.5. but couldn't install the R packages

sudo apt install r-base-core

Please advise..
Thanks

database option in DIAMOND_analysis_counter.py

Dear Sir/Madam,

I want to know how to prepare this -D database for DIAMOND_analysis_counter.py script.

DIAMOND_analysis_counter.py

-D, database, specifies a reference database to search against for results

Thanks.

Shicheng

IndexElist index out of rangerror:

I am trying to work with the full release of RefSeq, so following instruction I have downloaded and concatenated all the files, and produced the DIAMOND index, using the gzipped file. In order to save disk space I have left the gzipped RefSeq full file in the database folder, but it ended up it's not a good move..In fact, when running the master-script, at DIAMOND_analysis_counter.py step, I get:

Starting database analysis now.
Traceback (most recent call last):
File "/home/stefano/scripts/samsa2/python_scripts/DIAMOND_analysis_counter.py", line 138, in
db_entry = db_entry[1][:-1]
IndexError: list index out of range
'python /home/stefano/scripts/samsa2/python_scripts/DIAMOND_analysis_counter.py -I /home/stefano/GoogleDrive/TEST-SAMSA2/step_4_output/MO1.R1.anqrpht.clean.fastq.merged.RefSeq_annotated -D /home/stefano/scripts/samsa2/full_databases/RefSeq_full.gz -O' exited with non-zero status 1

I suppose the error arise since the DIAMOND_analysis_counter.py does not check and process gzipped files: indeed, I uncompressed the file, and the script worked fine.

So maybe it could be useful do add the capability of handling gzipepd file to DIAMOND_analysis_counter.py (and other scripts, using that db)
Cheers,
Stefano

Error in DIAMOND_analysis_counter.py

Hi,
I have compiled a DIAMOND database from the current RefSeq database, but apparently the script DIAMOND_analysis_counter.py get stuck at one line.

Do you have any idea if I need to do any preprocessing of the database before starting DIAMOND_analysis_counter.py

Traceback (most recent call last):
  File "samsa2/python_scripts/DIAMOND_analysis_counter.py", line 151, in <module>
    if split_db_org[1] == "sp.":
IndexError: list index out of range

line 162, in <module>
    db_org = split_db_org[1] + " " + split_db_org[2]
IndexError: list index out of range

Best, Michael

make_combined_graphs.R

currently, Lines 30-33 read:
} else {
cat ("Working directory is ", opt$directory, "\n")
wd_location <- opt$directory
}

it is missing the setwd() step (as in other scripts) and should read:
} else {
cat ("Working directory is ", opt$directory, "\n")
wd_location <- opt$directory
setwd(wd_location)
}

Update make_combined_graphs.R script to also work with Subsystems input

Currently, the make_combined_graphs.R is designed only to work with RefSeq input.

This can be reworked to also take Subsystems results as input, but there will need to be additional functionality added to flatten to the chosen hierarchy level. Add it to my to-do list.

Different database?

Hello,
We want to use a different database (NR) for the pipeline but we've encountered a few issues. The pipeline doesn't progress from step 4 to step 5 when we change the database. Also we need to produce a custom NR database that only contains bacterial and fungal hits - do you know if this is possible?
Many thanks

High number of hypothetical proteins

I've been interested in looking at genes associated with a particular genus (Streptococcus), to do this I have run the DIAMOND_specific_organism_retriever.py script followed by DIAMOND_analysis_counter.py. When pulling out reads using the first script I am able to get roughly 10k to 100k reads depending on sample. However when running the analysis counter roughly 70-90% of the reads are annotated as "hypothetical proteins".

Is that to be expected and is there any way to improve on this? I find the results quite surprising considering how well studied Strep is.

How to deal with the same proteins with slightly different names from the RefSeq_bac database?

Hi Sam,

In my statistical analysis, some of the functions from the RefSeq_bac database are being categorized as different proteins only because of a small difference in their names like a dash (e.g. "(3R)-hydroxymyristoyl ACP dehydratase" "(3R)-hydroxymyristoyl-ACP dehydratase"), a comma, or lower/uppercase letters (e.g. "(2fe-2S)-binding domain-containing protein" and "(2Fe-2S)-binding domain-containing protein").
Also, some others are partial or complete sequences of the same protein (e.g. "(2Fe-2S) ferredoxin" and "(2Fe-2S) ferredoxin, partial").

I wanted to know if you correct those names in the database or after annotation-aggregation. And if yes, would you please guide me on how to do it?

-Mona

Error with DIAMOND Database

The master script stops after the ribo-depletion step with error

Error: Database file is not a DIAMOND database.

'diamond blastx --db samsa2/full_databases/RefSeq_bac -q samsa2/step_3_output/test.merged.ribodepleted.fastq -a test.merged.ribodepleted.fastq.RefSeq -t ./ -k 1' exited with non-zero status 1

Do I have to convert the RefSeq_bac to DIAMOND format ?

Thanks for you help!

install errors

Hi,
I'm trying to install your pipeline, and I'm running into an error with the package_installation.bash script.
It's having issues with line 32.

package_installation.bash: line 32: package_installation.bash/../bash_scripts/lib/common.sh: Not a directory

any way to get around this?

transcript / samsa2 Goto Github PK

samsa2's Introduction

SAMSA2 - A complete metatranscriptome analysis pipeline

Version 2 of the SAMSA pipeline - faster! Lighter! More options! Less waiting!

How to cite:

New in version 2:

Dependencies

Quick start

SAMSA: Simple Analysis of Metatranscriptomes by Sequence Annotation

Individual programs in SAMSA2 and their functions

Need assistance?

Citations of other tools used:

samsa2's People

Contributors

Stargazers

Watchers

Forkers

samsa2's Issues

checked <cmd...>

Set pathway for SAMSA to location of samsa2 GitHub download:

Recommend Projects

Recommend Topics

Recommend Org