Giter Site home page Giter Site logo

transcript / samsa2 Goto Github PK

View Code? Open in Web Editor NEW
53.0 53.0 36.0 55.02 MB

SAMSA pipeline, version 2.0. An open-source metatranscriptomics pipeline for analyzing microbiome data, built around DIAMOND and customizable reference databases.

License: GNU General Public License v3.0

R 38.33% Python 30.81% Shell 30.86%

samsa2's People

Contributors

benbechade avatar eugenekim76 avatar jjkoehorst avatar kaedenn avatar lisakmalins avatar mltreiber avatar pjtorres avatar seb951 avatar transcript avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

samsa2's Issues

install errors

Hi,
I'm trying to install your pipeline, and I'm running into an error with the package_installation.bash script.
It's having issues with line 32.

package_installation.bash: line 32: package_installation.bash/../bash_scripts/lib/common.sh: Not a directory

any way to get around this?

Checkpoint TRIMMO

Hi,

the first step of TRIMMING is repeated, probably because the checkpoint is not written.

the first checkpoint trimming is created at line 125 master_script.sh
printf "TRIMMO\n" >>pipeline/checkpoints

All other checkpoints are written into
printf "MERGING\n" >>$INPUT_DIR/checkpoints

Can you please explain shortly what the checkpoint is doing with the debug function ?

checked <cmd...>

checked() {
  if [[ -n "$DRY_RUN" ]]; then
    debug "DRY RUN: $@"
    logstr "$@"
  else
    debug "Running $@"
    log "$@"
    status=$?
    if [[ $status -ne 0 ]]; then
      echo "'$@' exited with non-zero status $status" >&2
      exit $status
    fi
    debug "Finished running $@"
  fi
}

How does the Refseq database supplied with samsa2 script differ to the one available online

I have run the pipeline with two refseq databases, one that is downloaded with the samsa2 scripts:
"https://bioshare.bioinformatics.ucdavis.edu/bioshare/download/2c8s521xj9907hn/RefSeq_bac.fa

and the other I downloaded from:
ftp://ftp.ncbi.nlm.nih.gov/refseq/release/bacteria/

I get different results when using the two databases and I noticed the one from ncbi is twice the size, so what is the difference between the two?

Truncated names in RefSeq organism results

Hi Sam,

Just a small bug - I've noticed that for some of the organism names the beginning of the name is removed from the organism results.

You can see this in some of your sample files, for example: https://github.com/transcript/samsa2/blob/master/sample_files_paired-end/6_RefSeq_org_results/control_1_TINY.RefSeq_annot_organism.tsv

Line 13 should be 'Prevotella sp. HMSC073D09', line 17 should be 'Bacteroidales bacterium KA00344' and so on. I fortunately haven't run into too many of these in my own results so I can just grep the truncated part to get the full name manually from the database header.

This happens with all of the organism names containing "sp." but also with others, so I can't quite work out why it's grabbing from the middle of the [organism name] from the RefSeq header (is it expecting only two words?).

Cheers,
Rachael

diversity_stats.R

Currently, L109-114 only make sense if you have 12 experimental and 12 control samples. It would make more sense to have a more general parsing such as:
divShannon_exp <- mean(diversity(flipped_complete_table[grep("exp",rownames(flipped_complete_table)),], index = "shannon"))

You just need to make sure your sample names contain "exp" and "control" in their names (note that I also had to modify the "GET FILE NAME" section, because currently it does not parse the information properly).

Finally, if you want this script to work with the Subsystems_results directory, you should modify Lines 66, 70,84,88 to add a "fill = T" argument (eg:

control_table <- read.table(file = x, header = F, quote = "", sep = "\t",fill = T)

This would take care of rows that do not have a hierarchy ("NO HIERARCHY"). This is probably true in other scripts too.

problems running "test_of_master_script_tiny.bash"

Hi,
We had problems running "test_of_master_script_tiny.bash" under Ubuntu-MATE 18.04 because of the way BASH_SOURCE is used to get the directory where the script is running from and where the scripts are in the ../bash_scripts below it - The BASH_SOURCE strategy only works if an absolute path is used to invoke the test e.g.:

bash /home/username/samsa2/setup_and_test/test_of_master_script_tiny.bash

There are also a couple of bugs in "bash_scripts/lib/common.h":
#1 $SAMSA prefix is missing for the awk source file
#2 Command not found error because $@ is quoted inappropriately

diff --git a/bash_scripts/lib/common.sh b/bash_scripts/lib/common.sh
index b8cf7ed..141c0ca 100644
--- a/bash_scripts/lib/common.sh
+++ b/bash_scripts/lib/common.sh
@@ -42,7 +42,7 @@ export LOGFILE
 
 # get_config <varname>
 get_config() {
-  PROG="bash_scripts/lib/parse_config.awk"
+  PROG="$SAMSA/bash_scripts/lib/parse_config.awk"
   VAR_NAME="$1"
   DEF_VAL="undef"
   for arg in "$@"; do
@@ -97,7 +97,7 @@ fatal() {
 
 # log <cmd...>
 log() {
-  "$@" 2>&1 | tee -a "$LOGFILE"
+  $@ 2>&1 | tee -a "$LOGFILE"
   return ${PIPESTATUS[0]}
 }

Code problems

Greetings!

I was trying to use SAMSA2, and running the master_script.bash, it fails at line 89 because of what i think is a typo: the "##*/" in "out_name=echo ${out_path##*/}". I tried editing that out and further problems arose because of file names (changed .merged to merged. and *.cleaned to *_cleaned). This edits allowed me to get to the python scripts, however, after identifying the genes, it fails to find their identifiers in the databases for retrieving functional and taxonomic information.

I downloaded SAMSA2 from github, and installed everything with the supplied scripts from the tool. Am I using an older non-functional version of the tool?

Thank you for your attention

database option in DIAMOND_analysis_counter.py

Dear Sir/Madam,

I want to know how to prepare this -D database for DIAMOND_analysis_counter.py script.

DIAMOND_analysis_counter.py

-D, database, specifies a reference database to search against for results

Thanks.

Shicheng

Running the included R scripts to generate figures

Sorry I'm still new to this field.

Can anyone give me some advice on how to run the provided r scripts to generate the different figures?l I have completed the full master script so have run the initial run_DESeq_stats.R but I'd like to generate some figures as well. I've been trying to work it out but I'm not sure what to use as input to the script (control_table, experimental_table etc).

Also I would like to look at which bacteria are the most active within the experimental samples relative to the control, as well as what pathways may be up/downregulated within particular genus of interest. Is that possible to do?

For example, based on the RefSeq org results Streptococcus spp seem to be significantly more active in the experimental samples, can I also look at what functions/pathways are active in the Streptococcus spp alone rather than the whole dataset?

Errors in master_script_preserving_unmerged.sh and combining_unmerged.R

Hi, I am getting errors when trying to run the master script for including unmerged reads ("master_script_preserving_unmerged.sh"). The errors seem to stem from lines 159-161 of the script.

$R_programs/combining_umerged.R $STEP_1

printf "MERGING\n" >>pipeline/checkpoints

This gives the errors:

line 178: /combining_umerged.R: No such file or directory
insect_script.sh: line 180: pipeline/checkpoints: No such file or directory

I tried replacing the unfound variables $R_programs with $R_DIR and pipeline/checkpoints with $INPUT_DIR/checkpoints.

However then I get an error related to the R script "combining_unmerged.R"

/cvmfs/soft.computecanada.ca/easybuild/software/2017/avx2/Compiler/intel2016.4/r/3.5.0/bin/Rscript: bad interpreter: No such file or directory

Any help with this would be appreciated. Thanks!

make_combined_graphs.R

currently, Lines 30-33 read:
} else {
cat ("Working directory is ", opt$directory, "\n")
wd_location <- opt$directory
}

it is missing the setwd() step (as in other scripts) and should read:
} else {
cat ("Working directory is ", opt$directory, "\n")
wd_location <- opt$directory
setwd(wd_location)
}

Estimate microbial diversity with rRNA

Dear Sam,

I remember, in metagenome research, we can use 16S rRNA to estimate metagenome diversity. I am thinking why it is not showed in the samsa2 pipeline since in the default mode, samsa2 remove all rRNAs.

Thanks.

Shicheng

  • The flag --num_alignments 0 in the ribosomal sortmrna step has been removed. This caused problems and slowed things down a lot. Plus, we don't care about the rRNA alignments - whether a sequence aligns to 1 or 1,000 rRNA, it's out anyways...

Assigning NAs as 1 might bias alpha diversity estimates?

Hi Sam,

I am looking at your script for estimating alpha diversity indices (e.g. 'diversity_stats.R') and at the merge step where 'complete_table' is generated. Following this the NAs are assigned as 1. Given how sensitive alpha diversities are to singletons, I believe this could bias the diversity estimates generated by the script. Is there a reason why the NAs were not assigned as 0? Or am I wrong in my assumptions?

Thanks,
Paul

Combining the concatenated (unassembled) forward and reverse reads with the assembled reads

Hi Sam,

I've done a paired-end sequencing and managed to merge ~ 70% of the forward and reverse sequences using pear. I have then used the combining_unmerged.R on the unassembled forward and reverse sequences to concatenate them. I was wondering if you combine the results of concatenated forward and reverse reads at any step with the rest of the assembled reads, and if yes in which script I can find that information. Thanks in advance!

  • Mona

Trim and merge OR merge and trim?

Hi Sam,

I'd like to know the reason that you trim assembled reads (and so you treat them as single-end sequences), instead of first trimming them as paired-end reads and then merge them.

I'd also like to thank you for the nice documentation.

Mona

'Bacteria' in the result of combined_graphs_unified.R

Hi Sam,

Applying combined_graphs_unified.R and SAMSA2 pipeline, I obtained a figure as the blow. My question is how to understand the 'Bacteria' in the Genus list since I think all others also belong to bacteria, right? Hope to receive some explanation to my confusion. Thanks.

Shicheng

SAMSA2-2020-Shicheng-Guo-Strain-Skin

Including locusTags alongwith gene names

Hi,
In the DIAMOND_analysis_counter.py the gene products are extracted and a *function.tsv file is outputted. However, due to the large inconsistencies in naming, sometimes the gene names are truncated or missed as well as all hypothetical proteins clubbed as one. Is it possible that an option for getting counts for each locus tag can be introduced? This will also likely give an idea of which locus tag and co-localized genes are actively used for a given genome and also make downstream linking of outputs to custom databases using locus tags more flexible.
The output tsv for instance can be formatted to give the following fields:

|-----------------------------------------------------------------------|
| RelativeAbundance | RawCount | LocusTag | GeneName/Product            | 
|-----------------------------------------------------------------------|
| 42.1377616129     | 2877037  | XX_0201  | Dehydrogenase               |
|-----------------------------------------------------------------------|

The locus tag, for instance, can help in pathway enrichment analysis by linking to KEGG orthologs.

Best wishes,
Sudarshan
Disclaimer: Not a bioinformatician and pardon me if this is a trivial request.

From @utpalh: code problems

@utpalh, I'm making this a new issue so it stays open. #9

I came upon the following error while processing for Level 2 subsystem and level 4 subsystem

[1] "USAGE: $ run_DESeq_stats.R -I working_directory/ -O save.filename -L level (1,2,3,4)"
Working directory is /home/utpal/Documents/samsa2/step_5_output/Subsystems_results/
Saving results as Subsystems_level-2_DESeq_results.tab
Calculating DESeq results for hierarchy level 2
Error in setnames(x, value) :
Can't assign 4 names to a 3 column data.table
Calls: colnames<- ... colnames<- -> names<- -> names<-.data.table -> setnames
Execution halted
[1] "USAGE: $ run_DESeq_stats.R -I working_directory/ -O save.filename -L level (1,2,3,4)"
Working directory is /home/utpal/Documents/samsa2/step_5_output/Subsystems_results/
Saving results as Subsystems_level-1_DESeq_results.tab
Calculating DESeq results for hierarchy level 4
Error in row.names<-.data.frame(tmp, value = value) :
duplicate 'row.names' are not allowed
Calls: rownames<- ... rownames<- -> row.names<- -> row.names<-.data.frame
In addition: Warning message:
non-unique value when setting 'row.names': ‘NO HIERARCHY’
Execution halted

Fail to install R 3.0 Ubuntu 16.04LTS

Hello

Before running the package_installation.bash, I tried to install R 3.0.2 on the Ubutu16.04LTS from the source code, but failed. Here is link of installing R3.0.2 https://stackoverflow.com/questions/39870825/installing-r-3-0-2-in-ubuntu-16-04
Our Linux was installed in VirtualBox on Windows10

I wanted to ask what Linux OS and R version did you use?

We also used the following command to install R3.5. but couldn't install the R packages

sudo apt install r-base-core

Please advise..
Thanks

Read duplication

Hi all,
do you have experience with duplicated reads ? The annotation step with DIAMOND is extremely slow because of the high number of duplicated reads. Do you have ideas to deduplicate the dataset before annotation BLAST ?

Best, Michael

Showing error in diamond subsystem analysis

After running the pipeline in diamond subsystem analysis step i am getting some error.
Kindly help me to solve the error. The error message is listed below.

Traceback (most recent call last):
File "python_scripts/DIAMOND_subsystems_analysis_counter.py", line 119, in
for line in db:
File "/home/amit/anaconda3/lib/python3.7/codecs.py", line 322, in decode
(result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x96 in position 3173: invalid start byte

Thank you.
Kind regards,
Amit

Bad operand protections - DIAMOND_specific_organism_retriever.py can be overwritten when calling it

Recently I've been helping some researchers with their Samsa2 troubles. I've noticed a few students using operands inappropriately leading to some headaches. I know it's hard to account for this stuff but it might be nice to actively help prevent them!

In the file DIAMOND_specific_organism_retriever.py you might consider adding the following check after the -O, sys.argv parsing. I have seen a few students make this mistake.

if target_org_outfile == os.path.basename(__file__): sys.exit("Cannot write to the file we are specifying: target_org_outfile: {}".format(target_org_outfile))

SAMS2 bash script for 16S rRNA pipeline

Dear Sam,

I think it will be great if you can provide a bash script for 16S rRNA pipeline. what we need to do is replace the 16S rRNA from SortMeRNA to the following steps, rather than reads of non-16S rRNA.

Thanks.

Shicheng

run_DESeq_stats error

The script is complaining about how control_table isn't found. in line 84.
Is that batch of code (lines 84-86) supposed to be within the for-loop?

Normalize based on gene length?

From Pedro Torres, suggestion to normalize counts based on gene length, as longer genes would have more reads.

Probably is feasible by mapping the database and getting a geneLength number for each gene. This could be plugged in later, likely at the R step, and used to normalize based on gene lengths.

make_DESeq_PCA.R

In make_DESeq_PCA.R, line 132 should read:
pdf(file = paste(save_filename,".pdf",sep = ""), width=10, height=7)

Currently, it reads:
pdf(file = save_filename, sep = "", width=10, height=7)

Different database?

Hello,
We want to use a different database (NR) for the pipeline but we've encountered a few issues. The pipeline doesn't progress from step 4 to step 5 when we change the database. Also we need to produce a custom NR database that only contains bacterial and fungal hits - do you know if this is possible?
Many thanks

Help with scripts

I'm new to running bash scripts so I was hoping to get some advice...

I downloaded samsa2 and have tried following the guidelines for installation using the scripts in the setup_and_test folder.

I have used gmod +x to give executable permission to the scripts and I try to run them using:

bash pathway_to_script.sh

When doing this I get the following error:

package_installation.bash: line 32: package_installation.bash/../bash_scripts/lib/common.sh: Not a directory

When looking at the script I see this line:

Set pathway for SAMSA to location of samsa2 GitHub download:

IGNORE_DEPS=1 source "${BASH_SOURCE%/*}/../bash_scripts/lib/common.sh"

What is the correct way to set this pathway? Do I need to delete the whole line as it is and add in the path to the samsa2 download or is there another way to do it?

Thanks for your help!

IndexElist index out of rangerror:

I am trying to work with the full release of RefSeq, so following instruction I have downloaded and concatenated all the files, and produced the DIAMOND index, using the gzipped file. In order to save disk space I have left the gzipped RefSeq full file in the database folder, but it ended up it's not a good move..In fact, when running the master-script, at DIAMOND_analysis_counter.py step, I get:

Starting database analysis now.
Traceback (most recent call last):
File "/home/stefano/scripts/samsa2/python_scripts/DIAMOND_analysis_counter.py", line 138, in
db_entry = db_entry[1][:-1]
IndexError: list index out of range
'python /home/stefano/scripts/samsa2/python_scripts/DIAMOND_analysis_counter.py -I /home/stefano/GoogleDrive/TEST-SAMSA2/step_4_output/MO1.R1.anqrpht.clean.fastq.merged.RefSeq_annotated -D /home/stefano/scripts/samsa2/full_databases/RefSeq_full.gz -O' exited with non-zero status 1

I suppose the error arise since the DIAMOND_analysis_counter.py does not check and process gzipped files: indeed, I uncompressed the file, and the script worked fine.

So maybe it could be useful do add the capability of handling gzipepd file to DIAMOND_analysis_counter.py (and other scripts, using that db)
Cheers,
Stefano

Error with DIAMOND Database

The master script stops after the ribo-depletion step with error

Error: Database file is not a DIAMOND database.

'diamond blastx --db samsa2/full_databases/RefSeq_bac -q samsa2/step_3_output/test.merged.ribodepleted.fastq -a test.merged.ribodepleted.fastq.RefSeq -t ./ -k 1' exited with non-zero status 1

Do I have to convert the RefSeq_bac to DIAMOND format ?

Thanks for you help!

Where is the File 2 (R2) or reverse file in the script?

I’ve been looking at the master_script.sh of SAMSA2 and I’ve noticed that the line to identify the reverse file or the R2 file is not really clear. Instead it shows an echo for the file 1 (R1) and ‘awk’ command that will out-put a file with a printed name of R2.

Considering that two independent files exist, for reverse and forward sequences and SAMSA2 can call out one of the files, why does it not call file 2 (R2)?


STEP 1: MERGING OF PAIRED-END FILES USING PEAR
Note: paired-end files are usually named using R1 and R2 in the name.
Example: control_1.R1.fastq
control_1.R2.fastq

Note: if using single-end sequencing, skip this step (comment out).
Note: if performing R analysis (step 6), be sure to name files with
the appropriate prefix ("control_$file" and "experimental_$file")!

cd $starting_files_location
for file in $starting_files_location/*.gz
do
gunzip $file
done

for file in $starting_files_location/_R1
do
file1=$file
file2=echo $file1 | awk -F "R1" '{print $1 "R2" $2}'
out_path=echo $file | awk -F "_R1" '{print $1 ".merged"}'
out_name=echo ${out_path##*/}

$pear_location/pear -f $file1 -r $file2 -o $out_name

done

mkdir $starting_location/step_1_output/
mv $starting_files_location/merged $starting_location/step_1_output/
echo -e "\nPaired-end merging step completed.\n"

####################################################################

I would like to know what exactly is the command doing for file 2 (R2) ?

Thank you for your time,
Camille

Step 5

Hi, thanks very much for this pipeline. However, we're having a problem at step 5. We seem to be able to create the annotated refseq file and the diamond binary file at the end of step 4, but step 5 seems to produce empty tsv files. Do you have any idea what's going wrong? We can send you a test file if necessary?
Many thanks in advance
Rachael

Error in DIAMOND_analysis_counter.py

Hi,
I have compiled a DIAMOND database from the current RefSeq database, but apparently the script DIAMOND_analysis_counter.py get stuck at one line.

Do you have any idea if I need to do any preprocessing of the database before starting DIAMOND_analysis_counter.py

Traceback (most recent call last):
  File "samsa2/python_scripts/DIAMOND_analysis_counter.py", line 151, in <module>
    if split_db_org[1] == "sp.":
IndexError: list index out of range

line 162, in <module>
    db_org = split_db_org[1] + " " + split_db_org[2]
IndexError: list index out of range

Best, Michael

SEED Subsystems: level 4 category does not always share the same higher level categories

Hi Sam,

I've been exploring the SEED subsystems results and I've noticed that in the Subsystems_pie_charts.R script, the first file is read in as data_table, then all subsequent files are put into temp_table where only the counts and the level 4 name is kept before merging with data_table (lines 67-68):

temp_table <- temp_table[,c(2,3)]
data_table <- merge(temp_table, data_table, by = "Level4")

It would be no problem to use the level 3, 2 and 1 hierarchy names from the first file if each level 4 category always grouped into the same level 3, 2 and 1 hierarchies - however strangely, this isn't the case.

As an example, when I look in my results for the level 4 name DNA topoisomerase I (EC 5.99.1.2), some samples have a line like this in the .hierarchy.reduced file:

0.159525979945 28.0 DNA topoisomerase I (EC 5.99.1.2) Conserved_gene_cluster_associated_with_Met-tRNA_formyltransferase Clustering-based subsystems

and others have a line like this:

0.132864735128 31.0 DNA topoisomerase I (EC 5.99.1.2) DNA_topoisomerases,_Type_I,_ATP-independent DNA Metabolism DNA replication

Indeed when I look for DNA topoisomerase I (EC 5.99.1.2) in the subsys_db.fa database provided with SAMSA2, I can see that there's several entries and they do differ in their level 3, 2 and 1 hierarchy names despite an identical level 4 name.

I am uncertain how large an effect this may have on the downstream analysis (when comparing between two of my samples, 24 of the 655 level 4 terms in common between them had variable hierarchies). At the moment I can't think of a nice way to deal with this but wanted to bring it to your attention.

(Also, should line 68 include all = TRUE? Otherwise we are keeping only the level 4 functions that are present in every single sample)

Cheers,
Rachael

Docker/Singularity image?

First of all, thanks for this tool/pipeline: I've just started tinkering with it but it seems well designed. I was wondering if you are considering to build a Docker/Singularity image for making SAMSA2 installation even easier: I know it is a major effort but I'm convinced it could be very useful in spreading the software adoption (e.g., HPC environments). Thanks in advance.

minor bug in master_script_for_sample_files.bash

Hi,

There's a minor bug in this bash script (samsa2/sample_files_paired-end/master_script_for_sample_files.bash).

I think Line 123 should read "if [ -f $starting_files_location/step_2_output/raw_counts.txt ]" not, "if [ -f $starting_files_location/step_2_output/raw_counts.txt]" (missing space after .txt)

thanks,

sebastien

make_DESeq_heatmap.R

It's not entirely clear how the clustering is done in this script to generate the heatmap, but the same (sample) clustering seems to applied to both the x and y axis. Therefore, the information in the heatmap is actually only one dimensional and redundant.

Instead, in gene expression studies, data is usually clustered by samples on one axis, while on the other, genes are clustered. Also, to avoid having thousands of genes (or here: organisms, functions or subsystems) on heatmap, usually only the expression of differentially expressed genes are used for clustering (see for example pheatmap(test) in ?pheatmap). Updates to script should reflect that I think. Thanks.

Parsing of genus/species names

Hello,

I've been using the SAMSA2 pipeline and it works great for my application.

One thing I've noticed is that the genus/species names reported for Step 5 outputs are parsed using the final two space-separated names in the taxonomy. Most of the time this works well enough (e.g., the output is something like Bacillus subtilis, a proper genus and species pair).

But I seem to have quite a few cases where the output is something like "sp. Root239" or "sp. NRRL", the latter of which is particularly uninformative because NRRL is a type collection and so could really be pointing to anything.

I'm wondering if there is a way to modify the output of the script so that the user can get the full taxonomy? I see that the DIAMOND_general_RefSeq_analysis_counter.py python script deals with this function (around line 132 if I'm reading this correctly?). Maybe even having an option to add a column for taxid in the output here would be useful.

Thanks for any input you have on this!
Matt

trimming for adaptors

In the trimmomatic step of the pipeline (e.g.: Line 108 of https://github.com/transcript/samsa2/blob/master/bash_scripts/master_script.sh), there is no cleaning of adaptor contamination. I think this should be included by default as suggested by trimmomatic. http://www.usadellab.org/cms/?page=trimmomatic

Is there any reason not to include it? In addition, what is the reasoning for doing the trimming after the merging? Intuitively, I feel that it would make more sense to trim, then merge (i.e. you don't want to merge sequence based on adaptor contamination) ?

thanks,

DIAMOND_analysis_counter.py issue

To whom it may concern,

I have created myfile.daa , then I obtain a BLAST m 8 data table (through diamond view).
Then, I'd like to aggregate results considering the organism through:
$python2 DIAMOND_analysis_counter.py –I final_output –D uniprot_sprot.dmnd –O
I have tried, but then as output, I receive this:
WARNING: need to specify either organism results (with -O flag in command) or functional results (with -F flag in command)

What am I missing?

Thanks,
Marco

meaning behind "checked" in test script demo

Hi,
I'm trying to run the test program "test_of_master_script_tiny.bash" and I've had to edit a bunch already (fill in all the bash vars, remove weird quotes around things), but this one takes the cake.

What's the meaning behind "checked"? It's sprinkled throughout the code like a command, but it's not actually a program call. What is it supposed to have been?

How to deal with the same proteins with slightly different names from the RefSeq_bac database?

Hi Sam,

In my statistical analysis, some of the functions from the RefSeq_bac database are being categorized as different proteins only because of a small difference in their names like a dash (e.g. "(3R)-hydroxymyristoyl ACP dehydratase" "(3R)-hydroxymyristoyl-ACP dehydratase"), a comma, or lower/uppercase letters (e.g. "(2fe-2S)-binding domain-containing protein" and "(2Fe-2S)-binding domain-containing protein").
Also, some others are partial or complete sequences of the same protein (e.g. "(2Fe-2S) ferredoxin" and "(2Fe-2S) ferredoxin, partial").

I wanted to know if you correct those names in the database or after annotation-aggregation. And if yes, would you please guide me on how to do it?

-Mona

High number of hypothetical proteins

I've been interested in looking at genes associated with a particular genus (Streptococcus), to do this I have run the DIAMOND_specific_organism_retriever.py script followed by DIAMOND_analysis_counter.py. When pulling out reads using the first script I am able to get roughly 10k to 100k reads depending on sample. However when running the analysis counter roughly 70-90% of the reads are annotated as "hypothetical proteins".

Is that to be expected and is there any way to improve on this? I find the results quite surprising considering how well studied Strep is.

snakemake implementation of samsa2 pipeline

Hi,

i started working on a snakemake implementation of the samsa2 pipeline that can

  • use single ended and paired end sample
  • progressively add more samples to the same experiment without re-running unnecessary
    computation
  • run tasks as cluster jobs in parallel

would you guys be interested in incorporating that into samsa2? It's not quite done yet but getting close.

Regards,
Wolfgang

Slow run time and issue with running run_DESeq_stats.R

Hi, I'm a new user of samsa2 and just so as to test, I used just two metatranscriptome libraries (approx. 16 Mio x 2 paired-end reads in each of the two libraries) but it it took 2 full days to reach until SUBSYS_AGGREG checkpoint on my High-performance PC (245G available total memory; 16TB available space and no other major computational jobs running at the time of running samsa2 pipeline). I wonder how I might be able to scale up the analysis speed when I'll need to input a much higher number of metatranscriptomic libraries? I think the DIAMOND step is taking the most time (understandably) but I'm curious to know if this rather slow speed of analysis is just encountered by only me and not you or the other users?

Also, I get the below error while run_DESeq_stats.R is called from the master_script_preserving_unmerged.sh.

[1] "USAGE: $ run_DESeq_stats.R -I working_directory/ -O save.filename"
Working directory is /data/samsa2/step_5_output/RefSeq_results/org_results
Error in colnames<-(*tmp*, value = "SRR6493775") :
attempt to set 'colnames' on an object with less than two dimensions
Calls: colnames<- -> colnames<-
Execution halted
'Rscript /data/samsa2/R_scripts/run_DESeq_stats.R -I /data/samsa2/step_5_output/RefSeq_results/org_results -O RefSeq_org_DESeq_results.tab -R /data/samsa2/step_2_output/raw_counts.txt' exited with non-zero status 1

This is how the first few lines of one of the files (control_SRR6493783.merged.RefSeq_annot_organism.tsv) at /data/samsa2/step_5_output/RefSeq_results/org_results look like:
14.8100107084 1206281 Bacteroides
14.0013343093 1140414 Bacteroides uniformis
6.71950210668 547306 Ruminococcus bromii
5.89554032774 480194 Bacteroides sp.
3.24016904525 263913 Faecalibacterium prausnitzii
3.06935339037 250000 Ruminococcus sp.
2.79296425627 227488 Bacteroides xylanisolvens
2.71861223975 221432 Eubacterium rectale
2.4190924585 197036 unclassified Bacteroides
2.41829442662 196971 Ruminococcus

This is the content of /data/samsa2/step_2_output/raw_counts.txt
SRR6493775.cleaned.forward 13993245
SRR6493783.cleaned.forward 16517196

Any help here would be much appreciated!

How DIAMOND_analysis_counter.py works?

Dear Sam,

Can you provide more details about how DIAMOND_analysis_counter.py and RefSeq_reduce_to_genus.py work to aggregate taxa? I am not python expert. What I find are:

1, xxx.merged.RefSeq_annot_organism.tsv in step_5_output has both species and genus terms
2, Not sure whether terms suffix with sp. belongs to species, others belong to genus?
3, RefSeq_reduce_to_genus.py -I xxx.merged.RefSeq_annot_organism.tsv will generate reduced form, but some genus will be lost. for example, Bacteria is lost in the attached example.

Thanks. Shicheng

SP-Reduceto-Genus-SAMSA2-2020-Shicheng-Guo-Strain-Skin

Shicheng

error with DIAMOND_subsystems_analysis_counter.py

Hi,

I get an error with DIAMOND_subsystems_analysis_counter.py when i run the test_of_master_script_tiny.bash

Traceback (most recent call last):
File "path_to_folder/samsa2/python_scripts/DIAMOND_subsystems_analysis_counter.py", line 105, in
db = open (db_name, "r", encoding='utf-8', errors='ignore')
TypeError: file() takes at most 3 arguments (4 given)

I have python 2.7.X installed

Thanks

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.