View Code? Open in Web Editor
NEW
SARS2seq is a pipeline designed to process raw FastQ data from targeted SARS-CoV-2 sequencing and generate biologically correct consensus sequences of the SARS-CoV-2 genome.
Home Page: https://rivm-bioinformatics.github.io/SARS2seq/
License: GNU Affero General Public License v3.0
Python 97.83%
Shell 2.17%
sars2seq's People
Watchers
sars2seq's Issues
As reported by @landmanf
The current scripts for retrieving the unique amplicon coverages do not properly handle the starting and stopping positions of said amplicons.
Currently, the ending position of amplicon 1 is immediately the starting position of amplicon 2. However, there should be a "void" space in between those two amplicons as this is the space where both amplicons overlap and therefore this area shouldn't be counted for either amplicon.
The issue can probably be found in the snippet below:
def Find_NonOverlap (df ):
dd = df .to_dict (orient = "records" )
startingpoint = {}
endingpoint = {}
lastindex = list (enumerate (dd ))[- 1 ][0 ]
for x , v in enumerate (dd ):
s = v .get ("leftstop" )
t_end = v .get ("rightstart" )
if x != lastindex :
end_override = dd [x + 1 ].get ("leftstop" )
else :
end_override = None
if end_override is not None :
if end_override in range (s , t_end ):
primerstart = s
primerend = end_override
else :
primerstart = s
primerend = t_end
else :
primerstart = s
primerend = t_end
startingpoint [primerstart ] = v .get ("name" )
endingpoint [primerend ] = v .get ("name" )
startdf = (
pd .DataFrame .from_dict (startingpoint , orient = "index" )
.reset_index ()
.rename (columns = {0 : "name" , "index" : "unique_start" })
)
enddf = (
pd .DataFrame .from_dict (endingpoint , orient = "index" )
.reset_index ()
.rename (columns = {0 : "name" , "index" : "unique_end" })
)
df = pd .merge (df , startdf , on = "name" , how = "inner" )
df = pd .merge (df , enddf , on = "name" , how = "inner" )
return df
This issue serves for logging purposes
Thanks for your wonderful jobs! It can process pair end sequencing data well.
I wonder if this workflow could process data of single-end sequencing.
Rewrite the regex rules to make sure that a sample will not be ran by the pipeline if a sample contains whitespace in its name:
Valid filename:
Example_data.fastq.gz
Invalid filename:
Example_data .fastq.gz
(note the space between samplename and file extensions)
Change the regex to something like the following for nanopore data: ([ ]*)([\S]*)\.f(ast)?q(\.gz)?
Group 1 matches only if there's a space in the samplename, empty group if all is well.
Group 2 matches the actual sample name as long as there's no space in the sample name. Empty group if there's a space.
Group 3 & 4: matches the file extension(s)
Using these groups allows us to not include a sample in the samplesheet if it has a space in its name