The sars2seq from rivm-bioinformatics

sars2seq's Issues

Unique amplicon coverages overview does not properly take primer start and stopping positions into account

As reported by @landmanf
The current scripts for retrieving the unique amplicon coverages do not properly handle the starting and stopping positions of said amplicons.

Currently, the ending position of amplicon 1 is immediately the starting position of amplicon 2. However, there should be a "void" space in between those two amplicons as this is the space where both amplicons overlap and therefore this area shouldn't be counted for either amplicon.

The issue can probably be found in the snippet below:

SARS2seq/SARS2seq/workflow/scripts/amplicon_covs.py

Lines 138 to 176 in 077ec6d

    
           def Find_NonOverlap(df): 
        
               dd = df.to_dict(orient="records") 
        
               startingpoint = {} 
        
               endingpoint = {} 
        
               lastindex = list(enumerate(dd))[-1][0] 
        
               for x, v in enumerate(dd): 
        
                   s = v.get("leftstop") 
        
                   t_end = v.get("rightstart") 
        
                   if x != lastindex: 
        
                       end_override = dd[x + 1].get("leftstop") 
        
                   else: 
        
                       end_override = None 
        
                   if end_override is not None: 
        
                       if end_override in range(s, t_end): 
        
                           primerstart = s 
        
                           primerend = end_override 
        
                       else: 
        
                           primerstart = s 
        
                           primerend = t_end 
        
                   else: 
        
                       primerstart = s 
        
                       primerend = t_end 
        
                   startingpoint[primerstart] = v.get("name") 
        
                   endingpoint[primerend] = v.get("name") 
        
               startdf = ( 
        
                   pd.DataFrame.from_dict(startingpoint, orient="index") 
        
                   .reset_index() 
        
                   .rename(columns={0: "name", "index": "unique_start"}) 
        
               ) 
        
               enddf = ( 
        
                   pd.DataFrame.from_dict(endingpoint, orient="index") 
        
                   .reset_index() 
        
                   .rename(columns={0: "name", "index": "unique_end"}) 
        
               ) 
        
               df = pd.merge(df, startdf, on="name", how="inner") 
        
               df = pd.merge(df, enddf, on="name", how="inner") 
        
               return df

This issue serves for logging purposes

Can this workflow process single-end sequencing data?

Thanks for your wonderful jobs! It can process pair end sequencing data well.

I wonder if this workflow could process data of single-end sequencing.

Don't include a sample if the sample name has a space

Rewrite the regex rules to make sure that a sample will not be ran by the pipeline if a sample contains whitespace in its name:

Valid filename:
Example_data.fastq.gz

Invalid filename:
Example_data .fastq.gz (note the space between samplename and file extensions)

Change the regex to something like the following for nanopore data: ([ ]*)([\S]*)\.f(ast)?q(\.gz)?

Group 1 matches only if there's a space in the samplename, empty group if all is well.
Group 2 matches the actual sample name as long as there's no space in the sample name. Empty group if there's a space.
Group 3 & 4: matches the file extension(s)

Using these groups allows us to not include a sample in the samplesheet if it has a space in its name

rivm-bioinformatics / sars2seq Goto Github PK

sars2seq's People

Contributors

Stargazers

Watchers

sars2seq's Issues

Unique amplicon coverages overview does not properly take primer start and stopping positions into account

Can this workflow process single-end sequencing data?

Don't include a sample if the sample name has a space

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

	def Find_NonOverlap(df):
	dd = df.to_dict(orient="records")
	startingpoint = {}
	endingpoint = {}
	lastindex = list(enumerate(dd))[-1][0]
	for x, v in enumerate(dd):
	s = v.get("leftstop")
	t_end = v.get("rightstart")
	if x != lastindex:
	end_override = dd[x + 1].get("leftstop")
	else:
	end_override = None
	if end_override is not None:
	if end_override in range(s, t_end):
	primerstart = s
	primerend = end_override
	else:
	primerstart = s
	primerend = t_end
	else:
	primerstart = s
	primerend = t_end
	startingpoint[primerstart] = v.get("name")
	endingpoint[primerend] = v.get("name")

	startdf = (
	pd.DataFrame.from_dict(startingpoint, orient="index")
	.reset_index()
	.rename(columns={0: "name", "index": "unique_start"})
	)
	enddf = (
	pd.DataFrame.from_dict(endingpoint, orient="index")
	.reset_index()
	.rename(columns={0: "name", "index": "unique_end"})
	)
	df = pd.merge(df, startdf, on="name", how="inner")
	df = pd.merge(df, enddf, on="name", how="inner")

	return df