rvolden / c3poa Goto Github PK

View Code? Open in Web Editor NEW

30.0 30.0 16.0 295 KB

Computational pipeline for calling consensi on R2C2 nanopore data

License: GNU General Public License v2.0

Python 98.94% Shell 1.06%

c3poa's People

Contributors

Stargazers

Watchers

Forkers

nicklu-hq peiwenliu18 morellr kamilmaliszardigen livefire2015 alicecsq hd00ljy christopher-vollmers theron-palmer da-i

c3poa's Issues

<determine_consensus.py> <pyabpoa.msa_aligner.msa> IndexError: list index out of range

Hi @rvolden,

I've been having issues trying to get C3POa.py running successfully, and I've referenced issues #17 #18 to no avail.

I would greatly appreciate it if you could guide me towards resolving these issues. Naturally, I'd be more than happy to dig deeper and submit pull requests as well 😄

Info

I got into this situation and looked into the issues because the program ran unexpectedly fast and I realized my R2C2_Consensus.fasta file contained minimal reads.
In order to make sure the problems were not specific to my dataset, I used the 10xR2C2 PBMCs MinION data (SRX7522427; SRR10851883) from the preprint to reproduce these errors.
I added .get() to the pool.apply_async code in order to see the error messages (#17).
I tested this on 2 different systems, one running Ubuntu 18.04 (no bsub) and the other CentOS 7 (bsub available), as another user mentioned that this resolved the issue (#18).
To ensure that this is not a pyabpoa or abpoa issue, I ran the following codes to confirm they are both working:

import pyabpoa as pa
a = pa.msa_aligner()
seqs=[
'CCGAAGA',
'CCGAACTCGA',
'CCCGGAAGA',
'CCGAAGA'
]
res=a.msa(seqs, out_cons=True, out_msa=True, out_pog='pog.png', incr_fn='') # perform multiple sequence alignment 
                                                                # generate a figure of alignment graph to pog.png

for seq in res.cons_seq:
    print(seq)  # print consensus sequence

res.print_msa() # print row-column multiple sequence alignment in PIR format

output:

CCGAAGA
CC--GAA---GA
CC--GAACTCGA
CCCGGAA---GA
CC--GAA---GA

Running abpoa on this:

>1
CGTCAATCTATCGAAGCATACGCGGGCAGAGCCGAAGACCTCGGCAATCCA
>2
CCACGTCAATCTATCGAAGCATACGCGGCAGCCGAACTCGACCTCGGCAATCAC
>3
CGTCAATCTATCGAAGCATACGCGGCAGAGCCCGGAAGACCTCGGCAATCAC
>4
CGTCAATGCTAGTCGAAGCAGCTGCGGCAGAGCCGAAGACCTCGGCAATCAC
>5
CGTCAATCTATCGAAGCATTCTACGCGGCAGAGCCGACCTCGGCAATCAC
>6
CGTCAATCTAGAAGCATACGCGGCAAGAGCCGAAGACCTCGGCCAATCAC
>7
CGTCAATCTATCGGTAAAGCATACGCTCTGTAGCCGAAGACCTCGGCAATCAC
>8
CGTCAATCTATCTTCAAGCATACGCGGCAGAGCCGAAGACCTCGGCAATC
>9
CGTCAATGGATCGAGTACGCGGCAGAGCCGAAGACCTCGGCAATCAC
>10
CGTCAATCTAATCGAAGCATACGCGGCAGAGCCGTCTACCTCGGCAATCACGT

returned this:

>Consensus_sequence
CGTCAATCTATCGAAGCATACGCGGCAGAGCCGAAGACCTCGGCAATCAC

splint.fasta used:

>UMI_Splint_1
TGAGGCTGATGAGTTCCATANNNNNTATATNNNNNATCAC
TACTTAGTTTTTTGATAGCTTCAAGCCAGAGTTGTCTTTT
TCTCTTTGCTGGCAGTAAAAGTATTGTGTACCTTTTGCTG
GGTCAGGTTGTTCTTTAGGAGGAGTAAAAGGATCAAATGC
ACTAANNNNNTATATNNNNNGCGATCGAAAATATCCCTTT

Command to run C3POa.py:
python3 C3POa.py -r SRR10851883.1.fastq -o testout_srr1/ -s splint.fasta -c config -l 1000 -d 500 -n 40 -g 1000

Issue

Aligning splints to reads with blat
Preprocessing: 100%|██████████| 4572/4572 [03:29<00:00, 21.84it/s]
Catting psls: 100%|██████████| 4572/4572 [00:17<00:00, 262.37it/s]
Removing preprocessing files: 100%|██████████| 4572/4572 [00:23<00:00, 196.20it/s]
Calling consensi:   0%|          | 0/4572 [00:00<?, ?it/s]RemoteTraceback: 
"""
Traceback (most recent call last):
  File "/home/miniconda3/envs/ont/lib/python3.8/multiprocessing/pool.py", line 125, in worker
    result = (True, func(*args, **kwds))
  File "/data/C3POa/C3POa.py", line 172, in analyze_reads
    consensus, repeats = determine_consensus(
  File "/data/C3POa/bin/determine_consensus.py", line 49, in determine_consensus
    res = poa_aligner.msa(subreads, out_cons=True, out_msa=True)
  File "python/pyabpoa.pyx", line 137, in pyabpoa.msa_aligner.msa
IndexError: list index out of range
"""

The above exception was the direct cause of the following exception:

File "/data/C3POa/C3POa.py", line 290, in <module>
    286  if not args.reads or not args.splint_file:
    287      print('Reads (--reads/-r) and splint (--splint_file/-s) are required', file=sys.stderr)
    288      sys.exit(1)
    289  mp.set_start_method("spawn")
--> 290  main(args)
    ..................................................
     args.reads = '/data/C3POa/SRR10851883.1.fastq
                   '
     args.splint_file = '/data/C3POa/splint.fasta'
     sys.stderr = <_io.TextIOWrapper name='<stderr>' mode='w' encoding='utf-8' >
     mp.set_start_method = <method 'DefaultContext.set_start_method' of <multiprocessin
                            g.context.DefaultContext object at 0x2b6889583d00> context.p
                            y:241>
     args = Namespace(blatThreads=False, compress_output=False, config='
             /data/C3POa/config', groupSize=1000, lencutoff=1000, mdistcutoff=500, numThreads=4
             0, out_path='/data/C3POa/testout_srr1/', reads='/data/C3POa/SRR10851883.1.fastq', 
             splint_file='/data/C3POa/splint.fasta', zero=True)
    ..................................................

File "/data/C3POa/C3POa.py", line 255, in main
    186  def main(args):
 (...)
    251              continue
    252          tmp_reads.append(read)
    253          current_num += 1
    254          if current_num == target:
--> 255              pool.apply_async(analyze_reads,
    256                  args=(args, tmp_reads, splint_dict, adapter_dict, adapter_set, iteration, racon),
    ..................................................
     args = Namespace(blatThreads=False, compress_output=False, config='
             /data/C3POa/config', groupSize=1000, lencutoff=1000, mdistcutoff=500, numThreads=4
             0, out_path='/data/C3POa/testout_srr1/', reads='/data/C3POa/SRR10851883.1.fastq',
             splint_file='/data/C3POa/splint.fasta', zero=True)
     read = ('SRR10851883.1.1048', 'CAGGGCAGAGAGGACATTGTACCCCCAGTAGCGAAT
             CTCCGGACAGTAGATGTGAGTAAACATTCGGCAGAAAAGCTCTGCCAAACTGAGTGCTCT
             GATGTGACTTTTTCATCAAGTCAGTATTCCTGGGATCTCTTGTACGTGATAATCTCACTC
             TTGTACATAGTAATCTCACTCTTGTACACCCAATGCTCTAGCTCTTATAAAGGTTTCATC
             GTTTCTGCTTACCTAGTTTCTTTCCCTCTGTTCTCTCCCACCAGACTGGACTCTGAAATG
             GGCATGTACAGAGACGAAGAGACCCCCAACATGCTTCAGGCTTTGAGTGGAGAGGACACA
             GCCTCTGCTGGGACAGGGAGACTAGAGGGATGTGGAGTCCCTGAAGATGCTTTTGGACAA
             TGGTCACAGAGGTTGGGACAGTGGCAGGAGATACCATTCACCCAGGATCTCCAGGACAAG
             AGATCAGCCTGGCAGTTACA...
     current_num = 1000
     target = 1000
     pool.apply_async = <method 'Pool.apply_async' of <multiprocessing.pool.Pool sta
                         te=RUN pool_size=40> pool.py:450>
     tmp_reads = [('SRR10851883.1.2', 'AAGGGATATTTTCGATCGCGACACATATATGTAGTTAG
                  TGCATTTGGTCTTTTACTCCTCTAAAGAACAACCTGACCCAGCAAAAGGTACACAATACT
                  TTTACTGCCAGCAAAGAGAAAAAAAGACAACTCTACGGCGAAGCTATCAAAACTAAGTAG
                  TGATCTACCATATAAGCCATATGGAACTCATCAACCTCACTACACGACGCTCTTCCGATC
                  TCATGACACATCAACAACACCTATCGTTTTTTTTTTTTTTTTAAAAACCTTACTTTTGTT
                  TAATTTGTTTTGCCAAACAAACACAGTGAAACTTTAGTCTGACTAATTGTACAGAAAATA
                  GAATTTGTAACCAGTAGCAAACAAAACAGGATAAACCTAAGTCCCTGGCAAGCTGGATCT
                  CCATCAGCAGGTCATCCAGATCCCCTTGTAAATGGCCTCCTGGGAAAATAAGTCCTAGGA
                  GCAGAGCCTGGAATTTACCA...
     splint_dict = {'UMI_Splint_1': ['TGAGGCTGATGAGTTCCATANNNNNTATATNNNNNATCACT
                    ACTTAGTTTTTTGATAGCTTCAAGCCAGAGTTGTCTTTTTCTCTTTGCTGGCAGTAAAAG
                    TATTGTGTACCTTTTGCTGGGTCAGGTTGTTCTTTAGGAGGAGTAAAAGGATCAAATGCA
                    CTAANNNNNTATATNNNNNGCGATCGAAAATATCCCTTT', 'AAAGGGATATTTTCGAT
                    CGCNNNNNATATANNNNNTTAGTGCATTTGATCCTTTTACTCCTCCTAAAGAACAACCTG
                    ACCCAGCAAAAGGTACACAATACTTTTACTGCCAGCAAAGAGAAAAAGACAACTCTGGCT
                    TGAAGCTATCAAAAAACTAAGTAGTGATNNNNNATATANNNNNTATGGAACTCATCAGCC
                    TCA', ]}
     adapter_dict = {'SRR10851883.1.2': ['UMI_Splint_1', '-', ],
                     'SRR10851883.1.3': ['UMI_Splint_1', '+', ],
                     'SRR10851883.1.4': ['UMI_Splint_1', '+', ],
                     'SRR10851883.1.5': ['UMI_Splint_1', '-', ],
                     'SRR10851883.1.6': ['UMI_Splint_1', '+', ],
                     'SRR10851883.1.7': ['UMI_Splint_1', '+', ],
                     'SRR10851883.1.8': ['UMI_Splint_1', '+', ],
                     'SRR10851883.1.9': ['UMI_Splint_1', '-', ],
                     'SRR10851883.1.10': ['UMI_Splint_1', '+', ],
                     'SRR10851883.1.11': ['UMI_Splint_1', '+', ],
                     'SRR10851883.1.12': ['UMI_Splint_1', '+', ],
                     '...
     adapter_set = {'UMI_Splint_1', }
     iteration = 1
     racon = '/data/C3POa/racon/build/bin/racon'
    ..................................................

File "/home/miniconda3/envs/ont/lib/python3.8/multiprocessing/pool.py", line 771, in get
    764  def get(self, timeout=None):
 (...)
    767          raise TimeoutError
    768      if self._success:
    769          return self._value
    770      else:
--> 771          raise self._value
    ..................................................
     self = <multiprocessing.pool.ApplyResult object at 0x2b6a1a7f09d0>
     timeout = None
     TimeoutError = <class 'multiprocessing.context.TimeoutError'>
     self._success = False
     self._value = IndexError('list index out of range')
    ..................................................

---- (full traceback above) ----
File "/data/C3POa/C3POa.py", line 290, in <module>
    main(args)
File "/data/C3POa/C3POa.py", line 255, in main
    pool.apply_async(analyze_reads,
File "/home/miniconda3/envs/ont/lib/python3.8/multiprocessing/pool.py", line 771, in get
    raise self._value

IndexError: list index out of range
Calling consensi:   0%|          | 0/4572 [00:24<?, ?it/s]

This is the structure of the output directory:

C3POa/testout_srr1/
├── c3poa.log
├── tmp
│   └── splint_to_read_alignments.psl
└── UMI_Splint_1
    └── tmp1
        ├── R2C2_Consensus.fasta
        ├── racon_messages.log
        ├── SRR10851883.1.5_overlaps.paf
        ├── SRR10851883.1.5_subreads.fastq
        └── subreads.fastq

3 directories, 7 files

Both SRR10851883.1.5_overlaps.fastq and SRR10851883.1.5_subreads.fastq files are [empty]
c3poa.log:

C3POa version: v2.2.2
No splint reads: 506104 (10.40%)
Under len cutoff: 294213 (6.05%)
Total thrown away reads: 800317 (16.45%)
Total reads: 4865719

C3POa/testout_srr1/UMI_Splint_1/tmp1/racon_messages.log:

[racon::Polisher::initialize] loaded target sequences 0.000042 s
[racon::Polisher::initialize] loaded sequences 0.000045 s
[racon::Polisher::initialize] loaded overlaps 0.000029 s
[racon::Polisher::initialize] aligned overlaps 0.002215 s
[racon::Polisher::initialize] transformed data into windows 0.000026 s
[racon::Polisher::polish] generated consensus 0.008006 s
[racon::Polisher::] total = 0.010624 s

C3POa/testout_srr1/UMI_Splint_1/tmp1/R2C2_Consensus.fasta (truncated):

>SRR10851883.1.2_23.12_3003_0_2171
...
>SRR10851883.1.3_23.53_4715_3_1121
...
>SRR10851883.1.4_20.86_2621_1_1429
...

C3POa/testout_srr1/UMI_Splint_1/tmp1/subreads.fastq (truncated):

@SRR10851883.1.2_0
AAGGG ...
+
.8,.> ...
@SRR10851883.1.2_1
TTTTA ...
+
#$%%- ...
@SRR10851883.1.3_1
GTATT ...
+
,'&&' ...
@SRR10851883.1.3_2
ATTGT ...
+
B9768 ...
@SRR10851883.1.3_3
AGTAT ...
+
E6::8 ...
@SRR10851883.1.3_0
CAGGG ...
+
%,()/ ...
@SRR10851883.1.3_4
ATTGT ...
+
)()') ...
@SRR10851883.1.4_1
ATTGT ...
+
:98+* ...
@SRR10851883.1.4_0
ATAGC ...
+
/''.-. ...
@SRR10851883.1.4_2
TATTG ...
+
@@978 ...
@SRR10851883.1.5_0
CGGTG ...
+
%(''%# ...
@SRR10851883.1.5_1
TTTAC ...
+
KOH<B ...

Additional info (if it helps)

I tried to investigate further by removing the reliance on multiprocessing in hopes of narrowing down the issue:

Reading existing psl file
File "/data/C3POa/C3POa.py", line 286, in <module>
    282  if not args.reads or not args.splint_file:
    283      print('Reads (--reads/-r) and splint (--splint_file/-s) are required', file=sys.stderr)
    284      sys.exit(1)
    285  #mp.set_start_method("fork")
--> 286  main(args)
    ..................................................
     args.reads = '/data/C3POa/SRR10851883.1.fastq'
     args.splint_file = '/data/C3POa/splint.fasta'
     sys.stderr = <_io.TextIOWrapper name='<stderr>' mode='w' encoding='utf-8' >
     args = Namespace(blatThreads=False, compress_output=False, config='
             /data/C3POa/config', groupSize=1000, lencutoff=1000, mdistcutoff=500, numThreads=4
             0, out_path='/data/C3POa/testout_srr1/', reads='/data/C3POa/SRR10851883.1.fastq',
             splint_file='/data/C3POa/splint.fasta', zero=True)
    ..................................................

File "/data/C3POa/C3POa.py", line 254, in main
    178  def main(args):
 (...)
--> 254              analyze_reads(args, tmp_reads, splint_dict, adapter_dict, adapter_set, iteration, racon)
    255              iteration += 1
    ..................................................
     args = Namespace(blatThreads=False, compress_output=False, config='
             /data/C3POa/config', groupSize=1000, lencutoff=1000, mdistcutoff=500, numThreads=4
             0, out_path='/data/C3POa/testout_srr1/', reads='/data/C3POa/SRR10851883.1.fastq',
             splint_file='/data/C3POa/splint.fasta', zero=True)
     tmp_reads = [('SRR10851883.1.2', 'AAGGGATATTTTCGATCGCGACACATATATGTAGTTAG
                  TGCATTTGGTCTTTTACTCCTCTAAAGAACAACCTGACCCAGCAAAAGGTACACAATACT
                  TTTACTGCCAGCAAAGAGAAAAAAAGACAACTCTACGGCGAAGCTATCAAAACTAAGTAG
                  TGATCTACCATATAAGCCATATGGAACTCATCAACCTCACTACACGACGCTCTTCCGATC
                  TCATGACACATCAACAACACCTATCGTTTTTTTTTTTTTTTTAAAAACCTTACTTTTGTT
                  TAATTTGTTTTGCCAAACAAACACAGTGAAACTTTAGTCTGACTAATTGTACAGAAAATA
                  GAATTTGTAACCAGTAGCAAACAAAACAGGATAAACCTAAGTCCCTGGCAAGCTGGATCT
                  CCATCAGCAGGTCATCCAGATCCCCTTGTAAATGGCCTCCTGGGAAAATAAGTCCTAGGA
                  GCAGAGCCTGGAATTTACCA...
     splint_dict = {'UMI_Splint_1': ['TGAGGCTGATGAGTTCCATANNNNNTATATNNNNNATCACT
                    ACTTAGTTTTTTGATAGCTTCAAGCCAGAGTTGTCTTTTTCTCTTTGCTGGCAGTAAAAG
                    TATTGTGTACCTTTTGCTGGGTCAGGTTGTTCTTTAGGAGGAGTAAAAGGATCAAATGCA
                    CTAANNNNNTATATNNNNNGCGATCGAAAATATCCCTTT', 'AAAGGGATATTTTCGAT
                    CGCNNNNNATATANNNNNTTAGTGCATTTGATCCTTTTACTCCTCCTAAAGAACAACCTG
                    ACCCAGCAAAAGGTACACAATACTTTTACTGCCAGCAAAGAGAAAAAGACAACTCTGGCT
                    TGAAGCTATCAAAAAACTAAGTAGTGATNNNNNATATANNNNNTATGGAACTCATCAGCC
                    TCA', ]}
     adapter_dict = {'SRR10851883.1.2': ['UMI_Splint_1', '-', ],
                     'SRR10851883.1.3': ['UMI_Splint_1', '+', ],
                     'SRR10851883.1.4': ['UMI_Splint_1', '+', ],
                     'SRR10851883.1.5': ['UMI_Splint_1', '-', ],
                     'SRR10851883.1.6': ['UMI_Splint_1', '+', ],
                     'SRR10851883.1.7': ['UMI_Splint_1', '+', ],
                     'SRR10851883.1.8': ['UMI_Splint_1', '+', ],
                     'SRR10851883.1.9': ['UMI_Splint_1', '-', ],
                     'SRR10851883.1.10': ['UMI_Splint_1', '+', ],
                     'SRR10851883.1.11': ['UMI_Splint_1', '+', ],
                     'SRR10851883.1.12': ['UMI_Splint_1', '+', ],
                     '...
     adapter_set = {'UMI_Splint_1', }
     iteration = 1
     racon = '/data/C3POa/racon/build/bin/racon'
    ..................................................

File "/data/C3POa/C3POa.py", line 165, in analyze_reads
    113  def analyze_reads(args, reads, splint_dict, adapter_dict, adapter_set, iteration, racon):
 (...)
    161          if not os.path.isdir(tmp_dir):
    162              os.mkdir(tmp_dir)
    163          subread_file = tmp_dir + 'subreads.fastq'
    164  
--> 165          consensus, repeats = determine_consensus(
    166              args, read, subreads, qual_subreads, dangling_subreads, qual_dangling_subreads,
    ..................................................
     args = Namespace(blatThreads=False, compress_output=False, config='
             /data/C3POa/config', groupSize=1000, lencutoff=1000, mdistcutoff=500, numThreads=4
             0, out_path='/data/C3POa/testout_srr1/', reads='/data/C3POa/SRR10851883.1.fastq',
             splint_file='/data/C3POa/splint.fasta', zero=True)
     reads = [('SRR10851883.1.2', 'AAGGGATATTTTCGATCGCGACACATATATGTAGTTAG
              TGCATTTGGTCTTTTACTCCTCTAAAGAACAACCTGACCCAGCAAAAGGTACACAATACT
              TTTACTGCCAGCAAAGAGAAAAAAAGACAACTCTACGGCGAAGCTATCAAAACTAAGTAG
              TGATCTACCATATAAGCCATATGGAACTCATCAACCTCACTACACGACGCTCTTCCGATC
              TCATGACACATCAACAACACCTATCGTTTTTTTTTTTTTTTTAAAAACCTTACTTTTGTT
              TAATTTGTTTTGCCAAACAAACACAGTGAAACTTTAGTCTGACTAATTGTACAGAAAATA
              GAATTTGTAACCAGTAGCAAACAAAACAGGATAAACCTAAGTCCCTGGCAAGCTGGATCT
              CCATCAGCAGGTCATCCAGATCCCCTTGTAAATGGCCTCCTGGGAAAATAAGTCCTAGGA
              GCAGAGCCTGGAATTTACCA...
     splint_dict = {'UMI_Splint_1': ['TGAGGCTGATGAGTTCCATANNNNNTATATNNNNNATCACT
                    ACTTAGTTTTTTGATAGCTTCAAGCCAGAGTTGTCTTTTTCTCTTTGCTGGCAGTAAAAG
                    TATTGTGTACCTTTTGCTGGGTCAGGTTGTTCTTTAGGAGGAGTAAAAGGATCAAATGCA
                    CTAANNNNNTATATNNNNNGCGATCGAAAATATCCCTTT', 'AAAGGGATATTTTCGAT
                    CGCNNNNNATATANNNNNTTAGTGCATTTGATCCTTTTACTCCTCCTAAAGAACAACCTG
                    ACCCAGCAAAAGGTACACAATACTTTTACTGCCAGCAAAGAGAAAAAGACAACTCTGGCT
                    TGAAGCTATCAAAAAACTAAGTAGTGATNNNNNATATANNNNNTATGGAACTCATCAGCC
                    TCA', ]}
     adapter_dict = {'SRR10851883.1.2': ['UMI_Splint_1', '-', ],
                     'SRR10851883.1.3': ['UMI_Splint_1', '+', ],
                     'SRR10851883.1.4': ['UMI_Splint_1', '+', ],
                     'SRR10851883.1.5': ['UMI_Splint_1', '-', ],
                     'SRR10851883.1.6': ['UMI_Splint_1', '+', ],
                     'SRR10851883.1.7': ['UMI_Splint_1', '+', ],
                     'SRR10851883.1.8': ['UMI_Splint_1', '+', ],
                     'SRR10851883.1.9': ['UMI_Splint_1', '-', ],
                     'SRR10851883.1.10': ['UMI_Splint_1', '+', ],
                     'SRR10851883.1.11': ['UMI_Splint_1', '+', ],
                     'SRR10851883.1.12': ['UMI_Splint_1', '+', ],
                     '...
     adapter_set = {'UMI_Splint_1', }
     iteration = 1
     racon = '/data/C3POa/racon/build/bin/racon'
     os.path.isdir = <function 'isdir' genericpath.py:39>
     tmp_dir = '/data/C3POa/testout_srr1/UMI_Splint_1/tmp1/'
     subread_file = '/data/C3POa/testout_srr1/UMI_Splint_1/tmp1/subreads.fastq'
     consensus = 'ATTGTGTACCTTTTGCTGGGTCAGGTTGTTCTTTAGGAGGAGTAAAAGGATCAAATGCA
                  CTAAATAACTATATAGACAGCGATCGAAAATATCCCCTTTAAGCAGTGGTATCAACTTTC
                  GAGTATGGGGGAGCCGCTGAGAAGCGCAGAAGGCGGGCCCCGTCTGAGGTCTGGCAGTCA
                  GAGACAGCCGGGCGCCCACAGCCCGAGCGCCCACGGCAGCACCATGCCCGCACTCCTGGA
                  GCGCCCCAAGCTTTCCACGCCATGGCCAGGGCGCTGCACCGGCATTATGATGGAGCCCAG
                  GAGCGCAAAGCAGGAGGAAGAAGAGGTGGATAAGATGATGGAACAGAAGATGAAGGAAGA
                  ACAGGAGAGAAGGAAGAAAAAGGAGATGGAAGAGAATGTCATTAGAGGAGACCAAGGGAA
                  CAAATTCTGAAGTTGGAGGAGAAGCTTTTGGCTCTACAGGAAGAAGCACCAGCTTTCCTG
                  CAGCTCAAGAAAGTTTACAT...
     repeats = 1
     read = ('SRR10851883.1.5', 'CGGTGTATGGTTCAGGTAAGAGAGTCCTCCTTCTTCGGG
             GAGAGTATAGTTGGGCCCATGATCCTGGCCTTCCAGGCCAGTGGAGACCAGCCGAGGGCT
             CGTCCAAAAGGCGTAGAACCGACCTTTGTCGGAGGTTGGCACCATTTTGGAAATATACAT
             CATAAGAGGGCCTTTGGGGTCACAGCTTTCTAATGCCCATGGCAGCCTCGTCGATCAGGG
             GCCCCTCATTACAGGCTCGCAGCGGTACTTCTGGGCCGTCACAGGGAGGGCAGGTGGATG
             GTGATCATCTGCAACAAGGCGTCTCCGGCAGGCAGCCAGCGGCGCGTCACAGCCTTCAGC
             AGGGGTTTGCCTTCTTTGTCCTTGTCGCAGCTGTCAGTTTGATGTCCAGTTTCTCTATCA
             GTTTTGCTGTCTCCTCTTTCTTGAAATTCATGATCATAAACACCTTGAAGATGGGGTCCA
             GGATCAGCTGGCGGAAGGGT...
     subreads = []
     qual_subreads = []
     dangling_subreads = ['CGGTGTATGGTTCAGGTAAGAGAGTCCTCCTTCTTCGGGGAGAGTATAGTTGGGCCCA
                          TGATCCTGGCCTTCCAGGCCAGTGGAGACCAGCCGAGGGCTCGTCCAAAAGGCGTAGAAC
                          CGACCTTTGTCGGAGGTTGGCACCATTTTGGAAATATACATCATAAGAGGGCCTTTGGGG
                          TCACAGCTTTCTAATGCCCATGGCAGCCTCGTCGATCAGGGGCCCCTCATTACAGGCTCG
                          CAGCGGTACTTCTGGGCCGTCACAGGGAGGGCAGGTGGATGGTGATCATCTGCAACAAGG
                          CGTCTCCGGCAGGCAGCCAGCGGCGCGTCACAGCCTTCAGCAGGGGTTTGCCTTCTTTGT
                          CCTTGTCGCAGCTGTCAGTTTGATGTCCAGTTTCTCTATCAGTTTTGCTGTCTCCTCTTT
                          CTTGAAATTCATGATCATAAACACCTTGAAGATGGGGTCCAGGATCAGCTGGCGGAAGGG
                          TGTGGCAGCTTCTTCCCGCT...
     qual_dangling_subreads = ["%(''%#$($(&&#+,,'$&%4&1:>B@>-4,+'$%$$&0*%%$)'+((4)8//611-(
                               (*)%%*&&()*35$>5HC?>3;?B831311.3-,&%+'&&-2'&%'8:;6/.),2=<DA>
                               >A@>>FG+45?>>&<?></454>=BLLJG?=),:??;.AB@E>@><32EAB=@=@@&'A;
                               <?:675444-%$%&$$)''-11?@C=?A@-B?9+*-,(2+746?F3/+$%&$%%)5=04,
                               ))?112//9::?BA4?BA@?-(()?4:,11;446D?FD==>@>ED?BFFD7;?DCBM@;B
                               <A>?EFF?2))?F>?GEBID?BA((%%*-*,,4;<GF?5.**<;>BIHF*;:=??A?;26
                               6<8:1&9&&'&15/(-'3246>===,,F779AB?6A4=4C6D?IN.;@GABBE,;7;A?A
                               ?DCC@@FJEB5-104:)/AE>9B?FBA<@;;:???DC003CDB>:2-(/-//%%&*1%&(
                               &&-,136167;5A>6<1$#$...
    ..................................................

File "/data/C3POa/bin/determine_consensus.py", line 49, in determine_consensus
    13   def determine_consensus(args, read, subreads, sub_qual, dangling_subreads, qual_dangling_subreads, racon, tmp_dir, subread_file):
 (...)
    45               os.system('rm {tmp_files}'.format(tmp_files=' '.join([tmp_subread_file, overlap_file])))
    46               return '', 0
    47           abpoa_cons = pairwise_consensus(res.msa_seq, subreads, sub_qual)
    48       else:
--> 49           res = poa_aligner.msa(subreads, out_cons=True, out_msa=True)
    50           if not res.cons_seq:
    ..................................................
     args = Namespace(blatThreads=False, compress_output=False, config='
             /data/C3POa/config', groupSize=1000, lencutoff=1000, mdistcutoff=500, numThreads=4
             0, out_path='/data/C3POa/testout_srr1/', reads='/data/C3POa/SRR10851883.1.fastq',
             splint_file='/data/C3POa/splint.fasta', zero=True)
     read = ('SRR10851883.1.5', 'CGGTGTATGGTTCAGGTAAGAGAGTCCTCCTTCTTCGGG
             GAGAGTATAGTTGGGCCCATGATCCTGGCCTTCCAGGCCAGTGGAGACCAGCCGAGGGCT
             CGTCCAAAAGGCGTAGAACCGACCTTTGTCGGAGGTTGGCACCATTTTGGAAATATACAT
             CATAAGAGGGCCTTTGGGGTCACAGCTTTCTAATGCCCATGGCAGCCTCGTCGATCAGGG
             GCCCCTCATTACAGGCTCGCAGCGGTACTTCTGGGCCGTCACAGGGAGGGCAGGTGGATG
             GTGATCATCTGCAACAAGGCGTCTCCGGCAGGCAGCCAGCGGCGCGTCACAGCCTTCAGC
             AGGGGTTTGCCTTCTTTGTCCTTGTCGCAGCTGTCAGTTTGATGTCCAGTTTCTCTATCA
             GTTTTGCTGTCTCCTCTTTCTTGAAATTCATGATCATAAACACCTTGAAGATGGGGTCCA
             GGATCAGCTGGCGGAAGGGT...
     subreads = []
     sub_qual = []
     dangling_subreads = ['CGGTGTATGGTTCAGGTAAGAGAGTCCTCCTTCTTCGGGGAGAGTATAGTTGGGCCCA
                          TGATCCTGGCCTTCCAGGCCAGTGGAGACCAGCCGAGGGCTCGTCCAAAAGGCGTAGAAC
                          CGACCTTTGTCGGAGGTTGGCACCATTTTGGAAATATACATCATAAGAGGGCCTTTGGGG
                          TCACAGCTTTCTAATGCCCATGGCAGCCTCGTCGATCAGGGGCCCCTCATTACAGGCTCG
                          CAGCGGTACTTCTGGGCCGTCACAGGGAGGGCAGGTGGATGGTGATCATCTGCAACAAGG
                          CGTCTCCGGCAGGCAGCCAGCGGCGCGTCACAGCCTTCAGCAGGGGTTTGCCTTCTTTGT
                          CCTTGTCGCAGCTGTCAGTTTGATGTCCAGTTTCTCTATCAGTTTTGCTGTCTCCTCTTT
                          CTTGAAATTCATGATCATAAACACCTTGAAGATGGGGTCCAGGATCAGCTGGCGGAAGGG
                          TGTGGCAGCTTCTTCCCGCT...
     qual_dangling_subreads = ["%(''%#$($(&&#+,,'$&%4&1:>B@>-4,+'$%$$&0*%%$)'+((4)8//611-(
                               (*)%%*&&()*35$>5HC?>3;?B831311.3-,&%+'&&-2'&%'8:;6/.),2=<DA>
                               >A@>>FG+45?>>&<?></454>=BLLJG?=),:??;.AB@E>@><32EAB=@=@@&'A;
                               <?:675444-%$%&$$)''-11?@C=?A@-B?9+*-,(2+746?F3/+$%&$%%)5=04,
                               ))?112//9::?BA4?BA@?-(()?4:,11;446D?FD==>@>ED?BFFD7;?DCBM@;B
                               <A>?EFF?2))?F>?GEBID?BA((%%*-*,,4;<GF?5.**<;>BIHF*;:=??A?;26
                               6<8:1&9&&'&15/(-'3246>===,,F779AB?6A4=4C6D?IN.;@GABBE,;7;A?A
                               ?DCC@@FJEB5-104:)/AE>9B?FBA<@;;:???DC003CDB>:2-(/-//%%&*1%&(
                               &&-,136167;5A>6<1$#$...
     racon = '/data/C3POa/racon/build/bin/racon'
     tmp_dir = '/data/C3POa/testout_srr1/UMI_Splint_1/tmp1/'
     subread_file = '/data/C3POa/testout_srr1/UMI_Splint_1/tmp1/subreads.fastq'
     tmp_subread_file = '/data/C3POa/testout_srr1/UMI_Splint_1/tmp1/SRR10851883.1.5_subreads.fastq'
     overlap_file = '/data/C3POa/testout_srr1/UMI_Splint_1/tmp1/SRR10851883.1.5_overlaps.paf'
    ..................................................

File "python/pyabpoa.pyx", line 137, in pyabpoa.msa_aligner.msa

---- (full traceback above) ----
File "/data/C3POa/C3POa.py", line 286, in <module>
    main(args)
File "/data/C3POa/C3POa.py", line 254, in main
    analyze_reads(args, tmp_reads, splint_dict, adapter_dict, adapter_set, iteration, racon)
File "/data/C3POa/C3POa.py", line 165, in analyze_reads
    consensus, repeats = determine_consensus(
File "/data/C3POa/bin/determine_consensus.py", line 49, in determine_consensus
    res = poa_aligner.msa(subreads, out_cons=True, out_msa=True)
File "python/pyabpoa.pyx", line 137, in pyabpoa.msa_aligner.msa

IndexError: list index out of range

I also printed out the overlap_file and tmp_subread_file in C3POa/bin/determine_consensus.py (~ line 30) to inspect which files are being read:

       # temporary subreads specific for the current read (req. by racon)
       tmp_subread_file = tmp_dir + '{name}_subreads.fastq'.format(name=name)
       tmp_subread_fh = open(tmp_subread_file, 'w+')
-->    print(overlap_file, tmp_subread_file)

       # align subreads together using abPOA
       poa_aligner = poa.msa_aligner(match=5)

and these are the outputs before the program exits:

/data/C3POa/testout_srr1/UMI_Splint_1/tmp1/SRR10851883.1.3_overlaps.paf
/data/C3POa/testout_srr1/UMI_Splint_1/tmp1/SRR10851883.1.3_subreads.fastq
/data/C3POa/testout_srr1/UMI_Splint_1/tmp1/SRR10851883.1.4_overlaps.paf
/data/C3POa/testout_srr1/UMI_Splint_1/tmp1/SRR10851883.1.4_subreads.fastq
/data/C3POa/testout_srr1/UMI_Splint_1/tmp1/SRR10851883.1.5_overlaps.paf
/data/C3POa/testout_srr1/UMI_Splint_1/tmp1/SRR10851883.1.5_subreads.fastq

I suppose I will stop here for now. Thank you for your attention and assistance 😃

Calling Consensi Hang

Hi, I'm trying to replicate the B-Cell data from your 2018 paper (lovely work btw!)

I've downloaded the SRA data, so I'm working with SRR6924616_R2C2_full_length_cDNA_sequencing_of_single_human_B_cells_1.fastq.gz.

I've git clone'd your repo and ran the setup without issue, along with Racon / BLAT conda installed (available in the path).

When running the command (using a subset of the data or the whole enchilada), it seems to hang on the "Calling consensi" portion for a long period of time, then the script finishes. The splint.fasta is the one included in your repo. The output directory contains a splint_to_read_alignments.psl file which is sizeable, but the R2C2_Consensus.fasta & R2C2_Subreads.fastq are empty.

Command:

python3 C3POa.py \
                -r ../../Data/R2C2/SRR6924616_R2C2_full_length_cDNA_sequencing_of_single_human_B_cells_1.fastq.gz \
                -o ../C3POa_All -s splint.fasta -l 1000 -d 500 -n 8 -g 1000

Log Contents:

C3POa version: v2.2.3
Total reads: 2873159
No splint reads: 796038 (27.71%)
Under len cutoff: 668197 (23.26%)
Total thrown away reads: 1464235 (50.96%)
Reads after preprocessing: 1408924

Output:

Aligning splints to reads with blat
Preprocessing: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2205/2205 [19:48<00:00,  1.85it/s]
Catting psls: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2205/2205 [00:22<00:00, 99.51it/s]
Removing preprocessing files: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2205/2205 [00:00<00:00, 2347.87it/s]
Calling consensi:   0%|                                                                                                                                                                                  | 0/2205 [1:44:10<?, ?it/s]
Catting consensus reads: 0it [00:00, ?it/s]
Catting subreads: 0it [00:00, ?it/s]
Removing files: 0it [00:00, ?it/s]

System shows 100% usage and those 8 threads actively working, even with no output on the Consensus Calling. Any advice would be very welcome / obvious errors I could be making! (and thanks for maintaining this repo).

As an aside, it would be really great if you were able to host the final bam for this paper on something like Zenodo (to compare pipeline outputs), but obviously understand that's outside the scope of this Github issue!

Thanks,

Andrew

C3POa.py: AttributeError: 'tuple' object has no attribute 'append'

I've tried C3POa.py to process the preprocessed R2C2 fastq data: python3 ./C3POa.py --reads splint_reads/R2C2_raw_reads.fastq --path ./consensus --matrix ./NUC.4.4.mat --config config --slencutoff 1000 --groupSize 1000 --numThreads 16

It returns the AttributeError(see below)

Traceback (most recent call last):
  File "./C3POa.py", line 708, in <module>
    main()
  File "./C3POa.py", line 683, in main
    read_list = read_fastq_file(input_file)
  File "./C3POa.py", line 610, in read_fastq_file
    read_list[-1].append(line)
AttributeError: 'tuple' object has no attribute 'append'

the input file (R2C2_raw_reads.fastq) is output file of C3POa_preprocessing.py, with the splint position in the header.
config file is in tab format:

poa     /data/analysis/bio-pipeline/poaV2/poa
racon   /data/analysis/racon/bin/racon
gonk    /data/analysis/gonk/gonk
minimap2        /data/analysis/minimap2/minimap2
consensus       /data/analysis/C3POa/consensus.py
racon   /data/analysis/racon/bin/racon
blat    /data/analysis/blat

Is there anything wrong with my settings? How to fix it? Thank you.

PingPing

Error in running C3POa.py

Hi, I am struggling with the C3POa.py script.
I was able to run the preprocessing script without an issue and able to see the expected files.
But when I try to use these files for C3POa.py, I am getting this error :

rm: cannot remove `/home/jlee20/dataset/ONT/PRJNA448331/consensus//tmp1': No such file or directory
/home/jlee20/dataset/ONT/PRJNA448331/raw_data/preprocessed/splint_reads/1/R2C2_raw_reads.fastq
Traceback (most recent call last):
File "/camhpc/pkg/C3POa/1.0/centos6/C3POa.py", line 755, in
main()
File "/camhpc/pkg/C3POa/1.0/centos6/C3POa.py", line 752, in main
analyze_reads(read_list)
File "/camhpc/pkg/C3POa/1.0/centos6/C3POa.py", line 713, in analyze_reads
score_list_f = split_SW(name, forward, step=True)
File "/camhpc/pkg/C3POa/1.0/centos6/C3POa.py", line 485, in split_SW
run_water(step, seq1, seq2, totalLen, diag_dict, diag_set)
File "/camhpc/pkg/C3POa/1.0/centos6/C3POa.py", line 459, in run_water
diag_set, diag_dict)
File "/camhpc/pkg/C3POa/1.0/centos6/C3POa.py", line 512, in parse_file
for line in open(matrix_file):
FileNotFoundError: [Errno 2] No such file or directory: 'SW_PARSE.txt'

Can you let me know what is going wrong? My bash script is like this :

outpath=/home/jlee20/dataset/ONT/PRJNA448331/consensus/
path=/camhpc/pkg/C3POa/1.0/centos6/NUC.4.4.mat
path2=/camhpc/pkg/C3POa/1.0/centos6/config_file
path3=/home/jlee20/dataset/ONT/PRJNA448331/raw_data/preprocessed/splint_reads
fasta=consensus.fasta
output=$outpath$fasta
reads=/home/jlee20/dataset/ONT/PRJNA448331/raw_data/preprocessed/splint_reads/1/R2C2_raw_reads.fastq

python3 /camhpc/pkg/C3POa/1.0/centos6/C3POa.py --reads $reads --matrix $path --config $path2 --output $output -p $outpath

Thank you!
Joon

R2C2_C3POa_10X_data analysis

Hello,
Thanks for the amazing tool.
I'm trying to analyze the long read data generated from the ONT sequencer according to your C3POa work flow, but the pre-processing doesn't continue and finished at Calling consensi.
Despite the tools were installed with their dependencies, and I prepared the UMI_Splint.fasta used in the experiment, but unfortunately the process stopped as showed below:

command:
(base) [ukhussein@ldragon3 C3POa-2.2.3]$ python3 C3POa.py -r ../../projects/nanopore_R2C2/10X_071_R2C2/test/dngqu0264_71_fastq_pass.tar.gz -s ./UMI_Splint.fasta/UMI_Splints.fasta -d 500 -l 100 -g 1000 -n 32 -o out2

Output:

Log Contents:
$ cat/out/c3poa.log
C3POa version: v2.2.3
Total reads: 1687451
No splint reads: 1505291 (89.21%)
Under len cutoff: 15 (0.00%)
Total thrown away reads: 1505306 (89.21%)
Reads after preprocessing: 182145

Could you please help me to figure out what is the problem?

Splint and adapter sequences : B-Cell dataset

Hello,
I am trying to replicate the results from your paper for the 96 single-cell part as I would like to get acquainted with the tool and see all results.

I am trying to get all the needed information to urn C3POa to work but:

I cannot find the DNA splint sequence (is the one in #3 ok for these samples ? )
and I don't know how to retrieve this
-a sequence of cDNA adapter sequences in fasta format. Sequence names must be 3Prime_adapter and 5Prime_adapter

Could you share the required fasta with the 96 adapter sequences and the Splint one if it's different?

thanks a lot,
Mattia

Error when running C3POa.py

Hi, I've tried running C3POa using data downloaded from your 10xR2C2 manuscript. The C3POa_preprocessing.py works fine, but I was unable to get results from C3POa.py step.

Here's my command running C3POa.py

mkdir -p ./C3POa_out/PBMCs-Rep1-MinION
python3 /histor/zhao/zhangjy/02.scNanopore/scripts/C3POa/C3POa.py \
        -r ./C3POa_preprocess/PBMCs-Rep1-MinION/UMI_Splint_1/R2C2_raw_reads.fastq \
        -p ./C3POa_out/PBMCs-Rep1-MinION \
        -m /histor/zhao/zhangjy/02.scNanopore/scripts/C3POa/NUC.4.4.mat \
        -n 8 \
        -l 1000 \
        -g 1000 \
        -d 500 \
        -c ./C3POa.config

Content of C3POa.config:

# Order doesn't matter
# If you use the config file, you should provide paths to all of the programs
# You need to include all of the example programs
# Use tabs to separate the program name from the path
poa /histor/zhao/zhangjy/02.scNanopore/scripts/C3POa/bio-pipeline/poaV2/poa
racon   /histor/zhao/zhangjy/02.scNanopore/scripts/C3POa/build/bin/racon
gonk    /histor/zhao/zhangjy/02.scNanopore/scripts/C3POa/gonk/gonk
minimap2    /histor/public/software/minimap2/minimap2
consensus   /histor/zhao/zhangjy/02.scNanopore/scripts/C3POa/consensus.py
blat    /histor/public/software/UCSC_utility/blat/blat

Output when running C3POa.py, all lines were reporting the same error:

./C3POa_preprocess/PBMCs-Rep1-MinION/UMI_Splint_1/R2C2_raw_reads.fastq
rm: cannot remove './C3POa_out/PBMCs-Rep1-MinION//tmp1': No such file or directory
rm: cannot remove './C3POa_out/PBMCs-Rep1-MinION//tmp2': No such file or directory
rm: cannot remove './C3POa_out/PBMCs-Rep1-MinION//tmp3': No such file or directory
rm: cannot remove './C3POa_out/PBMCs-Rep1-MinION//tmp4': No such file or directory
rm: cannot remove './C3POa_out/PBMCs-Rep1-MinION//tmp5': No such file or directory
rm: cannot remove './C3POa_out/PBMCs-Rep1-MinION//tmp6': No such file or directory
rm: cannot remove './C3POa_out/PBMCs-Rep1-MinION//tmp7': No such file or directory
rm: cannot remove './C3POa_out/PBMCs-Rep1-MinION//tmp8': No such file or directory
rm: cannot remove './C3POa_out/PBMCs-Rep1-MinION//tmp9': No such file or directory
mkdir: cannot create directory './C3POa_out/PBMCs-Rep1-MinION//tmp9': No such file or directory
rm: cannot remove './C3POa_out/PBMCs-Rep1-MinION//tmp10': No such file or directory
mkdir: cannot create directory './C3POa_out/PBMCs-Rep1-MinION//tmp10': No such file or directory
rm: cannot remove './C3POa_out/PBMCs-Rep1-MinION//tmp11': No such file or directory
mkdir: cannot create directory './C3POa_out/PBMCs-Rep1-MinION//tmp11': No such file or directory
rm: cannot remove './C3POa_out/PBMCs-Rep1-MinION//tmp12': No such file or directory
mkdir: cannot create directory './C3POa_out/PBMCs-Rep1-MinION//tmp12': No such file or directory
rm: cannot remove './C3POa_out/PBMCs-Rep1-MinION//tmp13': No such file or directory
mkdir: cannot create directory './C3POa_out/PBMCs-Rep1-MinION//tmp13': No such file or directory
rm: cannot remove './C3POa_out/PBMCs-Rep1-MinION//tmp14': No such file or directory
mkdir: cannot create directory './C3POa_out/PBMCs-Rep1-MinION//tmp14': No such file or directory
rm: cannot remove './C3POa_out/PBMCs-Rep1-MinION//tmp15': No such file or directory
mkdir: cannot create directory './C3POa_out/PBMCs-Rep1-MinION//tmp15': No such file or directory
...

And every R2C2_Consensus.fasta and R2C2_Subreads.fastq under output directory is empty.

$ tree ./C3POa_out/PBMCs-Rep1-MinION
.
|-- R2C2_Consensus.fasta
|-- R2C2_Subreads.fastq
|-- tmp1
|   `-- R2C2_Consensus.fasta
|-- tmp2
|   `-- R2C2_Consensus.fasta
|-- tmp3
|   `-- R2C2_Consensus.fasta
|-- tmp4
|   `-- R2C2_Consensus.fasta
|-- tmp5
|   `-- R2C2_Consensus.fasta
|-- tmp6
|   `-- R2C2_Consensus.fasta
|-- tmp7
|   `-- R2C2_Consensus.fasta
`-- tmp8
    `-- R2C2_Consensus.fasta

8 directories, 10 files

Do you have any suggestions? Thanks!

Confusion about the C3POa_preprocessing.py and C3POa_postprocessing.py results ?

Hello,
I have two questions to ask you :

C3POa_preprocessing.py analysis will produce two types of results: R2C2_raw_reads.fastq and No_splint_reads.fastq. Were the No_splint_reads.fastq could aligned to the appropriate genomes directly ? Does the No_splint_reads.fastq have incomplete sequence of splint fasta or cDNA adapter sequences ?
C3POa_postprocessing.py analysis will produce one result file: R2C2_full_length_consensus_reads_R2.fasta. Was the meaning of the last numbers of reads name(as shown in the sequence below:377)? the true length of below sequence is subtracted by 377, and the final result is 80. How should we understand the number 80 ?

82d4e633-f330-464b-8c04-bcc093f37174_15.52_2488_2_735_377
GGCGACCAATGAGATCTTACACCGTCGGCAGCGTCAGATGTGTATAAGAGACAGTGAATTTGGTGGGAGCTTTGCTGAACTCCTCTACAGGTTCCGATTGTCTGAAGATGCCCGTCCGGGCTTTGCTAAGACTGGCTCCTGTGCTGCTTGGGAACCCACAGCCGATGGTCATGTGACCACCTCTGCCGAGGGAAACCCACCCTAGAAATGACTGCGGCACAGAACCCTGATGGCAGCAAAGTGAAGGCAGCATTTCTGTTTTGTAACCAATGGAACTCAGTTGCATACGCCTCACTGTGCTTAAAATTCATGTTGAAAATAAGACAGATAACGCTGGTGTTGTCCACGTGTCATATGATGTATAAAACATCAGTTAAAACTCACATTTTGTAACAAAGATTTTGTTTGTTTTCAAAAAAAAAAAAAAAAACATTTGCGTTGATACCACTGCTTAAAG

Issues when running C3POa.py

Hi, I've been trying to run C3POa.py but I'm running into some issues:
c3poa_test.log

I thought that racon might be the problem, but when I look at the racon_messages there doesn't seem to be any errors:
[racon::Polisher::initialize] loaded target sequences 0.000064 s
[racon::Polisher::initialize] loaded sequences 0.000112 s
[racon::Polisher::initialize] loaded overlaps 0.000172 s
[racon::Polisher::initialize] aligned overlaps 0.000247 s
[racon::Polisher::initialize] transformed data into windows 0.000022 s
[racon::Polisher::polish] generated consensus 0.007131 s
[racon::Polisher::] total = 0.007923 s

So I'm not quite sure what's going wrong here. Any help would be greatly appreciated! Thanks :)

Error when running C3Poa.py with 10x R2C2 nanopore data

Dear @rvolden,

Hello,
I am using C3POa.py to preprocess 10x based-Nanopore R2C2 sequencing data.

python3.7 C3POa/C3POa.py -r data/R2C2_q7_pass_merged.fastq -o c3poa_output -s data/10x_UMI_splint.fasta -c data/config -l 1000 -d 500 -n 10 -g 1000

When I did the first run, I thought it was going well, but I got the following error.

Reading existing psl file
 89%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▍                 | 4404/4934 [50:17<06:03,  1.46it/s]
cat: c3poa_output/Splint1/tmp*/R2C2_Consensus.fasta: No such file or directory
cat: c3poa_output/Splint1/tmp*/subreads.fastq: No such file or directory

Also, I obtained another error when I tried to do the same thing after erasing the c3poa_output/results.

Aligning splints to reads with blat
Loaded 200 letters in 1 sequences
Searched 40289422513 bases in 4933273 sequences
  0%|                                                                                                                                                                            | 0/4934 [6:49:10<?, ?it/s]
cat: c3poa_output/Splint1/tmp*/R2C2_Consensus.fasta: No such file or directory
cat: c3poa_output/Splint1/tmp*/subreads.fastq: No such file or directory

and then,

Aligning splints to reads with blat
Traceback (most recent call last):
  File "C3POa/C3POa.py", line 227, in <module>
    main(args)
  File "C3POa/C3POa.py", line 182, in main
    adapter_dict, adapter_set, no_splint = preprocess(blat, args.out_path, tmp_dir, read_list, args.splint_file, tmp_adapter_dict)
  File "/appl/applications/cellphonedb/cpdb-venv/bin/C3POa/bin/preprocess.py", line 24, in preprocess
    with open(align_psl) as f:
FileNotFoundError: [Errno 2] No such file or directory: 'c3poa_output/tmp/splint_to_read_alignments.psl'

Could you comment on what would be the cause of these issues and what I have to do to resolve it?

---input information---
config:

# Order doesn't matter
# If you use the config file, you should provide paths to all of the programs
# You need to include all of the example programs
# Use tabs to separate the program name from the path
racon   /appl/racon/racon/build/bin/racon
blat    /appl/blat/blatSrc/bin/blat

10x_UMI_splint.fasta:

>Splint1
TGAGGCTGATGAGTTCCATANNNNNTATATNNNNNATCAC
TACTTAGTTTTTTGATAGCTTCAAGCCAGAGTTGTCTTTT
TCTCTTTGCTGGCAGTAAAAGTATTGTGTACCTTTTGCTG
GGTCAGGTTGTTCTTTAGGAGGAGTAAAAGGATCAAATGC
ACTAANNNNNTATATNNNNNGCGATCGAAAATATCCCTTT

Thank you for your support.

Very few subreads and consensus reads despite many reads after preprocessing

Hello,
I'm testing out C3POa v2.2.3 on a small test dataset (176,000 reads from a much larger PromethION run). I'm hoping to use C3POa's demultiplexing feature and I've prepared a splints file with four sequences. Initial processing looks good at first:

$ # python3 C3POa.py -r /data/chunk.fastq -s /data/splints.fasta -l 100 -d 500 -g 1000 -o out
Aligning splints to reads with blat
Preprocessing:  99%|█████████████████████████████████████████████████████████████████████████▌| 176/177 [02:09<00:00,  1.36it/s]
Catting psls: 100%|██████████████████████████████████████████████████████████████████████████| 176/176 [00:01<00:00, 129.95it/s]
Removing preprocessing files: 100%|█████████████████████████████████████████████████████████| 176/176 [00:00<00:00, 2590.99it/s]
Calling consensi:   0%|                                                                                 | 0/177 [02:25<?, ?it/s]
Catting consensus reads: 100%|█████████████████████████████████████████████████████████████████| 4/4 [00:00<00:00, 11949.58it/s]
Catting subreads: 100%|███████████████████████████████████████████████████████████████████████| 17/17 [00:00<00:00, 7898.00it/s]
Removing files: 100%|█████████████████████████████████████████████████████████████████████████| 17/17 [00:00<00:00, 4450.05it/s]
Catting consensus reads: 100%|██████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 9177.91it/s]
Catting subreads: 100%|███████████████████████████████████████████████████████████████████████| 13/13 [00:00<00:00, 8104.33it/s]
Removing files: 100%|█████████████████████████████████████████████████████████████████████████| 13/13 [00:00<00:00, 4262.17it/s]
Catting consensus reads: 100%|███████████████████████████████████████████████████████████████| 16/16 [00:00<00:00, 13879.81it/s]
Catting subreads: 100%|██████████████████████████████████████████████████████████████████████| 87/87 [00:00<00:00, 11671.71it/s]
Removing files: 100%|█████████████████████████████████████████████████████████████████████████| 87/87 [00:00<00:00, 4974.09it/s]
Catting consensus reads: 100%|████████████████████████████████████████████████████████████████| 12/12 [00:00<00:00, 7833.72it/s]
Catting subreads: 100%|██████████████████████████████████████████████████████████████████████| 82/82 [00:00<00:00, 10553.00it/s]
Removing files: 100%|█████████████████████████████████████████████████████████████████████████| 82/82 [00:00<00:00, 4555.04it/s]
(lr-c3poa) root@f8924132b3ed:/#

$ cat out/c3poa.log
C3POa version: v2.2.3
Total reads: 176000
No splint reads: 37306 (21.20%)
Under len cutoff: 0 (0.00%)
Total thrown away reads: 37306 (21.20%)
Reads after preprocessing: 138694

However, in checking the output subread and consensus files, I see very few entries:

# grep -c '^[>@]' out/10x_Splint_*/*
out/10x_Splint_1/R2C2_Consensus.fasta:4
out/10x_Splint_1/R2C2_Subreads.fastq:96
out/10x_Splint_2/R2C2_Consensus.fasta:1
out/10x_Splint_2/R2C2_Subreads.fastq:60
out/10x_Splint_3/R2C2_Consensus.fasta:19
out/10x_Splint_3/R2C2_Subreads.fastq:282
out/10x_Splint_4/R2C2_Consensus.fasta:13
out/10x_Splint_4/R2C2_Subreads.fastq:325

These seem like awfully low numbers to me, but it's not clear to me where they're getting lost. Shouldn't the total number of subreads add up to reads after preprocessing? And assuming 5-10 passes per subreads, shouldn't the number of consensus reads be somewhere between 14k - 30k reads? Is there a way to know what's happening to the rest of the reads? Or is my understanding simply incorrect?

Thanks,
-Kiran

C3POa_preprocessing halts when parsing splint blat alignments

Hi there,

I've been trying to run C3POa to error correct some RCA data (albeit using a custom protocol not R2C2 - thank you for writing a program flexible enough to accommodate new splint sequences).

I think the program installed correctly and ran normally - it creates a "R2C2_temp_for_BLAT.fasta" and a "Splint_to_read_alignments.psl", the latter of which has 349408 lines - so we are clearly finding my custom splint sequence all over the place as expected.

However, the program halts after this, throwing to sterr:

Traceback (most recent call last):
File "C3POa_preprocessing.py", line 214, in
main()
File "C3POa_preprocessing.py", line 209, in main
adapter_dict = parse_blat(output_path)
File "C3POa_preprocessing.py", line 148, in parse_blat
adapter_dict[read_name][strand].append((adapter, float(a[0]), position))
KeyError: '495d8ff0-8c89-4f54-bca7-9344e74a1951'

This is the name of the first read in the Splint_to_read_alignments.psl file. So I'm reading this as a problem with the parsing? The .psl has this format (first 10 lines):

64 1 0 0 0 0 1 2 - 495d8ff0-8c89-4f54-bca7-9344e74a1951 1358 545 610 Splint_TruSeq 67 0 67 2 12,53, 748,760, 0,14,
65 0 0 0 1 1 1 2 + 71b57cbd-a089-456d-8822-2ff160f5c5fc 1066 275 341 Splint_TruSeq 67 0 67 3 7,42,16, 275,283,325, 0,7,51,
25 0 0 0 0 0 1 3 - 71b57cbd-a089-456d-8822-2ff160f5c5fc 1066 798 823 Splint_TruSeq 67 35 63 2 7,18, 243,250, 35,45,
26 0 0 0 1 41 1 41 + 22de2a2c-2a66-4fb9-a79b-7ab146ef45a2 1208 355 422 Splint_TruSeq 67 0 67 2 13,13, 355,409, 0,54,
65 1 0 0 1 1 1 1 - 22de2a2c-2a66-4fb9-a79b-7ab146ef45a2 1208 355 422 Splint_TruSeq 67 0 67 3 17,8,41, 786,804,812, 0,17,26,
58 0 0 0 1 4 2 9 + 9c6466b8-df94-44cf-9d2d-102aff9e1e51 1031 364 426 Splint_TruSeq 67 0 67 3 28,12,18, 364,396,408, 0,35,49,
26 0 0 0 1 36 1 41 - 9c6466b8-df94-44cf-9d2d-102aff9e1e51 1031 364 426 Splint_TruSeq 67 0 67 2 13,13, 605,654, 0,54,
60 1 0 0 0 0 3 6 + b9318330-dbb3-4aff-8a22-fe48469af700 1047 414 475 Splint_TruSeq 67 0 67 4 10,23,10,18, 414,424,447,457, 0,13,37,49,
59 0 0 0 3 4 4 8 + 7e872ed6-c0cb-4cf1-a9e1-8f2f7e9e50fc 1298 1154 1217 Splint_TruSeq 67 0 67 6 7,13,5,9,7,18, 1154,1162,1177,1182,1191,1199, 0,7,21,27,38,49,
59 3 0 0 1 6 1 5 - 67c37574-ba0a-4a3a-aff4-f05cf80a894c 1047 928 996 Splint_TruSeq 67 0 67 2 16,46, 51,73, 0,21,

Would be grateful for your help in figuring out what's going wrong - I'm really excited to have a crack at these data.

Regards,
Chris L

EDIT - I'm running python 3.6.8 as part of the Anaconda distribution, if that helps at all.

question and IndexError: list index out of range

Hello,

I would like to use the pipeline for analysys very short amplicons (from 250 to 500 bp) . Do you think that could be possible with this pipeline?

I've tried to used, I installed all the requeriments and I'm using Python(3.6) and Numphy in a enviroment. For the rest of the softwares I've installed with setup.py(the last version), go ( the install process was ok) and blat, I had some problems whit the path for it but finally I made it.
When I run this command line for the preprocessing:

(phy3)python C3POa_preprocessing.py -i BC26.fastq -o /home/ivan/C3POa -q 7 -l 160 -s Adapters_1d2.fasta -c configf.txt

I obtained:
raceback (most recent call last):
File "C3POa_preprocessing.py", line 63, in
progs = configReader(args['config'])
File "C3POa_preprocessing.py", line 41, in configReader
progs[line[0]] = line[1]
IndexError: list index out of range

the configf.file(I've copied your example)

Order doesn't matter

If you use the config file, you should provide paths to all of the programs

You need to include all of the example programs

Use tabs to separate the program name from the path

poa /home/ivan/C3POa/bio-pipeline/poaV2/poa
racon /home/ivan/C3POa/racon/bin/racon
water /home/ivan/C3POa/EMBOSS-6.6.0/emboss/water
minimap2 /home/ivan/C3POa/minimap2/minimap2
consensus /home/ivan/C3POa/consensus.py
racon /home/ivan/C3POa/racon/bin/racon
blat /home/ivan/blatSrc

I changed /home/ivan/blatSrc for blat but the result is the same.
What do you think that could be the problem?the blat installation?

Thank you very much

duplicate subread record issue

There are duplicate subreads record after running C3POa.py

Can this be problematic for consensus reads generation by C3POa.py?
And also, is it okay to proceed the downstream 10xR2C2 processes with these duplicate reads in subreads.fastq?

[hrs@node35 Splint1]$ cat R2C2_Subreads.fastq | awk 'NR%4==1'  | uniq -c | awk '$1!=1' | head
      3 @8746a94c-ed1c-4415-8ff6-46b3410d05e4_subread_2
      3 @ae3b22d7-e17b-4c0a-a601-816dd374ea95_subread_2
      2 @d33dedd5-b1b1-4e79-8894-00353593798b_subread_0
      3 @d33dedd5-b1b1-4e79-8894-00353593798b_subread_1
      2 @d33dedd5-b1b1-4e79-8894-00353593798b_subread_2
      7 @d33dedd5-b1b1-4e79-8894-00353593798b_subread_3
      3 @f4837886-b2ea-4ed0-95d5-13a4c48a21c3_subread_17
      3 @686f952f-49da-4e18-9d98-5b6038b8cded_subread_12
      7 @d5da835f-1e94-4a42-a609-6743df4b8bdf_subread_0
      3 @d5da835f-1e94-4a42-a609-6743df4b8bdf_subread_2

[hrs@node35 Splint1]$ cat R2C2_Subreads.fastq | awk 'NR%4==0'  | uniq -c | awk '$1!=1' | head | less -S
      3 A?<8FAC<>?9678;(009;<.*8ABN@>?=>;@9:91):*//=IGTMPCIF=AA;;?BADAF?9<AAEGB=>?FGAA013AB25;41D=@B?>9>E<;<>G<<=?==%%'%$4:7*3+&&($%$*97,=D@B=GE9<A?BBKB;:':<<CC:0,@?&/5&'2A@@;:@BB>A@E=948:;754>>?;=8?AFG9'0,;?+((&&??DBC
      3 AB??.03;<9<KHHD3<:@@:C@7=@@7)@<:>?A5?::53(878824<AB=HHIBCB5?44124->CCBADD?@L68946A@>25DIGCFDFE:[email protected])*+*/5799;;6/&%&7<<769?84:944@?DHIDGEI9;CBDPSIHAFIBHCC>AB>?@-2@BH0DIJLE>A@=?44AGHDKKJAA<-=CC?
      2 26022;4<AD8;>EB6+.<AA54:500-=?;?<>,,,++8..04FFPMHAC8<>?:/--'*(.-/((-<=?DCKDEB????204,.88::;>79;?<5632%'222BC=?B=:?C;HHE=9CED?LJG?4HG?/;<?A?=75998:620.,*))('''&&####"%#$$$$$$%%&&&'())((''&%'14<D>:/,?=>ACG8-%.GEI
      3 101;:@BBC=?AB=47&;??-+&$*+-+%%$$##$"%*/0;6CDDE@<90..$&$&%<?>@@A@AFECDE?B>BDC><.568@?><=?9<<=90:88<400/-&&)+:A=A@9:8<:5111&+187CC928EDB===?8010-)1>?2,,,+++++++,,-//00/..-+)'&&%%%%03578;88:=01''8''&)----.//0**78:
      2 =<7.FA@<>B:=A@;0)*8FB205A?C>BBA?7+%(()**34<<CAC1)(7221+*23-%$+(057:@>AA=B1))-.76886=FEAD?==;-1555;+))++27:01=?=EFA?7:A<CIEADE@@>><BA5;=<9=CGBGB7%%(<?2'&%%%$$$$$$$$%%%%%%&&''''''''''''&&''''''&&&&%*129?BC=II@;;9
      7 -0.--8<?C;DDF<?EFA79<?CC=CD:,:?:587<<>BF:><8<;2@>;<>41-;-57?A>=<</+&1$%+0AA=E:>04;?DGHGKKMHJB4*(*=D39;AMNEQNNLLEBIB=@EE?A@E@DDBSKD<;55@55435+0--11&/3/>+28)70.$+443:,A:71)$:65=>>;8;CAD8>EDCGEAE?EADJJKKD@A@C;?A?=
      3 34+)555336&+-5*3:(<E;024589:;2//)$$$$('257:9?=:99+7).676:;=;;;;:4*+,$,*++)'''&3236622995893,45:633712$$0.&'487986646<57/*022:8>>A@>7),68863966-**022/.,+*('$*//45642748:<3-239-657'$%$%$%#&$$)*-0/,%.8;:9;=:7-'((+
      3 5=??>..??@UPRKAF@CG?C?%/:*,6;<>@B=9BHGJ=9<67:KJ=CKHBOG@;<@9.7:=F>DDFHECEAAAKLIB8<C@89;BHD>?@AA?FFC9854420/0,/:>=>;AEHCC@>C?A--1<AIGG<>9B?A3;A;<@;:C<<CDHSO3>?B?AB>:F@KIGCMC;;=@?66999<@@B=<732889=;79?;<=>@>--885(
      7 >?$'),)*+576><>B@026><>>A;()8<=5GEAGF;01+1,,7DCC=7=>>0:8A;<>A>CB@>:9:??+''01/29.100)&+')563$$$$%&967:<A=D=;<;ACBAED@B3--@5?>A>@;<555,1>?>AD@DEFBA8((*7&547?;0DFCAB@A>>@AEBLLE@>98(*4478?9>>CAEIBAECI>@E@AH>BEF??@?
      3 )%&,*01212<:<677<+048897-<?.*&)8221'(%99>=?>*:@AA<;:;7=?<&&+*(+FCD66>AA=D?<;'D?;B&&&4<4<9<@A+8<CBL<G??<?<<?>G:?91%111.:A?9=;A:@?DC;>?CE>C@'++,214;=@CE<DE8;IC??2B>>?@>C><-*+8:;AKHJMDC<<=DNNL5QHC:>DF><245JH:BC<<=

Error when using bsub

Hi, thanks for developing this program. I ran C3POa on login nodes, and it worked. But I used bsub to submit the task to computing nodes, it was stuck at the step of calling consensi.

Aligning splints to reads with blat
Preprocessing: 100%|██████████| 10/10 [00:06<00:00,  1.59it/s]
Catting psls: 100%|██████████| 10/10 [00:00<00:00, 86.01it/s]
Removing preprocessing files: 100%|██████████| 10/10 [00:00<00:00, 287.00it/s]
Calling consensi:   0%|          | 0/10 [00:00<?, ?it/s]

the command I used is this:
bsub -q TEST-A -n 32 -e c3pao.err -o c3poa.out 'python ~/software/c3poa/C3POa/C3POa.py -r ../01.QC/nanofilt/rep1/test.fastq -o output_test/ -s ../10x_UMI_splint.fasta -c c3pao.config -l 1000 -d 500 -n 32 -g 1000'

Do you know how to run C3POa by submitting bsub command? Thank you!

Best regards,
Chujie

No fastq files generated

Hi Roger,

nice tool and paper. I was going to test your algorithm on some of my RCA nanopore data and everything seems to be going okay based on the messages I get (set the minimum length to 20 kb to make testing faster). I have successfully used the same data for INC-seq before.

$ python3 C3POa_preprocessing.py -i readsfilter.fastq -o /Users/user/C3POa/output -q 8 -l 20000 -s Splint_Aga2.fasta -c configblat.txt

Using minimap2 from your path, not the config file.
Using consensus from your path, not the config file.
Using water from your path, not the config file.
Using poa from your path, not the config file.
Using racon from your path, not the config file.
Reading and filtering fastq file
Running BLAT to find splint locations (This can take hours)
Loaded 288 letters in 1 sequences
Searched 6953471 bases in 283 sequences
Parsing BLAT output
Writing fastq output files in bins of 4000 into separate folders

However, the fastq files are nowhere to be found. The splint_reads folder stays empty. Any idea what the reason might be? The Splint_to_read_alignments.psl appears to contain sensible information.
Thanks!
Philipp

Question about -d parameter?

Hi,

-d median distance between peaks cutoff. This should be the length of your shortest
input sequence in your library preparation. Defaults to 500

I have a question about -d parameter.
What's the mean of median distance between peaks cutoff?

Median distance = median length of a1 and b1 or median length of a1, b1 and a2 or other?
Between peaks means between two splint sequence?

Thanks.

HJTsai

DNA splint sequence

I wanted to try out this script using the SRA archived SIRV data but I cannot see a DNA splint sequence supplied. I think I worked out the sequence from find the fwd and rev primer sequences in a downloaded lambda DNA fasta file. But I am unsure whether this is the exact lambda DNA sequence you used or not

Sequence is included the:

TGAGGCTGATGAGTTCCATATTTGAAAAGTTTTCATCACTACTTAGTTTTTTGATAGCTTCAAGCCAGAGTTGTCTTTTTCTATCTACTCTCATACAACCAATAAATGCTGAAATGAATTCTAAGCGGAGATCGCCTAGTGATTTTAAACTATTGCTGGCAGCATTCTTGAGTCCAATATAAAAGTATTGTGTACCTTTTGCTGGGTCAGGTTGTTCTTTAGGAGGAGTAAAAGGATCAAATGCACTAAACGAAACTGAAACAAGCGATCGAAAATATCCCTTT

how to remove the 3' 5' adapeters or the ployA tail form the postprocessing sequences

Hi,
When get the postprocessing sequences,what I need to de next is to remove the 3' and 5' adapeters and the ployA tail. Do you have a good implementation or software to do this ? Thanks.

Issues with C3POa.py: cannot execute gonk

I'm having issues with the C3POa.py script at the gonk stage. Preprocessing worked out well, but as I run C3POa.py on one of my fastq files, I am getting this error from bash:

Traceback (most recent call last):
File "/Users/ckim/C3POa/C3POa.py", line 663, in
main()
File "/Users/ckim/C3POa/C3POa.py", line 656, in main
analyze_reads(read_list)
File "/Users/ckim/C3POa/C3POa.py", line 624, in analyze_reads
scoreList = split_SW(name, seed, seq)
File "/Users/ckim/C3POa/C3POa.py", line 419, in split_SW
scoreList = runGonk(seq1, seq)
File "/Users/ckim/C3POa/C3POa.py", line 388, in runGonk
scoreList = parse_file(scores)
File "/Users/ckim/C3POa/C3POa.py", line 373, in parse_file
for line in open(scores):
FileNotFoundError: [Errno 2] No such file or directory: '/Users/ckim/20200213_0143_20200212_ASD_mCh_R2c2test///SW_PARSE.txt'

When I check the gonk_messages to see the error, this is the error reported:

sh: /Users/ckim/C3POa/gonk/gonk: cannot execute binary file

I made sure to get the Go dependency for gonk. I did setup from the instruction at the beginning, but haven't been able to get any farther with the script. Any help is appreciated and happy to give any more information as needed. Thanks!

Number of repeats used for consensus

I have generated consensus sequences for different datasets using C3POa. I am trying to do some stats by stablishing a correlation between the number of subreads and the accuracy of the consensus. When I am splitting the output file based on the information present in the header of each consensus sequence in the C3POa output, I have noticed that there is a jump from "1" to "3" without any sequences with "2" in all my output files. I have checked my input file and I have data that should fall into the "2" category. I am not sure why this is happening or If I am misunderstanding the output file. Thanks! for your assistance.