It seems that "igdiscover dereplicate" is not processing any sequence. I installed IgD

Complain if the input reads contain non-nucleotide characters/spaces about igdiscover-legacy HOT 10 CLOSED

bridgelan commented on August 29, 2024

Complain if the input reads contain non-nucleotide characters/spaces

from igdiscover-legacy.

Comments (10)

marcelm commented on August 29, 2024

Thanks for providing a full log file; that makes it easier to find out what is going on!

The problem appears to be the primer trimming step in IgDiscover. IgDiscover uses Cutadapt to remove them and it also uses the --discard-untrimmed option which makes it throw away all reads that do not contain a primer. Since you have already removed primer sequences, all reads are thrown away. You can see that in the log file:

Total reads processed:                 281,290
Reads with adapters:                         0 (0.0%)
Reads written (passing filters):             0 (0.0%)

Creating the stats/trimming-succesful file by hand unfortunately cannot solve the problem because IgDiscover will then just try to continue with an empty file.

IgDiscover should work correctly if you just give it the unprocessed reads. It will then merge paired-end reads and remove primers. If you want to continue to do these steps yourself, you need to not provide any primers. That is, give an empty list of primers in for the forward_primers: and reverse_primers: settings in the configuration.

from igdiscover-legacy.

marcelm commented on August 29, 2024

I've analyzed this file with IgBlast before, and it seems non-corrupted: it detected 90.000 productive sequences in it. Is this too small for IgDiscover?

It’s a bit low compared to what we had available when developing IgDiscover (roughly one million input reads, I don’t remember how many of those were productive), but this means only that you may not be able to discover some rarely expressed germline sequences. You should definitely get some results.

from igdiscover-legacy.

bridgelan commented on August 29, 2024

Thanks for the really fast reply! Just to confirm if I understood it well:

My first option: give to IgDiscover the unprocessed reads;
Second option: execute again with the processed reads, but with an empty primer list.

And the empty list should have some empty line, like

forward_primers:
- 
reverse_primers:
-

or is completely empty, like

forward_primers:
reverse_primers:

from igdiscover-legacy.

bridgelan commented on August 29, 2024

I ran it again, with the "completely empty" option, and it seems the cutadapt problem was solved:

igdiscover dereplicate --json=stats/groups.json --minimum-length=300 reads/4-trimmed.fasta.gz | pigz > reads/sequences.fasta.gz
INFO: 281290 sequences processed
INFO: 157215 sequences long enough
INFO: 98092 dereplicated sequences written
INFO: CPU time 00:00:03.1. Maximum memory usage 0.323 GB

But now, a new error appeared, haha.

time igdiscover igblast --sequence-type=Ig --rename 'testV3D0M'_ --threads=8 --stats=iteration-01/stats/assigned.json iteration-01/database reads/sequences.fasta.gz | pigz > iteration-01/assigned.tab.gz
INFO: Running IgBLAST on database sequences to find CDR/FR region locations
INFO: Running IgBLAST on input reads
multiprocessing.pool.RemoteTraceback: 
"""
Traceback (most recent call last):
  File "/home/bridgelan/miniconda3/envs/igdiscover/lib/python3.8/multiprocessing/pool.py", line 125, in worker
    result = (True, func(*args, **kwds))
  File "/home/bridgelan/miniconda3/envs/igdiscover/lib/python3.8/site-packages/igdiscover/cli/igblast.py", line 226, in __call__
    records = list(parser)
  File "/home/bridgelan/miniconda3/envs/igdiscover/lib/python3.8/site-packages/igdiscover/parse.py", line 706, in __iter__
    yield self._parse_record(record_lines, fasta_record)
  File "/home/bridgelan/miniconda3/envs/igdiscover/lib/python3.8/site-packages/igdiscover/parse.py", line 786, in _parse_record
    assert qsequence == full_sequence[hit.query_start:hit.query_start+len(qsequence)]
AssertionError
"""

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/bridgelan/miniconda3/envs/igdiscover/bin/igdiscover", line 10, in <module>
    sys.exit(main())
  File "/home/bridgelan/miniconda3/envs/igdiscover/lib/python3.8/site-packages/igdiscover/__main__.py", line 95, in main
    to_run()
  File "/home/bridgelan/miniconda3/envs/igdiscover/lib/python3.8/site-packages/igdiscover/__main__.py", line 93, in <lambda>
    to_run = lambda: subcommand(args)
  File "/home/bridgelan/miniconda3/envs/igdiscover/lib/python3.8/site-packages/igdiscover/cli/igblast.py", line 392, in main
    for record in igblast(database, sequences, sequence_type=args.sequence_type,
  File "/home/bridgelan/miniconda3/envs/igdiscover/lib/python3.8/site-packages/igdiscover/cli/igblast.py", line 359, in igblast
    for igblast_output, igblast_records in pool.imap(runner, chunks, chunksize=1):
  File "/home/bridgelan/miniconda3/envs/igdiscover/lib/python3.8/multiprocessing/pool.py", line 868, in next
    raise value
AssertionError
Error memory mapping:/tmp/tmp18lnhlkp/J.nsq openedFilesCount=30 threadID=2
Error memory mapping:/tmp/tmp18lnhlkp/D.nhr openedFilesCount=29 threadID=2
WORKER: T2 BATCH # 36 CEXCEPTION: Attempt to access NULL pointer.
WORKER: T2 BATCH # 36 CEXCEPTION: Attempt to access NULL pointer.
Error memory mapping:/tmp/tmp18lnhlkp/V.nhr openedFilesCount=33 threadID=1
WORKER: T1 BATCH # 36 CEXCEPTION: Cannot memory map /tmp/tmp18lnhlkp/V.nhr. Number of files opened: 33
Error memory mapping:/tmp/tmp18lnhlkp/V.nhr openedFilesCount=33 threadID=1
WORKER: T1 BATCH # 36 CEXCEPTION: Cannot memory map /tmp/tmp18lnhlkp/V.nhr. Number of files opened: 33
Error memory mapping:/tmp/tmp18lnhlkp/V.nhr openedFilesCount=33 threadID=1
WORKER: T1 BATCH # 36 CEXCEPTION: Cannot memory map /tmp/tmp18lnhlkp/V.nhr. Number of files opened: 33

real	0m48,464s
user	1m27,708s
sys	0m0,974s
[Mon Jul  5 10:16:17 2021]
Error in rule igdiscover_igblast:
    jobid: 52
    output: iteration-01/assigned.tab.gz, iteration-01/stats/assigned.json
    shell:
        time igdiscover igblast --sequence-type=Ig --rename 'testV3D0M'_ --threads=8 --stats=iteration-01/stats/assigned.json iteration-01/database reads/sequences.fasta.gz | pigz > iteration-01/assigned.tab.gz
        (one of the commands exited with non-zero exit code; note that snakemake uses bash strict mode!)

Removing output files of failed job igdiscover_igblast since they might be corrupted:
iteration-01/assigned.tab.gz
Shutting down, this might take some time.
Exiting because a job execution failed. Look above for error message
Complete log: /home/bridgelan/Documentos/testV3D0M/.snakemake/log/2021-07-05T101517.688811.snakemake.log
Total CPU time: 0h 2.00m
ERROR:

The problem seems to be coming from genes not annotated in "human.ndm.imgt", correct? Or does it have to do with the database I'm using? Should I open a new issue for this?

from igdiscover-legacy.

marcelm commented on August 29, 2024

I’m not actually at work at the moment, so I don’t have time for fully debugging this right now, perhaps tonight. In any case, you can ignore the second part of the stacktrace (the part where you got the WORKER: ... messages), that is just an error that IgBLAST throws because it got interrupted due to the actual error. IgDiscover runs IgBLAST with a custom database, so whether things are annotated in human.ndm.imgt or not does not matter, so that shouldn’t be the problem.

The problem is here:

assert qsequence == full_sequence[hit.query_start:hit.query_start+len(qsequence)]

and I will need to look into the code to understand why that assertion is there.

If possible, it would be helpful if I had access to the contents of the iteration-01/database/ folder and your reads/sequences.fasta.gz. You can send the files to me by e-mail (my address is on my GitHub profile), and I promise not to share the data and to delete the files as soon as possible.

You don’t need to open a new issue at the moment.

from igdiscover-legacy.

bridgelan commented on August 29, 2024

I sent you the Google Drive to the folder containing the data you requested.

I decided to sent you the full folder, in the case it's necessary to have them.

I'm really thankful for your help, Marcel!

At the same time, I'll annotate this sequences with IgBlast to confirm the data integrity again.

from igdiscover-legacy.

marcelm commented on August 29, 2024

Thanks a lot for the files, this has made it a lot easier to find out what is going on.

The problem is that your input reads contain spaces within the sequence. IgBLAST ignores them, but IgDiscover does not. As a sanity check, IgDiscover tests whether the query sequence that it sends to IgBLAST is the same as the one reported back from IgBLAST, and that fails because of the extra space character.

May I ask which tool generated that file? If that is a commonly used tool, it may be worth supporting spaces in reads in IgDiscover (but it’s unlikely I have the time to implement this anytime soon).

For now, I suggest you either run the tool that you used in such a way that it does not insert the spaces, or that you manually remove the spaces from the files that you already have. Here’s a sed command for this (that only works if each FASTA record occupies exactly two lines):

zcat reads.fasta.gz | sed '2~2s/ //' | pigz > reads-nospace.fasta.gz

from igdiscover-legacy.

bridgelan commented on August 29, 2024

The tool is not commonly used, it's a specific piece of a VDJ pipeline code that I'm using to correct the sequences through barcodes, coming from this article.

I will reexamine the input data and the pipeline code to understand what's inserting these spaces. I don't think it's worth your time to support this feature in IgDiscover, haha: I'm afraid that this correction script is generating unwanted modifications in the sequences. It might be interesting to raise a specific Python error in this case, but I don't know if it's a priority, space characters are really unexpected in the middle of a FASTA sequence line (until now, haha).

Marcel, I'm sincerely thankful for your time to help me, man. Keep up the good work!

from igdiscover-legacy.

marcelm commented on August 29, 2024

You’re welcome! I will leave this issue open for now until I have made a decision of whether IgDiscover should complain about spaces in the input sequences.

from igdiscover-legacy.

marcelm commented on August 29, 2024

This repository is outdated and is going to be archived. Please see the new repository at https://gitlab.com/gkhlab/igdiscover22/ or the homepage at https://www.igdiscover.se/ for the most recent and maintained IgDiscover version.

from igdiscover-legacy.

Complain if the input reads contain non-nucleotide characters/spaces about igdiscover-legacy HOT 10 CLOSED

Comments (10)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent