Hello, I have a nucleotide.fasta file with lot of sequences, which I translated to

<a class="issue-link js-issue-link" data-error-text="Failed to load title" data-id="19

Hello, I used that option and I still got this message: <code cl

<div class="snippet-clipboard-content notranslate position-relative overflow-auto" data-snippet-clip

Well, you can just simply combine the methods above. <div class="snippet-clipboard

Retrieve the nucleotide sequences initially used from the SeqIDs of the protein sequences (question) about seqkit HOT 9 CLOSED

silvia1234567890 commented on June 14, 2024

Retrieve the nucleotide sequences initially used from the SeqIDs of the protein sequences (question)

from seqkit.

Comments (9)

shenwei356 commented on June 14, 2024 1

#415 (comment)

from seqkit.

shenwei356 commented on June 14, 2024

You can use the option --id-regexp to specify the IDs for matching.

seqkit grep -f nonproductiveIDs.txt nucleotide.fasta -o nonprodnucl.fasta  --id-regexp '^(.+)_frame='

from seqkit.

silvia1234567890 commented on June 14, 2024

Hello,
I used that option and I still got this message:

[INFO] 608703 patterns loaded from file

and the file "nonprodnucl.fasta" is empty. Is it okay to use -o for the output files?

Thank you

from seqkit.

shenwei356 commented on June 14, 2024

[INFO] 608703 patterns loaded from file

It just shows how many patterns are loaded.

The key problem is that these patterns have suffixes like _frame=1, which do not exist in the sequence file. So, we need to remove these suffixes before searching.

seqkit grep -f <(perl -pne  's/_frame\=\d+$//' nonproductiveIDs.txt) nucleotide.fasta -o nonprodnucl.fasta

Is it okay to use -o for the output files?

It is.

from seqkit.

silvia1234567890 commented on June 14, 2024

Oh, I understand.
I would like to create the IDs.txt with the first 15 characters of the headers, because that sequence is an UMI and it's unique for every sequence. How could I do it?

from seqkit.

shenwei356 commented on June 14, 2024

seqkit seq nonprod_seq.fasta -ni --id-regexp '(^.+?)\|' -o IDs.txt

seqkit seq nonprod_seq.fasta -ni | cut -d '|' -f 1 > IDs.txt

Then you need to add the same --id-regexp '(^.+?)\|' when using seqkit grep, if the format does not change in nucleotide.fasta.

from seqkit.

silvia1234567890 commented on June 14, 2024

and to create the IDs.txt with the full header except the _frame= (example) part? How could I do it?

from seqkit.

shenwei356 commented on June 14, 2024

Well, you can just simply combine the methods above.

seqkit seq nonprod_seq.fasta -n | perl -pne  's/_frame\=\d+$//' > IDs.txt

from seqkit.

silvia1234567890 commented on June 14, 2024

Hello, I ran this code:
seqkit seq nonproductive_seqs.fasta -ni --id-regexp 's/_frame\=\d+$//' -o IDs.txt
and I'm getting this error:
[ERRO] fastx: regular expression must contain "(" and ")" to capture matched ID. default: ^(\S+)\s?

from seqkit.

Recommend Projects

Retrieve the nucleotide sequences initially used from the SeqIDs of the protein sequences (question) about seqkit HOT 9 CLOSED

Comments (9)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent