chang_et_al_zebrafish_tes's People
chang_et_al_zebrafish_tes's Issues
A bug in <3.make_TE_classification.sh>
Hello! Thanks for this good workflow. I thought one input file (TE_GTF) danRer11.TEtrans_uID.gtf
in 3.make_TE_classification.sh
may be incorrect (shown below).
TE_GTF="./data/**danRer11.TEtrans_uID.gtf**"
time python ./scripts/clean_OCTFTA_output.py \
--te_gtf ./data/**danRer11.TEtrans_uID.gtf** \
--input ./data/danRer11.nonalt.fa.out.elem_sorted.csv \
--te_length ./data/tes.lengths \
--te_length_cutoff 2 \
--output ./data/danRer11.TEtrans_uID.dfrag.gtf
According to the workflow, TE_GTF: danRer11.TEtrans_uID.gtf
was generated in 2.make_gene_and_TE_counts.sh
, as described in the following code in <2.make_gene_and_TE_counts.sh>:
rmsk2bed < ./data/danRer11.nonalt.fa.out > ./data/danRer11.nonalt.fa.out.bed
# Convert to GTF format
time python ./scripts/RM_bed2GTF.py -i ./data/danRer11.nonalt.fa.out.bed -o ./data/danRer11.nonalt.fa.out.gtf
# Adapt GTF to TEtranscript requested format
# Use ./data/danRer11_classes_wredundant.txt file to solve some TE superfamily/family name redundancy.
# Subset TE classes: DNA, LINE, LTR, RC, SINE
sed 's/?//g' ./data/danRer11_classes_wredundant.txt \
| awk '{if ($2=="DNA" || $2=="LINE" || $2=="LTR" || $2=="RC" || $2=="SINE") print }' \
> ./data/danRer11_classes_wredundant.subClass.txt
python ./scripts/adapt_GTF_TEtranscript.py \
-i ./data/danRer11.nonalt.fa.out.gtf \
-c ./data/danRer11_classes_wredundant.subClass.txt \
-o ./**data/danRer11.TEtrans_uID.gtf**
After running the above code, I got danRer11.TEtrans_uID.gtf
, which looks like:
chr1 RepeatMasker exon 1 397 1686 - . gene_id "Harbinger-N29_DR"; transcript_id "Harbinger-N29_DR"; family_id "PIF-Harbinger"; class_id "DNA";
chr1 RepeatMasker exon 403 462 404 + . gene_id "L2-5_DRe"; transcript_id "L2-5_DRe"; family_id "L2"; class_id "LINE";
chr1 RepeatMasker exon 460 532 309 + . gene_id "hAT-N25_DR"; transcript_id "hAT-N25_DR"; family_id "hAT-Ac"; class_id "DNA";
chr1 RepeatMasker exon 469 611 494 + . gene_id "Harbinger-8N2_DR"; transcript_id "Harbinger-8N2_DR"; family_id "PIF-Harbinger"; class_id "DNA";
However, it's not the format that clean_OCTFTA_output.py
recognizes, and the correct format looks like:
chr1 RepeatMasker exon 403 462 404 + . gene_id "L2-5_DRe"; transcript_id "L2-5_DRe"; family_id "L2"; class_id "LINE"; geneId "ENSDARG00000099104"; transcriptId "ENSDART00000158290"; annotation_fix "Distal"; TE_GE_strand "OS"; annotation_2 "Intergenic.OS"; annotation_3 "Intergenic.OS"
chr1 RepeatMasker exon 625 1027 2198 + . gene_id "L2-5_DRe"; transcript_id "L2-5_DRe_dup1"; family_id "L2"; class_id "LINE"; geneId "ENSDARG00000099104"; transcriptId "ENSDART00000158290"; annotation_fix "Distal"; TE_GE_strand "OS"; annotation_2 "Intergenic.OS"; annotation_3 "Intergenic.OS"
chr1 RepeatMasker exon 1026 1390 1791 + . gene_id "L2-5_DRe"; transcript_id "L2-5_DRe_dup2"; family_id "L2"; class_id "LINE"; geneId "ENSDARG00000099104"; transcriptId "ENSDART00000158290"; annotation_fix "Distal"; TE_GE_strand "OS"; annotation_2 "Intergenic.OS"; annotation_3 "Intergenic.OS"
And I got the following errors when running ./scripts/adapt_GTF_TEtranscript.py
in 3.make_TE_classification.sh
File "./scripts/clean_OCTFTA_output.py", line 318, in <module>
print_TEgtf_line = make_TEgtf_print_line(TEgtf_lines[i])
File "./scripts/clean_OCTFTA_output.py", line 180, in make_TEgtf_print_line
attribute_string = 'gene_id "%s"; transcript_id "%s"; defrag_transcript_id "%s"; family_id "%s"; class_id "%s"; geneId "%s"; transcriptId "%s"; annotation_fix "%s"; TE_GE_strand "%s"; annotation_2 "%s"; annotation_3 "%s"' % (TEgtf_line[8]['gene_id'], TEgtf_line[8]['transcript_id'], TEgtf_line[8]['defrag_transcript_id'], TEgtf_line[8]['family_id'], TEgtf_line[8]['class_id'], TEgtf_line[8]['geneId'], TEgtf_line[8]['transcriptId'], TEgtf_line[8]['annotation_fix'], TEgtf_line[8]['TE_GE_strand'], TEgtf_line[8]['annotation_2'], TEgtf_line[8]['annotation_3'])
KeyError: 'geneId'
So I suspected danRer11.TEtrans_uID.gtf might not be the right input file for the following code of <3.make_TE_classification.sh>.
TE_GTF="./data/**danRer11.TEtrans_uID.gtf**"
time python ./scripts/clean_OCTFTA_output.py \
--te_gtf ./data/**danRer11.TEtrans_uID.gtf** \
--input ./data/danRer11.nonalt.fa.out.elem_sorted.csv \
--te_length ./data/tes.lengths \
--te_length_cutoff 2 \
--output ./data/danRer11.TEtrans_uID.dfrag.gtf
Can you help me check if there is anything wrong with the input file of 3.make_TE_classification.sh
? Thank you!
Error in running clean_OCTFTA_output.py
Hello! I encountered the following problems when running clean_OCTFTA_output.py:
File "clean_OCTFTA_output.py", line 336, in <module>
print_TEgtf_line = make_TEgtf_print_line(new_TEgtf_line)
File "clean_OCTFTA_output.py", line 180, in make_TEgtf_print_line
attribute_string = 'gene_id "%s"; transcript_id "%s"; defrag_transcript_id "%s"; family_id "%s"; class_id "%s"; geneId "%s"; transcriptId "%s"; annotation_fix "%s"; TE_GE_strand "%s"; annotation_2 "%s"; annotation_3 "%s"' % (TEgtf_line[8]['gene_id'], TEgtf_line[8]['transcript_id'], TEgtf_line[8]['defrag_transcript_id'], TEgtf_line[8]['family_id'], TEgtf_line[8]['class_id'], TEgtf_line[8]['geneId'], TEgtf_line[8]['transcriptId'], TEgtf_line[8]['annotation_fix'], TEgtf_line[8]['TE_GE_strand'], TEgtf_line[8]['annotation_2'], TEgtf_line[8]['annotation_3'])
KeyError: 'geneId'
I suspected that maybe my input file might be wrong. Can you please advise what's wrong? Thank you so much for your time!
My scripts:
$ python clean_OCTFTA_output.py -t ce10.TEtrans_uID.gtf -i elem.csv -l ce10.fa.out.length -c 2 -o t1.txt
My output:
Reading TE GTF file: ce10.TEtrans_uID.gtf
Reading |################################| 36467/36467
Reading TE length file: ce10.fa.out.length
Reading |################################| 155/155
Reading TE length file: elem.csv
Reading | | 1/82286Traceback (most recent call last):
File "clean_OCTFTA_output.py", line 336, in <module>
print_TEgtf_line = make_TEgtf_print_line(new_TEgtf_line)
File "clean_OCTFTA_output.py", line 180, in make_TEgtf_print_line
attribute_string = 'gene_id "%s"; transcript_id "%s"; defrag_transcript_id "%s"; family_id "%s"; class_id "%s"; geneId "%s"; transcriptId "%s"; annotation_fix "%s"; TE_GE_strand "%s"; annotation_2 "%s"; annotation_3 "%s"' % (TEgtf_line[8]['gene_id'], TEgtf_line[8]['transcript_id'], TEgtf_line[8]['defrag_transcript_id'], TEgtf_line[8]['family_id'], TEgtf_line[8]['class_id'], TEgtf_line[8]['geneId'], TEgtf_line[8]['transcriptId'], TEgtf_line[8]['annotation_fix'], TEgtf_line[8]['TE_GE_strand'], TEgtf_line[8]['annotation_2'], TEgtf_line[8]['annotation_3'])
KeyError: 'geneId'
My input files:
elem.csv:
Score %_Div %_Del %_Ins Query Beg. End. Length Sense Element Family Pos_Repeat_Beg Pos_Repeat_End Pos_Repeat_Left ID Num_Assembled %_of_Ref
###432 21.9 2.4 0 chrI 1622 1744 126 + LONGPAL1 DNA/MULE-MuDR 136 261 -2330 4 1 0.049
.......
ce10.TEtrans_uID.gtf:
chrI RepeatMasker exon 1622 1744 432 + . gene_id "LONGPAL1"; transcript_id "LONGPAL1"; family_id "MULE-MuDR"; class_id "DNA";
chrI RepeatMasker exon 2052 3026 8509 + . gene_id "PALTTTAAA3"; transcript_id "PALTTTAAA3"; family_id "DNA"; class_id "DNA";
chrI RepeatMasker exon 3124 3652 4521 + . gene_id "PALTTTAAA3"; transcript_id "PALTTTAAA3_dup1"; family_id "DNA"; class_id "DNA";
chrI RepeatMasker exon 4423 4750 1259 + . gene_id "CELE2"; transcript_id "CELE2"; family_id "DNA"; class_id "DNA";
chrI RepeatMasker exon 6781 6886 541 + . gene_id "PALTTAA3_CE"; transcript_id "PALTTAA3_CE"; family_id "PiggyBac"; class_id "DNA";
chrI RepeatMasker exon 7166 7254 381 + . gene_id "PALTTTAAA3"; transcript_id "PALTTTAAA3_dup2"; family_id "DNA"; class_id "DNA";
chrI RepeatMasker exon 7297 7307 304 - . gene_id "PALTTTAAA3"; transcript_id "PALTTTAAA3_dup3"; family_id "DNA"; class_id "DNA";
chrI RepeatMasker exon 7308 7362 319 + . gene_id "PALTTTAAA1"; transcript_id "PALTTTAAA1"; family_id "DNA"; class_id "DNA";
......
ce10.fa.out.length:
CELE1 329
CELE11 218
CELE12A 368
CELE12B 171
CELE14A 177
CELE14B 187
CELE2 325
CELE4 470
CELE42 238
CELE45 266
CELE46A 438
CELE46B 705
CELE6 158
CELE7 363
CELETC2 446
CEMUDR1 7227
CEMUDR2 5505
CER1 7881
CER10-I_CE 11155
......
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. ๐๐๐
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google โค๏ธ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.