vaquerizaslab / chang_et_al_zebrafish_tes Goto Github PK

View Code? Open in Web Editor NEW

10.0 10.0 4.0 904 KB

Shell 13.38% Python 4.76% R 7.14% Jupyter Notebook 74.72%

chang_et_al_zebrafish_tes's People

Contributors

Stargazers

Watchers

Forkers

grgong dbrg77 wangchengww helianfeixing

chang_et_al_zebrafish_tes's Issues

A bug in <3.make_TE_classification.sh>

Hello! Thanks for this good workflow. I thought one input file (TE_GTF) danRer11.TEtrans_uID.gtf in 3.make_TE_classification.sh may be incorrect (shown below).

TE_GTF="./data/**danRer11.TEtrans_uID.gtf**"
time python ./scripts/clean_OCTFTA_output.py \
  --te_gtf ./data/**danRer11.TEtrans_uID.gtf** \
  --input ./data/danRer11.nonalt.fa.out.elem_sorted.csv \
  --te_length ./data/tes.lengths \
  --te_length_cutoff 2 \
  --output ./data/danRer11.TEtrans_uID.dfrag.gtf

According to the workflow, TE_GTF: danRer11.TEtrans_uID.gtf was generated in 2.make_gene_and_TE_counts.sh, as described in the following code in <2.make_gene_and_TE_counts.sh>:

rmsk2bed < ./data/danRer11.nonalt.fa.out > ./data/danRer11.nonalt.fa.out.bed
# Convert to GTF format
time python ./scripts/RM_bed2GTF.py -i ./data/danRer11.nonalt.fa.out.bed -o ./data/danRer11.nonalt.fa.out.gtf
# Adapt GTF to TEtranscript requested format
# Use ./data/danRer11_classes_wredundant.txt file to solve some TE superfamily/family name redundancy.
# Subset TE classes: DNA, LINE, LTR, RC, SINE
sed 's/?//g' ./data/danRer11_classes_wredundant.txt \
  | awk '{if ($2=="DNA" || $2=="LINE" || $2=="LTR" || $2=="RC" || $2=="SINE") print }' \
  > ./data/danRer11_classes_wredundant.subClass.txt
python ./scripts/adapt_GTF_TEtranscript.py \
  -i ./data/danRer11.nonalt.fa.out.gtf \
  -c ./data/danRer11_classes_wredundant.subClass.txt \
  -o ./**data/danRer11.TEtrans_uID.gtf**

After running the above code, I got danRer11.TEtrans_uID.gtf, which looks like:

chr1    RepeatMasker    exon    1       397     1686    -       .       gene_id "Harbinger-N29_DR"; transcript_id "Harbinger-N29_DR"; family_id "PIF-Harbinger"; class_id "DNA";
chr1    RepeatMasker    exon    403     462     404     +       .       gene_id "L2-5_DRe"; transcript_id "L2-5_DRe"; family_id "L2"; class_id "LINE";
chr1    RepeatMasker    exon    460     532     309     +       .       gene_id "hAT-N25_DR"; transcript_id "hAT-N25_DR"; family_id "hAT-Ac"; class_id "DNA";
chr1    RepeatMasker    exon    469     611     494     +       .       gene_id "Harbinger-8N2_DR"; transcript_id "Harbinger-8N2_DR"; family_id "PIF-Harbinger"; class_id "DNA";

However, it's not the format that clean_OCTFTA_output.py recognizes, and the correct format looks like:

chr1    RepeatMasker    exon    403     462     404     +       .       gene_id "L2-5_DRe"; transcript_id "L2-5_DRe"; family_id "L2"; class_id "LINE"; geneId "ENSDARG00000099104"; transcriptId "ENSDART00000158290"; annotation_fix "Distal"; TE_GE_strand "OS"; annotation_2 "Intergenic.OS"; annotation_3 "Intergenic.OS"
chr1    RepeatMasker    exon    625     1027    2198    +       .       gene_id "L2-5_DRe"; transcript_id "L2-5_DRe_dup1";  family_id "L2"; class_id "LINE"; geneId "ENSDARG00000099104"; transcriptId "ENSDART00000158290"; annotation_fix "Distal"; TE_GE_strand "OS"; annotation_2 "Intergenic.OS"; annotation_3 "Intergenic.OS"
chr1    RepeatMasker    exon    1026    1390    1791    +       .       gene_id "L2-5_DRe"; transcript_id "L2-5_DRe_dup2"; family_id "L2"; class_id "LINE"; geneId "ENSDARG00000099104"; transcriptId "ENSDART00000158290"; annotation_fix "Distal"; TE_GE_strand "OS"; annotation_2 "Intergenic.OS"; annotation_3 "Intergenic.OS"

And I got the following errors when running ./scripts/adapt_GTF_TEtranscript.py in 3.make_TE_classification.sh

  File "./scripts/clean_OCTFTA_output.py", line 318, in <module>
    print_TEgtf_line = make_TEgtf_print_line(TEgtf_lines[i])
  File "./scripts/clean_OCTFTA_output.py", line 180, in make_TEgtf_print_line
    attribute_string = 'gene_id "%s"; transcript_id "%s"; defrag_transcript_id "%s"; family_id "%s"; class_id "%s"; geneId "%s"; transcriptId "%s"; annotation_fix "%s"; TE_GE_strand "%s"; annotation_2 "%s"; annotation_3 "%s"' % (TEgtf_line[8]['gene_id'], TEgtf_line[8]['transcript_id'], TEgtf_line[8]['defrag_transcript_id'], TEgtf_line[8]['family_id'], TEgtf_line[8]['class_id'], TEgtf_line[8]['geneId'], TEgtf_line[8]['transcriptId'], TEgtf_line[8]['annotation_fix'], TEgtf_line[8]['TE_GE_strand'], TEgtf_line[8]['annotation_2'], TEgtf_line[8]['annotation_3'])
KeyError: 'geneId'

So I suspected danRer11.TEtrans_uID.gtf might not be the right input file for the following code of <3.make_TE_classification.sh>.

TE_GTF="./data/**danRer11.TEtrans_uID.gtf**"
time python ./scripts/clean_OCTFTA_output.py \
  --te_gtf ./data/**danRer11.TEtrans_uID.gtf** \
  --input ./data/danRer11.nonalt.fa.out.elem_sorted.csv \
  --te_length ./data/tes.lengths \
  --te_length_cutoff 2 \
  --output ./data/danRer11.TEtrans_uID.dfrag.gtf

Can you help me check if there is anything wrong with the input file of 3.make_TE_classification.sh? Thank you!

Error in running clean_OCTFTA_output.py

Hello! I encountered the following problems when running clean_OCTFTA_output.py:

File "clean_OCTFTA_output.py", line 336, in <module>
   print_TEgtf_line = make_TEgtf_print_line(new_TEgtf_line)
 File "clean_OCTFTA_output.py", line 180, in make_TEgtf_print_line
   attribute_string = 'gene_id "%s"; transcript_id "%s"; defrag_transcript_id "%s"; family_id "%s"; class_id "%s"; geneId "%s"; transcriptId "%s"; annotation_fix "%s"; TE_GE_strand "%s"; annotation_2 "%s"; annotation_3 "%s"' % (TEgtf_line[8]['gene_id'], TEgtf_line[8]['transcript_id'], TEgtf_line[8]['defrag_transcript_id'], TEgtf_line[8]['family_id'], TEgtf_line[8]['class_id'], TEgtf_line[8]['geneId'], TEgtf_line[8]['transcriptId'], TEgtf_line[8]['annotation_fix'], TEgtf_line[8]['TE_GE_strand'], TEgtf_line[8]['annotation_2'], TEgtf_line[8]['annotation_3'])
KeyError: 'geneId'

I suspected that maybe my input file might be wrong. Can you please advise what's wrong? Thank you so much for your time!

My scripts:

$ python clean_OCTFTA_output.py -t ce10.TEtrans_uID.gtf -i elem.csv -l ce10.fa.out.length -c 2 -o t1.txt

My output:


Reading TE GTF file: ce10.TEtrans_uID.gtf
Reading |################################| 36467/36467
Reading TE length file: ce10.fa.out.length
Reading |################################| 155/155
Reading TE length file: elem.csv
Reading |                                | 1/82286Traceback (most recent call last):
  File "clean_OCTFTA_output.py", line 336, in <module>
    print_TEgtf_line = make_TEgtf_print_line(new_TEgtf_line)
  File "clean_OCTFTA_output.py", line 180, in make_TEgtf_print_line
    attribute_string = 'gene_id "%s"; transcript_id "%s"; defrag_transcript_id "%s"; family_id "%s"; class_id "%s"; geneId "%s"; transcriptId "%s"; annotation_fix "%s"; TE_GE_strand "%s"; annotation_2 "%s"; annotation_3 "%s"' % (TEgtf_line[8]['gene_id'], TEgtf_line[8]['transcript_id'], TEgtf_line[8]['defrag_transcript_id'], TEgtf_line[8]['family_id'], TEgtf_line[8]['class_id'], TEgtf_line[8]['geneId'], TEgtf_line[8]['transcriptId'], TEgtf_line[8]['annotation_fix'], TEgtf_line[8]['TE_GE_strand'], TEgtf_line[8]['annotation_2'], TEgtf_line[8]['annotation_3'])
KeyError: 'geneId'

My input files:

elem.csv:

 Score	%_Div	%_Del	%_Ins	Query	Beg.	End.	Length	Sense	Element	Family	Pos_Repeat_Beg	Pos_Repeat_End	Pos_Repeat_Left	ID	Num_Assembled	%_of_Ref
###432	21.9	2.4	0	chrI	1622	1744	126	+	LONGPAL1	DNA/MULE-MuDR	136	261	-2330	4	1	0.049
.......

ce10.TEtrans_uID.gtf:

chrI	RepeatMasker	exon	1622	1744	432	+	.	gene_id "LONGPAL1"; transcript_id "LONGPAL1"; family_id "MULE-MuDR"; class_id "DNA";
chrI	RepeatMasker	exon	2052	3026	8509	+	.	gene_id "PALTTTAAA3"; transcript_id "PALTTTAAA3"; family_id "DNA"; class_id "DNA";
chrI	RepeatMasker	exon	3124	3652	4521	+	.	gene_id "PALTTTAAA3"; transcript_id "PALTTTAAA3_dup1"; family_id "DNA"; class_id "DNA";
chrI	RepeatMasker	exon	4423	4750	1259	+	.	gene_id "CELE2"; transcript_id "CELE2"; family_id "DNA"; class_id "DNA";
chrI	RepeatMasker	exon	6781	6886	541	+	.	gene_id "PALTTAA3_CE"; transcript_id "PALTTAA3_CE"; family_id "PiggyBac"; class_id "DNA";
chrI	RepeatMasker	exon	7166	7254	381	+	.	gene_id "PALTTTAAA3"; transcript_id "PALTTTAAA3_dup2"; family_id "DNA"; class_id "DNA";
chrI	RepeatMasker	exon	7297	7307	304	-	.	gene_id "PALTTTAAA3"; transcript_id "PALTTTAAA3_dup3"; family_id "DNA"; class_id "DNA";
chrI	RepeatMasker	exon	7308	7362	319	+	.	gene_id "PALTTTAAA1"; transcript_id "PALTTTAAA1"; family_id "DNA"; class_id "DNA";
......

ce10.fa.out.length:

CELE1	329
CELE11	218
CELE12A	368
CELE12B	171
CELE14A	177
CELE14B	187
CELE2	325
CELE4	470
CELE42	238
CELE45	266
CELE46A	438
CELE46B	705
CELE6	158
CELE7	363
CELETC2	446
CEMUDR1	7227
CEMUDR2	5505
CER1	7881
CER10-I_CE	11155
......

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.