Giter Site home page Giter Site logo

chang_et_al_zebrafish_tes's People

Contributors

jonathan-wells avatar monkeysylvia avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

chang_et_al_zebrafish_tes's Issues

A bug in <3.make_TE_classification.sh>

Hello! Thanks for this good workflow. I thought one input file (TE_GTF) danRer11.TEtrans_uID.gtf in 3.make_TE_classification.sh may be incorrect (shown below).

TE_GTF="./data/**danRer11.TEtrans_uID.gtf**"
time python ./scripts/clean_OCTFTA_output.py \
  --te_gtf ./data/**danRer11.TEtrans_uID.gtf** \
  --input ./data/danRer11.nonalt.fa.out.elem_sorted.csv \
  --te_length ./data/tes.lengths \
  --te_length_cutoff 2 \
  --output ./data/danRer11.TEtrans_uID.dfrag.gtf

According to the workflow, TE_GTF: danRer11.TEtrans_uID.gtf was generated in 2.make_gene_and_TE_counts.sh, as described in the following code in <2.make_gene_and_TE_counts.sh>:

rmsk2bed < ./data/danRer11.nonalt.fa.out > ./data/danRer11.nonalt.fa.out.bed
# Convert to GTF format
time python ./scripts/RM_bed2GTF.py -i ./data/danRer11.nonalt.fa.out.bed -o ./data/danRer11.nonalt.fa.out.gtf
# Adapt GTF to TEtranscript requested format
# Use ./data/danRer11_classes_wredundant.txt file to solve some TE superfamily/family name redundancy.
# Subset TE classes: DNA, LINE, LTR, RC, SINE
sed 's/?//g' ./data/danRer11_classes_wredundant.txt \
  | awk '{if ($2=="DNA" || $2=="LINE" || $2=="LTR" || $2=="RC" || $2=="SINE") print }' \
  > ./data/danRer11_classes_wredundant.subClass.txt
python ./scripts/adapt_GTF_TEtranscript.py \
  -i ./data/danRer11.nonalt.fa.out.gtf \
  -c ./data/danRer11_classes_wredundant.subClass.txt \
  -o ./**data/danRer11.TEtrans_uID.gtf**

After running the above code, I got danRer11.TEtrans_uID.gtf, which looks like:

chr1    RepeatMasker    exon    1       397     1686    -       .       gene_id "Harbinger-N29_DR"; transcript_id "Harbinger-N29_DR"; family_id "PIF-Harbinger"; class_id "DNA";
chr1    RepeatMasker    exon    403     462     404     +       .       gene_id "L2-5_DRe"; transcript_id "L2-5_DRe"; family_id "L2"; class_id "LINE";
chr1    RepeatMasker    exon    460     532     309     +       .       gene_id "hAT-N25_DR"; transcript_id "hAT-N25_DR"; family_id "hAT-Ac"; class_id "DNA";
chr1    RepeatMasker    exon    469     611     494     +       .       gene_id "Harbinger-8N2_DR"; transcript_id "Harbinger-8N2_DR"; family_id "PIF-Harbinger"; class_id "DNA";

However, it's not the format that clean_OCTFTA_output.py recognizes, and the correct format looks like:

chr1    RepeatMasker    exon    403     462     404     +       .       gene_id "L2-5_DRe"; transcript_id "L2-5_DRe"; family_id "L2"; class_id "LINE"; geneId "ENSDARG00000099104"; transcriptId "ENSDART00000158290"; annotation_fix "Distal"; TE_GE_strand "OS"; annotation_2 "Intergenic.OS"; annotation_3 "Intergenic.OS"
chr1    RepeatMasker    exon    625     1027    2198    +       .       gene_id "L2-5_DRe"; transcript_id "L2-5_DRe_dup1";  family_id "L2"; class_id "LINE"; geneId "ENSDARG00000099104"; transcriptId "ENSDART00000158290"; annotation_fix "Distal"; TE_GE_strand "OS"; annotation_2 "Intergenic.OS"; annotation_3 "Intergenic.OS"
chr1    RepeatMasker    exon    1026    1390    1791    +       .       gene_id "L2-5_DRe"; transcript_id "L2-5_DRe_dup2"; family_id "L2"; class_id "LINE"; geneId "ENSDARG00000099104"; transcriptId "ENSDART00000158290"; annotation_fix "Distal"; TE_GE_strand "OS"; annotation_2 "Intergenic.OS"; annotation_3 "Intergenic.OS"

And I got the following errors when running ./scripts/adapt_GTF_TEtranscript.py in 3.make_TE_classification.sh

  File "./scripts/clean_OCTFTA_output.py", line 318, in <module>
    print_TEgtf_line = make_TEgtf_print_line(TEgtf_lines[i])
  File "./scripts/clean_OCTFTA_output.py", line 180, in make_TEgtf_print_line
    attribute_string = 'gene_id "%s"; transcript_id "%s"; defrag_transcript_id "%s"; family_id "%s"; class_id "%s"; geneId "%s"; transcriptId "%s"; annotation_fix "%s"; TE_GE_strand "%s"; annotation_2 "%s"; annotation_3 "%s"' % (TEgtf_line[8]['gene_id'], TEgtf_line[8]['transcript_id'], TEgtf_line[8]['defrag_transcript_id'], TEgtf_line[8]['family_id'], TEgtf_line[8]['class_id'], TEgtf_line[8]['geneId'], TEgtf_line[8]['transcriptId'], TEgtf_line[8]['annotation_fix'], TEgtf_line[8]['TE_GE_strand'], TEgtf_line[8]['annotation_2'], TEgtf_line[8]['annotation_3'])
KeyError: 'geneId'

So I suspected danRer11.TEtrans_uID.gtf might not be the right input file for the following code of <3.make_TE_classification.sh>.

TE_GTF="./data/**danRer11.TEtrans_uID.gtf**"
time python ./scripts/clean_OCTFTA_output.py \
  --te_gtf ./data/**danRer11.TEtrans_uID.gtf** \
  --input ./data/danRer11.nonalt.fa.out.elem_sorted.csv \
  --te_length ./data/tes.lengths \
  --te_length_cutoff 2 \
  --output ./data/danRer11.TEtrans_uID.dfrag.gtf

Can you help me check if there is anything wrong with the input file of 3.make_TE_classification.sh? Thank you!

Error in running clean_OCTFTA_output.py

Hello! I encountered the following problems when running clean_OCTFTA_output.py:

File "clean_OCTFTA_output.py", line 336, in <module>
   print_TEgtf_line = make_TEgtf_print_line(new_TEgtf_line)
 File "clean_OCTFTA_output.py", line 180, in make_TEgtf_print_line
   attribute_string = 'gene_id "%s"; transcript_id "%s"; defrag_transcript_id "%s"; family_id "%s"; class_id "%s"; geneId "%s"; transcriptId "%s"; annotation_fix "%s"; TE_GE_strand "%s"; annotation_2 "%s"; annotation_3 "%s"' % (TEgtf_line[8]['gene_id'], TEgtf_line[8]['transcript_id'], TEgtf_line[8]['defrag_transcript_id'], TEgtf_line[8]['family_id'], TEgtf_line[8]['class_id'], TEgtf_line[8]['geneId'], TEgtf_line[8]['transcriptId'], TEgtf_line[8]['annotation_fix'], TEgtf_line[8]['TE_GE_strand'], TEgtf_line[8]['annotation_2'], TEgtf_line[8]['annotation_3'])
KeyError: 'geneId'

I suspected that maybe my input file might be wrong. Can you please advise what's wrong? Thank you so much for your time!

My scripts:

$ python clean_OCTFTA_output.py -t ce10.TEtrans_uID.gtf -i elem.csv -l ce10.fa.out.length -c 2 -o t1.txt

My output:


Reading TE GTF file: ce10.TEtrans_uID.gtf
Reading |################################| 36467/36467
Reading TE length file: ce10.fa.out.length
Reading |################################| 155/155
Reading TE length file: elem.csv
Reading |                                | 1/82286Traceback (most recent call last):
  File "clean_OCTFTA_output.py", line 336, in <module>
    print_TEgtf_line = make_TEgtf_print_line(new_TEgtf_line)
  File "clean_OCTFTA_output.py", line 180, in make_TEgtf_print_line
    attribute_string = 'gene_id "%s"; transcript_id "%s"; defrag_transcript_id "%s"; family_id "%s"; class_id "%s"; geneId "%s"; transcriptId "%s"; annotation_fix "%s"; TE_GE_strand "%s"; annotation_2 "%s"; annotation_3 "%s"' % (TEgtf_line[8]['gene_id'], TEgtf_line[8]['transcript_id'], TEgtf_line[8]['defrag_transcript_id'], TEgtf_line[8]['family_id'], TEgtf_line[8]['class_id'], TEgtf_line[8]['geneId'], TEgtf_line[8]['transcriptId'], TEgtf_line[8]['annotation_fix'], TEgtf_line[8]['TE_GE_strand'], TEgtf_line[8]['annotation_2'], TEgtf_line[8]['annotation_3'])
KeyError: 'geneId'

My input files:

elem.csv:

 Score	%_Div	%_Del	%_Ins	Query	Beg.	End.	Length	Sense	Element	Family	Pos_Repeat_Beg	Pos_Repeat_End	Pos_Repeat_Left	ID	Num_Assembled	%_of_Ref
###432	21.9	2.4	0	chrI	1622	1744	126	+	LONGPAL1	DNA/MULE-MuDR	136	261	-2330	4	1	0.049
.......

ce10.TEtrans_uID.gtf:

chrI	RepeatMasker	exon	1622	1744	432	+	.	gene_id "LONGPAL1"; transcript_id "LONGPAL1"; family_id "MULE-MuDR"; class_id "DNA";
chrI	RepeatMasker	exon	2052	3026	8509	+	.	gene_id "PALTTTAAA3"; transcript_id "PALTTTAAA3"; family_id "DNA"; class_id "DNA";
chrI	RepeatMasker	exon	3124	3652	4521	+	.	gene_id "PALTTTAAA3"; transcript_id "PALTTTAAA3_dup1"; family_id "DNA"; class_id "DNA";
chrI	RepeatMasker	exon	4423	4750	1259	+	.	gene_id "CELE2"; transcript_id "CELE2"; family_id "DNA"; class_id "DNA";
chrI	RepeatMasker	exon	6781	6886	541	+	.	gene_id "PALTTAA3_CE"; transcript_id "PALTTAA3_CE"; family_id "PiggyBac"; class_id "DNA";
chrI	RepeatMasker	exon	7166	7254	381	+	.	gene_id "PALTTTAAA3"; transcript_id "PALTTTAAA3_dup2"; family_id "DNA"; class_id "DNA";
chrI	RepeatMasker	exon	7297	7307	304	-	.	gene_id "PALTTTAAA3"; transcript_id "PALTTTAAA3_dup3"; family_id "DNA"; class_id "DNA";
chrI	RepeatMasker	exon	7308	7362	319	+	.	gene_id "PALTTTAAA1"; transcript_id "PALTTTAAA1"; family_id "DNA"; class_id "DNA";
......

ce10.fa.out.length:

CELE1	329
CELE11	218
CELE12A	368
CELE12B	171
CELE14A	177
CELE14B	187
CELE2	325
CELE4	470
CELE42	238
CELE45	266
CELE46A	438
CELE46B	705
CELE6	158
CELE7	363
CELETC2	446
CEMUDR1	7227
CEMUDR2	5505
CER1	7881
CER10-I_CE	11155
......

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.