Giter Site home page Giter Site logo

Comments (21)

akramdi avatar akramdi commented on August 11, 2024 1

Hi Alison,

@amira glad you were able to fix it. That behavior is expected. We throw out the junctions that don't have a strand because we use strand information when determining whether a read is corrected or inconsistent.

Just to conclude on my side, I realized that the short reads junctions I use for the correction did not have a strand information (no XS tag). So, I redid the mapping to get this extra tag and was able to run FLAIR with no error. (the "hack" I implemented was forcing the correction regardless of the strand info, there was no need for it).

Comments here have been very helpful, thanks a lot!

Best,
Amira

from flair.

belgravia avatar belgravia commented on August 11, 2024

Hi Dharm,

Since you have the raw reads aligned ok, have you already looked into the corrected reads bed or psl file? Are the full length reads present there? Also, your commands look fine and thanks for your patience as we work together to figure this out!

-Alison

from flair.

Dharmendra-G-1 avatar Dharmendra-G-1 commented on August 11, 2024

Hello Alison,
As per your suggestion, I looked into .psl and .bed aligned file from this nanopore reads alignment file .bed and .psl output, the result I get out of flair align:

python /opt/flair/flair.py align -r all_4.fastq -g Sus_scrofa.Sscrofa11.1.dna.toplevel_with_PL5.fa -m /opt/minimap2-2.14_x64-linux/minimap2 -o flair.aligned.PL5. -t 40 -p -v1.3

I can see long isoforms for GAPDH and PL5 regions genes in both the .psl and .bed file the images are attached for your reference:

The following is .psl file alignment for GAPDH long isoforms:
GAPDH_PSL_alignment

The following is .psl file alignment for PL5 regions several genes long isoforms:
PL5_PSL_alignment copy

The following is .bed file alignment for GAPDH long isoforms:
GAPDH_bed_igv_snapshot

GAPDH_bed_S_igv_snapshot

The following is .bed file alignment for PL5 region long isoforms:

PL5_bed_igv_snapshot

PL5_bed_S_igv_snapshot copy

Please advice, what I need to change in parameter to avoid missing long isoforms in collapse steps output fasta files.

Thanks,

With Regards,
Dharm

from flair.

belgravia avatar belgravia commented on August 11, 2024

To clarify, the igv shots you are showing me are aligned reads, right? I think you should also check the *all_corrected.bed file (or the *all_corrected.psl file, they're equivalent). It's possible that the flair-correct step is removing the long reads.

-Alison

from flair.

Dharmendra-G-1 avatar Dharmendra-G-1 commented on August 11, 2024

Yes, you are absolutely right, the above is the first step of aligned files. As per your suggestion, I also looked into the **all_corrected. file which shows that long isoforms are broken and not connected and the way they don't count as single long isoform. The images are attached for your reference. Please let me know what step can be taken to avoid this:

Bed file with GAPDH same as above:

corrected_GAPDH_BED

corrected_GAPDH_BED_tail

For PL5 region genes:
corrected_PL5_gene_PSL copy

corrected_PL5_regions_gene_bed

Why the corrected.bed and .psl missing the long isoforms?

Thanks,

With Regards,
Dharm

from flair.

belgravia avatar belgravia commented on August 11, 2024

The purpose of flair-correct is to correct spurious splice junctions in noisy reads to splice junctions we're more confident in (i.e. annotations, short-reads), and if the junction can't be corrected then the read is removed. This informs me that there may be an issue with how the script is handling splice junctions from your GTF. Can you send me a link to where you downloaded your gtf from? We might have to add some code to cover cases like this. Thanks for the cooperation :)

from flair.

Dharmendra-G-1 avatar Dharmendra-G-1 commented on August 11, 2024

I am working on pig cells RNAseq using Nanopore long read technology. The original GTF file is downloaded from ftp://ftp.ensembl.org/pub/release-95/gtf/sus_scrofa/Sus_scrofa.Sscrofa11.1.95.gtf.gz for some of the transgenes we are trying to express in Payload 5 regions are added as additional chromosomes in GTF files

You can check on the GAPDH gene which would be more helpful.
Thanks

from flair.

Dharmendra-G-1 avatar Dharmendra-G-1 commented on August 11, 2024

Hello Alison,
Just wondering if you had time to look into the above GTF file to see it has some problems that may lead to shorter isoforms and count.
Thanks

from flair.

csoulette avatar csoulette commented on August 11, 2024

Hi Dharmendra,

We've had a bit of trouble recapitulating the weird filtering issue you've raised in this thread.

I've taken nanopore reads derived from the GAPDH locus from a human sequencing experiment, and successfully converted them into isoforms using the Sus_scrofa.Sscrofa11.1.dna.toplevel.fa genome and associated annotation file. When issues of unnecessary filtering occurs during the correction step it means that the splice sites for each cannot be corrected. This is usually due to issues when building the splice site database in which we use to query each read against.

I was wondering if it would be possible to share some of the reads that are being filtered so that we can try to recapitulate the issue ourselves. Perhaps it would be possible to share reads from a locus that is being heavily filtered but is not important to your studies?

Thanks~

-CMS

from flair.

Dharmendra-G-1 avatar Dharmendra-G-1 commented on August 11, 2024

Hello CMS,
Thanks for comments and working on this issue, as per your request here are selected reads for GAPDH regions which missed on long isoforms:

long_reads_selected_for_GAPDH_region_marked_duplicates.bam.gz

Hope this will help us resolve this issue.

Thanks

from flair.

belgravia avatar belgravia commented on August 11, 2024

So I took the reads you provided, made a bed, and ran: python flair.py correct -q long_reads_selected_for_GAPDH_region_marked_duplicates.bed -f Sus_scrofa.Sscrofa11.1.95.gtf -c pig.chromsizes.

Here is the *_all_corrected.bed reads that I got: https://genome.ucsc.edu/s/atang14/susScr11
You'll notice that it has the single-exon reads in your screenshots, but it also looks like many of the multi-exonic reads are getting corrected and kept. Seeing as we're running essentially the same commands on the same data and getting different results, I'm guessing it's something about the environment. Maybe the script didn't finish running? Did it output any errors?

-Alison

from flair.

Dharmendra-G-1 avatar Dharmendra-G-1 commented on August 11, 2024

No I didn't got an error message. I repeated this, I haven't got any error message. I also tried with the GAPDH region selected reads which I forwarded you previously and I am still missing long reads.

from flair.

Dharmendra-G-1 avatar Dharmendra-G-1 commented on August 11, 2024

Do you have a dockerize version of Flair, I can use that to avoid environmental problems I may be facing and missing the longer isoforms as I am not getting any error messages when I run the Flair on our end.
Thanks

from flair.

belgravia avatar belgravia commented on August 11, 2024

Hi Dharm,

We're working on a Docker now to hopefully solve whatever issue is occurring. In the meantime, could you send me your annotation file that you've been using with extra PL5 entries? Since we were able to run the correction step successfully in our hands with your reads, with the only difference in our commands being the annotation file, I'd like to try running the correction step with your reads/annotation to try and debug further.

-Alison

from flair.

akramdi avatar akramdi commented on August 11, 2024

Hi Alison,

I am contributing to this thread because I think I am facing the same problem: FLAIR runs without errors but does not report some isoforms that we can clearly see after mapping. As mentioned above, it looks like it's due to the corrections step: when a read does not overlap a known/annotated junction is it removed.

I work on transposable elements in Arabidopsis Thaliana and most of their genes are poorly annotated. It there anyway to keep these reads and correct them only using short reads?

Thank you!

Best,
Amira

from flair.

belgravia avatar belgravia commented on August 11, 2024

Hi Amira and Dharm,

We have made a docker for FLAIR. You can use it like so:
docker pull quay.io/brookslab/flair
docker run -w /usr/data -v [your_path_to_data]:/usr/data -t -d [image_id]
docker exec [container_id] python3 /usr/local/flair/flair.py align [rest_of_your_command]
If the docker gives the same results (no long reads after flair-correct) then we'd have to take a closer look at the data/annotation. Dharm I know you're using a custom annotation file, so maybe that might be causing issues for FLAIR since in our hands the portion of data that you sent look ok?

Amira, we are working on correcting with only short reads specified if you don't have an annotation.
-Alison

from flair.

csoulette avatar csoulette commented on August 11, 2024

Hi Alison,

I am contributing to this thread because I think I am facing the same problem: FLAIR runs without errors but does not report some isoforms that we can clearly see after mapping. As mentioned above, it looks like it's due to the corrections step: when a read does not overlap a known/annotated junction is it removed.

I work on transposable elements in Arabidopsis Thaliana and most of their genes are poorly annotated. It there anyway to keep these reads and correct them only using short reads?

Thank you!

Best,
Amira

Hi Amira,

You will find the latest version of ssCorrect will now allow you to run correction without GTF annotations. Let me know if there are any issues. Thanks ~

Best,
CMS

from flair.

akramdi avatar akramdi commented on August 11, 2024

Hi Alison and CMS,

My apologies for the late reply and thank you for the fix!

I updated FLAIR and reran the correction step. I first ran the command without the GTF file assuming that is was not mandatory anymore, but it's still is. So I tried with the GTF file, here's my command:

python $SOURCE/flair.py correct -f $TAIR10GTF -c $CHRLEN -q $queryONTreads -j $shortReadsJunctions -o $OUTDIR -t 30

I checked the correction output and some isoforms are still missing . Here are some examples:

grey track: *all_corrected.bed
green track: *all_inconsistent.bed

ex1_FLAIR
ex2_FLAIR

For me, the inconsistent isoforms here (green) are the ones that should be reported as consistent.

Could you please look a bit more into it? I can send you the GTF file I am working with if it helps.

Many thanks in advance,
Best
Amira

from flair.

belgravia avatar belgravia commented on August 11, 2024

Thanks for your patience Amira.

We have made the fixes necessary so that the gtf argument is optional, and at least one of a gtf file or short read junction file needs to be specified. Please try again with only your short read junctions and let us know how that goes.

We have also slightly altered the syntax for the flair-correct command such that a genome sequence fasta file is required. The genome file must also be indexed (you can run samtools faidx yourgenome.fa to generate the .fai). So your command might look something like python $SOURCE/flair.py correct -c $CHRLEN -q $queryONTreads -j $shortReadsJunctions -o $OUTDIR -t 30 -g genome.fa now.

-Alison

from flair.

akramdi avatar akramdi commented on August 11, 2024

Hi Alison,

Thanks again for the fixes!

I updated FLAIR and tried again by running the command as you suggested. I got the following error:

> python $SOURCE/flair.py correct -c $CHRLEN -q $queryONTreads -j $shortReadsJunctions -o $OUTDIR -t 30 -g $GENOME
Step 2/5: Processing additional junction file  /kingdoms/a2e/workspace2/kramdi/ATAC-1_A2016/ONT_analysis/detect_isoforms/FLAIR/extract-spliceJunction-shortReads/shortReadsA2E_junctions.nreads5.bed ...
No junctions from GTF or junctionsBed to correct with. Exiting...
Correction command did not exit with success status

I looked a bit into the code and it seems that, in the script ssCorrect.py, the case where the strand is "0" (unknown) in the junction file is ignored and only the values "1" and "2" are converted to "+" and "-".

I edited my local version and was able to run the script with no error. The consistent isoforms are now the one I expect visually. Here are the new results on the examples showed in my previous post:

ex1_FLAIR_after_fix

ex2_FLAIR_after_fix

Bleu track: consistent isoforms
Green track: inconsistent isoforms

Could you please check that part of the script ssCorrect.py that generates the error and update the repo?

I'll carry on with the rest of the steps (collapse and quantify). So far, it looks pretty good! So thanks a lot!

Best,
Amira

from flair.

belgravia avatar belgravia commented on August 11, 2024

Hi @Dharmendra-G-1 , you may want to check out issue #34. They had the same issue with longer reads getting removed at the correction step, and it was due to the kind of genome annotation file format they were using. We've fixed that issue now, and since the only thing that was different between your failed and my successful FLAIR runs on the data you provided was the annotation, it's possible that your issues may be fixed. A little late, but just I'd let you know :)

@amira glad you were able to fix it. That behavior is expected. We throw out the junctions that don't have a strand because we use strand information when determining whether a read is corrected or inconsistent.
-Alison

from flair.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.