rvolden / mandalorion Goto Github PK

View Code? Open in Web Editor NEW

5.0 5.0 7.0 83 KB

Pipeline to identify isoforms from full-length cDNA sequencing data

License: MIT License

Python 100.00%

mandalorion's People

Contributors

Stargazers

Watchers

Forkers

christopher-vollmers guanguiwensy

mandalorion's Issues

cell-type specific isoform analysis

Dear Rodger,

Could you help me in the analysis of cell-type specific isoforms?

After obtaining cell barcode list for each cell type, I pooled reads.fasta files and subreads,fastq files into single file, respectively to get pooled.fasta and pooled.fastq and ran mandalorion but it failed.

The problem seems to arise because the read id and subread id are trimmed of the additional information separated by "_".

Here is the details of what I did

I used fasta and fastq files from MergeUMIs10x.py step in 10xR2C2.

$ head cell_208_GTTTGGAAGTGTCATC.merged.fasta
>71c10016-715b-4bb8-8808-2059ecc38312
AGACGTTCTTCGCCGA....ATGACACTTCCAAAC
>b81eb912-76a4-4ac1-b18c-f47010c334f5
GCTCTTTCTCAGTGA.....CCGGGTGGTTTGCTT

$ head cell_208_GTTTGGAAGTGTCATC.merged.subreads.fastq
@71c10016-715b-4bb8-8808-2059ecc38312
TATTGTGTACCTTTTGCTAG...CGGCCGCCCA
+
>?;<;;>==<;=@>()$$$-336...5;=>>;.&
@71c10016-715b-4bb8-8808-2059ecc38312
ATACCTTCCGTTCA...TGCGGCCGCCCATAGC
+
###$%-.%/088<+-2...>?ADPC?@BJ

After getting barcode lists per cell type using seurat, I merged fasta and fastq of the same cell type

$ cat \
   cell_173_CACGTTCGTATGTCCA.merged.fasta \
   cell_229_TGCCGAGCATGACGTT.merged.fasta \
   > Pooled_reads.fasta

$ cat \
   cell_173_CACGTTCGTATGTCCA.merged.subreads.fastq \
   cell_229_TGCCGAGCATGACGTT.merged.subreads.fastq \
  > pooled.fastq

I ran mandalorion on the pooled reads

python3 Mandalorion_nonqsub.sh \
 -c ${cfg} \
 -s 500 \
 -g ${gtf_fn_step10} \
 -G ${ref} \
 -a ${adapter} \
 -f pooled.fasta \
 -b pooled.fastq \
 -p ${newdir}/mandalorian_output \
 -O 0,70,0,70 \
 -t ${thread_step10} \
 -e TGGG,AAAA

I got the error message as following

[M::mm_idx_gen::66.980*1.80] collected minimizers
[M::mm_idx_gen::75.689*3.01] sorted minimizers
[M::main::75.689*3.01] loaded/built the index for 25 target sequence(s)
[M::mm_mapopt_update::80.209*2.90] mid_occ = 748
[M::mm_idx_stat] kmer size: 15; skip: 5; is_hpc: 0; #seq: 25
[M::mm_idx_stat::82.165*2.85] distinct minimizers: 167178949 (35.49% are singletons); average occurrences: 5.986; average spacing: 3.086
[M::worker_pipeline::215.486*10.32] mapped 30792 sequences
[M::main] Version: 2.17-r941
[M::main] CMD: minimap2 -G 400k --secondary=no -ax splice:hq -t 15 Homo_sapiens_assembly38.fasta Pooled_reads.fasta
[M::main] Real time: 215.642 sec; CPU: 2224.208 sec; Peak RSS: 27.709 GB
SAM from mm: false
Took 844.717088ms to run.
rm: cannot remove 'mandalorian_output//parsed_reads/': No such file or directory
rm: cannot remove 'mandalorian_output//mp/': No such file or directory
Using medaka from your path, not the config file.
rm: cannot remove 'mandalorian_output//mp': No such file or directory
Traceback (most recent call last):
  File "createConsensi.py", line 318, in <module>
    main()
  File "createConsensi.py", line 303, in main
    determine_consensus(name, fasta, fastq, str(counter))
  File "createConsensi.py", line 159, in determine_consensus
    fastq_reads_full, fastq_reads_partial = read_fastq_file(fastq)
  File "createConsensi.py", line 122, in read_fastq_file
    number = int(name_root[1])
IndexError: list index out of range

Could you help me with this issue?

How mandalorion resolve cases where a single read supporting multiple isoforms?

In case a single read supports >=2 isoforms that are not internal to each other,
how does mandalorion resolve the situation?

Does it divide the read counts by the number of simultaneously supporting isoforms?

How to convert mandalorion output psl to sqanti input gtf

What tool do you use to conver psl to gtf?

I found some of the scripts for this purpose in some repositories but all that I tried were not working and no longer maintained.

medaka command error

I temporarily resolved the previous issue(#2 ) by adding unique numbers starting from 0

but I bumped into another issue of medaka as following

Traceback (most recent call last):
  File "createConsensi.py", line 318, in <module>
    main()
  File "createConsensi.py", line 303, in main
    determine_consensus(name, fasta, fastq, str(counter))
  File "createConsensi.py", line 273, in determine_consensus
    temp_folder + '/' + counter, temp_folder + '/' + counter)

sError: Command 'medaka -f -i  mp/1_subsampled.fastq -d  mp/1_consensus_1.fasta -o mp/1 > mp/1_medaka_messages.txt 2>&1' returned non-zero exit status 2

(I changed os.system to subprocess.run to propagate exceptions )

directly running the medaka command gave this error

usage: medaka [-h] [--version]
              {compress_bam,features,train,consensus,smolecule,consensus_from_features,fastrle,stitch,variant,snp,tools}
              ...
medaka: error: argument command: invalid choice: 'mp/1_subsampled.fastq' (choose from 'compress_bam', 'features', 'train', 'consensus', 'smolecule', 'consensus_from_features', 'fastrle', 'stitch', 'variant', 'snp', 'tools')

I initially installed medaka using this command

conda create -n medaka -c conda-forge -c bioconda medaka

Newbie question - Can I even use Mandalorian?

Hi. I just downloaded our very first batch of ONT cDNA reads ever from the contractor we use for sequencing, and wish to assemble transcripts using these data. I searched online to find out potential assemblers and found a master's thesis online that evaluated Mandalorian, so I went ahead and downloaded Mandalorian and got it set up. But then I see that Mandalorian as input takes reads that resulted from R2C2. From looking around online a bit it seems to me that R2C2 is not a standard process, so it's extremely likely that the "standard workflow" ONT data our contractor generated is not the result of R2C2. Assuming I am right about my data not being R2C2, does that mean there is no way for me to use Mandalorian?

Thanks in advance for answering this question, and I apologize for having to ask such a basic question.

John Martinson

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.