rvolden / mandalorion Goto Github PK
View Code? Open in Web Editor NEWPipeline to identify isoforms from full-length cDNA sequencing data
License: MIT License
Pipeline to identify isoforms from full-length cDNA sequencing data
License: MIT License
Dear Rodger,
Could you help me in the analysis of cell-type specific isoforms?
After obtaining cell barcode list for each cell type, I pooled reads.fasta files and subreads,fastq files into single file, respectively to get pooled.fasta and pooled.fastq and ran mandalorion but it failed.
The problem seems to arise because the read id and subread id are trimmed of the additional information separated by "_".
Here is the details of what I did
$ head cell_208_GTTTGGAAGTGTCATC.merged.fasta
>71c10016-715b-4bb8-8808-2059ecc38312
AGACGTTCTTCGCCGA....ATGACACTTCCAAAC
>b81eb912-76a4-4ac1-b18c-f47010c334f5
GCTCTTTCTCAGTGA.....CCGGGTGGTTTGCTT
$ head cell_208_GTTTGGAAGTGTCATC.merged.subreads.fastq
@71c10016-715b-4bb8-8808-2059ecc38312
TATTGTGTACCTTTTGCTAG...CGGCCGCCCA
+
>?;<;;>==<;=@>()$$$-336...5;=>>;.&
@71c10016-715b-4bb8-8808-2059ecc38312
ATACCTTCCGTTCA...TGCGGCCGCCCATAGC
+
###$%-.%/088<+-2...>?ADPC?@BJ
$ cat \
cell_173_CACGTTCGTATGTCCA.merged.fasta \
cell_229_TGCCGAGCATGACGTT.merged.fasta \
> Pooled_reads.fasta
$ cat \
cell_173_CACGTTCGTATGTCCA.merged.subreads.fastq \
cell_229_TGCCGAGCATGACGTT.merged.subreads.fastq \
> pooled.fastq
python3 Mandalorion_nonqsub.sh \
-c ${cfg} \
-s 500 \
-g ${gtf_fn_step10} \
-G ${ref} \
-a ${adapter} \
-f pooled.fasta \
-b pooled.fastq \
-p ${newdir}/mandalorian_output \
-O 0,70,0,70 \
-t ${thread_step10} \
-e TGGG,AAAA
[M::mm_idx_gen::66.980*1.80] collected minimizers
[M::mm_idx_gen::75.689*3.01] sorted minimizers
[M::main::75.689*3.01] loaded/built the index for 25 target sequence(s)
[M::mm_mapopt_update::80.209*2.90] mid_occ = 748
[M::mm_idx_stat] kmer size: 15; skip: 5; is_hpc: 0; #seq: 25
[M::mm_idx_stat::82.165*2.85] distinct minimizers: 167178949 (35.49% are singletons); average occurrences: 5.986; average spacing: 3.086
[M::worker_pipeline::215.486*10.32] mapped 30792 sequences
[M::main] Version: 2.17-r941
[M::main] CMD: minimap2 -G 400k --secondary=no -ax splice:hq -t 15 Homo_sapiens_assembly38.fasta Pooled_reads.fasta
[M::main] Real time: 215.642 sec; CPU: 2224.208 sec; Peak RSS: 27.709 GB
SAM from mm: false
Took 844.717088ms to run.
rm: cannot remove 'mandalorian_output//parsed_reads/': No such file or directory
rm: cannot remove 'mandalorian_output//mp/': No such file or directory
Using medaka from your path, not the config file.
rm: cannot remove 'mandalorian_output//mp': No such file or directory
Traceback (most recent call last):
File "createConsensi.py", line 318, in <module>
main()
File "createConsensi.py", line 303, in main
determine_consensus(name, fasta, fastq, str(counter))
File "createConsensi.py", line 159, in determine_consensus
fastq_reads_full, fastq_reads_partial = read_fastq_file(fastq)
File "createConsensi.py", line 122, in read_fastq_file
number = int(name_root[1])
IndexError: list index out of range
Could you help me with this issue?
In case a single read supports >=2 isoforms that are not internal to each other,
how does mandalorion resolve the situation?
Does it divide the read counts by the number of simultaneously supporting isoforms?
What tool do you use to conver psl to gtf?
I found some of the scripts for this purpose in some repositories but all that I tried were not working and no longer maintained.
I temporarily resolved the previous issue(#2 ) by adding unique numbers starting from 0
but I bumped into another issue of medaka as following
Traceback (most recent call last):
File "createConsensi.py", line 318, in <module>
main()
File "createConsensi.py", line 303, in main
determine_consensus(name, fasta, fastq, str(counter))
File "createConsensi.py", line 273, in determine_consensus
temp_folder + '/' + counter, temp_folder + '/' + counter)
sError: Command 'medaka -f -i mp/1_subsampled.fastq -d mp/1_consensus_1.fasta -o mp/1 > mp/1_medaka_messages.txt 2>&1' returned non-zero exit status 2
(I changed os.system to subprocess.run to propagate exceptions )
directly running the medaka command gave this error
usage: medaka [-h] [--version]
{compress_bam,features,train,consensus,smolecule,consensus_from_features,fastrle,stitch,variant,snp,tools}
...
medaka: error: argument command: invalid choice: 'mp/1_subsampled.fastq' (choose from 'compress_bam', 'features', 'train', 'consensus', 'smolecule', 'consensus_from_features', 'fastrle', 'stitch', 'variant', 'snp', 'tools')
I initially installed medaka using this command
conda create -n medaka -c conda-forge -c bioconda medaka
Hi. I just downloaded our very first batch of ONT cDNA reads ever from the contractor we use for sequencing, and wish to assemble transcripts using these data. I searched online to find out potential assemblers and found a master's thesis online that evaluated Mandalorian, so I went ahead and downloaded Mandalorian and got it set up. But then I see that Mandalorian as input takes reads that resulted from R2C2. From looking around online a bit it seems to me that R2C2 is not a standard process, so it's extremely likely that the "standard workflow" ONT data our contractor generated is not the result of R2C2. Assuming I am right about my data not being R2C2, does that mean there is no way for me to use Mandalorian?
Thanks in advance for answering this question, and I apologize for having to ask such a basic question.
John Martinson
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.