oshlack / lace Goto Github PK

View Code? Open in Web Editor NEW

67.0 67.0 19.0 1.09 MB

Building SuperTranscripts: A linear representation of transcriptome data

License: Other

Python 99.86% Shell 0.14%

lace's People

Contributors

Stargazers

Watchers

Forkers

hdashnow quarkins genomicsnx bioxiao tw7649116 xuwei684 cwt1 anorris8 yanxueqing621 inambioinfo damayanthiherath wangdi2014 jlv100 altingia sarahhp brianjohnhaas biocko sales-lab xuzhougeng

lace's Issues

failed install Lace

After install Lace ,：

conda env create -f environment.yml 
conda activate lace
pip install .

running Lace return s

Lace/Lace_run.py  -h
Traceback (most recent call last):
  File "/share/software/Lace/latest/Lace/Lace_run.py", line 14, in <module>
    from Lace.BuildSuperTranscript import SuperTran
  File "/home/rna/software/miniconda3/envs/lace/lib/python3.9/site-packages/Lace/BuildSuperTranscript.py", line 11, in <module>
    import networkx as nx
  File "/home/rna/software/miniconda3/envs/lace/lib/python3.9/site-packages/networkx/__init__.py", line 114, in <module>
    import networkx.generators
  File "/home/rna/software/miniconda3/envs/lace/lib/python3.9/site-packages/networkx/generators/__init__.py", line 14, in <module>
    from networkx.generators.intersection import *
  File "/home/rna/software/miniconda3/envs/lace/lib/python3.9/site-packages/networkx/generators/intersection.py", line 13, in <module>
    from networkx.algorithms import bipartite
  File "/home/rna/software/miniconda3/envs/lace/lib/python3.9/site-packages/networkx/algorithms/__init__.py", line 16, in <module>
    from networkx.algorithms.dag import *
  File "/home/rna/software/miniconda3/envs/lace/lib/python3.9/site-packages/networkx/algorithms/dag.py", line 23, in <module>
    from fractions import gcd
ImportError: cannot import name 'gcd' from 'fractions' (/home/rna/software/miniconda3/envs/lace/lib/python3.9/fractions.py)

Searched 68096 bases in 10 sequences
add_edge() takes exactly 3 arguments (4 given)
FAILED to construct
add_edge() takes exactly 3 arguments (4 given)
FAILED to construct
add_edge() takes exactly 3 arguments (4 given)
FAILED to construct
add_edge() takes exactly 3 arguments (4 given)
FAILED to construct

The following lines are reported and then the process stops.

Speed

Bypass the I/O by writing on aligner or encorporating blat using cython?

Not reverse complimenting the strand when it should

I think lace still isn't handling the strand properly. e.g. see RECK in /mnt/storage/nadiad/work_area/20160203_ALL/simulation/SIM/lace on our server
input files are all.fasta and all.groupings
in /mnt/storage/nadiad/work_area/20160203_ALL/simulation/SIM

lace was installed using conda instructions on the wiki.

-1 Transcripts and Whirls for some gene entries

This newest version of Lace (including commenting out the print_exception line) returns a file where some sequences have a -1 count for transcripts and whirls. This seems unusual.

ParaCluster.py can't work with a single gene as input

Bed or gff format output

Ideal describing which block belong to which transcripts.

Add flag to contig IDs when the topo sorting hasn't gone to plan

And something to indicate that a bubble was burts or the longest contigs was returns.

e.g

NDUFV2.fasta Number of transcripts: 4, Bubbles broken

Ask for help about this condition。

[chej2tc@mu01 Example]$ /GS01/software/biosoft/python/python3.5/bin/python /GS01/software/biosoft/Lace-1.00/Lace.py -o test2 Example_Genome.fasta clusters.txt

( ) / \ / )( __)
/ (// ( (__ ) )
_/_/_/_)(___)
Lace Version: 0.82
Last Editted: 30/01/17
Creating output directory
Creating dictionary of transcripts in clusters...
Creating a fasta file per gene...
Now Building SuperTranscript for each gene...
sh: blat: command not found
FAILED to construct
sh: blat: command not found
FAILED to construct
BUILT SUPERTRANSCRIPTS ---- 0.11878800392150879 seconds ----

Dear Oshalak，
I run the Lace.py with this code, but there two FAILED, although the output file has been created, but I don't know why has FAILED, is there any package I don't installation correctly? or This is right for run the lace?

Another question：
I don't understand how to use the "ClusterFile". The species I study just have Trinity created result, I don't know how to collect the ClusterFile. and Why we need a ClusterFile?

In the help info, I think we should use the word "TranscriptFile" instead of "GenomeFile"

"GenomeFile" makes me think of the genome reference fasta file.

Does a cluster mean a unigene or transcript?

Does the cluster sequence in "SuperDuper.fasta" file mean a unigene? If it is, Can I do the regular non-model species transcriptome pipeline analysis? I mean, annotate these cluster genes(unigenes) and do downstream differential expression analysis?

analysing blat is failing

i made an assembly of arabidopsis using trinity, and trying to run LACE. however, i am getting errors. i figured out where it goes wrong, but not why.

in the function BuildGraph, between #Copy graph before simplifying and ####### Whirl Elimination ######################, i get a runtime error: dictionary changed size during iteration.

and in the first blat run, i sometimes get the error:

add_edge() takes 3 positional arguments but 4 were given

I am working on python 3.6 and used your method to install LACE. any tips?

Question: supertranscripts (conceptual question)

I have a more conceptual question. I have used Lace to create the reference SuperTranscript from de novo assembly in order to call variants between 2 individuals reared under 2 conditions (4 samples/4 libraries/4 vcf files). Reads used for calling SNPs were the same used for the de novo assembly, which was performed by trinity.
I would like to ask you why only heterozygous SNPs, which are defined as those with at least one read supporting the reference allele, should be further Analysed? I thought that homozygous SNPs would have been more informative to me as I want to detect any differences between the 2 strains.
Are these heterozygous SNPs those that are represented by GT:0/1 in the vcf files?

Thanks! Sofia

Excessive memory usage on large dataset

On a large dataset (made from 30 mouse samples, of different tissues, 100M RNAseq reads per sample) Lace consistently stalls without error. I traced this to excessive memory usage (>200GB of RAM), which exceeds our capacity to run the program on the whole dataset.

The denovo assembly was conducted in Trinity and the clustering was done using the necklace protocol. https://github.com/Oshlack/necklace.

Possibly related to issue #29 and/or #31.

Question: use SuperTranscripts for paralogs

Hi there,

Lace & superTranscripts sounds excellent for non-model organisms without a reference genome. I'd love to try it. Though the application would be slightly different. I think it may still work but would like your opinion.
I work with single celled eukaryotic algae. While they don't seem to usually splice their transcripts, they are riddled with paralogs which they transcribe. I'd like to use this method to compare paralogs by treating them the same as splice variants. Do you see any problems with this?

Cheers!

Lace processed without any files being produced.

Hello!
Could you please help me to resolve some issue I encountered while using Lace. All the time I ran it in my dataset, the job finished without any warnings, but nothing produced. The Lace on the test data worked successfully. I have a corset-produced clusters.
I have no idea how to resolve the issue. I am reinstalling and reconfiguring the Lace and trying different SLURM parameters for weeks.

Best regards
Asan

Problem with memory exceeding the limit.

Problem reported by user over email.
Appears to be a single cluster that uses all the memory.
User sent data and the problem was reproduced. We need to investigate Cluster-16676.1839

Raise error when networkx version is too high

I noticed during my use of Lace that when I included networkx v2 in my conda enviornment it ran without error and created an incorrect superTranscriptome where all "whirl" counts were set to 0 and no case change occcured in the sequences.

It would be helpful to raise this error at runtime so that new users can identify this and adjust their environment setup accordingly (especially since many other programs require networkx v2).

Lace stalls without error ...

hello,
lace stalls always without stopping or giving any error...
the individual fasta-files are generated and then supertranscripts are built (some with the usage of large amounts of RAM, up to 256+128 SWAP), but after a while the program stalls for a long time and after I stop the program with Ctrl-C following output is given: any ideas what goes wrong...

^CProcess ForkPoolWorker-5:
Traceback (most recent call last):
File "/data/analysis/Dietmar/SW/Supertranscript/Lace-master/Lace.py", line 192, in
Split(args.GenomeFile,args.ClusterFile,args.cores,args.maxTran,args.outputDir)
File "/data/analysis/Dietmar/SW/Supertranscript/Lace-master/Lace.py", line 136, in Split
pool.join()
File "/data/analysis/Dietmar/SW/Anaconda/Anaconda3/envs/lace/lib/python3.6/multiprocessing/pool.py", line 510, in join
self._worker_handler.join()
File "/data/analysis/Dietmar/SW/Anaconda/Anaconda3/envs/lace/lib/python3.6/threading.py", line 1056, in join
self._wait_for_tstate_lock()
File "/data/analysis/Dietmar/SW/Anaconda/Anaconda3/envs/lace/lib/python3.6/threading.py", line 1072, in _wait_for_tstate_lock
elif lock.acquire(block, timeout):
Traceback (most recent call last):
KeyboardInterrupt
File "/data/analysis/Dietmar/SW/Anaconda/Anaconda3/envs/lace/lib/python3.6/multiprocessing/process.py", line 249, in _bootstrap
self.run()
File "/data/analysis/Dietmar/SW/Anaconda/Anaconda3/envs/lace/lib/python3.6/multiprocessing/process.py", line 93, in run
self._target(*self._args, **self._kwargs)
File "/data/analysis/Dietmar/SW/Anaconda/Anaconda3/envs/lace/lib/python3.6/multiprocessing/pool.py", line 108, in worker
task = get()
File "/data/analysis/Dietmar/SW/Anaconda/Anaconda3/envs/lace/lib/python3.6/multiprocessing/queues.py", line 343, in get
res = self._reader.recv_bytes()
File "/data/analysis/Dietmar/SW/Anaconda/Anaconda3/envs/lace/lib/python3.6/multiprocessing/connection.py", line 216, in recv_bytes
buf = self._recv_bytes(maxlength)
File "/data/analysis/Dietmar/SW/Anaconda/Anaconda3/envs/lace/lib/python3.6/multiprocessing/connection.py", line 407, in _recv_bytes
buf = self._recv(4)
File "/data/analysis/Dietmar/SW/Anaconda/Anaconda3/envs/lace/lib/python3.6/multiprocessing/connection.py", line 379, in _recv
chunk = read(handle, remaining)
KeyboardInterrupt

Python version should be downgraded in environment.yml

Hello!
Here is an issue I have recently encountered: the python version was incompatible with the networkx package.

Traceback (most recent call last):
  File "/project/_app/Lace/Lace/Lace_run.py", line 14, in <module>
    from Lace.BuildSuperTranscript import SuperTran
  File "/home/user/miniconda3/envs/lace/lib/python3.11/site-packages/Lace/BuildSuperTranscript.py", line 11, in <module>
    import networkx as nx
  File "/home/user/miniconda3/envs/lace/lib/python3.11/site-packages/networkx/__init__.py", line 114, in <module>
    import networkx.generators
  File "/home/user/miniconda3/envs/lace/lib/python3.11/site-packages/networkx/generators/__init__.py", line 14, in <module>
    from networkx.generators.intersection import *
  File "/home/user/miniconda3/envs/lace/lib/python3.11/site-packages/networkx/generators/intersection.py", line 13, in <module>
    from networkx.algorithms import bipartite
  File "/home/user/miniconda3/envs/lace/lib/python3.11/site-packages/networkx/algorithms/__init__.py", line 16, in <module>
    from networkx.algorithms.dag import *
  File "/home/user/miniconda3/envs/lace/lib/python3.11/site-packages/networkx/algorithms/dag.py", line 23, in <module>
    from fractions import gcd
ImportError: cannot import name 'gcd' from 'fractions' (/home/user/miniconda3/envs/lace/lib/python3.11/fractions.py)

Look at this compatibility table. fractions.gcd(a, b) has been moved to math.gcd(a, b) in Python 3.9. Either recent networkx version should be used or python version downgraded.

Best regards
Asan

Addtional functionality

Add functionality to:

Remove intermediate files (.fasta and .psl files created by SuperTranscript).
Supply the number of cores you wish to use.
Provide cross-checks and alternate annotation file or not.

(installation) In environment.yml, fix Python Version

Sadly, networkx 2.3 clashes with Python 3.9, since fractions.gcd() has been migrated to math.gcd(). Because of that, running Lace following installation causes an error:

ImportError: cannot import name 'gcd' from 'fractions'

Fixing the version of Python in Lace-1.14.1/environment.yml fixes this problem. I tried Python=3.5, and now it works.

Please add shebangs to the python scripts

Could you add an appropriate shebang (#!/usr/bin/env python) to the various python scripts? Users (and bioconda) won't need to modify them then.

DTE and DTU without biological replicates ?

In the example "Differential Transcript Usage on a non model organism", the script for DTE analysis requires having a biological replicate. Is it possible do the analysis without biological replicate? The DTE script of the example (voom_diff.R) doesn't run if not having biological replicate.

I modified the script and now it runs without biological replicates, but iI don't know if the analysis is correct.

My script is this:

# library
library('edgeR')

## Read in data
counts <- read.table("Counts/counts.txt",header=TRUE,sep="\t")

##Define groups
treatment = c('T1','T2')

## Make exon id
eid = paste0(counts$Chr,"-S",counts$Start,"-E",counts$End)

## Define design matrix
design <- model.matrix(~treatment)

## Make DGElist and normalise
dx <- DGEList(counts[,c(7:8)])
dx <-calcNormFactors(dx,group= treatment )

## glmFit
gfit <- glmFit(dx, design, dispersion = 0.1)
ds <- diffSpliceDGE(gfit, geneid = counts$Chr, exonid = eid)

## Results
topSpliceDGE(ds, number = 20, test = "Simes")
plotSpliceDGE(ds)

SuperDuperTrans.gff has wrong entries for special cases

I found a few odd entries in the chicken SuperDuperTrans.gff for the genes AKAP2 and FAM188B. Blocks are annotated beond the length of the super transcript. Both these gene have another gene that includes there name (PALM2-AKAP2 and INMT-FAM188B) and I think the annotation of these is getting confused. Some output from a command I was running that ran into the issue is pasted below.

Feature (AKAP2:4374-5634) beyond the length of AKAP2 size (3007 bp). Skipping.
Feature (AKAP2:5637-7198) beyond the length of AKAP2 size (3007 bp). Skipping.
Feature (FAM188B:2849-3179) beyond the length of FAM188B size (753 bp). Skipping.
Feature (FAM188B:3179-3409) beyond the length of FAM188B size (753 bp). Skipping.
Feature (FAM188B:2849-3179) beyond the length of FAM188B size (753 bp). Skipping.
Feature (FAM188B:3179-3237) beyond the length of FAM188B size (753 bp). Skipping.
Feature (FAM188B:3238-3282) beyond the length of FAM188B size (753 bp). Skipping.

Wrong example name

Hi
on the https://github.com/Oshlack/Lace/wiki/Usage-Documentation page the command
python Lace.py Example/Example_Genome.fasta Example/clusters.txt -t -o Test
does not work because it should be
python3 Lace.py Example/Example_Transcripts.fasta Example/clusters.txt -t -o Test

Dealing with reverse compliments

Logging

Add Logger to script

'DiGraph' object issue on networkx 2.3

Hello!
I configured virtual environment using your environment.yml file and changed the python version to 3.9. While running Lace on test data, the error occured:
'DiGraph' object has no attribute 'node'

The suggestion on StackExchange is to change networkx version to 1.1 or modify files used by DiGraph.
What is your preferred way to solve the issue?

Best regards
Asan

SuperDuper.gff not consistent with gff2 standard

9th column should give the gene ID and/or transcript ID not be "."

Have ribbon write intermediate fasta and psl file to a subdirectory

Suggestion. So that typing "ls" in the directory where ribbon ran doesn't take forever, and so it's quick to identify the final .fasta and .psl files

mv: cannot stat ....

Error reported at the end of a run (to do with clean up).
Doesn't seems to affect the results.

Remove ".fasta" from contig IDs in SuperDuper.fasta

e.g.

ZNF385A.fasta Number of transcripts: 1

should be

ZNF385A Number of transcripts: 1

So it matches the cluster ID.

End coordinate off by one in gff

Issues from Lukes review

Changed it so the default is 1 core
Have an option where you can decide where to put the output
Potentially have just one script Ribbon (which does both Checker and STViewer) [Optional]

Assessing completeness of SuperTranscript transcriptome assembly/BUSCO

Hello team of the Oshlak lab,

do you have experience with BUSCO-analysis on SuperTranscript data?
I have used corset and Lace to cluster and stitch plant transcriptome assemblies. Afterwards, it did not find a lot of the BUSCOs. However, when using OrthoFinder to find orthologs to additional species, which uses BLAST/Diamond, the assemblies looked more complete.

Do you think, SuperTranscripts are in principle compatible with BUSCO?

Could you think of an alternative way to check the completeness of the SuperTranscript assemblies?

Thank you,
Maria

gff table needs another column

As least featureCounts expects it. One line example:

PYGL SuperTranscript exon 0 301 . . 0

vs.

PYGL SuperTranscript exon 0 301 . . 0 .

Test

This is a test

Hello World

oshlack / lace Goto Github PK

lace's People

Contributors

Stargazers

Watchers

Forkers

lace's Issues

Recommend Projects

Recommend Topics

Recommend Org