fenyolab / l1em Goto Github PK

View Code? Open in Web Editor NEW

30.0 12.0 9.0 6.31 MB

Estimate locus specific human LINE-1 expression.

License: Other

Python 68.92% Shell 30.86% Dockerfile 0.21%

l1em's Introduction

Installation

conda way

You will need

git (https://git-scm.com/book/en/v2/Getting-Started-Installing-Git)
anaconda (https://docs.anaconda.com/anaconda/install/)

Download from github

git clone https://github.com/FenyoLab/L1EM

Create conda environment

cd L1EM
conda env create -f L1EM.yml

Before running L1EM, activate the environment:

source activate L1EM

When finished, deactivate the environment:

source deactivate L1EM

old way

Alternatively you can install the following dependencies yourself:

python version 2.7+ (version 2.7 tested)
bwa (version 0.7.17 tested)
samtools (version 1.9 tested)
numpy (version 1.14.3 tested)
scipy (version 1.1.0 tested)
pysam (version 0.15.0 tested)
bedtools (version 2.27.1 tested)

No compiling of L1EM is necessary. Python scripts will be called from inside the L1EM directory.

If necessary, you can specify the path for bwa and samtools in the run_L1EM.sh script. You must use samtools >=1.0. Early version of pysam will not work. I highly recommend that you use bwa 0.7.17. Earlier versions may differ in how they write the XA tag. This will lead to inaccurate results without throwing an error.

Quick guide

First time: build L1EM reference

You will need the hg38 reference genome in fasta format, with bwa index. Downloaded from UCSC genome browser:

wget http://hgdownload.cse.ucsc.edu/goldenPath/hg38/bigZips/hg38.fa.gz
zcat hg38.fa.gz > hg38.fa
bwa index hg38.fa

Note: this will take some time.

Then you can build the L1EM reference using the provided shell script:

bash generate_L1EM_fasta_and_index.sh /fullpathto/hg38.fa

This should be done inside the L1EM directory

Executing the L1-EM pipeline

You will need a bam file with strand specific paired end read alignments to hg38. You can use any aligner, but make sure that all reads from the original fastq files are present trimming should be okay, but is tested. Filtering reads will potentially break the pipeline.

First move to an empty directory and then execute the shell script:

bash -e /fullpathto/run_L1EM.sh /fullpathto/alignments.bam /fullpathto/L1EM /fullpathto/hg38.fa

L1EM will write files with specific names, so do NOT run two instances of L1EM in the same directory.

At the end of the run_L1EM.sh script are a commented set of commands to delete all the intermediate files. If you wish to automatically delete intermediate files, you can delete these comments.

Output

At completion, three tab delimited tables will be written.

full_counts.txt: raw count estimates for each L1HS/L1PA* element with any aligned read pairs
l1hs_transcript_counts.txt: expression estimates for L1HS elements, reported as raw counts
filter_L1HS_FPM.txt: L1HS whose expression is supported by at least 100 read pairs, reported as FPM (read pairs per million properly aligned)

The rows of all files are L1 loci.

For full_counts.txt each of the five transcript types: only, runon, passive (sense), passive (antisense), antisense are reported.

For l1hs_transcript_counts.txt and filter_L1HS_FPM.txt only proper transcription from L1HS elements start at the 5' UTR is reported.

The results are also written as pickle files to facilitate further analysis in python. To generate a python dictionary with keys being the transcript names and values being the relative expression:

X_est = dict(zip(pickle.load(open('names_final.pkl')),pickle.load(open('X_final.pkl'))))

Additional details

Our Bioinformatics paper introducing L1EM: https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btz724/5581349
More details can be found in manual.md

Mouse Version

Scripts and annotation to measure the expression of LINE-1 loci in mm39 has been added. The mouse version uses all the same methodology as the human version, but has not been as rigorously tested.

Download and index the mm39 reference genome (UCSC genome browser version)

wget http://hgdownload.cse.ucsc.edu/goldenPath/mm39/bigZips/mm39.fa.gz
zcat mm39.fa.gz > mm39.fa
bwa index mm39.fa

Build the mm39 L1EM reference.

bash generate_mm39_L1EM_fasta_and_index.sh /fullpathto/mm39.fa

Run L1EM.

bash /fullpathto/run_L1EM_mm39.sh /fullpathto/alignments.bam /fullpathto/L1EM /fullpathto/mm39.fa

All L1Md loci are quantified in full_counts.txt. Normalized expression of 5' UTR intact young (L1Md_Tf I/II/II, L1Md_Gf I/II, L1Md_A I/II/III) LINE-1 loci supported by at least 100 reads can be found in filter_active_L1Md_FPM.txt.

l1em's People

Contributors

Stargazers

Watchers

Forkers

seanchen607 mywanuo troygarry hallvaaw danielfaulkner azrealh xtmgah fw262 felixkrueger

l1em's Issues

Error in L1EM step4: UnboundLocalError: local variable 'new_read_id1' referenced before assignment

Hi, thanks for implementing L1EM!

I have installed L1EM as indicated in the manual and built the human reference as indicated.

I next tried to use the pipeline on one stranded PE RNA-seq sample but at STEP-4 an error rose (pasted below). The pipeline nevertheless continues its run (I guess this is a consequence of the bash resilience, introducing set -e might solve this[?]), and all the txt output files are empty.

I was wondering whether I am somehow mis-using L1EM and what's going on.

ERROR:
STEP 4: G(R) matrix construction
Traceback (most recent call last):
File "/home/fansalon/Software/L1EM//L1EM/G_of_R.py", line 308, in
main()
File "/home/fansalon/Software/L1EM//L1EM/G_of_R.py", line 294, in main
if not (new_read_id1 or new_read_id2):
UnboundLocalError: local variable 'new_read_id1' referenced before assignment

COMMAND:

source activate L1EM
genome=~/Software/L1EM/hg38/hg38.fa
# map reads to hg38
bwa mem -v 1 -t $cpu $genome $reads1 $reads2 > ${name}.sam
# convert sam to bam and sort it
samtools view -@ $cpu -b ${name}.sam -o ${name}.bam
samtools sort -@ $cpu ${name}.bam -o ${name}.sorted.bam
samtools index -@ $cpu ${name}.sorted.bam
## Run L1EM
bash ~/Software/L1EM/run_L1EM.sh ${PWD}/${name}.sorted.bam ~/Software/L1EM/ $genome

Thanks,
Federico

Installation download instructions needs updating

Not a major issue, but the download from github instructions in the README likely needs to be updated from "git clone https://github.com/wmckerrow/L1EM" to "git clone https://github.com/FenyoLab/L1EM."

Error running L1EM

Hi I am following the steps to run L1EM exactly as suggested in your manual (provided absolute paths for all data)
THE CMD:
for i in *bam; do bash /L1EM/run_L1EM.sh /l1em/"${i}" /L1EM /l1em/L1EM_hg38_reference/GRCh38.primary_assembly.genome.fa; done;

I get the foll. error:
THE ERROR LOG:
STEP 1: realign
[E::hts_open_format] Failed to open file 4239.bam
samtools view: failed to open "4239.bam" for reading: No such file or directory
[M::bam2fq_mainloop] discarded 0 singletons
[M::bam2fq_mainloop] processed 0 reads
[main] Version: 0.7.17-r1188
[main] CMD: L1EM/bin/bwa aln -k 3 -n 3 -t 16 -i 20 /l1em/L1EM_hg38_reference/GRCh38.primary_assembly.genome.fa unaligned.fq1
[main] Real time: 2.559 sec; CPU: 2.467 sec
[main] Version: 0.7.17-r1188
[main] CMD: L1EM/bin/bwa aln -k 3 -n 3 -t 16 -i 20 /l1em/L1EM_hg38_reference/GRCh38.primary_assembly.genome.fa unaligned.fq2
[main] Real time: 2.678 sec; CPU: 2.671 sec
[main] Version: 0.7.17-r1188
[main] CMD: L1EM/bin/bwa sampe /l1em/L1EM_hg38_reference/GRCh38.primary_assembly.genome.fa 1.sai 2.sai unaligned.fq1 unaligned.fq2
[main] Real time: 0.003 sec; CPU: 0.002 sec
STEP 2: extract
[E::hts_open_format] Failed to open file 4239.bam
Traceback (most recent call last):
File "/L1EM/utilities/read_or_pair_overlap_bed.py", line 73, in
main()
File "/L1EM/utilities/read_or_pair_overlap_bed.py", line 39, in main
inbam = pysam.AlignmentFile(bamfile,'rb')
File "pysam/libcalignmentfile.pyx", line 734, in pysam.libcalignmentfile.AlignmentFile.cinit
File "pysam/libcalignmentfile.pyx", line 933, in pysam.libcalignmentfile.AlignmentFile._open
IOError: [Errno 2] could not open alignment file 4239.bam: No such file or directory
[E::hts_open_format] Failed to open file temp.bam
samtools sort: can't open "temp.bam": No such file or directory
[M::bam2fq_mainloop] discarded 0 singletons
[M::bam2fq_mainloop] processed 0 reads
[M::bam2fq_mainloop] discarded 0 singletons
[M::bam2fq_mainloop] processed 0 reads
STEP 3: candidate alignments
[bwa_aln] fail to locate the index
[bwa_aln] fail to locate the index
[bwa_sai2sam_pe] fail to locate the index
STEP 4: G(R) matrix construction
[E::hts_open_format] Failed to open file 4239.bam
Traceback (most recent call last):
File "/L1EM/CGC/median_template_and_pairs.py", line 34, in
for read in pysam.AlignmentFile(bamfile):
File "pysam/libcalignmentfile.pyx", line 734, in pysam.libcalignmentfile.AlignmentFile.cinit
File "pysam/libcalignmentfile.pyx", line 933, in pysam.libcalignmentfile.AlignmentFile._open
IOError: [Errno 2] could not open alignment file 4239.bam: No such file or directory
usage: G_of_R.py [-h] -b BAMFILE [-e ERROR_PROB] [-m MAX_START2START_LEN]
[-r READS_PER_PICKLE] [-p PREFIX] [-n NMDIFF] -i INSERT_MEAN
[--flanking FLANKING] [--as_start AS_START] [-w WIGGLE]
[--min_len MIN_LEN] [--min_exon_len MIN_EXON_LEN]
G_of_R.py: error: argument -i/--insert_mean: expected one argument
STEP 5: Expectation maximization
ls: cannot access ../G_of_R/*pk2: No such file or directory
ls: cannot access ../G_of_R/*TE_list.txt: No such file or directory
cp: missing destination file operand after ‘TE_list.txt’
Try 'cp --help' for more information.
Traceback (most recent call last):
File "/L1EM/L1EM/L1EM.py", line 168, in
main()
File "/L1EM/L1EM.py", line 117, in main
for name in open(TE_list):
IOError: [Errno 2] No such file or directory: 'TE_list.txt'
STEP 6: Writing results
Traceback (most recent call last):
File "/L1EM/utilities/report_l1_exp_counts.py", line 29, in
X_est = dict(zip(pickle.load(open('names_final.pkl')),pickle.load(open('X_final.pkl'))))
IOError: [Errno 2] No such file or directory: 'names_final.pkl'
Traceback (most recent call last):
File "/L1EM/utilities/report_l1hs_transcription.py", line 29, in
X_est = dict(zip(pickle.load(open('names_final.pkl')),pickle.load(open('X_final.pkl'))))
IOError: [Errno 2] No such file or directory: 'names_final.pkl'
Traceback (most recent call last):
File "/L1EM/utilities/filtered_and_normalized_l1hs.py", line 25, in
X_est = dict(zip(pickle.load(open(sys.argv[1])),pickle.load(open(sys.argv[2]))))
IOError: [Errno 2] No such file or directory: 'names_final.pkl'
STEP 7: Clean up
cp: cannot stat ‘*final.pkl’: No such file or directory

split: invalid number of lines: ‘41539,6’

Getting this error when executing L1EM
split: invalid number of lines: ‘41539,6’
Not sure what to do.

mm39 L1EM annotation offset

Let me apologise for this issue, we seem to have gotten the intentional +-400 flanking regions wrong. All good, sorry for the inconvenience.

README File Installation Details

I am interested to install the software the old way. It has a list of dependencies to install, but I would also like to see instructions about how to install the L1EM Python software itself. Do I simply copy all of the Python code files into a directory listed in my PYTHONPATH variable or is there more to installing it? Also, there's a duplicated sentence in Output section.

At completion, two tab delimited table will be written. At completion, three tab delimited tables will be written.

Determination of category 1 and category 2 transcripts

Hello,

I was wondering if you can provide some detail on how category 1 (has 5' UTR) and category 2 (no 5' UTR) transcripts are determined. Is this based on blastn results or is this information derived from the RepeatMasker L1HS and L1PA annotations?

Thanks for your help,
Michael

L1EM for Non-strand specific RNAseq datasets

Can L1EM be used to reliably predict line elements from non-strand specific rnaseq datasets?

L1EM using hg19 reference

Hi!

I have read in the wiki that we have to use hg38 reference. However, I was wondering if it is possible to choose another reference such as hg19?

Thanks!

Coordinate or neighbor genes information of non-"only" L1s

Thank you for making this great tool.

Where can I find information about the neighboring genes or coordinates for reads that are classified as run-on, passive, or antisense? I would like to look at the genomic neighborhood of these reads to determine if the L1 sequence retro-transposed to a new genomic location and is expressed from that new genomic location.

Thank you,
Michael

About L1EM/annotation/L1EM.400.bed

Hello,

I find the gene L1TD1 's loci in hg38.gtf from ucsc is chr1 ncbiRefSeq transcript 62194849 62212328 . + . gene_id "L1TD1"; transcript_id "NM_019079.5"; gene_name "L1TD1";

But it is not in the file L1EM/annotation/L1EM.400.bed . ( I use bedtools intersect to check it, still don't have overlap peak with chr1 62194849 62212328 .)

So if i find some RNA-seq files have highly L1TD1 express (use featureCounts and hg38.gtf, bam files is used hisat2), but it is not in this pipeline results file full_counts.txt.

What should I do? Should I add peak chr1 62194849 62212328 to the file L1EM/annotation/L1EM.400.bed ?

Thank you very much for your busy reading my issues.

[E::idx_find_and_load] Could not retrieve index file for '../split_fqs/../split_fqs/L1.al.aln.bam'

Hi!

After running the following command, according to the wiki:

bash -e run_L1EM_unstranded.sh /fullpathto/sampleA.bam /fullpathto/L1EM /fullpathto/hg38.fa

I get these errors:

[main] CMD: /opt/software/Compiler/gcccore/system/bwa/0.7.17/bin/bwa sampe -n 10000000 -N 10000000 /mnt/netapp2/conda_python_envs/py_envs/L1EM//annotation/L1EM.400.fa L1.ab.R1.aln.sai L1.ab.R2.aln.sai L1.fq1.ab L1.fq2.ab
[main] Real time: 17.741 sec; CPU: 9.017 sec
[E::idx_find_and_load] Could not retrieve index file for '../split_fqs/../split_fqs/L1.af.aln.bam'
[E::idx_find_and_load] Could not retrieve index file for '../split_fqs/../split_fqs/L1.aj.aln.bam'
[E::idx_find_and_load] Could not retrieve index file for '../split_fqs/../split_fqs/L1.ab.aln.bam'
[E::idx_find_and_load] [E::idx_find_and_load] Could not retrieve index file for '../split_fqs/../split_fqs/L1.am.aln.bam'Could not retrieve index file for '../split_fqs/../split_fqs/L1.ap.a
ln.bam'

[E::idx_find_and_load] Could not retrieve index file for '../split_fqs/../split_fqs/L1.ak.aln.bam'
[E::idx_find_and_load] Could not retrieve index file for '../split_fqs/../split_fqs/L1.an.aln.bam'
[E::idx_find_and_load] [E::idx_find_and_load] Could not retrieve index file for '../split_fqs/../split_fqs/L1.ai.aln.bam'Could not retrieve index file for '../split_fqs/../split_fqs/L1.ad.a
ln.bam'

[E::idx_find_and_load] Could not retrieve index file for '../split_fqs/../split_fqs/L1.ao.aln.bam'
[E::idx_find_and_load] Could not retrieve index file for '../split_fqs/../split_fqs/L1.ac.aln.bam'
[E::idx_find_and_load] Could not retrieve index file for '../split_fqs/../split_fqs/L1.ae.aln.bam'
[E::idx_find_and_load] Could not retrieve index file for '../split_fqs/../split_fqs/L1.aa.aln.bam'
[E::idx_find_and_load] Could not retrieve index file for '../split_fqs/../split_fqs/L1.ag.aln.bam'
[E::idx_find_and_load] Could not retrieve index file for '../split_fqs/../split_fqs/L1.ah.aln.bam'
[E::idx_find_and_load] Could not retrieve index file for '../split_fqs/../split_fqs/L1.al.aln.bam'

This is the content of split_fqs directory:

L1.aa.aln.bam     L1.ac.R1.aln.sai  L1.ae.R2.aln.sai  L1.ah.aln.bam     L1.aj.R1.aln.sai  L1.al.R2.aln.sai  L1.ao.aln.bam     L1.fq1.ab  L1.fq1.ai  L1.fq1.ap  L1.fq2.ag  L1.fq2.an
L1.aa.R1.aln.sai  L1.ac.R2.aln.sai  L1.af.aln.bam     L1.ah.R1.aln.sai  L1.aj.R2.aln.sai  L1.am.aln.bam     L1.ao.R1.aln.sai  L1.fq1.ac  L1.fq1.aj  L1.fq2.aa  L1.fq2.ah  L1.fq2.ao
L1.aa.R2.aln.sai  L1.ad.aln.bam     L1.af.R1.aln.sai  L1.ah.R2.aln.sai  L1.ak.aln.bam     L1.am.R1.aln.sai  L1.ao.R2.aln.sai  L1.fq1.ad  L1.fq1.ak  L1.fq2.ab  L1.fq2.ai  L1.fq2.ap
L1.ab.aln.bam     L1.ad.R1.aln.sai  L1.af.R2.aln.sai  L1.ai.aln.bam     L1.ak.R1.aln.sai  L1.am.R2.aln.sai  L1.ap.aln.bam     L1.fq1.ae  L1.fq1.al  L1.fq2.ac  L1.fq2.aj
L1.ab.R1.aln.sai  L1.ad.R2.aln.sai  L1.ag.aln.bam     L1.ai.R1.aln.sai  L1.ak.R2.aln.sai  L1.an.aln.bam     L1.ap.R1.aln.sai  L1.fq1.af  L1.fq1.am  L1.fq2.ad  L1.fq2.ak
L1.ab.R2.aln.sai  L1.ae.aln.bam     L1.ag.R1.aln.sai  L1.ai.R2.aln.sai  L1.al.aln.bam     L1.an.R1.aln.sai  L1.ap.R2.aln.sai  L1.fq1.ag  L1.fq1.an  L1.fq2.ae  L1.fq2.al
L1.ac.aln.bam     L1.ae.R1.aln.sai  L1.ag.R2.aln.sai  L1.aj.aln.bam     L1.al.R1.aln.sai  L1.an.R2.aln.sai  L1.fq1.aa         L1.fq1.ah  L1.fq1.ao  L1.fq2.af  L1.fq2.am

As you can see, there are no indexes for *.aln.bam files. Do you know how can I solve this?
Thanks!