gwct / referee Goto Github PK
View Code? Open in Web Editor NEWReference genome quality scores
Home Page: https://gwct.github.io/referee/
License: GNU General Public License v3.0
Reference genome quality scores
Home Page: https://gwct.github.io/referee/
License: GNU General Public License v3.0
At the moment I am working on amplicon data from an aploid genome (same length, already aligned with a multiple alignment). Since I cannot generate a .bam file I cannot use Referee to calculate GL and I've to call mutations from aligned fastq through base count generation and evaluation of frequency of SNVs. To validate a mutation I would like to calculate genotype likelihood, but I cannot do it efficiently since my background in statistics is not good enough.
I would like to use the formulae cited ot Referee page, but I am confused regarding the equation 3.
P(br/Ai)={(ebem)/3 if b!=Ai || 1 - (ebem) if b=Ai
If I am not wrong, Ai should be EACH allele for the specific read b is the position tested. Said that I would like to use the formula 4b to calculate the Log likelihood for each nucleotide. However I cannot perform the right calculation. Since I haven't the mapping qualities I assume an em=0..0001 (uniquely mapped, I used blast to map them). Assuming that in a specific position the Reference base is A and we observed 1000 A with an em=0.0001 and we see 100 G with an em=0.01, how could I calculate lilekihood to test whether the G is a virtually true mutation?
I thank you in advance for your willingness
Hi,
I seem to be experiencing an error related to multiprocessing, however, the error also occurs when -p 1
is set.
Any clue what might be the problem?
#
# =================================================
__
_ __ ___ / _| ___ _ __ ___ ___
| '__/ _ \ |_ / _ \ '__/ _ \/ _ \
| | | __/ _| __/ | | __/ __/
|_| \___|_| \___|_| \___|\___|
Reference genome quality score calculator.
Pseudo assembly by iterative mapping.
#
# Version 1.2 released on August 08, 2020
# Referee was developed by Gregg Thomas and Matthew Hahn
# Citation: https://doi.org/10.1093/gbe/evz088
# Website: https://gwct.github.io/referee/
# Report issues: https://github.com/gwct/referee/issues
#
# The date and time at the start is: 02.14.2023 | 10:26:14
# Using Python version: 3.9.15
#
# The program was called as: /home/rks/referee-1.2/referee.py -p 1 -gl results/BBR060303/referee/BBR060303.pileup -ref results/common/finalAssemblies/BBR060303_final.fasta -d results/BBR060303/referee/ -prefix BBR060303_referee_scores --pileup --correct --overwrite --mapq --quiet --haploid
#
# -----------------------------------------------------------------------------------------------------------------------------
# INPUT/OUTPUT INFO
# Input file: results/BBR060303/referee/BBR060303.pileup
# Reference file: results/common/finalAssemblies/BBR060303_final.fasta
# Output directory: results/BBR060303/referee/
# Output prefix: BBR060303_referee_scores
# -----------------------------------------------------------------------------------------------------------------------------
# OPTIONS INFO
# Option Current setting Current action
# -----------------------------------------------------------------------------------------------------------------------------
# --pileup True Input type set to pileup. Referee will calculate genotype likelihoods.
# --mapq True Incorporating mapping qualities (7th column of pileup file) into quality score calculations if they are present.
# --fastq False Not writing output in FASTQ format.
# --bed False Not writing output in BED format.
# --mapped False Calculating scores for every position in the reference genome.
# --haploid True Calculating genotype likelihoods and quality scores for HAPLOID data (4 genotypes).
# --raw False NOT printing raw Referee score in tabbed output.
# --correct True Suggesting higher scoring alternative base when reference score is negative or reference base is N.
# --quiet True No further information will be output while Referee is running.
# -p 1 Referee will use this many processes to run.
# -l 100000 This many lines will be read per process to be calculated at one time in parallel
# -----------------------------------------------------------------------------------------------------------------------------
# Running...
# 02.14.2023 10:26:14 Detecting compression In progress... ������������������������������Success! 0.01172 8e-05 14.72656 22.28906
# 02.14.2023 10:26:14 Indexing reference FASTA In progress... ������������������������������Success! 0.01654 0.00473 14.72656 22.28906
# 02.14.2023 10:26:14 Getting scaffold lengths from index In progress... ������������������������������Success! Read 1 scaffolds 0.01669 4e-05 14.72656 22.28906
# 02.14.2023 10:26:14 Computing likelihood look-up table In progress... ������������������������������Success! 0.02417 0.00741 17.52344 24.84766
# 02.14.2023 10:26:14 Processing lines 1-100000 In progress... Traceback (most recent call last):
File "/home/rks/referee-1.2/referee.py", line 166, in <module>
referee(globs);
File "/home/rks/referee-1.2/referee.py", line 88, in referee
for result in pool.starmap(CALC.refCalc, ((line_chunk, globs) for line_chunk in line_chunks)):
File "/home/rks/miniconda3/lib/python3.9/multiprocessing/pool.py", line 372, in starmap
return self._map_async(func, iterable, starmapstar, chunksize).get()
File "/home/rks/miniconda3/lib/python3.9/multiprocessing/pool.py", line 771, in get
raise self._value
File "/home/rks/miniconda3/lib/python3.9/multiprocessing/pool.py", line 537, in _handle_tasks
put(task)
File "/home/rks/miniconda3/lib/python3.9/multiprocessing/connection.py", line 211, in send
self._send_bytes(_ForkingPickler.dumps(obj))
File "/home/rks/miniconda3/lib/python3.9/multiprocessing/reduction.py", line 51, in dumps
cls(buf, protocol).dump(obj)
TypeError: cannot pickle '_thread.RLock' object
Best,
Richard
Hello,
encountering this error
+ Making tmp directory: referee-tmpdir-04-10-2020.11-23-18-rYrIwO
Traceback (most recent call last):
File "./referee-1.1.21/referee.py", line 110, in <module>
referee(files, globs, step_start_time);
File "./referee-1.1.21/referee.py", line 51, in referee
for result in pool.imap(RC.getSubPID, range(globs['num-procs'])):
File "/usr/lib/python2.7/multiprocessing/pool.py", line 673, in next
raise value
multiprocessing.pool.MaybeEncodingError: Error sending result: 'psutil.Process(pid=8172, name='referee.py', started='11:29:58')'. Reason: 'TypeError("can't pickle thread.lock objects",)'
The command used was
/referee-1.1.21/referee.py -gl reads-gl.glf.gz -ref ref.fasta --fastq --correct -o reference -p 20
On Ubuntu 18.04
thank you
Hello,
I have tried running the input line for referee:
python referee.py -gl file.mpileup -ref ref.file.fasta --pileup
First I received this error:
File "/work/jgjohns6/penicillata/referee/referee.py", line 10, in
import sys, os, multiprocessing as mp, shutil, lib.refcore as RC, lib.ref_calc as CALC, \
File "/work/jgjohns6/penicillata/referee/lib/ref_calc.py", line 210
print gt, log_gls[gt];
^
SyntaxError: Missing parentheses in call to 'print'. Did you mean print(gt, log_gls[gt])?
Then, after manually adding in the parentheses I received a similar one:
Traceback (most recent call last):
File "/work/jgjohns6/penicillata/referee/referee.py", line 10, in
import sys, os, multiprocessing as mp, shutil, lib.refcore as RC, lib.ref_calc as CALC, \
File "/work/jgjohns6/penicillata/referee/lib/ref_calc.py", line 211
print rq, lr, l_match, l_mismatch;
^
SyntaxError: Missing parentheses in call to 'print'. Did you mean print(rq, lr, l_match, l_mismatch)?
However even after going in and adding those parentheses, I received the following error:
File "/work/jgjohns6/penicillata/referee/referee.py", line 10, in
import sys, os, multiprocessing as mp, shutil, lib.refcore as RC, lib.ref_calc as CALC, \
File "/work/jgjohns6/penicillata/referee/lib/ref_calc.py", line 1, in
import math, re, refcore as RC, ref_out as OUT, sys
ModuleNotFoundError: No module named 'refcore'
I am not sure what is happening here, and why the program won't just run.
Quality scores are log10(likelihood of ref/likelihood of nonref) (right?).
To prevent misinterpretations, the documentation should make it clear that a quality score of 1 is equivalent to the ref being having 10 times the likelihood of nonref, and a quality score of 4 means 10000 times.
This is different from the standard quality score scalings in which these situations would result in scores of 10 and 40.
Hello,
I was just wondering if there were any recommendations regarding the options to use for generating the pileup file with samtools mpileup?
EDIT: had missed the walkthrough
When processing a single input file with multiple threads, referee splits the file into multiple temporary files and then processes those files. If a user has a large file, they will try to use multiple processes to speed the analysis; however, this causes the large file to be duplicated. Additionally, before the program splits the files, referee reads the entire file once to count the number of lines. Then reads it again to split the file. On network-backed storage, this can take a really long time, and the user won't know why referee isn't generating any output or log updates.
An alternative strategy is for referee to avoid counting lines is to measure the file size using os.stat
and then jump a certain percentage into the file, find the next newline and split there.
Another strategy is to get rid of the temporary files and have one file-reading process and multiple data-processing processes.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.