gwct / referee Goto Github PK

View Code? Open in Web Editor NEW

19.0 5.0 3.0 90.44 MB

Reference genome quality scores

Home Page: https://gwct.github.io/referee/

License: GNU General Public License v3.0

Python 84.39% R 8.00% Shell 7.61%

genomics bioinformatics quality-score genome-assembly

referee's People

Contributors

Stargazers

Watchers

Forkers

pythseq altingia reedacartwright

referee's Issues

Unable to reproduce Referee calculations from raw data

At the moment I am working on amplicon data from an aploid genome (same length, already aligned with a multiple alignment). Since I cannot generate a .bam file I cannot use Referee to calculate GL and I've to call mutations from aligned fastq through base count generation and evaluation of frequency of SNVs. To validate a mutation I would like to calculate genotype likelihood, but I cannot do it efficiently since my background in statistics is not good enough.
I would like to use the formulae cited ot Referee page, but I am confused regarding the equation 3.

P(br/Ai)={(ebem)/3 if b!=Ai || 1 - (ebem) if b=Ai

If I am not wrong, Ai should be EACH allele for the specific read b is the position tested. Said that I would like to use the formula 4b to calculate the Log likelihood for each nucleotide. However I cannot perform the right calculation. Since I haven't the mapping qualities I assume an em=0..0001 (uniquely mapped, I used blast to map them). Assuming that in a specific position the Reference base is A and we observed 1000 A with an em=0.0001 and we see 100 G with an em=0.01, how could I calculate lilekihood to test whether the G is a virtually true mutation?
I thank you in advance for your willingness

TypeError: cannot pickle '_thread.RLock' object

Hi,

I seem to be experiencing an error related to multiprocessing, however, the error also occurs when -p 1 is set.
Any clue what might be the problem?

#
# =================================================
                   __                    
         _ __ ___ / _| ___ _ __ ___  ___ 
        | '__/ _ \ |_ / _ \ '__/ _ \/ _ \
        | | |  __/  _|  __/ | |  __/  __/
        |_|  \___|_|  \___|_|  \___|\___|
    Reference genome quality score calculator.

       Pseudo assembly by iterative mapping.

#
# Version 1.2 released on August 08, 2020
# Referee was developed by Gregg Thomas and Matthew Hahn
# Citation:      https://doi.org/10.1093/gbe/evz088
# Website:       https://gwct.github.io/referee/
# Report issues: https://github.com/gwct/referee/issues
#
# The date and time at the start is: 02.14.2023 | 10:26:14
# Using Python version:              3.9.15
#
# The program was called as: /home/rks/referee-1.2/referee.py -p 1 -gl results/BBR060303/referee/BBR060303.pileup -ref results/common/finalAssemblies/BBR060303_final.fasta -d results/BBR060303/referee/ -prefix BBR060303_referee_scores --pileup --correct --overwrite --mapq --quiet --haploid
#
# -----------------------------------------------------------------------------------------------------------------------------
# INPUT/OUTPUT INFO
# Input file:       results/BBR060303/referee/BBR060303.pileup
# Reference file:   results/common/finalAssemblies/BBR060303_final.fasta
# Output directory: results/BBR060303/referee/
# Output prefix:    BBR060303_referee_scores
# -----------------------------------------------------------------------------------------------------------------------------
# OPTIONS INFO
# Option            Current setting     Current action
# -----------------------------------------------------------------------------------------------------------------------------
# --pileup          True                Input type set to pileup. Referee will calculate genotype likelihoods.
# --mapq            True                Incorporating mapping qualities (7th column of pileup file) into quality score calculations if they are present.
# --fastq           False               Not writing output in FASTQ format.
# --bed             False               Not writing output in BED format.
# --mapped          False               Calculating scores for every position in the reference genome.
# --haploid         True                Calculating genotype likelihoods and quality scores for HAPLOID data (4 genotypes).
# --raw             False               NOT printing raw Referee score in tabbed output.
# --correct         True                Suggesting higher scoring alternative base when reference score is negative or reference base is N.
# --quiet           True                No further information will be output while Referee is running.
# -p                1                   Referee will use this many processes to run.
# -l                100000              This many lines will be read per process to be calculated at one time in parallel
# -----------------------------------------------------------------------------------------------------------------------------
# Running...
# 02.14.2023  10:26:14  Detecting compression                   In progress...                ������������������������������Success!                      0.01172             8e-05           14.72656                 22.28906            
# 02.14.2023  10:26:14  Indexing reference FASTA                In progress...                ������������������������������Success!                      0.01654             0.00473         14.72656                 22.28906            
# 02.14.2023  10:26:14  Getting scaffold lengths from index     In progress...                ������������������������������Success! Read 1 scaffolds     0.01669             4e-05           14.72656                 22.28906            
# 02.14.2023  10:26:14  Computing likelihood look-up table      In progress...                ������������������������������Success!                      0.02417             0.00741         17.52344                 24.84766            
# 02.14.2023  10:26:14  Processing lines 1-100000               In progress...                Traceback (most recent call last):
  File "/home/rks/referee-1.2/referee.py", line 166, in <module>
    referee(globs);
  File "/home/rks/referee-1.2/referee.py", line 88, in referee
    for result in pool.starmap(CALC.refCalc, ((line_chunk, globs) for line_chunk in line_chunks)):
  File "/home/rks/miniconda3/lib/python3.9/multiprocessing/pool.py", line 372, in starmap
    return self._map_async(func, iterable, starmapstar, chunksize).get()
  File "/home/rks/miniconda3/lib/python3.9/multiprocessing/pool.py", line 771, in get
    raise self._value
  File "/home/rks/miniconda3/lib/python3.9/multiprocessing/pool.py", line 537, in _handle_tasks
    put(task)
  File "/home/rks/miniconda3/lib/python3.9/multiprocessing/connection.py", line 211, in send
    self._send_bytes(_ForkingPickler.dumps(obj))
  File "/home/rks/miniconda3/lib/python3.9/multiprocessing/reduction.py", line 51, in dumps
    cls(buf, protocol).dump(obj)
TypeError: cannot pickle '_thread.RLock' object

Best,
Richard

multiprocessing.pool.MaybeEncodingError: Error sending result:

Hello,

encountering this error


+ Making tmp directory: referee-tmpdir-04-10-2020.11-23-18-rYrIwO
Traceback (most recent call last):
  File "./referee-1.1.21/referee.py", line 110, in <module>
    referee(files, globs, step_start_time);
  File "./referee-1.1.21/referee.py", line 51, in referee
    for result in pool.imap(RC.getSubPID, range(globs['num-procs'])):
  File "/usr/lib/python2.7/multiprocessing/pool.py", line 673, in next
    raise value
multiprocessing.pool.MaybeEncodingError: Error sending result: 'psutil.Process(pid=8172, name='referee.py', started='11:29:58')'. Reason: 'TypeError("can't pickle thread.lock objects",)'

The command used was


/referee-1.1.21/referee.py -gl reads-gl.glf.gz -ref ref.fasta --fastq --correct -o reference -p 20

On Ubuntu 18.04

thank you

Error Message

Hello,
I have tried running the input line for referee:
python referee.py -gl file.mpileup -ref ref.file.fasta --pileup
First I received this error:
File "/work/jgjohns6/penicillata/referee/referee.py", line 10, in
import sys, os, multiprocessing as mp, shutil, lib.refcore as RC, lib.ref_calc as CALC, \
File "/work/jgjohns6/penicillata/referee/lib/ref_calc.py", line 210
print gt, log_gls[gt];
^
SyntaxError: Missing parentheses in call to 'print'. Did you mean print(gt, log_gls[gt])?

Then, after manually adding in the parentheses I received a similar one:
Traceback (most recent call last):
File "/work/jgjohns6/penicillata/referee/referee.py", line 10, in
import sys, os, multiprocessing as mp, shutil, lib.refcore as RC, lib.ref_calc as CALC, \
File "/work/jgjohns6/penicillata/referee/lib/ref_calc.py", line 211
print rq, lr, l_match, l_mismatch;
^
SyntaxError: Missing parentheses in call to 'print'. Did you mean print(rq, lr, l_match, l_mismatch)?

However even after going in and adding those parentheses, I received the following error:
File "/work/jgjohns6/penicillata/referee/referee.py", line 10, in
import sys, os, multiprocessing as mp, shutil, lib.refcore as RC, lib.ref_calc as CALC, \
File "/work/jgjohns6/penicillata/referee/lib/ref_calc.py", line 1, in
import math, re, refcore as RC, ref_out as OUT, sys
ModuleNotFoundError: No module named 'refcore'

I am not sure what is happening here, and why the program won't just run.

Quality scores are not phred-scaled.

Quality scores are log10(likelihood of ref/likelihood of nonref) (right?).

To prevent misinterpretations, the documentation should make it clear that a quality score of 1 is equivalent to the ref being having 10 times the likelihood of nonref, and a quality score of 4 means 10000 times.

This is different from the standard quality score scalings in which these situations would result in scores of 10 and 40.

Recommendation for the pileup file

Hello,

I was just wondering if there were any recommendations regarding the options to use for generating the pileup file with samtools mpileup?

EDIT: had missed the walkthrough

Multithreading over a single 400GB file is slow.

When processing a single input file with multiple threads, referee splits the file into multiple temporary files and then processes those files. If a user has a large file, they will try to use multiple processes to speed the analysis; however, this causes the large file to be duplicated. Additionally, before the program splits the files, referee reads the entire file once to count the number of lines. Then reads it again to split the file. On network-backed storage, this can take a really long time, and the user won't know why referee isn't generating any output or log updates.

An alternative strategy is for referee to avoid counting lines is to measure the file size using os.stat and then jump a certain percentage into the file, find the next newline and split there.

Another strategy is to get rid of the temporary files and have one file-reading process and multiple data-processing processes.

gwct / referee Goto Github PK

referee's People

Contributors

Stargazers

Watchers

Forkers

referee's Issues

Unable to reproduce Referee calculations from raw data

TypeError: cannot pickle '_thread.RLock' object

multiprocessing.pool.MaybeEncodingError: Error sending result:

Error Message

Quality scores are not phred-scaled.

Recommendation for the pileup file

Multithreading over a single 400GB file is slow.

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent