Giter Site home page Giter Site logo

ncsa / neat Goto Github PK

View Code? Open in Web Editor NEW
42.0 7.0 13.0 95.89 MB

NEAT (NExt-generation Analysis Toolkit) simulates next-gen sequencing reads and can learn simulation parameters from real data.

License: Other

Python 87.52% HTML 12.48%
bioinformatics bioinformatics-tool

neat's Introduction

The NEAT Project v4.2

Welcome to the NEAT project, the NExt-generation sequencing Analysis Toolkit, version 4.2. This release of NEAT includes several fixes and a little bit of restructuring. There is still lots of work to be done. See the ChangeLog for notes. We have discarded the fasta file writing for now and removed that code. We may add that in as a feature in the future, if users call for it. We also removed GC bias for now. It severely complicated implementation, and had very few noticeable effects. After discussing with some people at the Illinois Institute for Genomic Biology, it sounded like GC bias may be a bit of a non-factor with improved chemistries. These will be reintroduced if needed/called for.

We are also working on redeveloping NEAT in Rust, a memory and thread safe language that will lend itself well to the way NEAT works, check that out here: https://github.com/ncsa/rusty-neat

Stay tuned over the coming weeks for exciting updates to NEAT, and learn how to contribute yourself. If you'd like to use some of our code, no problem! Just review the license, first.

NEAT's read-simulator is a fine-grained read simulator. It simulates real-looking data using models learned from specific datasets. There are several supporting utilities for generating models used for simulation and for comparing the outputs of alignment and variant calling to the golden BAM and golden VCF produced by NEAT.

This is release v4.2 of the software. While it has been tested, it does represent a shift in the software with the introduction of a configuration file. For a stable release using the old command line interface, please see: NEAT 3.0 (or check out older tagged releases)

To cite this work, please use:

Stephens, Zachary D., Matthew E. Hudson, Liudmila S. Mainzer, Morgan Taschuk, Matthew R. Weber, and Ravishankar K. Iyer. "Simulating next-generation sequencing datasets from empirical mutation and sequencing models." PloS one 11, no. 11 (2016): e0167047.

Table of Contents

Requirements (the most up-to-date requirements are found in the environment.yml file)

  • Some version of Anaconda to set up the environment
  • Python == 3.10.*
  • poetry == 1.3.*
  • biopython == 1.79
  • pkginfo
  • matplotlib
  • numpy
  • pyyaml
  • scipy
  • pysam
  • frozendict

Installation

To install NEAT, you must create a virtual environment using a tool such as conda. Once activated, you can use the poetry module in build a wheel file, which can then be pip installed. You will need to run these commands from within the NEAT directory.

> conda env create -f environment.yml -n neat
> conda activate neat
> poetry build
> pip install dist/neat*whl

Alternatively, if you wish to work with NEAT in the development environment, you can use poetry install within the NEAT repo, after creating the conda environment:

> conda env create -f environment.yml -n neat
> conda activate neat
> poetry install

Notes: If any packages are struggling to resolve, check the channels and try to manually pip install the package to see if that helps (but note that NEAT is not tested on the pip versions.)

Test your install by running:

> neat --help

You can also try running it using the python command directly:

> python -m neat --help

Usage

NEAT's core functionality is invoked using the read-simulator command. Here's the simplest invocation of read-simulator using default parameters. This command produces a single ended fastq file with reads of length 151, ploidy 2, coverage 10X, using the default sequencing substitution, and mutation rate models.

Contents of neat_config.yml

reference: /path/to/my/genome.fa
neat read-simulator -c neat_config.yml -o simulated_data

The output prefix should not specify a file extension (i.e., .fasta, .fastq, etc.), as these will be appended by NEAT.

A config file is required. The config is a yml file specifying the input parameters. The following is a brief description of the potential inputs in the config file. See NEAT/config_template/template_neat_config.yml for a template config file to copy and use for your runs.

reference: full path to a fasta file to generate reads from read_len: The length of the reads for the fastq (if using). Integer value, default 101. coverage: desired coverage value. Float or int, default = 10 ploidy: Desired value for ploidy (# of copies of each chromosome in the organism). Default is 2 paired_ended: If paired-ended reads are desired, set this to True. Setting this to true requires either entering values for fragment_mean and fragment_st_dev or entering the path to a valid fragment_model. fragment_mean: Use with paired-ended reads, set a fragment length mean manually fragment_st_dev: use with paired-ended reads, set the standard deviation of the fragment length dataset

The following values can be set to true or omitted to use defaults. If True, NEAT will produce the file type. The default is given:

produce_bam: False produce_vcf: False produce_fastq: True

error_model: full path to an error model generated by NEAT. Leave empty to use default model (default model based on human, sequenced by Illumina) mutation_model: full path to a mutation model generated by NEAT. Leave empty to use a default model (default model based on human data sequenced by Illumina) fragment_model: full path to fragment length model generate by NEAT. Leave empty to use default model (default model based on human data sequenced by Illumina)

threads: The number of threads for NEAT to use. Note: this feature is not yet fully implemented avg_seq_error: average sequencing error rate for the sequencing machine. Use to increase or decrease the rate of errors in the reads. Float betwoon 0 and 0.3. Default is set by the error model. rescale_qualities: rescale the quality scores to reflect the avg_seq_error rate above. Set True to activate if you notice issues with the sequencing error rates in your datatset. include_vcf: full path to list of variants in vcf format to include in the simulation. These will be inserted as they appear in the input VCF into the final VCF, and the corresponding fastq and bam files, if requested. target_bed: full path to list of regions in bed format to target. All areas outside these regions will have coverage of 0. discard_bed: full path to a list of regions to discard, in BED format. mutation_rate: Desired rate of mutation for the dataset. Float between 0.0 and 0.3 (default is determined by the mutation model) mutation_bed: full path to a list of regions with a column describing the mutation rate of that region, as a float with values between 0 and 0.3. The mutation rate must be in the third column as, e.g., mut_rate=0.00. rng_seed: Manually enter a seed for the random number generator. Used for repeating runs. Must be an integer. min_mutations: Set the minimum number of mutations that NEAT should add, per contig. Default is 0. We recommend setting this to at least one for small chromosomes, so NEAT will produce at least one mutation per contig. |

The command line options for NEAT are as follows:

Universal options can be applied to any subfunction. The commands should come before the function name (e.g., neat --log-level DEBUG read-simulator ...), excetp -h or --help, which can appear anywhere in the command.

Universal Options Description
-h, --help Displays usage information
--no-log Turn off log file creation
--log-name LOG_NAME Custom name for log file, can be a full path (default is current working directory with a name starting with a timestamp)
--log-level VALUE VALUE must be one of [DEBUG, INFO, WARN, WARNING, ERROR] - sets level of log to display
--log-detail VALUE VALUE must be one of [LOW, MEDIUM, HIGH] - how much info to write for each log record
--silent-mode Writes logs, but suppresses stdout messages

read-simulator command line options

Option Description
-c VALUE, --config VALUE The VALUE should be the name of the config file to use for this run
-o OUTPUT, --output OUTPUT The path, including filename prefix, to use to write the output files

Functionality

Diagram describing the way that genReads simulates datasets

NEAT produces simulated sequencing datasets. It creates FASTQ files with reads sampled from a provided reference genome, using sequencing error rates and mutation rates learned from real sequencing data. The strength of NEAT lies in the ability for the user to customize many sequencing parameters, produce 'golden,' true positive datasets. We are working on expanding the functionality even further to model more species, generate larger variants, model tumor/normal data and more!

Features:

  • Simulate single-end and paired-end reads
  • Custom read length
  • Can introduce random mutations and/or mutations from a VCF file
    • Supported mutation types include SNPs, indels (of any length), inversions, translocations, duplications
    • Can emulate multi-ploid heterozygosity for SNPs and small indels
  • Can simulate targeted sequencing via BED input specifying regions to sample from
  • Can accurately simulate large, single-end reads with high indel error rates (PacBio-like) given a model
  • Specify simple fragment length model with mean and standard deviation or an empirically learned fragment distribution using utilities/computeFraglen.py
  • Simulates quality scores using either the default model or empirically learned quality scores using neat gen_mut_model
  • Introduces sequencing substitution errors using either the default model or empirically learned from utilities/
  • Output a VCF file with the 'golden' set of true positive variants. These can be compared to bioinformatics workflow output (includes coverage and allele balance information)
  • Output a BAM file with the 'golden' set of aligned reads. These indicate where each read originated and how it should be aligned with the reference
  • Create paired tumour/normal datasets using characteristics learned from real tumour data
  • Parallelization. COMING SOON!

Examples

The following commands are examples for common types of data to be generated. The simulation uses a reference genome in fasta format to generate reads of 126 bases with default 10X coverage. Outputs paired fastq files, a BAM file and a VCF file. The random variants inserted into the sequence will be present in the VCF and all of the reads will show their proper alignment in the BAM. Unless specified, the simulator will also insert some "sequencing error" -- random variants in some reads that represents false positive results from sequencing.

Whole genome simulation

Simulate whole genome dataset with random variants inserted according to the default model.

[contents of neat_config.yml]
reference: hg19.fa
read_len: 126
produce_bam: True
produce_vcf: True
paired_ended: True
fragment_mean: 300
fragment_st_dev: 30

neat read-simulator                  \
        -c neat_config.yml          \
        -o /home/me/simulated_reads

Targeted region simulation

Simulate a targeted region of a genome, e.g. exome, with a targeted bed:

[contents of neat_config.yml]
reference: hg19.fa
read_len: 126
produce_bam: True
produce_vcf: True
paired_ended: True
fragment_mean: 300
fragment_st_dev: 30
targed_bed: hg19_exome.bed

neat read-simulator                 \
        -c neat_config              \
        -o /home/me/simulated_reads

Insert specific variants

Simulate a whole genome dataset with only the variants in the provided VCF file using -v and setting mutation rate to 0 with -M.

[contents of neat_config.yml]
reference: hg19.fa
read_len: 126
produce_bam: True
produce_vcf: True
paired_ended: True
fragment_mean: 300
fragment_st_dev: 30
include_vcf: NA12878.vcf
mutation_rate: 0

neat read-simulator                 \
        -c neat_config.yml          \
        -o /home/me/simulated_reads

Single end reads

Simulate single end reads by omitting paired ended options.

[contents of neat_config.yml]
reference: hg18.fa
read_len: 126
produce_bam: True
produce_vcf: True

neat read-simulator                 \
        -c neat_config.yml          \
        -o /home/me/simulated_reads

Large single end reads

Simulate PacBio-like reads by providing an error model. (Not yet implemented in NEAT 4.0)

[contents of neat-config.yml]
reference: hg19.fa
read_len: 5000
error_model: errorModel_pacbio.pickle.gz
avg_seq_error: 0.1

neat read-simulator                 \
        -c neat_config.yml          \
        -o /home/me/simulated_reads

Utilities

Several scripts are distributed with gen_reads that are used to generate the models used for simulation.

neat model-fraglen

Computes empirical fragment length distribution from sample data.

neat model-fraglen   \
    -i input.bam            \
    -o /path/to/prefix

and creates fraglen.pickle.gz model in working directory.

neat gen-mut-model

Takes references genome and VCF file to generate mutation models:

neat gen-mut-model reference.fa input_variants.vcf   \
        -o /home/me/models

Trinucleotides are identified in the reference genome and the variant file. Frequencies of each trinucleotide transition are calculated and output as a pickle (.p) file.

Option Description
-o Path to output file and prefix. Required.
--bed Flag that indicates you are using a bed-restricted vcf and fasta (see below)
--save-trinuc Save trinucleotide counts for reference
--human-sample Use to skip unnumbered scaffolds in human references
--skip-common Do not save common snps or high mutation areas

neat model-seq-err

Generates sequencing error model for neat. This script needs revision, to improve the quality-score model eventually, and to include code to learn sequencing errors from pileup data.

Note that model-seq-err does not allow for sam input, as genSeqErrorModel.py did. If you would like to use a bam/sam file, use samtools to convert to a fastq, then use the fastq as input.

Note additionally that the file must either be unzipped or bgzipped. If your file is currently gzipped, you can use samtools' built-in bgzip utility to convert.

gzip -d my_file.fastq.gz bgzip my_file.fastq

This blocked zip format allows for indexing of the file.

For quality scores, note that using a single number will check quality scores for every number. As this could potentially slow down model creation, binned quality scores are advisable.

Soon we will take a samtools mpileup output as input and have some plotting features.

neat model-seq-err                                    \
        -i input_read.fq (.gz)                       \
        -o /path/to/prefix                             \
        -q quality score offset [33]                  \
        -Q maximum quality score [2, 11, 24, 37]      \
        -m maximum number of reads to process [all]   \

Please note that -i2 can be used in place of -i to produce paired data.

neat plot_mutation_model

Performs plotting and comparison of mutation models generated from genMutModel.py (Not yet implemented in NEAT 4.0).

neat plot_mutation_model                                                \
        -i model1.pickle.gz [model2.pickle.gz] [model3.pickle.gz]...    \
        -l legend_label1 [legend_label2] [legend_label3]...             \
        -o path/to/pdf_plot_prefix

neat vcf_compare

Tool for comparing VCF files (Not yet implemented in NEAT 4.0).

neat vcf_compare
        -r <ref.fa>        * Reference Fasta                           \
        -g <golden.vcf>    * Golden VCF                                \
        -w <workflow.vcf>  * Workflow VCF                              \
        -o <prefix>        * Output Prefix                             \
        -m <track.bed>     Mappability Track                           \
        -M <int>           Maptrack Min Len                            \
        -t <regions.bed>   Targetted Regions                           \
        -T <int>           Min Region Len                              \
        -c <int>           Coverage Filter Threshold [15]              \
        -a <float>         Allele Freq Filter Threshold [0.3]          \
        --vcf-out          Output Match/FN/FP variants [False]         \
        --no-plot          No plotting [False]                         \
        --incl-homs        Include homozygous ref calls [False]        \
        --incl-fail        Include calls that failed filters [False]   \
        --fast             No equivalent variant detection [False]

Mappability track examples: https://github.com/zstephens/neat-repeat/tree/master/example_mappabilityTracks

Note on Sensitive Patient Data

ICGC's "Access Controlled Data" documentation can be found at https://docs.icgc.org/portal/access/. To have access to controlled germline data, a DACO must be submitted. Open tier data can be obtained without a DACO, but germline alleles that do not match the reference genome are masked and replaced with the reference allele. Controlled data includes unmasked germline alleles.

neat's People

Contributors

123abcxyz321 avatar johanneskoester avatar joshfactorial avatar keshav-gandhi avatar lsmainzer avatar meridith-e avatar morgantaschuk avatar mrweber2 avatar niemasd avatar thondeboer avatar yashwasnik7 avatar zstephens avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

neat's Issues

Copy number variations input

Hi,

I see that in gen_reads.py, one of the outputs is golden VCF file which includes variants which will be present in golden BAM file. One of the lines in header of that VCF file is:
##ALT=<ID=CNV,Description="Copy number variable region">

Is there a way to insert copy numbers into generated BAM file? For example, to provide an input with copy number regions and their numbers (ex. chr1 1 100 7 - where columns are: chromosome, start, end, copy number) or something similar which will output golden BAM file and information about copy number variations?

Thank you!

Error when sampling reads

Describe the bug
Hello,

I am trying to use NEAT to geerate some synthetic TB reads using a VCF file to introduce known mutations that are associated with antibiotic resistance. When i am running my code (seen below) i am getting an error that i think is related to BioPython. I'm not sure if its related to the releas of biopython i am using which is v1.80 (installed using conda). Or if its related to the way my reference and vcf are formatted so for clarity i have attached them below as well (as .txt as github doesn't support .vcf or .fasta). Any help would be greatly appreciated.

Below is the error that i am experencing

(base) [matthew.bird syn-data]$ python3 NEAT/gen_reads.py -r reference/TB_ref.fasta -R 300 -v simulated_reads_c100.vcf -c 100 --pe 300 1 -o simulated_reads_c100

Using default sequencing error model.
Warning: Read length of error model (147) does not match -R value (300), rescaling model...
Using default gc-bias model.
Using artificial fragment length distribution. mean=300 std=1
Index not found, creating one... 
0.022 (sec)
--------------------------------
reading input VCF...

Warning: Found variants without a GT field, assuming heterozygous...
found 16 valid variants in input vcf.
 * 0 variants skipped: (qual filtered / ref genotypes / invalid syntax)
 * 0 variants skipped due to multiple variants found per position
--------------------------------
reading ChrI... 
2.210 (sec)
found 16 valid variants for ChrI in input VCF...
--------------------------------
sampling reads...
[Traceback (most recent call last):
  File "NEAT/gen_reads.py", line 896, in <module>
    main()
  File "NEAT/gen_reads.py", line 618, in main
    all_inserted_variants = sequences.random_mutations()
  File "/hpscol02/tenant2/hpc_storage/home/matthew.bird/syn-data/NEAT/source/SequenceContainer.py", line 602, in random_mutations
    temp = MutableSeq(self.sequences[i])
  File "/home/matthew.bird/anaconda3/lib/python3.8/site-packages/Bio/Seq.py", line 1662, in __init__
    raise TypeError(
TypeError: The sequence data given to a MutableSeq object should be a string or an array (not a Seq object etc)

simulated_reads_c100.txt
TB_ref.txt

Paired-end reads have identical sequence

Hello,

Describe the bug:

I would like to use neat for the simulation of targeted paired-end read sequencing. For some reason though, about half of the generated read pairs have the same sequence for forward and reverse (i.e. in read pair 1, the forward read sequence == reverse read sequence). As I set the fragment length to about 400 (previously 272, which produced the same result), I was expecting no overlap to exist, but instead, the reads overlap completely.
It also seems that such read pairs are not written into the golden BAM produced by the --bam flag.

To Reproduce

I used the following command:

python NEAT/gen_reads.py -c 5000 -r GRCh37_sub.fa -R 101 -o neat_output/neat_5k_272_15_hotspots_cancer -tr neat_input/regions.bed --pe 400 15 --bam -M 0 -E 0 -to 0

The refrence genome is a subset of GRCh37 containing only chromosomes 1, 2 and 3; the bed file simply contains one target region:
chr1 10000 12000

Expected behavior

To not find any read pairs in which the forward and reverse read have an identical sequence

Error with utilities/vcf_compare_OLD.py

Describe the bug
I ran python ~/NEAT/utilities/vcf_compare_OLD.py --help to confirm that downloading the git repo worked. However, I got the following error that seems to indicate a bug in the argument parser of this script.

(neat_conda) python ~/NEAT/utilities/vcf_compare_OLD.py 
Traceback (most recent call last):
  File "/storage/hpc/group/UMAG_test/WORKING/jakth2/deep-variant/scripts/setup/NEAT/utilities/vcf_compare_OLD.py", line 37, in <module>
    parser = argparse.ArgumentParser('python %prog [options] -r <ref.fa> -g <golden.vcf> -w <workflow.vcf>',
TypeError: __init__() got an unexpected keyword argument 'version'

To Reproduce
Steps to reproduce the behavior:

  1. cd /path/to/working/directory
  2. conda activate -p /path/to/neat_environment with packages listed as Requirements
  3. git clone https://github.com/ncsa/NEAT.git
  4. cd NEAT
  5. pip install .
... 
Successfully installed NEAT-3.0
  1. See error

Expected behavior

python vcf_compare_OLD.py
        -r <ref.fa>        * Reference Fasta                           \
        -g <golden.vcf>    * Golden VCF                                \
        -w <workflow.vcf>  * Workflow VCF                              \
        -o <prefix>        * Output Prefix                             \
        -m <track.bed>     Mappability Track                           \
        -M <int>           Maptrack Min Len                            \
        -t <regions.bed>   Targetted Regions                           \
        -T <int>           Min Region Len                              \
        -c <int>           Coverage Filter Threshold [15]              \
        -a <float>         Allele Freq Filter Threshold [0.3]          \
        --vcf-out          Output Match/FN/FP variants [False]         \
        --no-plot          No plotting [False]                         \
        --incl-homs        Include homozygous ref calls [False]        \
        --incl-fail        Include calls that failed filters [False]   \
        --fast             No equivalent variant detection [False]

Desktop (please complete the following information):

  • Device: remote machine
  • OS: CentOS Linux 7 (Core)
  • Dependency Versions:
(neat_conda)[user@remote_machine deep-variant]$ conda list
# packages in environment at ~/neat_conda:
#
# Name                    Version                   Build  Channel
_libgcc_mutex             0.1                 conda_forge    conda-forge
_openmp_mutex             4.5                       1_gnu    conda-forge
biopython                 1.79             py39h3811e60_1    conda-forge
brotli                    1.0.9                h7f98852_6    conda-forge
brotli-bin                1.0.9                h7f98852_6    conda-forge
bzip2                     1.0.8                h7f98852_4    conda-forge
c-ares                    1.18.1               h7f98852_0    conda-forge
ca-certificates           2021.10.8            ha878542_0    conda-forge
certifi                   2021.10.8        py39hf3d152e_1    conda-forge
cycler                    0.11.0             pyhd8ed1ab_0    conda-forge
dbus                      1.13.6               h5008d03_3    conda-forge
expat                     2.4.3                h9c3ff4c_0    conda-forge
fontconfig                2.13.1            hba837de_1005    conda-forge
fonttools                 4.28.5           py39h3811e60_0    conda-forge
freetype                  2.10.4               h0708190_1    conda-forge
gettext                   0.19.8.1          h73d1719_1008    conda-forge
glib                      2.70.2               h780b84a_1    conda-forge
glib-tools                2.70.2               h780b84a_1    conda-forge
gst-plugins-base          1.14.0               hbbd80ab_1  
gstreamer                 1.14.0               h28cd5cc_2  
icu                       58.2              hf484d3e_1000    conda-forge
jpeg                      9d                   h36c2ea0_0    conda-forge
kiwisolver                1.3.2            py39h1a9c180_1    conda-forge
krb5                      1.19.2               hcc1bbae_3    conda-forge
lcms2                     2.12                 hddcbb42_0    conda-forge
ld_impl_linux-64          2.36.1               hea4e1c9_2    conda-forge
libblas                   3.9.0           13_linux64_openblas    conda-forge
libbrotlicommon           1.0.9                h7f98852_6    conda-forge
libbrotlidec              1.0.9                h7f98852_6    conda-forge
libbrotlienc              1.0.9                h7f98852_6    conda-forge
libcblas                  3.9.0           13_linux64_openblas    conda-forge
libcurl                   7.81.0               h2574ce0_0    conda-forge
libdeflate                1.9                  h7f98852_0    conda-forge
libedit                   3.1.20191231         he28a2e2_2    conda-forge
libev                     4.33                 h516909a_1    conda-forge
libffi                    3.4.2                h7f98852_5    conda-forge
libgcc-ng                 11.2.0              h1d223b6_11    conda-forge
libgfortran-ng            11.2.0              h69a702a_11    conda-forge
libgfortran5              11.2.0              h5c6108e_11    conda-forge
libglib                   2.70.2               h174f98d_1    conda-forge
libgomp                   11.2.0              h1d223b6_11    conda-forge
libiconv                  1.16                 h516909a_0    conda-forge
liblapack                 3.9.0           13_linux64_openblas    conda-forge
libnghttp2                1.43.0               h812cca2_1    conda-forge
libnsl                    2.0.0                h7f98852_0    conda-forge
libopenblas               0.3.18          pthreads_h8fe5266_0    conda-forge
libpng                    1.6.37               h21135ba_2    conda-forge
libssh2                   1.10.0               ha56f1ee_2    conda-forge
libstdcxx-ng              11.2.0              he4da1e4_11    conda-forge
libtiff                   4.2.0                h85742a9_0  
libuuid                   2.32.1            h7f98852_1000    conda-forge
libwebp-base              1.2.2                h7f98852_0    conda-forge
libxcb                    1.13              h7f98852_1004    conda-forge
libxml2                   2.9.12               h03d6c58_0  
libzlib                   1.2.11            h36c2ea0_1013    conda-forge
lz4-c                     1.9.3                h9c3ff4c_1    conda-forge
matplotlib                3.5.1            py39hf3d152e_0    conda-forge
matplotlib-base           3.5.1            py39h2fa2bec_0    conda-forge
matplotlib-venn           0.11.6             pyh9f0ad1d_0    conda-forge
munkres                   1.1.4              pyh9f0ad1d_0    conda-forge
ncurses                   6.3                  h9c3ff4c_0    conda-forge
neat                      3.0                      pypi_0    pypi
numpy                     1.22.0           py39h91f2184_1    conda-forge
olefile                   0.46               pyh9f0ad1d_1    conda-forge
openssl                   1.1.1l               h7f98852_0    conda-forge
packaging                 21.3               pyhd8ed1ab_0    conda-forge
pandas                    1.3.5            py39hde0f152_0    conda-forge
pcre                      8.45                 h9c3ff4c_0    conda-forge
pillow                    7.2.0            py39h6f3857e_2    conda-forge
pip                       21.3.1             pyhd8ed1ab_0    conda-forge
pthread-stubs             0.4               h36c2ea0_1001    conda-forge
pyparsing                 3.0.6              pyhd8ed1ab_0    conda-forge
pyqt                      5.9.2            py39h2531618_6  
pysam                     0.17.0           py39h20405f9_1    bioconda
python                    3.9.9           h62f1059_0_cpython    conda-forge
python-dateutil           2.8.2              pyhd8ed1ab_0    conda-forge
python_abi                3.9                      2_cp39    conda-forge
pytz                      2021.3             pyhd8ed1ab_0    conda-forge
qt                        5.9.7                h5867ecd_1  
readline                  8.1                  h46c0cb4_0    conda-forge
scipy                     1.7.3            py39hee8e79c_0    conda-forge
setuptools                59.8.0           py39hf3d152e_0    conda-forge
sip                       4.19.13          py39h2531618_0  
six                       1.16.0             pyh6c4a22f_0    conda-forge
sqlite                    3.37.0               h9cd32fc_0    conda-forge
tk                        8.6.11               h27826a3_1    conda-forge
tornado                   6.1              py39h3811e60_2    conda-forge
tzdata                    2021e                he74cb21_0    conda-forge
wheel                     0.37.1             pyhd8ed1ab_0    conda-forge
xorg-libxau               1.0.9                h7f98852_0    conda-forge
xorg-libxdmcp             1.1.3                h7f98852_0    conda-forge
xz                        5.2.5                h516909a_1    conda-forge
zlib                      1.2.11            h36c2ea0_1013    conda-forge
zstd                      1.4.9                ha95c52a_0    conda-forge

Additional context
I'm trying to compare a VCF from a variant calling algorithm against the truth set generated by NEAT, but I am unable to do so.

Refactor main code

We want a single point of entry, with suboptions, rather than a fragmented code base, similar to how GATK is called for all of it's subfunctions:

gatk [ToolName] ToolArguments

I think most of the model creations could be changed to flags in the main gen_reads workflow. Rather than running gen_mut_model.py, then running gen_reads.py, we have one command, where if you need to create the mutation model you can either do it as part of the gen_reads run or as a standalone process.

neat GenMutModel -i /path/to/file

neat GenReads -i /path/to/file --gmm /path/to/other_file

The key component here is to move all the python code to Source, so that we don't have to mess around with importing from neigboring modules, as we currently do in genSeqErrorModel.py (importing the DiscreteDistribution class, though that can also probably be just replaced with standard modules). We would use a bash launcher or a python launcher in the main folder that would call the source code and perform the actual runs.

Number of variants generated from mutational model lower than expected

Hi,
I'm generating tumor reads using the same code described in #49 and bypassing the error using a simple try-except.
Anyway, the number of mutations inserted (153) are considerably lower than the original input tumor dataset (~56.000), from which the mutational model is generated.
Looking into the code, if I'm not wrong, the number of variants, generated with the poisson distribution, are lowered.

Do you think the number of mutations generated is correct or there is a bug in the mutational model / artificial reads generation?
Thanks a lot.

Incorporating utilities into NEAT in a streamlined way

Allow for runs of the modeling utilities during a run of the sequencing simulator. For example, if you want to analyze fragment length of some sample data and use that model in your simulation, that would all be done in one step, by entering inputs in the config.

Overhaul probabilities

This will need to be broken up into smaller pieces, but the general idea is that we want to overhaul how probabilities are calculated in NEAT. The current code, especially the probability.py file, is a slow point in the code.

Specifics:

  1. compute_fraglen.py could just calculate a median and std dev or equivalent, then that could be used as input to a standard probability distribution about that point.
  2. Investigate the same for sequencing error
  3. Investigate the same for mutation model
  4. In SequenceContainer, investigate how to improve the random mutations models.

Provide the algorithm used to create BAM files

Hi there, first of all thank you for this repo is amazing and saves me a lot of time for benchmarking!
My problem is related to the algorithm used to create the gold bam file. Indeed I used Picard CompareSAMs to check the gold standard bam and the bam created manually from the fastqs and I see a lot of difference.

Steps:
A) I created simulated data (the bed file is a single row bed file chr21\t0\t46709983)
python gen_reads.py -r hg38.analysisSet.fa -R 150 -o ../NEAT --bam --vcf --pe 300 30 -tr chr21.hg38.bed
B) I trimmed the fastqs
fastp --thread 16 --detect_adapter_for_pe -i read1.fastq.gz -I read2.fastq.gz -o trimmed_read1.fastq.gz -O trimmed_read2.fastq.gz -h trimmed.rep_html -j trimmed..rep_json
C) Aligned with bwa-mem2
bwa-mem2 mem -R '@RG\\tID:A1\\tSM:NEAT\\tLB:lib1\\tPU:run1\\tPL:ILLUMINA' -Y -K 100000000 -M -t 32 hg38.analysisSet.fa read1.fastq.gz read2.fastq.gz -o aligned.sam
D) sort
samtools view -Shb -@ 5 aligned.sam | samtools sort -@ 5 -m 5G - > aligned.bam samtools index aligned.bam

E) Compare results
gatk CompareSAMs aligned.bam gold.bam --LENIENT_HEADER true -O compare.tsv

The output I get is
MAPPING_MATCH: 1800623
MAPPINGS_DIFFER: 749533
ARE_EQUAL: NO

Thus I was wondering which is the algorithms used to create bam files and how to reproduce it.

Thank you very much

matplotlib_venn --> matplotlib-venn

Describe the bug
When installing the packages matplotlib_venn did not work for me, only when I changed to matplotlib-venn I had no issues with conda. Just posting in case others experiment the same issue

Number of sites in VCF does not correspond to -M setting?

Describe the bug
This isn't really a bug, just more likely a need for clarification. But let's say I specify -M 0.1. I was expecting then that 10% of sites in the input FASTA file would then be labeled in the golden VCF file as SNPs. However, when I do this for a single chromosome of length 61420004, I only find 1164343 unique sites with SNPs in the golden VCF file. This is only 1.9% of the total sites.

To Reproduce
Steps to reproduce the behavior:
python gen_reads.py -r {input.fa} -o {prefix} --bam --vcf -R 150 --pe 300 30 -c 20 -M 0.1
zgrep -v "#" {prefix}_golden.vcf.gz | cut -f 1,2 | sort | uniq | wc -l

Expected behavior
My expectation was that 10% of sites would have SNPs with setting -M 0.1, but I'm probably misunderstanding something.

Any clarification would be helpful! Thanks!

setup.py not updated

Hello,
I noticed setup.py file still contains the repickle.py file and vcf_compare_OLD.py, which causes an error during the pip setup. After setup.py update, installation worked.

sequencing error rate scale

Less of an issue and more of a question but i'm struggling to understand the scale of the sequencing error variable. As an example you use Pacbio reads which use -E 0.10 but Pacbio reads have an average sequencing error rate of 15%? Does that mean 0.10 equaltes to 10% error rate? I'm trying to simulate illumina reads with the average sequencing error rate (which on the hogh end is 0.1 per base sequenced so i set -E to 0.1 but am now thinking that may be wrong? Could you explain the scale of -E to me?

KeyError: 'AVG_MUT_RATE' when specifying mutModel

Describe the bug
Using gen_reads.py with -m option and one of the given models gives error.

Traceback (most recent call last):
File "/home/aaronkwc/neat-genreads/gen_reads.py", line 901, in
main()
File "/home/aaronkwc/neat-genreads/gen_reads.py", line 195, in main
mut_model = parse_input_mutation_model(mut_model, 1)
File "/home/aaronkwc/neat-genreads/source/SequenceContainer.py", line 1076, in parse_input_mutation_model
out_model[0] = pickle_dict['AVG_MUT_RATE']
KeyError: 'AVG_MUT_RATE'

To Reproduce
python ~/neat-genreads/gen_reads.py -r hg19_chrM.fasta -R 101 --pe 300 30 -c 20 -m ~/neat-genreads/models/MutModel_CLLE-ES_ICGC.p -o output_tumor

Tried some other mutModels in the repos but still the same error.

Expected behavior
Not giving an error message

Parallel simulations

Hi, Does version 3 still have parameter to parallelize the gen_reads.py for whole genomes? I used "--job" parameter for version 2 for parallelizing simulation. Thanks.

Error simulating reads using input VCF for human Y chromosome

Hello, I am using NEAT to simulate paired-end reads using the human reference genome GRCh38 and variants from a VCF from a 1000 Genomes individual for chromosomes Y. When running NEAT, I get the following error:

# Chromosome Y (ploidy set to 1)

~/miniconda3/envs/NEAT_env/bin/python3.8 ~/packages/NEAT/gen_reads.py -r GRCh38_full_analysis_set_plus_decoy_hla_chrY.fa -R 101 -o NA07048_NEAT_simulated_chrY_10x_haploid --vcf --pe 300 30 -v 20201028_CCDG_14151_B01_GRM_WGS_2020-08-05_chrY.recalibrated_variants_NA07048.vcf -M 0 -p 1

Using default sequencing error model.
Using default gc-bias model.
Using artificial fragment length distribution. mean=300, std=30
found index GRCh38_full_analysis_set_plus_decoy_hla_chrY.fa.fai
--------------------------------
reading input VCF...

found 1273 valid variants in input vcf.
 * 167631 variants skipped: (qual filtered / ref genotypes / invalid syntax)
 * 0 variants skipped due to multiple variants found per position
--------------------------------
reading chrY...

1379.071 (sec)
found 1273 valid variants for chrY in input VCF...
--------------------------------
sampling reads...
[Traceback (most recent call last):
  File "/home/amtarave/packages/NEAT/gen_reads.py", line 902, in <module>
    main()
  File "/home/amtarave/packages/NEAT/gen_reads.py", line 549, in main
    print(f'PROCESSING WINDOW: {(start, end), [buffer_added]}, '
UnboundLocalError: local variable 'buffer_added' referenced before assignment

I am not sure what is causing this UnboundLocalError: local variable 'buffer_added' referenced before assignment error.

Would you be able to assist me with troubleshooting this?

I am attaching here the reference genome (chromosome Y only) and the VCF I used in the above command.

20201028_CCDG_14151_B01_GRM_WGS_2020-08-05_chrY.recalibrated_variants_NA07048.vcf.gz

GRCh38_full_analysis_set_plus_decoy_hla_chrY.fa.gz

Thank you.

Best,
Angela

Add support for multi-allelic variants within input VCF

Hi!

We've noticed that multi-allelic VCF entries are not supported within NEAT.

NEAT seems to always chooses the first ALT within the VCF entry.

# assume we're just using first alt for inserted variants?

Curious on your thoughts re. whether or not this could ultimately supported?

i.e., here only a C will be inserted. Whereas ideally (with plody=2), a 1/2 (C/G) would be inserted into chr1:156736767, wholly replacing the A ref.

chr1 156736767 . A C,G 50 PASS platforms=5;platformnames=Illumina,PacBio,CG,10X,Solid;datasets=7;datasetnames=HiSeqPE300x,CCS15kb_20kb,CGnormal,HiSeq250x250,10XChromiumLR,HiSeqMatePair,SolidSE75bp;callsets=11;callsetnames=HiSeqPE300xSentieon,CCS15kb_20kbDV,CCS15kb_20kbGATK4,CGnormal,HiSeqPE300xfreebayes,HiSeq250x250Sentieon,10XLRGATK,HiSeq250x250freebayes,HiSeqMatePairSentieon,HiSeqMatePairfreebayes,SolidSE75GATKHC;datasetsmissingcall=IonExome;callable=CS_HiSeqPE300xSentieon_callable,CS_CCS15kb_20kbDV_callable,CS_10XLRGATK_callable,CS_CCS15kb_20kbGATK4_callable,CS_CGnormal_callable,CS_HiSeqPE300xfreebayes_callable,CS_HiSeq250x250Sentieon_callable GT:PS:DP:ADALL:AD:GQ 1/2:.:1144:0,181,158:51,294,258:546

Granted this may become a bit tricky with the ability for users to also choose a ploidy.
Assuming ploidy matches max(#alts) in a VCF, could multi-allelic support be formally added?

Thoughts?
cc @ajaltomare

Fasta-only mode

Need a better fasta only mode. We can make read-length optional and set a default as 101. From there we can create a better fasta-only approach to generating mutations and output.

IndexError when generating artificial tumor reads

Hi,
I'm trying to generate tumor reads using the following parameters:

gen_reads.py -r hg19.fa \
        -R 150 \
        --pe-model ${FRAGLENMODEL} \
        -c 2500 \
        -e ${SEQERRORMODEL} \
        --gc-model ${GCBIASMODEL} \
        -tr ${BED} \
        -m ${MUTMODEL} \
        -o output

but I'm having this error:

File "/path/to/NEAT/gen_reads.py", line 896, in <module>
   main()
 File "/path/to/NEAT/gen_reads.py", line 618, in main
   all_inserted_variants = sequences.random_mutations()
 File "/path/to/NEAT/source/SequenceContainer.py", line 546, in random_mutations
   if self.black_list[p][k]:
IndexError: index 18001 is out of bounds for axis 0 with size 18001

The mutational model was generated using an assembled vcf from a batch of tumor samples. The fragment length model and the gc bias model were generated from a tumor BAM, extracted from the tumor batch used for the mutational model. The sequencing model was generated from the same sample fastqs.
BED file comprises a series of genes of interest.

The error didn't appear when the script was tested with a BED file of a single gene (TP53) and using hg38 as a reference.
One solution can be to add a:

try: code except IndexError: continue
But I'd like your input about the error.
Thanks a lot for the tool, anyway!

Potential Bug in compute_fraglen.py - Incorrect fraglength distribution

Describe the bug
Fragment Length Distribution is incorrect, even when using NEAT's gen_reads.py

To Reproduce

  1. Create NEAT dataset with fragment length distribution as mean = 300 and SD = 10 (test.fa is 1Mb of random nucleotides, no Ns):
    python ~/NEAT/gen_reads.py -R 100 -r test.fa -c 10 --pe 300 10 -o NEAT_test

2.Map reads (I used BWA):
bwa mem -M test.fa NEAT_test_read1.fq.gz NEAT_test_read2.fq.gz > NEAT_test.bam

  1. Compute fragment distribution as sanity check:
    python ~/NEAT/utilities/compute_fraglen.py -i NEAT_test.bam -o NEAT_test

  2. open pickle file:
    import pandas
    pandas.read_pickle("NEAT_test.p")
    [[100], [1.0]]

Expected behavior
I would expect the same distribution that I gave gen_reads.py (mean of 300 with SD of 10) to come back out.

Potential Reason
I think the bug has to do with the count_frags method:
First 9 fields in BAM from BAM file:
NEAT_test-test-1 99 test 9039 60 100M = 9254 315

First 9 fields of BAM from script(adding print line within count_frags for loop, approx. line 96):
NEAT_test-test-1 99 0 9038 60 100M 0 9253 100

I believe that converting from an iterable pysam type to a string (line 95) causes some of the fields to change. This also implies that , since mate mapping(field 8) is always equal to the reference (field 3), every read will look like it was mate mapped, per your code (line 108).

I think this can be fixed by using pysams inherent parameters, such as item.mate_is_unmapped or item.template_length instead of splitting the line by tabs.

Multi-allelic mutational model

Suggestion via email: When building mutational models, take into account multi-allelic sites. Add multi-allelic functionality to NEAT across the board.

Unable to simulate HOM (1/1) variants? Reads/VCF only show HET (0/1)

Version
NEAT-genReads V3.2

Hi - Just recently starting using this tool, but noticed that the reads / vcf / bam only support HET calls even when using a input VCF file containing many genotypes. Ploidy is set to 2, so I assume two haplotypes are being constructed from which to sample reads from, but still we are only seeing HET support in the reads. Any ideas?

Steps to reproduce below.

truth_vcf

#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  NEAT
phix 363     .       T       C       222     .       .       GT:PL   0/1:255,0,255
phix 466     .       T       A       222     .       .       GT:PL   1/1:255,0,255
phix 469     .       T       C       222     .       .       GT:PL   1/1:255,0,255
phix 513     .       G       A       222     .       .       GT:PL   1/1:255,0,255
phix 528     .       G       T       222     .       .       GT:PL   0/1:255,0,255

Running NEAT (phix as an example)

$ python ~/git/NEAT/gen_reads.py -r genome.fa -R 150 -o neat-sim -c 35.0 -E 0.0 -M 0.0 --pe 350 70 -d --bam --vcf -p 2 --force-coverage -v truth.vcf
Using default sequencing error model.
Warning: Quality scores no longer exactly representative of error probability. Error model scaled by 0.000 to match desired rate...
Warning: Read length of error model (101) does not match -R value (150), rescaling model...
Using default gc-bias model.
Using artificial fragment length distribution. mean=350 std=70
found index genome.fa.fai
--------------------------------
reading input VCF...

found 5 valid variants in input vcf.
 * 0 variants skipped: (qual filtered / ref genotypes / invalid syntax)
 * 0 variants skipped due to multiple variants found per position
--------------------------------
reading elemphix_padded...
0.002 (sec)
found 5 valid variants for elemphix_padded in input VCF...
--------------------------------
sampling reads...
[PROCESSING WINDOW: ((0, 5566), [0]), next: (5216, 5566), isLastTime: True]
Read sampling completed in 0 (sec)
Writing output VCF...
NEAT finished the simulation in 0.8293910026550293

Followed by naive pileup + bcftools call to check GT status

$ samtools view -h neat-sim_golden.bam | bcftools mpileup --threads 1 --no-BAQ -Q 0 --ff UNMAP,SECONDARY,QCFAIL -Ov -f genome.fa -a 'AD,ADF,ADR,DP,SP,INFO/AD,INFO/ADF,INFO/ADR,FORMAT/SP' - | bcftools call --threads 1 -m -A | gre
p -v 0/0
Note: none of --samples-file, --ploidy or --ploidy-file given, assuming all sites are diploid
[mpileup] 1 samples in 1 input files
[mpileup] maximum number of reads per input file set to -d 250
##fileformat=VCFv4.2
##FILTER=<ID=PASS,Description="All filters passed">
##bcftoolsVersion=1.11+htslib-1.11
##bcftoolsCommand=mpileup --threads 1 --no-BAQ -Q 0 --ff UNMAP,SECONDARY,QCFAIL -Ov -f genome.fa -a AD,ADF,ADR,DP,SP,INFO/AD,INFO/ADF,INFO/ADR,FORMAT/SP -
##reference=file://genome.fa
##contig=<ID=phix,length=5566>
##ALT=<ID=*,Description="Represents allele(s) other than observed.">
##INFO=<ID=INDEL,Number=0,Type=Flag,Description="Indicates that the variant is an INDEL.">
##INFO=<ID=IDV,Number=1,Type=Integer,Description="Maximum number of raw reads supporting an indel">
##INFO=<ID=IMF,Number=1,Type=Float,Description="Maximum fraction of raw reads supporting an indel">
##INFO=<ID=DP,Number=1,Type=Integer,Description="Raw read depth">
##INFO=<ID=VDB,Number=1,Type=Float,Description="Variant Distance Bias for filtering splice-site artefacts in RNA-seq data (bigger is better)",Version="3">
##INFO=<ID=RPB,Number=1,Type=Float,Description="Mann-Whitney U test of Read Position Bias (bigger is better)">
##INFO=<ID=MQB,Number=1,Type=Float,Description="Mann-Whitney U test of Mapping Quality Bias (bigger is better)">
##INFO=<ID=BQB,Number=1,Type=Float,Description="Mann-Whitney U test of Base Quality Bias (bigger is better)">
##INFO=<ID=MQSB,Number=1,Type=Float,Description="Mann-Whitney U test of Mapping Quality vs Strand Bias (bigger is better)">
##INFO=<ID=SGB,Number=1,Type=Float,Description="Segregation based metric.">
##INFO=<ID=MQ0F,Number=1,Type=Float,Description="Fraction of MQ0 reads (smaller is better)">
##FORMAT=<ID=PL,Number=G,Type=Integer,Description="List of Phred-scaled genotype likelihoods">
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Number of high-quality bases">
##FORMAT=<ID=SP,Number=1,Type=Integer,Description="Phred-scaled strand bias P-value">
##FORMAT=<ID=AD,Number=R,Type=Integer,Description="Allelic depths (high-quality bases)">
##FORMAT=<ID=ADF,Number=R,Type=Integer,Description="Allelic depths on the forward strand (high-quality bases)">
##FORMAT=<ID=ADR,Number=R,Type=Integer,Description="Allelic depths on the reverse strand (high-quality bases)">
##INFO=<ID=AD,Number=R,Type=Integer,Description="Total allelic depths (high-quality bases)">
##INFO=<ID=ADF,Number=R,Type=Integer,Description="Total allelic depths on the forward strand (high-quality bases)">
##INFO=<ID=ADR,Number=R,Type=Integer,Description="Total allelic depths on the reverse strand (high-quality bases)">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##INFO=<ID=ICB,Number=1,Type=Float,Description="Inbreeding Coefficient Binomial test (bigger is better)">
##INFO=<ID=HOB,Number=1,Type=Float,Description="Bias in the number of HOMs number (smaller is better)">
##INFO=<ID=AC,Number=A,Type=Integer,Description="Allele count in genotypes for each ALT allele, in the same order as listed">
##INFO=<ID=AN,Number=1,Type=Integer,Description="Total number of alleles in called genotypes">
##INFO=<ID=DP4,Number=4,Type=Integer,Description="Number of high-quality ref-forward , ref-reverse, alt-forward and alt-reverse bases">
##INFO=<ID=MQ,Number=1,Type=Integer,Description="Average mapping quality">
##bcftools_callVersion=1.11+htslib-1.11
##bcftools_callCommand=call --threads 1 -m -A; Date=Tue May 24 08:53:40 2022
#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  NEAT
phix 363     .       T       C       185     .       DP=28;ADF=6,10;ADR=8,4;AD=14,14;VDB=0.201504;SGB=-0.686358;RPB=0.800964;MQB=1;MQSB=1;BQB=0.869704;MQ0F=0;ICB=1;HOB=0.5;AC=1;AN=2;DP4=6,8,10,4;MQ=60     GT:PL:DP:SP:ADF:ADR:AD  0/1:218,0,255:28:6:6,10:8,4:14,14
phix 466     .       T       A       200     .       DP=32;ADF=9,5;ADR=10,8;AD=19,13;VDB=0.244124;SGB=-0.683931;RPB=0.981766;MQB=1;MQSB=1;BQB=0.549964;MQ0F=0;ICB=1;HOB=0.5;AC=1;AN=2;DP4=9,10,5,8;MQ=60     GT:PL:DP:SP:ADF:ADR:AD  0/1:233,0,255:32:1:9,5:10,8:19,13
phix 469     .       T       C       181     .       DP=33;ADF=9,5;ADR=10,9;AD=19,14;VDB=0.383333;SGB=-0.686358;RPB=0.869806;MQB=1;MQSB=1;BQB=0.260949;MQ0F=0;ICB=1;HOB=0.5;AC=1;AN=2;DP4=9,10,5,9;MQ=60     GT:PL:DP:SP:ADF:ADR:AD  0/1:214,0,255:33:1:9,5:10,9:19,14
phix 513     .       G       A       222     .       DP=39;ADF=12,7;ADR=9,11;AD=21,18;VDB=0.866788;SGB=-0.691153;RPB=0.930232;MQB=1;MQSB=1;BQB=0.334958;MQ0F=0;ICB=1;HOB=0.5;AC=1;AN=2;DP4=12,9,7,11;MQ=60   GT:PL:DP:SP:ADF:ADR:AD  0/1:255,0,255:39:5:12,7:9,11:21,18
phix 528     .       G       T       222     .       DP=41;ADF=13,9;ADR=8,11;AD=21,20;VDB=0.931105;SGB=-0.692067;RPB=0.966558;MQB=1;MQSB=1;BQB=0.983471;MQ0F=0;ICB=1;HOB=0.5;AC=1;AN=2;DP4=13,8,9,11;MQ=60   GT:PL:DP:SP:ADF:ADR:AD  0/1:255,0,255:41:5:13,9:8,11:21,20

As you can see, all 5 come through as 0/1 HET. Is it possible to simulate reads across either GT as defined in the VCF? Any thoughts what could be going wrong here?

Cheers

Quality scores not mimicking real data

Hello! I've been using this program to simulate some whole genome sequencing data for horses to evaluate different variant calling and variant filtration methods. So far it's been really easy to use and understand. however I've been having issues with the Sequencing Error Model. Regardless of what options I use, what sample fastqs I provide or even if I use the default error models that come in the program files, I can't get any variation in the quality scores.
The code that I'm using to generate the error models looks like this:

python genSeqErrorModel.py \
       -i /scratch.global/marlo072/simulateddata/M11445_R1_001.fastq.gz\
       -o /scratch.global/marlo072/simulateddata/M11445.SeqErrorModel\
       -i2 /scratch.global/marlo072/simulateddata/M11445_R2_001.fastq.gz \

Then I use the generated model with the following command:

python gen_reads.py \
       -r /scratch.global/marlo072/simulateddata/goldenPath.Ec_build-3.0_wMSY.fa \
       -R 130 \
       -o /scratch.global/marlo072/simulateddata/S005 \
       -c 10 \
       -e /scratch.global/marlo072/simulateddata/M11445.pickle.gz \
       -m /scratch.global/marlo072/simulateddata/M11445.MutModel.pickle.gz \
       --pe-model  /scratch.global/marlo072/simulateddata/M11445.fraglenModel.pickle.gz \
       --gc-model /scratch.global/marlo072/simulateddata/M11445GCModel.pickle.gz \
       --vcf \
       --bam

Another thing that may be worth noting is that the name of the sequencing error model appears different in the two commands because the output name for the error model, "M11445.SeqErrorModel," get's truncated to just "M11445" when genSeqErrorModel.py is run, so the corresponding output file is "M11445.pickle.gz". I've seen this happen anytime there's a period in the output name.

When running fastqc on the resulting fastqs I get this distribution which centers around the correct average from the example fastqc report, but does not follow any sort of pattern throughout read length.
image

Thank you for all of the great work you've done with this tool and I look forward to using it more in the future.

neat read-simulator takes very long

Describe the bug
When I try to simulate reads with vcf-input the process seems to take forever. I did the following :
neat read-simulator -c neat_config.yaml -o neat/

2023-07-13 17:00:37,523:INFO:neat.read_simulator.runner:Using configuration file neat_config.yaml
2023-07-13 17:00:37,523:INFO:neat.read_simulator.runner:Saving output files to data
2023-07-13 17:00:37,524:INFO:neat.read_simulator.utils.options:Run Configuration...
2023-07-13 17:00:37,524:INFO:neat.read_simulator.utils.options:Input fasta: chr22.fa
2023-07-13 17:00:37,524:INFO:neat.read_simulator.utils.options:Producing the following files:
  - data/neat_r1.fastq.gz
- data/neat_r2.fastq.gz
- data/neat_golden.bam

2023-07-13 17:00:37,524:INFO:neat.read_simulator.utils.options:Single threading - 1 thread.
2023-07-13 17:00:37,524:INFO:neat.read_simulator.utils.options:Running in paired-ended mode.
2023-07-13 17:00:37,524:INFO:neat.read_simulator.utils.options:Generating fragment model based on mean=300.0, st dev=30.0
2023-07-13 17:00:37,524:INFO:neat.read_simulator.utils.options:Using a read length of 126
2023-07-13 17:00:37,524:INFO:neat.read_simulator.utils.options:Average coverage: 1
2023-07-13 17:00:37,524:INFO:neat.read_simulator.utils.options:Using default error model.
2023-07-13 17:00:37,524:INFO:neat.read_simulator.utils.options:Ploidy value: 2
2023-07-13 17:00:37,524:INFO:neat.read_simulator.utils.options:RNG seed value for run: 5404759810307010
2023-07-13 17:00:37,524:INFO:neat.read_simulator.runner:Reading Models...
2023-07-13 17:00:37,525:INFO:neat.read_simulator.runner:Reading chr22.fa.
2023-07-13 17:00:40,037:INFO:neat.read_simulator.runner:Beginning simulation.
2023-07-13 17:00:40,584:INFO:neat.read_simulator.runner:Generating variants for chr22
2023-07-13 17:03:39,927:INFO:neat.read_simulator.utils.generate_variants:Finished generating random mutations in 2.98 minutes
2023-07-13 17:03:39,927:INFO:neat.read_simulator.utils.generate_variants:Added 51203 mutations to chr22
2023-07-13 17:03:39,927:INFO:neat.read_simulator.runner:Outputting temp vcf for chr22 for later use
2023-07-13 17:03:40,358:INFO:neat.read_simulator.utils.local_file_writer:Finished outputting temp vcf/fasta
2023-07-13 17:03:40,361:INFO:neat.read_simulator.utils.generate_reads:Sampling reads..

All the files to replicate the issue can be downloaded here

test

Describe the bug
A clear and concise description of what the bug is.

To Reproduce
Steps to reproduce the behavior:

  1. Go to '...'
  2. Click on '....'
  3. Scroll down to '....'
  4. See error

Expected behavior
A clear and concise description of what you expected to happen.

Screenshots
If applicable, add screenshots to help explain your problem.

Desktop (please complete the following information):

  • OS: [e.g. iOS]
  • Browser [e.g. chrome, safari]
  • Version [e.g. 22]

Smartphone (please complete the following information):

  • Device: [e.g. iPhone6]
  • OS: [e.g. iOS8.1]
  • Browser [e.g. stock browser, safari]
  • Version [e.g. 22]

Additional context
Add any other context about the problem here.

Inconsistent coverage with high depth (> 20,000X)

Describe the bug
I'm not sure it is a bug, but as far as I can see the random variations in coverage can become very large with high depth datasets.

Background: we're testing sensitivity of variant callers in high-depth, very low allele fraction scenarios. We're using NEAT to generate the reads: a typical scenario expects around ~25,000X-30,000X coverage. Our approach is also focused, so we have a BED file with the regions we are interested in.

The issue (or perhaps working as intended, I don't know) is that NEAT is very variable at such high coverages. We tried with two depths:

  • 25,000X - after alignment with bwa-mem and measurement of coverage (with mosdepth) yields ~14K
  • 30,000X - we have around 18K coverage or so

The options we used NEAT with:

 -R 150 \
--pe 192 37 \ # Determined from our existing data
-c 30000 \ # or 250000
 -E 0.004 \ # From sequencing data 
-tr ${BED} \ 
-M 0.002 

The question is: is such a behavior a feature and not a bug? Do we have to push coverage much higher to actually get close to our estimated amount?

To Reproduce
Steps to reproduce the behavior:

  1. Generate FASTQ with NEAT at high depth
  2. Align against hg38 with bwa-mem
  3. Measure coverage with mosdepth

Expected behavior
A median coverage closer to the actual value specified.

Incorporate pybedtools into compute_gc + improvements

Refactor GC bias to take an input bam instead of a genomecov file. Should be able to take in either as -i. Update the description and help to make this clear. Require That the suffix bam (binary alignment file) will be the way the user indicates that the input is a bam instead of something else. Every other suffix will be treated as genomecov file..

See for description of the tool usage:
https://daler.github.io/pybedtools/autodocs/pybedtools.bedtool.BedTool.genome_coverage.html?highlight=genomecov

  • Refactor to detect a bam input by the ".bam" suffix.
    • If the input is a bam (ends in ".bam"), run pybedtools genome coverage on file and either use the output file as input or restructure output to match the output of the current process_genomecov (call the new function process_bam). Refactor process_genomecov if needed
    • If input is not a bam, process similar to how process_genomecov currently processes, refactor as needed per above
  • Currently the code misses if the genomecov file transitions from one chromosome to the next within the window in process_genomecov. Needs the current splt[0] compared to the current_ref variable. If we have changed chromosomes, we need to skip the partial window and start at the beginning of the new chromosome
  • Function process_genomecov has a check "if "N" not in seq" Remove the if check and process the next two lines regardless
  • Refactor variables for clarity.
  • Add unit tests
  • In calculate_coverage, instead of printing k/float(window), print just k, not as a percentage. my_len is just the sample size for that bin. Instead of that, print running_total. Refactor print statements to use f strings.
  • The if statement in caculate_coverage should only change how the mean is calculated, instead of changing the print statement and bin_dict assignation
  • There is a scope bug in calculate_coverage. It modifies the bin_dict in place, instead of generating a dictionary and returning it. The return should be the overall mean of the data set and the dictionary of bins: mean coverage
  • calculate_coverage just takes the mean and then computes overall averages. This could be done in the scope of the gc bin counting of process_genomecov
  • in the gc_bias model, instead of range(window_size + 1) report only the window_size and the y_out.

Algorithm description:

  • Takes in a genomecov file, a tsv file with lines "chrom 23 34" defined as: chromosome name, position (int, 1-based), coverage at that position (int). Position in python and fasta files is 0-based, so you'll need to subtract 1 from this position value to get the index.
  • User enters desired window (default 50)
  • A dictionary of values from 0 - 50 (the 51 possible values of counting G's an C's in each 50-base window)
  • Tabulates the GC count for every 50 bases, this will be your bin. Counting G's and C's for a 50-base window will yield a total from 0 to 50 inclusive. For each bin, we find the average coverage from each window that fell into that count.
  • We also keep track of the total coverage for the windows and the total number of windows, to calculate an overall average.
  • Finally, we normalize the bin averages by dividing by the overall average, giving a relative average for coverage bias for each concentration of GCs within the window size

Insertion of CNVs

Hi, thanks for the great tool!

I saw the issue here about inserting CNVs which should be provided in a VCF and given with: -v option. I tried doing that, but I can see that these variants will be skipped as they don't contain ACTG in ALT field.

reading input VCF...

found 247 valid variants in input vcf.
 * 0 variants skipped: (qual filtered / ref genotypes / invalid syntax)
 * 1 variants skipped due to multiple variants found per position
--------------------------------
reading EB0001...
20.439 (sec)
found 219 valid variants for EB0001 in input VCF...
7 variants skipped...
 - [0] ref allele does not match reference
 - [0] attempting to insert into N-region
 - [7] alt allele contains non-ACGT characters

I tried two different lines, with format of CNV:

EB0001 1087750 . C < CNV > . PASS SVTYPE=CNV;END=1090600;SVLEN=2850;Ploidy=1;DP=117 GT:DP:CN:AF 1:83:2:1
EB0001 1087750 . C CNV . PASS SVTYPE=CNV;END=1090600;SVLEN=2850;Ploidy=1;DP=117 GT:DP:CN:AF 1:83:2:1

In both cases it complains about ALT non-ACTG alleles. Can you provide an example of the CNV line in the VCF format you got it working?

Running on ChrM only

I am running a bed file against ChrM.

Two issues.

  1. When you feed a bed file has more regions than the target reference, you get the following error:
NEAT/gen_reads.py", line 327, in main
    for k in input_regions.keys():
RuntimeError: dictionary changed size during iteration

This is a python 2 to 3 issue. I think it can be resolved by changing the code to for i in list(input_regions):

  1. When running on chrM only, I get the following error:
python NEAT/gen_reads.py -r reference_input/chrM.fa -R 150 -o Homo_sapiens.GRCh38_simVar_60x_test -c 60 --force-coverage -to 0.02 -E 0.001 --pe 350 30 -tr target_beds/chrM.bed

Using default sequencing error model.
Warning: Quality scores no longer exactly representative of error probability. Error model scaled by 0.151 to match desired rate...
Warning: Read length of error model (101) does not match -R value (150), rescaling model...
Using default gc-bias model.
Using artificial fragment length distribution. mean=350, std=30
Index not found, creating one...
0.002 (sec)
reading chrM...
Traceback (most recent call last):
  File "/PATH-REDACTED/NEAT/gen_reads.py", line 892, in <module>
    main()
  File "/PATH-REDACTED/NEAT/gen_reads.py", line 412, in main
    (ref_sequence, n_regions) = read_ref(reference, ref_index[chrom], n_handling)
  File "/PATH-REDACTED/NEAT/source/ref_func.py", line 137, in read_ref
    temp = MutableSeq(my_dat)
  File "/PATH-REDACTED/miniconda3/lib/python3.7/site-packages/Bio/Seq.py", line 1663, in __init__
    "The sequence data given to a MutableSeq object "

EDIT: I am now seeing this error on all chromosomes. ChrM just runs fastest.

I am doing this to simulate each chr individually, and speed up the time for data generation.

Update package versions

Before launching version, try to update package versions to most recent and make sure they still run.

WES Style Data - Variant Inserts Missing with `-v` option on.

I am trying to generate WES style data to ensure a variant calling pipeline is functioning properly.

I first ran the simulator with the '-tr' option pointing to my bed file with WES regions. However, the inserted variants in the resulting truth file cover the genome, and not just the target regions.

For my testing, I would like to check precision/recall in only those regions with target coverage, and not complicate my testing with off target variants. Therefore, I intersected the gold standard VCF against my WES regions. I then ran NEAT a second time specifying the intersected VCF file from the first run as the -v option.

In my results I saw ~40-50% recall and 98% precision. In trying to diagnose the poor recall, I see the following behavior in the data from NEAT:

Properly called variant:
correct_call

Missing variant:
missing_call

As you can see for the missing variant, no bases were altered, and therefore no call was made.

It makes perfect sense that a call wasn't made where there was no inserted variant; however, I would expect when you specify the -v option that all variants in the VCF would be used. Is this not the case?

Multi-threading NEAT

Hi,

I saw in your changeLogs this:

For now, we've eliminated the "Jobs" option and the merge jobs function. We plan to implement multi-threading instead as a way to speed up NEAT's simulation process.

Could you tell me if multi-threading is implemented or how can I speed up NEAT?

Thank you!

IndexError: list index out of range

Hi,

I'm trying to simulate some specific SNPS into my reference bacteria. I have included my vcf file (as a .txt as github won't let me upload a .vcf) which is responsible for causing the error, although i'm not sure why. I have identified that its something to do with my first entry in the .vcf file (The SNP with position: 2155168). I have triple checked to make sure the position, REF and ALT are all correct and as far as i can see they are so not sure why i am getting this error. Any help would be great.

python gen_reads.py -r TB_ref.fasta -R 147 -v TB_vcf.vcf --pe 300 30 -o test_data

Traceback (most recent call last):
  File "/mnt/c/Users/Matt/Desktop/UKHSA/Projects/Current/TB/programs/NEAT/gen_reads.py", line 896, in <module>
    main()
  File "/mnt/c/Users/Matt/Desktop/UKHSA/Projects/Current/TB/programs/NEAT/gen_reads.py", line 305, in main
    (sample_names, input_variants) = parse_vcf(input_vcf, ploidy=ploids)
  File "/mnt/c/Users/Matt/Desktop/UKHSA/Projects/Current/TB/programs/NEAT/source/vcf_func.py", line 105, in parse_vcf
    pl_out = parse_line(splt, col_dict, col_samp)
  File "/mnt/c/Users/Matt/Desktop/UKHSA/Projects/Current/TB/programs/NEAT/source/vcf_func.py", line 15, in parse_line
    reference_allele = vcf_line[col_dict['REF']]
IndexError: list index out of range

vcf.txt

Missing information from "golden" vcf output

Hey there,

first of all thank you for this amazing tool, it has really helped me in my research!

I have noticed that in the "golden" .vcf output file there is some information missing.

Specifically in the INFO column there is only the value for the ploidy indicator (WP), although in the meta-data section there are others mentioned (i.e. DP, AF, DUP etc).

This is the command that I ran to generate my data:
python gen_reads.py -r testing/TP53/TP53.fasta -R 150 -c 5000 -M 0.1 -o testing/TP53/TP53 --pe 300 30 --bam --vcf.

I am also attaching a print screen from the "golden" file.
image

Thank you in advance,
Stella

HG19 input crashes NEAT

Describe the bug
Hi there, I just downloaded the latest release (compatible with python 3) and after the installation of all requirements, I tried to run gen_reads.py. But even if I don't get any error (only a warning regarding the bed file Warning: Reference contains sequences not found in targeted regions BED file. but I saw it's quite normal, and I know reference matches bed file because I used them in the germline pipeline) the output files are still empty after 7-8 hours)

This is my env

python==3.8.8 (I tried also with python 3.8.5 and 3.6 but it doesn't work)
numpy==1.19.5
biopython==1.78
matplotlib==3.3.4
matplotlib_venn==0.11.6
pandas==1.2.1
pysam==0.16.0.1

This is my command using hg38 as ref

python gen_reads.py -r ../../pipelines/genomes/hg38_analysisSet/hg38.analysisSet.fa -R 101 -o ../../pipelines/data/simulated_data/test_1 --vcf --pe 300 30 -tr ../../pipelines/genomes/geneAnnotations/hg38.exome.pad20nts.ncbiRefSeq.bed -c 50

This is my log

Using default sequencing error model. Using default gc-bias model. Using artificial fragment length distribution. mean=300, std=30 found index ../../pipelines/genomes/hg38_analysisSet/hg38.analysisSet.fa.fai Warning: Reference contains sequences not found in targeted regions BED file. reading chr1...

I'm running it on Ubuntu 20.04

While when I use hg19 I get this error
[Traceback (most recent call last): File "gen_reads.py", line 902, in main() File "gen_reads.py", line 549, in main print(f'PROCESSING WINDOW: {(start, end), [buffer_added]}, ' UnboundLocalError: local variable 'buffer_added' referenced before assignment

thanks :)

NOTE: I'm reproducing this bug from Zach's repository. It was originally submitted by @joys8998

Test Ticket

Did you get an email notification about this?

Testing

Describe the bug
A clear and concise description of what the bug is.

To Reproduce
Steps to reproduce the behavior:

  1. Go to '...'
  2. Click on '....'
  3. Scroll down to '....'
  4. See error

Expected behavior
A clear and concise description of what you expected to happen.

Screenshots
If applicable, add screenshots to help explain your problem.

Desktop (please complete the following information):

  • OS: [e.g. iOS]
  • Browser [e.g. chrome, safari]
  • Version [e.g. 22]

Smartphone (please complete the following information):

  • Device: [e.g. iPhone6]
  • OS: [e.g. iOS8.1]
  • Browser [e.g. stock browser, safari]
  • Version [e.g. 22]

Additional context
Add any other context about the problem here.

Bam input into compute_fraglen.py measures read length rather than insert size

Describe the bug
If a bam file is input directly into compute_fraglen.py without prior conversion to a .sam, the output pickle file appears to contain a model describing the read length rather than the insert size. I used this method because the command described in the readme (./samtools view toy.bam | python compute_fraglen.py) yielded the following error:

usage: compute_fraglen.py [-h] -i input -o output
compute_fraglen.py: error: the following arguments are required: -i, -o

To Reproduce
python compute_fraglen.py -i file.bam -o fraglen

Expected behavior
A a model of the insert size. What we receive instead is a model of the read length.

Desktop (please complete the following information):

  • OS: CentOS Linux 7 (Core)
  • Python v3.7.2
  • pysam v0.15.2
  • samtools v1.10

Specification of variant allele frequency for inserted mutation ?

Hi,

Thanks a lot for creating this tools, they look very useful !

I am not sure if this is a feature that would be nice to have or if it already exists but so far I could not find it: is it possible to specify variant allele frequencies for injected variants ? For example, I would like to be able to specify that the variant 1:100000 A>T will be present with an allelic fraction of 9%.

Is this something that is possible already? If so, via which option? I tried so far to use the INFO field with AF=0.09 in the vcf containing injected variants (option -v) but I have the impression that this input is not considered. I also tried to specify this in the fourth field ("meta_data") of the bed file provided via the -tr option but my impression is that this is also ignored.

Best,
Yvan

Ps. does the project have a user group ? It was for me not very straightforward to find on which medium or to who my question could be asked.

Coverage not working as intended

Hello dev,
I have been using your tool to work on some indel simulation and i see that the -c parameter is a hit or miss when it comes to simulation.

Commands that i tried

python ./NEAT/gen_reads.py -r hs37d5.fa -tr target.bed -R 151 -c 25 -E 0.0027 -M 0 -v dup_50_100_NGS120.vcf --pe 255 84 -p 2 --bam --vcf -o neat-50-100-dup-NGS120

so in this case i get coverage around 25 but it drops lower than 25 coverage and doesnt generate the variants that i have provided in the vcf file.

In order to force the coverage i used the --force-coverage parameter shown in the help document
--force-coverage [debug] ignore fancy models, force coverage to be constant (default: False)

python ./NEAT/gen_reads.py -r hs37d5.fa -tr target.bed -R 151 -c 25 --force-coverage -E 0.0027 -M 0 -v dup_50_100_NGS120.vcf --pe 255 84 -p 2 --bam --vcf -o neat-50-100-dup-NGS120

after using this command i get the variants i want generated nicely but the coverage in some regions just jump 20-50times more which is also not helpful as the coverage is going way above the required value i want to test at.

Is there a way to get the coverage at a certain level containing the variant with good reads.

FP SNPs in "golden" vcf?

Describe the bug
I generate reads with the following command:

"python /gpfs/home1/gozhegov/NEAT/gen_reads.py -r /gpfs/work3/0/qtholstg/hg38_res/Ref/Ref_simplified.fa -c 50 -R 147 -v {input.vcf} -tr  /gpfs/work3/0/qtholstg/hg38_res/intervals/Agilent_V6_hg38.uniq.bed -o {params.out_prefix} --pe 300 80 --bam --vcf"

In the golden VCF, I see a lot of SNPs that are not present in golden BAM (these regions are not covered at all) nor in the vcf that was used for producing reads.

Is it a bug or I'm using the tool wrong?

Screenshot from 2023-03-12 18-05-29

Problem with vcf_compare_OLD.py

Hello, I tried to use the tool for comparing "golden" vcf with produced by pipeline.

Command was

python NEAT/utilities/vcf_compare_OLD.py -r ../../hg38_res/Ref/GRCh38_full_analysis_set_plus_decoy_hla.fa -g FEMALE_reads/NL_VUMC_KG-002592_gol
den.vcf.gz -w COMPARISON/PER_SAMPLE/GLnexus_on_Haplotypecallerdefault/NL_VUMC_KG-002592_WES.vcf.gz  -o COMPARISON/GLnexus_on_HC_vs_golden/NL_VUMC_KG-002592

And it ends with an error.

SUCCESS: deleting duplicate, unzipped vcf file (NL_VUMC_KG-002592_WES.vcf)
SUCCESS: saving false negative venn plot here (COMPARISON/GLnexus_on_HC_vs_golden/NL_VUMC_KG-002592_FNvenn.pdf)                                                                                           
Traceback (most recent call last):
  File "/gpfs/work3/0/qtholstg/Georgii_tests/NEAT_test/NEAT/utilities/vcf_compare_OLD.py", line 874, in <module>                                                                                          
    main()
  File "/gpfs/work3/0/qtholstg/Georgii_tests/NEAT_test/NEAT/utilities/vcf_compare_OLD.py", line 812, in main                                                                                              
    mpl.savefig(ouf)
  File "/home/gozhegov/miniconda3/envs/preprocess/lib/python3.10/site-packages/matplotlib/pyplot.py", line 954, in savefig                                                                                
    res = fig.savefig(*args, **kwargs)
  File "/home/gozhegov/miniconda3/envs/preprocess/lib/python3.10/site-packages/matplotlib/figure.py", line 3274, in savefig                                                                               
    self.canvas.print_figure(fname, **kwargs)
  File "/home/gozhegov/miniconda3/envs/preprocess/lib/python3.10/site-packages/matplotlib/backend_bases.py", line 2338, in print_figure                                                                   
    result = print_method(
  File "/home/gozhegov/miniconda3/envs/preprocess/lib/python3.10/site-packages/matplotlib/backend_bases.py", line 2204, in <lambda>                                                                       
    print_method = functools.wraps(meth)(lambda *args, **kwargs: meth(
  File "/home/gozhegov/miniconda3/envs/preprocess/lib/python3.10/site-packages/matplotlib/backends/backend_pdf.py", line 2808, in print_pdf                                                               
    file = PdfFile(filename, metadata=metadata)
  File "/home/gozhegov/miniconda3/envs/preprocess/lib/python3.10/site-packages/matplotlib/backends/backend_pdf.py", line 713, in __init__                                                                 
    fh, opened = cbook.to_filehandle(filename, "wb", return_opened=True)
  File "/home/gozhegov/miniconda3/envs/preprocess/lib/python3.10/site-packages/matplotlib/cbook/__init__.py", line 492, in to_filehandle                                                                  
    fh = open(fname, flag, encoding=encoding)
FileNotFoundError: [Errno 2] No such file or directory: 'COMPARISON/GLnexus_on_HC_vs_golden/NL_VUMC_KG-002592_FNvenn.pdf' 

Bypass Bam Functionality when not Needed

The code in place to handle creating the golden bam is both computationally expensive and integrated into the codebase in such a way that is basically impossible to bypass at this point. We either need to refactor the core functionality of NEAT, or spend time teasing out what parts are bam creation-specific and which aren't. One possibility is to refactor the code with two core tracks:

  1. No bam needed version of gen_reads which just inserts mutations and adjusts the golden vcf output.
  2. Bam needed, slower, clunkier version

MutableSeq Object error

I have been receiving an error message when trying to run NEAT to generate artificial sequence data with the gen_reads.py script. Any help would be much appreciated, thanks!

To Reproduce

I have only recently tried to use NEAT and so have set up a conda environment containing the dependencies. I will attach a text file with my conda environment details.
conda_list.txt

I have then done a git clone of the repository and run the setup.py script.

I have then used the genSeqErrorModel.py script to generate an error model for the types of reads I would like to simulate using:
python genSeqErrorModel.py -i 150_reads1.fq -i2 150_reads2.fq -o errormodel_150
This then generates the file errormodel_150.pickle.gz

I have then used bedtools genomecov and then compute_gc.py to generate a gc coverage bias
bedtools genomecov -ibam 150_reads.bam -g ref.fa > 150_reads.bed
python compute_gc.py -r ref.fa -i 150_reads.bed -o gc_cov
This generates the file gc_cov.pickle.gz

From here I have run the following command:

python gen_reads.py -R 150 --vcf -p 1 -e errormodel_150.pickle.gz --gc-model gc_cov.pickle.gz -r ref.fa -o sim_data150

and receive the following output:

found index ref.fa.fai
reading NC_012920.1...
0.008 (sec)

sampling reads...
[Traceback (most recent call last):
File "gen_reads.py", line 892, in
main()
File "gen_reads.py", line 615, in main
all_inserted_variants = sequences.random_mutations()
File "/nfs/anaconda3/envs/read_sim/lib/python3.8/site-packages/NEAT-3.0-py3.8.egg/source/SequenceContainer.py", line 600, in random_mutations
temp = MutableSeq(self.sequences[i])
File "/nfs/anaconda3/envs/read_sim/lib/python3.8/site-packages/Bio/Seq.py", line 1662, in init
raise TypeError(
TypeError: The sequence data given to a MutableSeq object should be a string or an array (not a Seq object etc)

Expected behavior
I would have expected it to output reads of 150bp in length in fastq format with a VCF file that have a similar gc content and error profile to real sequence data I possess.

Desktop (please complete the following information):

  • OS: [windows10]
  • Browser [chrome]

Additional context
I am trying to get this working on a remote server that is running on Ubuntu 20.04.3

Skipped variants in VCF file

I am trying to introduce variance via a VCF file but when i try and do so NEAT will skip a large proportion of these variants and i'm not quite sure why (although i think it might be syntax that NEAT isn't used to). I have attached the VCF in question to this post to help but in principle i assume that NEAT is skipping all those variants that include any GT field that isn't 1/1. This may not be the problem but just though i would ask incase its being filtered out for another reason.

`Warning: Found variants without a GT field, assuming heterozygous..
found 3614 valid variants in input vcf.
3997 variants skipped: (qual filtered / ref genotypes / invalid syntax)
0 variants skipped due to multiple variants found per position

reading AL123456.3 Mycobacterium tuberculosis H37Rv complete genome..
3.196 (sec)

sampling reads..`

N0004.txt

Inserting New variants using VCF file not inserting the SNP/INDELS

Thank you for this amazing tool.
I'm trying to simulate a Whole exome Sequencing data from a Mouse fasta file.
I want to insert certain SNPs/Indels using the -v filename.vcf option
In order to check the working of the code I created a toy Fasta file (littleFasta.fa) which just contains 300 bases.
just looks like that in its entirety.

chr11
CCCTGCCAGGGCTGCTGGTGATTCTCCACATCCTTAGGCTCCGCGGTGCTTACCTTCAGG
ACTCTCCAGTTGTAACCCCTTTGTTGGGATGCCTGGGAGCCAGACAAGGTCACCCCATTT
TTTAAGAGAGGACGAAGGTGAGAGGGAGACTACAATGAAAAGGTTGGGAGGGGCCCCAGG
CATGGCCCCTGTGTGTGGAAAACACAGGTGACCACCGGCACCCAGACTGTCTACACTATG
CCTCCAGAAGGCACTTTGCCTAGCAACAGGCCTGACCATGCAGCGCTGGTCCAATCTCTC
I use a toy BED file (babyBed.bed) in order to specify the exonic regions for targeted sequencing.

chr11 0000000 0000100 NR_003517.2 0.000000 + mm39_ncbiRefSeq exon . gene_id "NR_003517.2"; transcript_id "NR_003517.2";
chr11 0000200 0000300 NR_003517.2 0.000000 + mm39_ncbiRefSeq exon . gene_id "NR_003517.2"; transcript_id "NR_003517.2";

the VCF file for variant specifying is called spec4.vcf and looks like:
##fileformat=VCFv4.2
##fileDate=20090805
##source=myImputationProgramV3.1
##reference=file:///seq/references/1000GenomesPilot-NCBI36.fasta
##contig=<ID=20,length=62435964,assembly=B36,md5=f126cdf8a6e0c7f379d618ff66beb2da,species="Homo sapiens",taxonomy=x>
##phasing=partial
##INFO=<ID=NS,Number=1,Type=Integer,Description="Number of Samples With Data">
##INFO=<ID=DP,Number=1,Type=Integer,Description="Total Depth">
##INFO=<ID=AF,Number=A,Type=Float,Description="Allele Frequency">
##INFO=<ID=AA,Number=1,Type=String,Description="Ancestral Allele">
##INFO=<ID=DB,Number=0,Type=Flag,Description="dbSNP membership, build 129">
##INFO=<ID=H2,Number=0,Type=Flag,Description="HapMap2 membership">
##FILTER=<ID=q10,Description="Quality below 10">
##FILTER=<ID=s50,Description="Less than 50% of samples have data">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality">
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Read Depth">
##FORMAT=<ID=HQ,Number=2,Type=Integer,Description="Haplotype Quality">
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT NA00001
11 1 . C A 100 PASS . GT 0/1
11 30 . A T 100 PASS . GT 1/1
11 60 . G T 100 PASS . GT 1/1
11 240 . G A,T 100 PASS . GT 1/2

The variants get passed but when I see the reads the bases that are supposed to change. For example C > A
the Reads don't actually have the substitution.

I used this command to run the simulation.:
python gen_reads.py -r littleFasta.fa -R 20 -c 1 --discard-offtarget -o outputfile -v spec.vcf -M 0 -tr babyBed.bed

Screenshot (138)

as you can see the read Spanning the Third variant G > T which is the last base in the first line of littleFasta.fa
I have no reads that have that variant inserted into the read.
What am I doing wrong here? Can anyone please help me out?
thank you

Pickle error

I cloned the repo, and am trying to run the base command.

I get the following error:

NEAT/source/SequenceContainer.py", line 872, in __init__
    error_dat = pickle.load(open(model_path, 'rb'), encoding="bytes")
_pickle.UnpicklingError: invalid load key, '\x1f'.

Can you advise?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.