Giter Site home page Giter Site logo

gtctovcf's Introduction

GTC to VCF converter

Semi-archived state

As of May 2023, it is now recommended to use the fully-supported Illumina solution: DRAGEN Array for GTC to VCF conversion. That tool is the most up-to-date with bug fixes and improvements. This repository will remain open, but may be fully archived in the near future.

Requirements

GTCtoVCF is currently only known to work with python2.7. The following packages are also required. Note that the pysam package is currently not supported on the Windows platform.

Requirement Version
pysam 0.9.0
numpy 1.11.2
pyvcf 0.6.8

An easy way to obtain a version of python with these dependencies available is to use "Miniconda". Miniconda is a minimal python installation along with a package manager ("conda") that can be used to install additional packages. First, obtain the 64-bit python 2.7 installer from https://conda.io/miniconda.html and install

bash Miniconda2-latest-Linux-x86_64.sh

This will run the installer and ask you where you would like to install Miniconda. Then, install numpy, pyvcf, and pysam

conda install -c miniconda numpy=1.11.2
conda install -c bioconda pyvcf=0.6.8
conda install -c bioconda pysam=0.9.0

where conda is a the package manager binary located in the installation location specified in the first step.

Usage

usage: gtc_to_vcf.py [-h] [--gtc-paths GTC_PATHS [GTC_PATHS ...]]
                     --manifest-file MANIFEST_FILE --genome-fasta-file
                     GENOME_FASTA_FILE [--output-vcf-path OUTPUT_VCF_PATH]
                     [--skip-indels] [--log-file LOG_FILE]
                     [--expand-identifiers] [--unsquash-duplicates]
                     [--auxiliary-loci AUXILIARY_LOCI]
                     [--filter-loci FILTER_LOCI] [--disable-genome-cache]
                     [--include-attributes [{GT,GQ,BAF,LRR} [{GT,GQ,BAF,LRR} ...]]]
                     [--version]

Convert GTC file to VCF format

optional arguments:
  -h, --help            show this help message and exit
  --gtc-paths GTC_PATHS [GTC_PATHS ...]
                        One or more GTC files or directories to process (optional)
  --manifest-file MANIFEST_FILE
                        Bead pool manifest for product (*.csv or *.bpm)
  --genome-fasta-file GENOME_FASTA_FILE
                        Reference genome in fasta format
  --output-vcf-path OUTPUT_VCF_PATH
                        Path for generation of VCF output (default is
                        output.vcf)
  --skip-indels         Skip processing of indels (default is False)
  --log-file LOG_FILE   File to write logging information (optional)
  --expand-identifiers  For VCF entries with multiple corresponding manifest
                        entries, list all manifest identifiers in VCF ID field
  --unsquash-duplicates
                        Generate unique VCF records for duplicate assays
  --auxiliary-loci AUXILIARY_LOCI
                        VCF file with auxiliary definitions of loci (optional)
  --filter-loci FILTER_LOCI
                        File containing list of loci names to filter from
                        input manifest (optional)
  --disable-genome-cache
                        Disable caching of genome reference data
  --include-attributes [{GT,GQ,BAF,LRR} [{GT,GQ,BAF,LRR} ...]]
                        Additional attributes to include in VCF FORMAT output
                        (optional) (default is GT GQ)
  --version             show program's version number and exit


Input details

Input and output files

Input GTC files are specified with the --gtc-paths option. One or more paths may be specified with this option. If a given path corresponds to a directory, the script will identify all GTC files within that directory for processing. A combination of file and directory paths may be specified. There is a performance benefit to analyzing multiple GTC files in a single invocation of the program, as it minimizes the IO overhead of reading manifest and genome reference data to memory. When multiple input GTC files are specified, a single VCF file is produced for each input GTC file, as opposed to a single multi-sample VCF file. When no GTC file is provided, the program will still produce an output VCF file without sample genotyping information.

The --output-vcf-path option may either be a file or directory. If the argument is a directory, the program will automatically determine an appropriate name for the VCF output file created within that directory. The behavior is summarized in the following table

# GTC files --vcf-output-path Behavior
0 directory VCF filename determined from manifest filename
1 directory VCF filename determined from input GTC file
2+ directory VCF filenames determined from input GTC files
0 file VCF filename determined from --vcf-output-path argument
1 file VCF filename determined from --vcf-output-path argument
2+ file Error

Manifests

The supplied manifest file may either be in CSV or BPM format; however, a CSV format manifest is required to generate indel records in the output VCF. When running the converter with a BPM manifest, indel processing must be explicitly disabled with the "--skip-indels" option. In either case, the manifest must provide RefStrand annotations. The GTC to VCF converter depends on the presence of accurate mapping information within the manifest, and may produce inaccurate results if the mapping information is incorrect. Mapping information should follow the implicit dbSNP standard, where

  • Positions are reported with 1-based indexing.
  • Positions in the PAR are reported with mapping position to the X chromosome.
  • For an insertion relative to the reference, the position of the base immediately 5' to the insertion (on the plus strand) is given.
  • For a deletion relative to the reference, the position of the most 5' deleted based (on the plus strand) is given.

Any standard product manifest provided by Illumina will already follow these conventions.

Reference genome

The contig identifiers in the provided genome FASTA file must match exactly the chromosome identifiers specified in the provided manifest. For a standard human product manifest, this means that the contig headers should read ">1" rather than ">chr1". For compatibility with BaseSpace Variant Interpreter (https://www.illumina.com/informatics/research/biological-data-interpretation/variant-interpreter.html), the specified path of the reference must contain either contain the string GrCh37 or GrCh38, for build 37 and 38, respectively. Suitable whole genome FASTA files can be built with the download_reference.sh script located within the scripts directory. This bash script is dependent on samtools (http://www.htslib.org/download/). If running on OSX, you may need to install coreutils (e.g. brew install coreutils).

Squashing duplicates

In the manifest, there can be cases where the same variant is probed by multiple different assays. These assays may be the same design or alternate designs for the same locus. In the default mode of operation, these duplicates will be "squashed" into a single record in the VCF. The method used to incorporate information across multiple assays is under the latter "Output description" heading. When the "--unsquash-duplicates" option is provided, this "squashing" behavior is disabled, and each duplicate assay will be reported in a separate entry in the VCF file. This option is helpful when you are interested in investigating or validating the performance of individual assays, rather than trying to generate genotypes for specific variants. Note that if a locus has more than two alleles and is also queried with duplicated designs, the duplicates will not be unsquashed.

Genome cache

By default, the entire reference genome will be read into memory. Generally, this will be more efficient than reading data from the indexed reference on disk at the expense of greater memory utilization. For situations in which the genome caching is not desirable (low memory availability or a small input manifest), it is possible to disable this default behavior with the "--disable-genome-cache" option.

Auxiliary loci

Certain classes of variant types (such as multi-nucleotide variants) are not currently supported in the upstream analysis software that produces GTC files. However, it is possible to query this type of variant by creating a SNP design that differentiates the specific multi-nucleotide alleles of interest. For example, if the true source sequence is

ATGC[AT/CG]GTAA

This assay could be designed as a SNP assay with the following source sequence

ATGC[A/C]NNNN

The GTC converter tool provides an option to supply a list of auxiliary records (in VCF format) to restore the true alleles for these cases in the output VCF. There are several restrictions around this function

  • The auxiliary definition must be bi-allelic.
  • The auxiliary definition must be a multi-nucleotide variant.
  • There must not be multiple array assays (e.g., duplicates) for the locus.

Include Attributes

Default attributes to be included in the VCF are "GT" (genotype) and "GQ" (genotype score). With this option, you can compose which fields from the GTC are output to the VCF. The additional fields available (from GTC version 5) are "BAF" (B Allele Frequency) and "LRR" (Log R Ratio).

Output description

The VCF file output follows VCF4.1 format (https://samtools.github.io/hts-specs/VCFv4.1.pdf). Some additional details on output formatting:

  • Genotypes are adjusted to reflect the sample ploidy. Calls are haploid for loci on Y, MT and non-PAR chromosome X for males.
  • Multiple SNPs in the input manifest which are mapped to the same chromosomal coordinate (e.g. tri-allelic loci or duplicated sites) are collapsed into one VCF entry and a combined genotype generated. To produce the combined genotype, the set of all possible genotypes is enumerated based on the queried alleles. Genotypes which are not possible based on called alleles and assay design limitations (e.g. infiniumII designs cannot distinguish between A/T and C/G calls) are filtered. If only one consistent genotype remains after the filtering process, then the site is assigned this genotype Otherwise, the genotype is ambiguous (more than 1) or inconsistent (less than 1) and a no-call is returned. Please see 'test_class.py' in the 'tests' folder for unit tests and examples demonstrating the genotype merging process.

Docker

Build

Build the Docker image:

docker build -t gtc_to_vcf .

Usage

Set the full path to the location of the GTC files (gtc_dir), manifest csv or bpm (manifest), and reference fasta (ref). The reference fasta must be indexed with samtools faidx, with an accompanying .fasta.fai file.

gtc_dir=/path/to/gtcs
manifest=/data/projects/cag/iscan/gsa-manifest-clusterfile/GSA-24v3-0_A1.csv
ref=/data/projects/cag/reference-data/reference-genomes/human_g1k_v37.fasta

docker run --rm \
-u $(id -u):$(id -g) \
-v $gtc_dir:/data \
-v $manifest:/tmp/$(basename ${manifest}) \
-v $ref:/tmp/ref.fasta \
-v ${ref}.fai:/tmp/ref.fasta.fai \
gtc_to_vcf --gtc-paths . --manifest-file /tmp/$(basename ${manifest}) --genome-fasta-file /tmp/ref.fasta --output-vcf-path .

License

Copyright 2018 Illumina

Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.

gtctovcf's People

Contributors

brianhill11 avatar eyherabh avatar gvandesteeg avatar jjzieve avatar jzieve avatar kelleyryanm avatar nkrumm avatar stephenturner avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

gtctovcf's Issues

VCF REF column outputting in bytes instead of string

Hello, we are trying to apply GTCtoVCF on Illumina's iScan data with Global Diversity Array. We've converted from IDAT to GTC via iaap-cli, but we noticed the VCF output from GTCtoVCF has a few formatting issues that hopefully you could help resolve.

  1. In the REF column, instead of a base C, we have b'C', which suggested that the output from the python script is in bytes instead of strings.
  2. In the ALT column, we not only have the alternative allele, but also the reference allele. Could this be related to the fact that the REF column is using bytes characters instead of strings?
  3. In the GT field, genotypes are encoded as 1, 2. no 0. Not sure if this is also related to the bytes format?

Example line from our output:
1 762320 JHU_1.762319,exm2268640 b'C' T,C . PASS . GT:GQ 2/2:6

Are there environmental variables that we should specify to prevent this behavior?

For reference, we used the manifest from https://support.illumina.com/downloads/infinium-global-diversity-array-v1-product-files.html and the references were built using the provided download_reference.sh

Python3 support

Hello.
In issue #65 @jjzieve said that GTCtoVCF should work on python3, but last commit on master says:

GTCtoVCF is currently only known to work with python2.7

I can confirm that GTCtoVCF working on Python3 (just tested) and even working with last numpy (1.21.4) and pysam (0.17.0).
Only problem is PyVCF and its use_2to3, removed in setuptools 58, that's why I have to use pip install setuptools<58.

If you plan to support the tool further, I suggest finding a replacement for pyvcf - other people looking (like here).

Here is part of my Dockerfile (we're using Python 3.7.5):

RUN pip install --no-cache-dir 'setuptools<58'
RUN pip install --no-cache-dir numpy pyvcf pysam


ENV GTCTOVCF_VERSION="1.2.1"
RUN cd "$SOFT" \
    && wget -q "https://github.com/Illumina/GTCtoVCF/archive/refs/tags/${GTCTOVCF_VERSION}.tar.gz" -O "$SOFT/GTCtoVCF-${GTCTOVCF_VERSION}.tar.gz" \
    && tar -xzf "$SOFT/GTCtoVCF-${GTCTOVCF_VERSION}.tar.gz" \
    && rm "$SOFT/GTCtoVCF-${GTCTOVCF_VERSION}.tar.gz"
ENV GTCTOVCF="$SOFT/GTCtoVCF-${GTCTOVCF_VERSION}/gtc_to_vcf.py"

Tool assumptions and behaviour

Hi @jzieve,

As a fix to #78, I might need to use a different csv beadpool manifest version than the orignal manifest used to generate idat/gtc. Many of new manifests will have fewer loci than the manifest. To ensure the resulting vcf is correct, may I please confirm some assumptions of the tool:

  1. Assumes snps to be in the same order between beadpool manifest and gtc.
  2. Manifest file name encoded in gtc should match the one provided as the parameter.
  3. Does not need any sequence column for SNP records. No effect or consequence.
  4. Will skip snp records that do not have ref_strand set.
  5. Sequence is required for indel records. It can be bypassed with a fake sequence. It will fail on those records and then skip.

Do you see any simple change to the code that will retrieve the genotypes from gtc based on the order in manifest (e.g. by querying through names) and not assuming the same order?

About ImportError: No module named vcf.parser

Hello, i cloned this project and want to use.
when i run this program the result became like this

Traceback (most recent call last):
File "gtc_to_vcf.py", line 10, in <module>
from vcf.parser import Writer, Reader
ImportError: No module named vcf.parser

I used python 2.7, anaconda, pycharm and i installed

conda install -c anaconda numpy=1.11.2
conda install -c bioconda pyvcf=0.6.8
conda install -c bioconda pysam=0.9.0

all these things.

and when i just run 'gtc_to_vcf.py' at pycharm
the result came like this

usage: gtc_to_vcf.py [-h] [--gtc-paths GTC_PATHS [GTC_PATHS ...]]
                     --manifest-file MANIFEST_FILE --genome-fasta-file
                     GENOME_FASTA_FILE [--output-vcf-path OUTPUT_VCF_PATH]
                     [--skip-indels] [--log-file LOG_FILE]
                     [--expand-identifiers] [--unsquash-duplicates]
                     [--auxiliary-loci AUXILIARY_LOCI]
                     [--filter-loci FILTER_LOCI] [--disable-genome-cache]
                     [--version]
gtc_to_vcf.py: error: argument --manifest-file is required

So i thought that VCF module has installed well.
but when i run such as

python gtc_to_vcf.py args~~~~~~

the error came out.
Can i get some help?

Reference not detected if lowercase characters

In V 1.2.0, if the reference was "a", and GTC said "A", this wouldn't be a problem. However, in V1.2.1, this is no longer the case and causes problems like
10 779284 GSA-rs2486591 g A,G . PASS

V1.2.0 would have been:
10 779284 GSA-rs2486591 G A . PASS

Limitations of --output-vcf-path and --log-file

There are two limitations inconsistencies that I believe would be important to resolve:

  1. --output-vcf-path does not allow to output to stdout. I got the following error:
GTC converter - ERROR - Unable to write to output file /dev/stdout

It would be nice to be able to output to stdout as this could be piped into tools like bcftools to generate compressed binary files rather than generate really large uncompressed text files. Ideally, the default behaviour could be to output to stdout if --output-vcf-path is not used
2) --log-file generates a log file (more or less including what is output on stderr) but even if the option is active the program will generate all the log output to stderr, duplicating this output. It is great that by default the log goes to stderr, but if the --log-file option is used then stderr should not be used instead

Last, I think it would be beneficial if the check for whether the manifest file matches the manifest file used in the gtc files happened in the very beginning, so that the user could promptly fix that without having to wait a very long time.

.vcf format for BaseSpace Variant Interpreter?

Dear Dr. Kelley,

I am Pelayo G. de Lena and I work as a bioinformatician for the laboratory "Instituto de Estudios Celulares y Moleculares (ICM) in Lugo (Spain). On November 29th we had a videoconference about the GSA chip we are using, I hope you remember me!

As I said, we are using your GTCtoVCF tool. I would like to be able to import those .vcf to your BaseSpace Variant Interpreter tool but the GTCtoVCF output format for .vcf is not compatible and gives different formatting errors when trying to import it. I would like to ask you some questions about this:

  1. Can we upload genotyping .vcf's files or does the Variant Interpreter only accept vcf's that come from sequencing?
  2. Do you have any tool/script to reformat the vcf that come out of GTCtoVCF to import into BaseSpace Variant Interpreter directly?
  3. Does GenomeStudio has any way to create vcf files for BaseSpace?

Thank you in advance for your time and attention.
Best regards.

Error using manifest in CSV format

A user is attempting to run the conversion using the CSV version of the manifest file in order to generate indel records in the vcf. It results in an error (see attached) suggesting the same version must be used as was used during generation of gtc calls.

error

Indel haploid calls printed as diploid

Hi there,

We came across this issue while trying to run bcftools norm command to merge multi-line variants into a single line. Here is the command and error:

bcftools norm -m +any -Oz -o output.sample1234.vcf.gz sample1234.vcf.gz
Error at X:48386632: cannot combine diploid with haploid genotype

It was trying to combine these records (sample seems to be a male):

X       48386632        seq-rs797045547 T       TG      .       PASS    .       GT:GQ   0:0.474965
X       48386632        seq-rs587783616 T       G       .       PASS    .       GT:GQ   0/1:0.409332

The 0/1 genotype call doesn't make sense to me for a male sample. It seems that there are many calls on the X chromosome that are haploid, and many that are diploid. My guess is that this piece of the code is part of the issue:

def format_vcf_genotype(vcf_allele1_char, vcf_allele2_char, ploidy):
    """
    Create a VCF representation of the genotype based on alleles. Format appropriately if haploid.
    Args:
        vcf_allele1_char (string): 0,1,2 etc.
        vcf_allele2_char (string): 0,1,2, etc.
        vcf_record (vcf._Record): Record for the entry analyzed
        ploidy(int): Expected ploidy.
    Returns
        string: String representation of genotype (e.g., "0/1"), adjusted for haploid calls if applicable
    """
    assert ploidy < 3

   if ploidy == 1:
        if vcf_allele1_char == vcf_allele2_char:
            return str(vcf_allele1_char)

    vcf_genotype = ""
    if vcf_allele2_char < vcf_allele1_char:
        vcf_genotype = str(vcf_allele2_char) + "/" + str(vcf_allele1_char)
    else:
        vcf_genotype = str(vcf_allele1_char) + "/" + str(vcf_allele2_char)
    return vcf_genotype

Since indels are printed out as either 0/1 or 1/0 (see the convert_indel_genotype_to_vcf function) the check if vcf_allele1_char == vcf_allele2_char: never evaluates to true, and therefore falls through and gets printed as a diploid call.

My thought was that something like this might make more sense:

if ploidy == 1:
        if vcf_allele1_char == vcf_allele2_char:
            return str(vcf_allele1_char)
        else: 
            return str(max(vcf_allele1_char, vcf_allele2_char))

Does that seem reasonable? If so, I can open a pull request.

Thanks,
Brian

error in mainifest

When I run gtc_to_vcf.py on my OmniX data, I got some errors as follows,

2018-09-12 15:17:32,991 - GTC converter - INFO - Processing 1 GTC files
2018-09-12 15:17:32,992 - GTC converter - INFO - Caching reference data
2018-09-12 15:18:11,128 - GTC converter - INFO - Finished caching reference data
2018-09-12 15:18:11,129 - GTC converter - INFO - Handling GTC file /gpfs/gsfs3/users/guow4/projectX1_2018/NIMHfamily/data/201805230017_R01C01.gtc
2018-09-12 15:18:11,166 - GTC converter - WARNING - Provided manifest name: /data/guow4/projectX1_2018/NIMHfamily/GP0441-IN1_NCIprocessed/GP0441-IN1_AnalysisManifest_2918.csv and manifest file used to generate GTC file: InfiniumOmniExpress-24v1-2_A1.bpm do not match, skipping

manifest.csv file>

Sample_ID SentrixBarcode_A SentrixPosition_A Sample_Plate Sample_Well Sample_Group Gender Sample_Name
SC103044_PC04743_A01 2.02E+11 R01C01 WG4001934-DNA A01 sAASP-001 M 1760018
SC102980_PC04743_B01 2.02E+11 R03C01 WG4001934-DNA B01 sAASP-001 F 1866001

Could somebody help me to figure out if the error is due to wrong manifest file or not? I am new to this field. Thanks.

3 things: iteritems() vs items(); rect to polar; sorting errors

  1. using python 3.5+ we get an error because python likes the use of ".items()" not the old ".iteritems()" from the 2.7ish syntax? Is this an issue for others using 3+?

LocusEntryFactory.py: for _, value in position2record.iteritems():

  1. These two sort lines are crashing saying they cannot sort string() < int() or some type conflict

2.1)LocusEntryFactory.py: return sorted(result, key=lambda entry: (self._chrom_sort_function(entry.vcf_record.CHROM), entry.vcf_record.POS, entry.vcf_record.REF))
2.2) ReaderTemplateFactory.py: sorted(contigs.items(), key=lambda x: self._chrom_sort_function(x[0])))

e.g.:
Traceback (most recent call last):
if limit >= 0:
TypeError: unorderable types: AttributeError() >= int()

  1. IlluminaBeadArrayFiles.py", line 600
    def rect_to_polar((x,y)):
    this should be def rect_to_polar(x,y):? does the code even use this function ? Anyhow it's a syntax error for me

Script reports that reference has missing chromosome 6

Is there a known cause for why the script will report the following error upon execution?

"2018-03-01 16:51:29,607 - GTC converter - ERROR - FASTA reference is missing entry for chromosome 6"

This seems to only be a problem with the GrCh38 reference; it seems to work fine with build 37.

GTC converter - WARNING - Reference is missing entry for chromosomes

Hi Dr. Kelley!
I get this error when using your tools in a Linux Virtual Machine hosted in windows. When downloading genome fasta file i found some warnings related to "grep" command at the end of the process. I continue as usual with the installation and when running the script i get this error.
Could you please help me to fix that? I re-download the reference, the manifest and the git repo and i still having the same problem.

Thank you in advance.

Pelayo

$ ./gtc_to_vcf.py --gtc-paths /home/alfonso/Escritorio/GTCs/ --manifest-file /home/alfonso/Escritorio/GSA-24v2-0_A1.csv --genome-fasta-file /home/alfonso/Escritorio/GrCh37/hg19.fa --output-vcf-path /home/alfonso/Escritorio/GTCs

Replace ID in output VCF with `--id` option

Whether processing multiple files as fixed in #9, or when processing a single VCF with --output-vcf-path, the ID in the VCF file always takes the filename of the GTC file. Later we can change this with bcftools reheader or hack it with sed -i, but this adds additional complexity to our pipeline. It would be nice if there's an option when converting a single file to allow for an option like --id <newid> that would allow the ID in the VCF file to be set on the command line. Thanks Ryan.

cc @vpnagraj @punkrockscience

Precision difference when updating numpy versions

Its suggested in the README to use numpy=1.11. But if its updated to a newer version (e.g. 1.15.2) regression tests will fail due to precision differences when outputting floats (e.g. GQ score).

Example:
`AssertionError: '1\t1158675\t1KG_1_1158675\tC\tCCTT\t.\tPASS\t.\tGT:GQ\t0/0:0.43702784\n' != '1\t1158675\t1KG_1_1158675\tC\tCCTT\t.\tPASS\t.\tGT:GQ\t0/0:0.437028\n'

  • 1 1158675 1KG_1_1158675 C CCTT . PASS . GT:GQ 0/0:0.43702784
    ? - -
  • 1 1158675 1KG_1_1158675 C CCTT . PASS . GT:GQ 0/0:0.437028`

Probable solution:

  • Limit precision (either via rounding or output formatting) to a hardcoded (or user-specified value) and regenerate regression truth data.

Python3 Question

The README on your github states to use python 2.7, which will reach end-of-life the end of this year. Is this compatible with python3 yet? If not, I can update the code myself, but if it is, it saves me a lot of time.

Thanks

Bus error when converting GTC to VCF

Hi Illumina team,
we got an "Bus error" when trying to convert a GTC file to VCF. The GTC file was generated from the Infinium LCG Assay idat files using iaap-cli gencall tool on linux.

  1. error message
    2020-10-21 12:32:35,934 - GTC converter - WARNING - Reference allele is not queried for locus: rs876659253_mnv 2020-10-21 12:32:35,941 - GTC converter - WARNING - Reference allele is not queried for locus: ilmnseq_rs878853738_mnv 2020-10-21 12:32:35,957 - GTC converter - WARNING - Reference allele is not queried for locus: rs886040595_mnv Bus error
  2. command used
    ~/molecpathlab/bin/GTCtoVCF/gtc_to_vcf.py --gtc-paths /gpfs/data/molecpathlab/development/Infinium_LCG/test --manifest-file /gpfs/data/molecpathlab/data/Infinium/GDA-8v1-0_A1.csv --genome-fasta-file /gpfs/data/molecpathlab/data/Infinium/genome.fa --output-vcf-path /gpfs/data/molecpathlab/development/Infinium_LCG/test --log-file /gpfs/data/molecpathlab/development/Infinium_LCG/test/Infinium_LCG_test.log --skip-indels --disable-genome-cache

Contig header and Chrom record order do not match reference order

Something I'm altering with post-processing, but it'd be nice if the contig order matched the reference given. In my example, the reference is 1-22,X,Y,MT and the program writes contig headers and records as 1-22,MT, X,Y.

If the reference order was maintained I think that would obviate concerns about "human specific" sorting in code.

Thanks!

BAF and LRR

With GenomeStudio it is possible to output information such as BAF and LRR from the GTC files. I don't understand how it is possible to do the same using GTCtoVCF. It seems like it would be something very easy to add.

vcf.parser error

Hi Ryan:
i'm a biologist learning bioinformatics.
When using your facility GTCtoVCF i have this error.
Could you please help me? I went step by step througt the readme file
I really appreciate your advice.
Thank you in advance

Pelayo

Traceback (most recent call last):
File "./gtc_to_vcf.py", line 9, in
from vcf.parser import Writer, Reader
ImportError: No module named vcf.parser

call rate filtering

Hi Ryan.

We've run into a use case where we're generating lots of VCFs from a directory full of GTCs then doing some merging and postprocessing from there. Except we're getting derailed by samples that have a low call rate, and presumably bad genotypes at sites that are called (when I say low, I mean like <90%, <70%, etc., really low). I know this could probably be addressed upstream somehow, and certainly downstream - I could check stats on call rate and implement some sample selection with bcftools or something else. But as I'm guessing the call rate is built into the GTC file somewhere itself, could the script be modified with a flag that will skip over any GTC files having a call rate below a threshold defined as an option? Are there better ways of handling this that you could envision?

Thanks,

Stephen

Any experience about GQ settings

Hi team,
Could you help to explain the meaning of GQ? what is the GQ range? Can we try to use the GQ to filter low-confidence calls?
Best,
Xianying

Next release

There is about 9 months difference between the latest commit of develop and master branches. When do you plan to do a new release to add latest changes to the master (and new version tag)?

SB calculation

Hallo,

I am analysing vcf files that are produced from miseq. There is strand bias filter used, but i dont know how this SB is calculated. I need to understand this to be able to find the right threshold. Can you please tell me, how the SB is calculated in miseq pipeline? Thankyou!

Unclear Error message

When there is a missing attribute in the GTC, it will report the missing entry in its numerical representation, it would be nice to report what exactly is missing. For example instead of reporting "GTC converter - ERROR - 1004", it would to nice to report "GTC converter - ERROR - 1004(Missing genotype quality score)"

Empty ALT genotypes?

Hello @jzieve

I see some of the genotypes have empty ALT allele. This could be simply missing calls. But instead of using . or N, the field is simply empty. In other records I do see the ALT being populated with an N. I can fix it by adding . or N in place of empty ALT field. But wanted to check with you if there is any reason that it might be intentionally set as empty by the GTCtoVCF.

Example records:

chr1 47851 cnvi0146654 C . PASS . GT:GQ ./.:0
chr1 50251 cnvi0146656 T . PASS . GT:GQ ./.:0
chr1 51938 cnvi0151530 T N,A . PASS . GT:GQ ./.:0
chr1 52651 cnvi0146655 T . PASS . GT:GQ ./.:0
chr1 55338 cnvi0159124 A N . PASS . GT:GQ ./.:0
chr1 64251 cnvi0146663 A . PASS . GT:GQ ./.:0
chr1 65451 cnvi0147451 A . PASS . GT:GQ ./.:0
chr1 80386 rs3878915 C A . PASS . GT:GQ ./.:0
chr1 82154 rs4477212 A T,C . PASS . GT:GQ 1/1:3
chr1 82620 cnvi0052563 A N . PASS . GT:GQ ./.:0

Thanks in advance for your help.

Discrepancies with the VCF Specification

Thanks for the software. While using it, I discovered two items that are not following the VCF specification.
https://samtools.github.io/hts-specs/VCFv4.2.pdf

  1. Commas are used to delimit the ID field, and the specification states semicolons (page 4)
  2. "GQ" is already a reserved keyword in the FORMAT field (page 6)

I'm doing some post-processing as a workaround, but figured I'd mention my findings.

Best,
Ben

GQ score clarification

Hi, I have converted some GTC files to VCF and I have noticed that the GQ score is always between 0 and 17. I have done the conversion with both Beeline and IAAP but the result does not change. I was under the impression that this value should be equal to the GenCall score. Could you please clarify?
Is there a way to have the actual GenCall score in the GTC files?

ERROR - Reference is missing entry for chromosome 3

Hello, I met an error with GTCtoVCF running.

I ran GTCtoVCF in my Ubuntu VM on Windows 10 (Anaconda, python v2.7) and it gave me an error like this:

"GTC converter - ERROR - Reference is missing entry for chromosome 3"

I tried three times to run the GTCtoVCF tool with the same error,
and in the error message, the chromosome number changed at each try.

And the last log file said:

2021-11-29 12:09:20,695 - GTC converter - WARNING - Skipping indel X:2836035
2021-11-29 12:09:20,695 - GTC converter - WARNING - Skipping indel X:77387141
2021-11-29 12:09:22,464 - GTC converter - ERROR - Reference is missing entry for chromosome 3
2021-11-29 12:09:22,465 - GTC converter - DEBUG - Traceback (most recent call last):
File "/home/moirai/Tools/GTCtoVCF-develop/gtc_to_vcf.py", line 309, in main
driver(gtc_paths, manifest_reader, genome_reader, output_vcf_files, args.expand_identifiers, args.unsquash_duplicates, auxiliary_records, args.include_attributes, logger)
File "/home/moirai/Tools/GTCtoVCF-develop/gtc_to_vcf.py", line 141, in driver
locus_entries = LocusEntryFactory(vcf_record_factory, genome_reader.get_contig_order(), unsquash_duplicates, logger).create_locus_entries(manifest_reader)
File "/home/moirai/Tools/GTCtoVCF-develop/LocusEntryFactory.py", line 38, in create_locus_entries
result.append(self._generate_locus_entry(record_group))
File "/home/moirai/Tools/GTCtoVCF-develop/LocusEntryFactory.py", line 81, in _generate_locus_entry
return LocusEntry(bpm_record_group, self._vcf_record_factory.create_vcf_record(bpm_record_group))
File "/home/moirai/Tools/GTCtoVCF-develop/VcfRecordFactory.py", line 78, in create_vcf_record
return self._get_record_for_snv(bpm_record_group)
File "/home/moirai/Tools/GTCtoVCF-develop/VcfRecordFactory.py", line 179, in _get_record_for_snv
chrom, start_index, start_index + 1)
File "/home/moirai/Tools/GTCtoVCF-develop/ReferenceGenome.py", line 178, in get_reference_bases
"Reference is missing entry for chromosome " + str(chrom))
ValueError: Reference is missing entry for chromosome 3

My command was:

/home/MyUserID/Tools/GTCtoVCF-develop/gtc_to_vcf.py --gtc-paths /MyGTCFilePath/ --manifest-file /IlluminaBPMFile --genome-fasta-file /MyRefGenomeFile(GrCh38) --output-vcf-path /home/MyUserID/VCF --skip-indels --log-file /home/MyUserID/VCF/log.txt

Can you help me with this issue?

Thanks.

GTC parser error : All GT values in converted vcf are './.'

I tried to convert all GTC files generated from Illumina Iscan using Beeline software to VCF

gtcpath=203239960044_gtc
outpath=~/P/Iscan/ISCP2
python2.7 /mnt/EdicoNAS/Crepo/GTCtoVCF-1.1.1/gtc_to_vcf.py \
        --gtc-paths $gtcpath \
        --manifest-file ../GSA-24v2-0_A1.csv \
        --genome-fasta-file /mnt/EdicoNAS/Crepo/GTCtoVCF-1.1.1/Reference/hg19.fa \
        --output-vcf-path $outpath

File conversion was successful, but all the GT values in all VCF are './.' which I understood from the VCF representation is no call for diploid cases

#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT 203239960044_R01C01
1 565433 rs9701055 C T . PASS . GT:GQ ./.:0.0
1 567667 rs9651229 C T . PASS . GT:GQ ./.:0.0
1 568208 rs9701872 T C . PASS . GT:GQ ./.:0.0
1 568527 rs11497407 G A . PASS . GT:GQ ./.:0.0
1 727841 GSA-rs116587930 G A . PASS . GT:GQ ./.:0.0

For another set of gtc files after conversion, the following GQ were found

  1. 0/2
  2. 0/3
  3. 2/2
    How to interpret these genotypes.

Please help to resolve this issue

scripts/download_reference.sh fails when missing `realpath`

Since my initial genome fasta was not accepted, I opted to use the downloaded version. The script tries to set an absolute path early on, using output_file=`realpath ${1}; but I'm currently working on a server where this tool is not installed. The script then continues, but only downloads the .gz files. From the looks of the second for loop, it should actually unzip the files and concatenate output to the specified output file and then run samtools.

download_reference.sh is outdated

Hey,

The $remote FTP site seems to have changed for build 38 to :
ftp://ftp.ncbi.nlm.nih.gov/genomes/Homo_sapiens/Assembled_chromosomes/GCF_000001405.39_GRCh38.p13_assembly_structure/Primary_Assembly/assembled_chromosomes/FASTA/

Unable to run download_reference.sh on Mac

I tried to build suitable whole genome FASTA files can be built with the download_reference.sh script on macOS Mojave:
./download_reference.sh output-37.fa 37

I got this error:
readlink: illegal option -- f
usage: readlink [-n] [file ...]
usage: dirname path

Does anyone know how to fix it? Thanks in advance!

logs usage of BPM file when CSV is invoked

Even when the following command is used:
gtc_to_vcf.py \ --gtc-paths \ $GTDIR/idats_with_GTCs/202238520009/ \ --manifest-file \ $GTDIR/GSA-24v1-0_C1.csv \ --genome-fasta-file \ ~/ReferenceData/grch37/grch37.fa \ --output-vcf-path \ $GTDIR/vcf \ --log-file \ $GTDIR/vcf/gtc_to_vcf.log 1> $GTDIR/gtc2vcf.out 2> $GTDIR/gtc2vcf.err &

the following is found in the logs:
2018-07-14 17:19:05,734 - GTC converter - INFO - Manifest file used for GTC conversion identified as: GSA-24v1-0_C1.bpm
The .BPM file does indeed exist in the same folder as .CSV but I did not use the "skip-indels" flag

Merging B allele frequency and log R ratio for multiple BPMRecords

Hi there,

I'm working on extending your code to allow for the extraction of the B allele frequency and log R ratio as additional FORMAT fields in the VCF file (potential solution for #14 ). I was wondering what your recommended method would be for merging multiple measurements (i.e. B allele frequency and log R ratio) at a particular location?

For context, here is sample code:

def generate_sample_format_info(self, bpm_records, vcf_record, sample_name):
        """
        Get the sample b allele frequency
        Args:
        Returns:
            float: B allele frequency
        """
        # TODO: we need to handle the case where there are multiple BPMs
        idx = bpm_records[0].index_num
        return self._b_allele_freq[idx]

The above code naively returns the B allele frequency of the first BPMRecord, but how should they be merged ideally?

Also, I would be happy to open a pull request if you would like to include this functionality once we've figured out how best to merge these records. Adding this additional information to the VCF is specified using an additional command line parameter. Please let me know if you would be interested.

Thanks,
Brian

Any workaround for missing RefStrand and SourceSeq column in the manifest?

Hello,

Thanks for this very useful tool. I am working on a compilation of data catalog. The dataset span many different beadpool manifests across time some of them pre-2009 or so. I have both binary bead pool manifest as well the csv version.

I wish to use the GTCtoVCF to convert the entire dataset to VCF for downstream processing. However since these beadpool manifests are older most does not have the RefStrand column and many do not have the SourceSeq column. I am fine with --skip-indels and focus on the single-nucleotide-polymorphsims, however, the RefStrand is abosolutely required and missing in both the binary and csv version.

Could you please advise on possible workaround to infer the RefStrand column? I have following columns populated consistently across chips:

address_a_id
chr
genome_build
ilmn_id
ilmn_strand
map_info
name
ploidy
snp
source
source_strand
source_version
species

I am working with the following manifests:

HumanOmni2.5-4v1_B.csv
GSAMD-24v1-0_20011747_A1.csv
GSAMD-24v2-0_20024620_B1.csv
Human610-Quadv1_B.csv
Human660W-Quad_v1_A.csv
Cardio-Metabo_Chip_11395247_C.csv
Rare_Cancer_272049_A.csv
HumanOmniExpress-12v1_A.csv
Peguses_FU_11602373_A.csv
Human1M-Duov3_B.csv
HumanHap550v3_B.csv
Consortium-OncoArray_15047405_A.csv
Cancer_BeadChip_11459870_B.csv
Immuno_BeadChip_11419691_B.csv
CGEMS_P_F2_272225_A.csv
Breast_Wide_Track_271628_A.csv
BDCHP-1X10-HUMANHAP550_11218540_C.csv
HumanOmni2.5S-8v1_B.csv
HumanOmni1-Quad_v1-0_B.csv
HumanExome-12v1_A.csv
HumanOmni1S-8v1_A.csv
GSAv3Confluence_20032937X371431_A1.csv
HumanOmni25-4v1_C.csv
HumanOmni2.5-8v1_A.csv
HumanOmni2.5-8v1_A.csv
HumanOmni5-4v1_B.csv

I would appreciate any suggestion or pointers. Thank you.

Illumina Generated Manifest has no RefStrand column

I have no problems running this using BPM, however, when using an Illumina provided manifest (.csv) it is missing the RefStrand header. These are the headers that were provided, can you tell me which one maps to RefStrand (I want to say it is likely SourceStrand, but I would like confirmation before changing a header name; if is is SourceStrand, should I update all SourceStrand embedded headers to be RefStrand)?:

IlmnID,
Name,
IlmnStrand,
SNP,
AddressA_ID,
AlleleA_ProbeSeq,
AddressB_ID,
AlleleB_ProbeSeq,
GenomeBuild,
Chr,
MapInfo,
Ploidy,
Species,
Source,
SourceVersion,
SourceStrand,
SourceSeq,
TopGenomicSeq,
BeadSetID

Reference allele is not queried for locus

I'm seeing 52 lines in the log that the reference allele isn't queried for a particular locus for the omni array I'm running (which includes some custom content) What causes this, and what should be done about it?

Here are a couple of examples. Looks like the reference allele is one thing, but the possible alleles (presumably from the manifest) do not include the reference allele. Is this an error in the manifest?

7       16831874        rs28945081      C       T,G     .       PASS    .       GT:GQ   2/2:0.49126184
15      20740293        rs373904595     T       G,C     .       PASS    .       GT:GQ   2/2:0.32350013

reduce manifest to extract subset?

Thanks for making this publicly available!

I'm testing this on some Omni 5 data. Taking about 20 minutes per sample. If I only want to extract a subset of SNPs (e.g., just SNPs with rs numbers), if I reduce the manafest CSV to contain only the SNPs I want in the final VCF, will this run at all (without error), and will this reduce processing/IO time?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.