freeseek / gtc2vcf Goto Github PK

Tools to convert Illumina IDAT/BPM/EGT/GTC and Affymetrix CEL/CHP files to VCF

License: MIT License

C 97.98% R 2.02%

gtc2vcf's Introduction

gtc2vcf

A set of tools to convert Illumina and Affymetrix DNA microarray intensity data files into VCF files without using Microsoft Windows. You can use the final output to run the pipeline to detect mosaic chromosomal alterations. If you use this tool in your publication, please cite this website. For any feedback or questions, contact the author

Usage
Installation
Software Installation
Identifying chip type for IDAT and CEL files
Convert Illumina IDAT files to GTC files
Convert Illumina GTC files to VCF
Convert Affymetrix CEL files to CHP files
Convert Affymetrix CHP files to VCF
Using an alternative genome reference
Plot variants
Illumina GenCall
Acknowledgements

Usage

Illumina data tool:

Usage: bcftools +gtc2vcf [options] [<A.gtc> ...]

Plugin options:
    -l, --list-tags                   list available FORMAT tags with description for VCF output
    -t, --tags LIST                   list of output FORMAT tags [GT,GQ,IGC,BAF,LRR,NORMX,NORMY,R,THETA,X,Y]
    -b, --bpm <file>                  BPM manifest file
    -c, --csv <file>                  CSV manifest file (can be gzip compressed)
    -e, --egt <file>                  EGT cluster file
    -f, --fasta-ref <file>            reference sequence in fasta format
        --set-cache-size <int>        select fasta cache size in bytes
        --gc-window-size <int>        window size in bp used to compute the GC content (-1 for no estimate) [200]
    -g, --gtcs <dir|file>             GTC genotype files from directory or list from file
    -i, --idat                        input IDAT files rather than GTC files
        --capacity <int>              number of variants to read from intensity files per I/O operation [32768]
        --adjust-clusters             adjust cluster centers in (Theta, R) space (requires --bpm and --egt)
        --use-gtc-sample-names        use sample name in GTC files rather than GTC file name
        --do-not-check-bpm            do not check whether BPM and GTC files match manifest file name
        --do-not-check-eof            do not check whether the BPM and EGT readers reach the end of the file
        --genome-studio <file>        input a GenomeStudio final report file (in matrix format)
        --no-version                  do not append version and command line to the header
    -o, --output <file>               write output to a file [standard output]
    -O, --output-type u|b|v|z|t[0-9]  u/b: un/compressed BCF, v/z: un/compressed VCF
                                      t: GenomeStudio tab-delimited text output, 0-9: compression level [v]
        --threads <int>               number of extra output compression threads [0]
    -x, --extra <file>                write GTC metadata to a file
    -v, --verbose                     print verbose information
    -W, --write-index[=FMT]           Automatically index the output files [off]

Manifest options:
        --beadset-order               output BeadSetID normalization order (requires --bpm and --csv)
        --fasta-flank                 output flank sequence in FASTA format (requires --csv)
    -s, --sam-flank <file>            input flank sequence alignment in SAM/BAM format (requires --csv)
        --genome-build <assembly>     genome build ID used to update the manifest file [GRCh38]

Examples:
    bcftools +gtc2vcf -i 5434246082_R03C01_Grn.idat
    bcftools +gtc2vcf 5434246082_R03C01.gtc
    bcftools +gtc2vcf -b HumanOmni2.5-4v1_H.bpm -c HumanOmni2.5-4v1_H.csv
    bcftools +gtc2vcf -e HumanOmni2.5-4v1_H.egt
    bcftools +gtc2vcf -c GSA-24v3-0_A1.csv -e GSA-24v3-0_A1_ClusterFile.egt -f human_g1k_v37.fasta -o GSA-24v3-0_A1.vcf
    bcftools +gtc2vcf -c HumanOmni2.5-4v1_H.csv -f human_g1k_v37.fasta 5434246082_R03C01.gtc -o 5434246082_R03C01.vcf
    bcftools +gtc2vcf -f human_g1k_v37.fasta --genome-studio GenotypeReport.txt -o GenotypeReport.vcf

Examples of manifest file options:
    bcftools +gtc2vcf -b GSA-24v3-0_A1.bpm -c GSA-24v3-0_A1.csv --beadset-order
    bcftools +gtc2vcf -c GSA-24v3-0_A1.csv --fasta-flank -o GSA-24v3-0_A1.fasta
    bwa mem -M GCA_000001405.15_GRCh38_no_alt_analysis_set.fna GSA-24v3-0_A1.fasta -o GSA-24v3-0_A1.sam
    bcftools +gtc2vcf -c GSA-24v3-0_A1.csv --sam-flank GSA-24v3-0_A1.sam -o GSA-24v3-0_A1.GRCh38.csv

Affymetrix data tool:

Usage: bcftools +affy2vcf [options] --csv <file> --fasta-ref <file> [<A.chp> ...]

Plugin options:
    -l, --list-tags                 list available FORMAT tags with description for  VCF output
    -t, --tags LIST                 list of output FORMAT tags [GT,CONF,BAF,LRR,NORMX,NORMY,DELTA,SIZE]
    -c, --csv <file>                CSV manifest file (can be gzip compressed)
    -f, --fasta-ref <file>          reference sequence in fasta format
        --set-cache-size <int>      select fasta cache size in bytes
        --gc-window-size <int>      window size in bp used to compute the GC content (-1 for no estimate) [200]
        --probeset-ids              tab delimited file with column 'probeset_id' specifying probesets to convert
        --calls <file>              apt-probeset-genotype calls output (can be gzip compressed)
        --confidences <file>        apt-probeset-genotype confidences output (can be gzip compressed)
        --summary <file>            apt-probeset-genotype summary output (can be gzip compressed)
        --snp <file>                apt-probeset-genotype SNP posteriors output (can be gzip compressed)
        --chps <dir|file>           input CHP files rather than tab delimited files
        --cel <file>                input CEL files rather CHP files
        --adjust-clusters           adjust cluster centers in (Contrast, Size) space (requires --snp)
        --no-version                do not append version and command line to the header
    -o, --output <file>             write output to a file [standard output]
    -O, --output-type u|b|v|z[0-9]  u/b: un/compressed BCF, v/z: un/compressed VCF, 0-9: compression level [v]
        --threads <int>             number of extra output compression threads [0]
    -x, --extra <file>              write CHP metadata to a file (requires CHP files)
    -v, --verbose                   print verbose information
    -W, --write-index[=FMT]         Automatically index the output files [off]

Manifest options:
        --fasta-flank               output flank sequence in FASTA format (requires --csv)
    -s, --sam-flank <file>          input flank sequence alignment in SAM/BAM format (requires --csv)

Examples:
    bcftools +affy2vcf \
        --csv GenomeWideSNP_6.na35.annot.csv \
        --fasta-ref human_g1k_v37.fasta \
        --chps cc-chp/ \
        --snp AxiomGT1.snp-posteriors.txt \
        --output AxiomGT1.vcf \
        --extra report.tsv
    bcftools +affy2vcf \
        --csv GenomeWideSNP_6.na35.annot.csv \
        --fasta-ref human_g1k_v37.fasta \
        --calls AxiomGT1.calls.txt \
        --confidences AxiomGT1.confidences.txt \
        --summary AxiomGT1.summary.txt \
        --snp AxiomGT1.snp-posteriors.txt \
        --output AxiomGT1.vcf

Examples of manifest file options:
    bcftools +affy2vcf -c GenomeWideSNP_6.na35.annot.csv --fasta-flank -o  GenomeWideSNP_6.fasta
    bwa mem -M GCA_000001405.15_GRCh38_no_alt_analysis_set.fna GenomeWideSNP_6.fasta -o GenomeWideSNP_6.sam
    bcftools +affy2vcf -c GenomeWideSNP_6.na35.annot.csv -s GenomeWideSNP_6.sam -o GenomeWideSNP_6.na35.annot.GRCh38.csv

Installation

Install basic tools (Debian/Ubuntu specific if you have admin privileges)

sudo apt install wget unzip git g++ zlib1g-dev bwa unzip samtools msitools cabextract mono-devel libgdiplus icu-devtools bcftools

Optionally, you can install these libraries to activate further HTSlib features

sudo apt install libbz2-dev libssl-dev liblzma-dev libgsl0-dev

Preparation steps

mkdir -p $HOME/bin $HOME/GRCh3{7,8} && cd /tmp

We recommend compiling the source code but, wherever this is not possible, Linux x86_64 pre-compiled binaries are available for download here. However, notice that you will require BCFtools version 1.20 or newer. You can also download a previous version of the plugin through bioconda

Download latest version of HTSlib and BCFtools (if not downloaded already)

wget https://github.com/samtools/bcftools/releases/download/1.20/bcftools-1.20.tar.bz2
tar xjvf bcftools-1.20.tar.bz2

Download and compile plugins code (make sure you are using gcc version 5 or newer)

cd bcftools-1.20/
/bin/rm -f plugins/{idat2gtc.c,gtc2vcf.{c,h},affy2vcf.c}
wget -P plugins https://raw.githubusercontent.com/freeseek/gtc2vcf/master/{idat2gtc.c,gtc2vcf.{c,h},affy2vcf.c}
make
/bin/cp bcftools plugins/{idat2gtc,gtc2vcf,affy2vcf}.so $HOME/bin/

Make sure the directory with the plugins is available to BCFtools

export PATH="$HOME/bin:$PATH"
export BCFTOOLS_PLUGINS="$HOME/bin"

Install the GRCh37 human genome reference

wget -O- ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/human_g1k_v37.fasta.gz | \
  gzip -d > $HOME/GRCh37/human_g1k_v37.fasta
samtools faidx $HOME/GRCh37/human_g1k_v37.fasta
bwa index $HOME/GRCh37/human_g1k_v37.fasta

Install the GRCh38 human genome reference (following the suggestion from Heng Li)

wget -O- ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/001/405/GCA_000001405.15_GRCh38/seqs_for_alignment_pipelines.ucsc_ids/GCA_000001405.15_GRCh38_no_alt_analysis_set.fna.gz | \
  gzip -d > $HOME/GRCh38/GCA_000001405.15_GRCh38_no_alt_analysis_set.fna
samtools faidx $HOME/GRCh38/GCA_000001405.15_GRCh38_no_alt_analysis_set.fna
bwa index $HOME/GRCh38/GCA_000001405.15_GRCh38_no_alt_analysis_set.fna

Affymetrix provides the Analysis Power Tools (APT) for free which allow to call genotypes from raw intensity data using an algorithm derived from BRLMM-P

mkdir -p $HOME/bin && cd /tmp
wget https://downloads.thermofisher.com/APT/APT_2.11.8/apt_2.11.8_linux_64_x86_binaries.zip
unzip -ojd $HOME/bin apt_2.11.8_linux_64_x86_binaries.zip apt_2.11.8_linux_64_x86_binaries/bin/apt-probeset-genotype
chmod a+x $HOME/bin/apt-probeset-genotype

Identifying chip type for IDAT and CEL files

To convert a pair of green and red IDAT files with raw Illumina intensities into a GTC file with genotype calls you need to provide both a BPM manifest file with the location of the probes and an EGT cluster file with the expected intensities of each genotype cluster. It is important to provide the correct BPM and EGT files otherwise the calling will fail possibly generating a GTC file with meaningless calls. Unfortunately newer IDAT files do not contain information about which BPM manifest file to use. The gtc2vcf bcftools plugin can be used to guess which files to use

path_to_idat_folder="..."
bcftools +gtc2vcf \
  -i -g $path_to_idat_folder

This will generate a spreadsheet table with information about each IDAT file including a guess for what manifest and cluster files you should use. If a guess is not provided, contact the author for troubleshooting

Similarly, you can use the affy2vcf bcftools plugin to extract chip type information from CEL files

path_to_cel_folder="..."
bcftools +affy2vcf \
  --cel --chps $path_to_cel_folder

Convert Illumina IDAT files to GTC files

The idat2gtc bcftools plugin can be used to convert Illumina IDAT files to GTC files

bpm_manifest_file="..."
egt_cluster_file="..."
bcftools +idat2gtc \
  --bpm $bpm_manifest_file \
  --egt $egt_cluster_file \
  --idats $path_to_idat_folder \
  --output $path_to_gtc_folder

The output is equivalent to the output of the Illumina GenCall algorithm while being significantly faster

If you do not have the manifest and cluster files for the Illumina IDAT files you are trying to convert, make sure to check the links [here][Illumina.md]

If you run the command with the option --autocall-date "" then the output should be deterministic and using the --preset option you can generate output equivalent to the output you obtain with any of the following:

Illumina AutoConvert
Illumina AutoConvert 2.0
Illumina Array Analysis Platform Genotyping Command Line Interface
Illumina Microarray Analytics Array Analysis Command Line Interface

If you similarly patch those tools to make them generate deterministic output, you should be able to verify that you get the same md5sum

Convert Illumina GTC files to VCF

Specifications for Illumina BPM, EGT, and GTC files were obtained through Illumina's BeadArrayFiles library and GTCtoVCF script. Specifications for IDAT files were obtained through Henrik Bengtsson's illuminaio package

bpm_manifest_file="..."
csv_manifest_file="..."
egt_cluster_file="..."
path_to_gtc_folder="..."
ref="$HOME/GRCh38/GCA_000001405.15_GRCh38_no_alt_analysis_set.fna" # or ref="$HOME/GRCh37/human_g1k_v37.fasta"
out_prefix="..."
bcftools +gtc2vcf \
  --no-version -Ou \
  --bpm $bpm_manifest_file \
  --csv $csv_manifest_file \
  --egt $egt_cluster_file \
  --gtcs $path_to_gtc_folder \
  --fasta-ref $ref \
  --extra $out_prefix.tsv | \
  bcftools sort -Ou -T ./bcftools. | \
  bcftools norm --no-version -o $out_prefix.bcf -Ob -c x -f $ref --write-index

Heavy random access to the reference will be needed, so it is important that enough extra memory be available for the operating system to cache the reference or else the task can run excruciatingly slowly. Notice that the gtc2vcf bcftools plugin will drop unlocalized variants. The final VCF might contain duplicates. If this is an issue bcftools norm -d exact can be used to remove such variants. At least one of the BPM or the CSV manifest files has to be provided. Normalized intensities cannot be computed without the BPM manifest file. Indel alleles cannot be inferred and will be skipped without the CSV manifest file. Information about genotype cluster centers will be included in the VCF if the EGT cluster file is provided. You can use gtc2vcf to convert one GTC file at a time, but we strongly advise to convert multiple files at once as single sample VCF files will consume a lot of storage space. If you convert hundreds of GTC files at once, you can use the --adjust-clusters option which will recenter the genotype clusters rather than using those provided in the EGT cluster file and will compute less noisy LRR values. If you use the --adjust-clusters option and you are using the output for calling mosaic chromosomal alterations, then it is safe to turn the median BAF/LRR adjustments off during that step (i.e. use --adjust-BAF-LRR -1)

Optionally, between the conversion and the sorting step you can include a bcftools reheader --samples <file> command to assign new names to the samples where <file> contains old_name new_name\n pairs separated by whitespaces, each on a separate line, with old_name being the GTC file name without the .gtc extension in this case

When running the conversion, the gtc2vcf plugin will double check that the SNP manifest metadata information in the GTC file matches the descriptor file name in the BPM file to make sure you are using the correct manifest file. Sometimes, due to discrepancies between the BPM file name provided by Illumina and the internal descriptor file name, this safety check fails. To turn off this feature in these cases, you can use option --do-not-check-bpm

Convert Affymetrix CEL files to CHP files

Affymetrix provides a best practice workflow for genotyping data generated using SNP6 and Axiom arrays. As an example, the following command will run the genotyping for the Affymetrix SNP6 array:

path_to_output_folder="..."
cel_list_file="..."
apt-probeset-genotype \
  --analysis-files-path . \
  --xml-file GenomeWideSNP_6.apt-probeset-genotype.AxiomGT1.xml \
  --out-dir $path_to_output_folder \
  --cel-files $cel_list_file \
  --special-snps GenomeWideSNP_6.specialSNPs \
  --chip-type GenomeWideEx_6 \
  --chip-type GenomeWideSNP_6 \
  --table-output false \
  --cc-chp-output \
  --write-models \
  --read-models-brlmmp GenomeWideSNP_6.generic_prior.txt

Affymetrix provides Library and NetAffx Annotation files for their arrays (here, here, and here)

As an example, the following commands will obtain the files necessary to run the genotyping for the Affymetrix SNP6 array:

wget http://www.affymetrix.com/Auth/support/downloads/library_files/genomewidesnp6_libraryfile.zip
wget http://www.affymetrix.com/Auth/analysis/downloads/lf/genotyping/GenomeWideSNP_6/SNP6_supplemental_axiom_analysis_files.zip
wget http://www.affymetrix.com/Auth/analysis/downloads/na35/genotyping/GenomeWideSNP_6.na35.annot.csv.zip
unzip -oj genomewidesnp6_libraryfile.zip CD_GenomeWideSNP_6_rev3/Full/GenomeWideSNP_6/LibFiles/GenomeWideSNP_6.{cdf,chrXprobes,chrYprobes,specialSNPs}
unzip -o SNP6_supplemental_axiom_analysis_files.zip GenomeWideSNP_6.{generic_prior.txt,apt-probeset-genotype.AxiomGT1.xml,AxiomGT1.sketch}
unzip -o GenomeWideSNP_6.na35.annot.csv.zip GenomeWideSNP_6.na35.annot.csv

Note: If the program exits due to different chip types or probe counts with error message such as Wrong CEL ChipType: expecting: 'GenomeWideSNP_6' and #######.CEL is: 'GenomeWideEx_6' then make sure you included the option --chip-type GenomeWideEx_6 --chip-type GenomeWideSNP_6 or --force to the command line to solve the problem

Convert Affymetrix CHP files to VCF

The affy2vcf bcftools plugin can be used to convert Affymetrix CHP files to VCF

csv_manifest_file="..." # for example csv_manifest_file="GenomeWideSNP_6.na35.annot.csv"
ref="$HOME/GRCh38/GCA_000001405.15_GRCh38_no_alt_analysis_set.fna" # or ref="$HOME/GRCh37/human_g1k_v37.fasta"
path_to_chp_folder="cc-chp"
path_to_txt_folder="..."
out_prefix="..."
bcftools +affy2vcf \
  --no-version -Ou \
  --csv $csv_manifest_file \
  --fasta-ref $ref \
  --chps $path_to_chp_folder \
  --snp $path_to_txt_folder/AxiomGT1.snp-posteriors.txt \
  --extra $out_prefix.tsv | \
  bcftools sort -Ou -T ./bcftools. | \
  bcftools norm --no-version -o $out_prefix.bcf -Ob -c x -f $ref --write-index

Heavy random access to the reference will be needed, so it is important that enough extra memory be available for the operating system to cache the reference or else the task can run excruciatingly slowly. The final VCF might contain duplicates. If this is an issue bcftools norm -d exact can be used to remove such variants. There is often no need to use the --adjust-clusters option for Affymetrix data as the cluster posteriors are already adjusted using the data processed by the genotype caller

Using an alternative genome reference

Illumina provides GRCh38/hg38 manifests for many of its genotyping arrays. However, if your genotyping array is not supported for the newer reference by Illumina, you can use the --fasta-flank and --sam-flank options to realign the flank sequences from the manifest files you have and recompute the marker positions. This approach uses flank sequence and strand information to identify the marker coordinates. It will need a sequence aligner such as bwa to realign the sequences and it seems to reproduce the coordinates provided from Illumina more than 99.9% of the times. Mapping information will follow the implicit dbSNP standard. Occasionally the flank sequence provided by Illumina is incorrect and it is impossible to recover the correct marker coordinate from the flank sequence alone

You first have to generate an alignment file for the flank sequences from a CSV manifest file

csv_manifest_file="..."
ref="$HOME/GRCh38/GCA_000001405.15_GRCh38_no_alt_analysis_set.fna" # or ref="$HOME/GRCh37/human_g1k_v37.fasta"
bam_alignment_file="..."
bcftools +gtc2vcf \
  -c $csv_manifest_file \
  --fasta-flank | \
  bwa mem -M $ref - | \
  samtools view -bS \
  -o $bam_alignment_file

Notice that you need to use the -M option to mark shorter split hits as secondary and you should not sort the output BAM file as gtc2vcf expects it to have the sequences in the same order as in the CSV file . Then you load the alignment file while converting your GTC files to VCF including the -s $bam_alignment_file option

Some older manifest files from Illumina have thousands of markers with incorrect RefStrand annotations that will lead to incorrect genotypes. While Illumina has not explained why this is the case, it still distributes incorrect manifests. If you are using one of the following manifests

Human1M-Duov3_H
Human610-Quadv1_H
Human660W-Quad_v1_H
HumanCytoSNP-12v2-1_Anova
HumanOmni1-Quad_v1-0-Multi_H
HumanOmni1-Quad_v1-0_H

We advise to either contact Illumina to demand a fixed version or to use gtc2vcf to realign the flank sequences

Also, Illumina assigns chromosomal positions to indels by first left aligning the flank sequences in an incoherent way (see here). Apparently this is incoherent enough that Illumina also cannot get the coordinates of homopolymer indels right. For example, chromosome 13 ClinVar indel rs80359507 is assigned to position 32913838 in the manifest file for the GSA-24v2-0 array, but it is assigned to position 32913837 in the manifest file for GSA-24v3-0 array (GRCh37 coordinates). If you want to trust genotypes at homopolymer indels, we advise to use gtc2vcf to realign the flank sequences

The same functionality exists for the affy2vcf tool to convert Affymetrix data

Plot variants

Install basic tools (Debian/Ubuntu specific if you have admin privileges):

sudo apt install r-cran-optparse r-cran-ggplot2 r-cran-data.table r-cran-gridextra

Download R scripts

/bin/rm -f $HOME/bin/gtc2vcf_plot.R
wget -P $HOME/bin https://raw.githubusercontent.com/freeseek/gtc2vcf/master/gtc2vcf_plot.R
chmod a+x $HOME/bin/gtc2vcf_plot.R

Plot variant (for Illumina data)

gtc2vcf_plot.R \
  --illumina \
  --vcf input.vcf \
  --chrom 11 \
  --pos 66328095 \
  --png rs1815739.png

Plot variant (for Affymetrix data)

gtc2vcf_plot.R \
  --affymetrix \
  --vcf input.vcf \
  --chrom 1 \
  --pos 196642233 \
  --png rs800292.png

Illumina GenCall

To genotype raw Illumina IDAT intensity files using Illumina GenCall algorithms, Illumina over the course of the year has provided several command line interfaces written in the .NET language:

AutoConvert (2011)
AutoConvert 2.0) (2017)
IAAP CLI (2019)
Array Analysis CLI (2023) We provide instructions to install and run these interfaces. The sed -i -e ':a' -e 'N' -e '$!ba' installation commands are used to prevent the interfaces from timestamping the output GTC files by removing the System.DateTime calls and accesses to the CreationTime property from the binaries, with the goal of making each execution completely reproducible. AutoConvert 2.0, IAAP-CLI, and Array Analysis CLI binaries will both perform version 1.2.0 of the normalization step and seem to produce the exact same results while AutoConvert will only perform version 1.1.2 of the normalization step yielding somewhat different results. If you want to run these binaries but fail to download them, contact the author for troubleshooting

Illumina also provides the Beeline software for free and this includes the AutoConvert.exe command line executable which allows to call genotypes from raw intensity data using Illumina's proprietary GenCall algorithm. AutoConvert is almost entirely written in Mono/.Net language, except for one small mathmatical function (findClosestSitesToPointsAlongAxis) which is included within a Windows PE32+ library (MathRoutines.dll). As this is unmanaged code, to be run on Linux with Mono it needs to be embedded in an equivalent Linux ELF64 library (libMathRoutines.dll.so) as shown below. This function is run as part of the normalization of the raw intensities when sampling 400 candidate homozygotes before calling genotypes.

Illumina AutoConvert

To run Illumina AutoConvert (version 1.6.3.1) you will need to fix the hardcoded Windows backlashes into UNIX [slashes](https://en.wikipedia.org/wiki/Slash_(punctuation), as shown below

mkdir -p $HOME/bin && cd /tmp
wget http://support.illumina.com/content/dam/illumina-support/documents/downloads/software/beeline/autoconvert-software-v1-6-3-installer.zip
wget http://raw.githubusercontent.com/freeseek/gtc2vcf/master/nearest_neighbor.c
unzip -o autoconvert-software-v1-6-3-installer.zip 
msiextract -C Illumina/AutoConvert SetupAutoConvert64_1.6.3.1.msi
msiextract -l SetupAutoConvert64_1.6.3.1.msi | grep DLL$ | while read dll; do mv Illumina/AutoConvert/$dll Illumina/AutoConvert/${dll%DLL}dll; done
gcc -fPIC -shared -O2 -o Illumina/AutoConvert/libMathRoutines.dll.so nearest_neighbor.c
sed -i 's/\x00\x03\\\x00/\x00\x03\/\x00/' Illumina/AutoConvert/AutoCallLib.dll
sed -i 's/G\x00R\x00N\x00.\x00i\x00d\x00a\x00t\x00/G\x00r\x00n\x00.\x00i\x00d\x00a\x00t\x00/' Illumina/AutoConvert/AutoCallLib.dll
sed -i 's/R\x00E\x00D\x00.\x00i\x00d\x00a\x00t\x00/R\x00e\x00d\x00.\x00i\x00d\x00a\x00t\x00/' Illumina/AutoConvert/AutoCallLib.dll
sed -i 's/\\\x00M\x00o\x00d\x00u\x00l\x00e\x00s\x00\\\x00B\x00S\x00G\x00T\x00\\\x00C\x00l\x00u\x00s\x00t\x00e\x00r\x00A\x00l\x00g\x00o\x00r\x00i\x00t\x00h\x00m\x00s\x00\\\x00/\/\x00M\x00o\x00d\x00u\x00l\x00e\x00s\x00\/\x00B\x00S\x00G\x00T\x00\/\x00C\x00l\x00u\x00s\x00t\x00e\x00r\x00A\x00l\x00g\x00o\x00r\x00i\x00t\x00h\x00m\x00s\x00\/\x00/' Illumina/AutoConvert/AutoCallLib.dll
sed -i 's/\\\x00M\x00o\x00d\x00u\x00l\x00e\x00s\x00\\\x00B\x00S\x00G\x00T\x00/\/\x00M\x00o\x00d\x00u\x00l\x00e\x00s\x00\/\x00B\x00S\x00G\x00T\x00/' Illumina/AutoConvert/Modules/BSGT/ClusterAlgorithms/{GoldenGate/GGCA,InfiniumII/I2CA,GenTrain/ILCA}.dll
sed -i 's/\\\x00d\x00a\x00t\x00.\x00b\x00i\x00n\x00/\/\x00d\x00a\x00t\x00.\x00b\x00i\x00n\x00/' Illumina/AutoConvert/Modules/BSGT/ClusterAlgorithms/{GoldenGate/GGCA,InfiniumII/I2CA,GenTrain/ILCA}.dll
sed -i -e ':a' -e 'N' -e '$!ba' -e 's/\x28\xa6\x00\x00\x0a\x13\x40\x12\x40\x28\xa7\x00\x00\x0a\x72\xad\x12\x00\x70\x28\xa6\x00\x00\x0a\x13\x40\x12\x40\x28\xa8\x00\x00\x0a\x28\x23\x00\x00\x0a/\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x7e\x16\x00\x00\x0a\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00/' Illumina/AutoConvert/AutoCallLib.dll
sed -i -e ':a' -e 'N' -e '$!ba' -e 's/\x11\x0e\x6f\xe5\x00\x00\x0a\x13\x11\x12\x11\x28\xe6\x00\x00\x0a\x72\xad\x12\x00\x70\x11\x0e\x6f\xe5\x00\x00\x0a\x13\x12\x12\x12\x28\xe7\x00\x00\x0a\x28\x23\x00\x00\x0a/\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x7e\x16\x00\x00\x0a\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00/' Illumina/AutoConvert/AutoCallLib.dll
rm autoconvert-software-v1-6-3-installer.zip SetupAutoConvert64_1.6.3.1.msi nearest_neighbor.c
mv Illumina/AutoConvert $HOME/bin/
rmdir Illumina

You can run Illumina's proprietary GenCall algorithm on a single IDAT file pair

mono $HOME/bin/AutoConvert/AutoConvert.exe \
  $idat_green_file \
  $path_to_output_folder \
  $bpm_manifest_file \
  $egt_cluster_file

Make sure that the red IDAT file is in the same folder as the green IDAT file. Alternatively you can run on multiple IDAT file pairs

mono $HOME/bin/AutoConvert/AutoConvert.exe \
  $path_to_idat_folder \
  $path_to_output_folder \
  $bpm_manifest_file \
  $egt_cluster_file

Illumina AutoConvert 2.0

To run Illumina AutoConvert 2.0 (version 2.0.1.179) you will need to separately download an additional Mono/.Net library (Heatmap.dll) from GenomeStudio or the polyploid clustering module and include it in your binary directory, most likely due to differences in which Mono and .Net resolve library dependencies, as shown below

mkdir -p $HOME/bin && cd /tmp
wget http://support.illumina.com/content/dam/illumina-support/documents/downloads/software/beeline/autoconvert-software-v2-0-1-installer.zip
wget http://support.illumina.com/content/dam/illumina-support/documents/downloads/software/genomestudio/genomestudiopolyploidclusteringv1-0.msi
wget http://raw.githubusercontent.com/freeseek/gtc2vcf/master/nearest_neighbor.c
unzip -o autoconvert-software-v2-0-1-installer.zip
msiextract AutoConvertInstaller.msi
msiextract genomestudiopolyploidclusteringv1-0.msi
mv Heatmap.DLL Illumina/AutoConvert\ 2.0/
gcc -fPIC -shared -O2 -o Illumina/AutoConvert\ 2.0/libMathRoutines.dll.so nearest_neighbor.c
sed -i 's/^\(     <AutosomalCallRateThreshold>\)0.97\(<\/AutosomalCallRateThreshold>\r\)$/\10.0\2/' Illumina/AutoConvert\ 2.0/AutoCallConfig.xml
sed -i 's/\\\x00d\x00a\x00t\x00.\x00b\x00i\x00n\x00/\/\x00d\x00a\x00t\x00.\x00b\x00i\x00n\x00/' Illumina/AutoConvert\ 2.0/{GGCA,I2CA,HDCA,ILCA,ILCA3}.dll
sed -i -e ':a' -e 'N' -e '$!ba' -e 's/\x28\xc7\x00\x00\x0a\x13\x3f\x12\x3f\x28\xc8\x00\x00\x0a\x72\xa8\x15\x00\x70\x28\xc7\x00\x00\x0a\x13\x3f\x12\x3f\x28\xc9\x00\x00\x0a\x28\x1f\x00\x00\x0a/\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x7e\x12\x00\x00\x0a\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00/' Illumina/AutoConvert\ 2.0/AutoCallLib.dll
msiextract -l genomestudiopolyploidclusteringv1-0.msi | grep -v Heatmap.DLL | xargs rm
rmdir Modules/BSPC/clusteralgorithms/*
rmdir -p Modules/BSPC/clusteralgorithms
rm autoconvert-software-v2-0-1-installer.zip AutoConvertInstaller.msi genomestudiopolyploidclusteringv1-0.msi nearest_neighbor.c
mv Illumina/AutoConvert\ 2.0 $HOME/bin/
rmdir Illumina

We change the autosomal call rate threshold to 0.0 to more aggressively call gender in lower quality samples

If you need to get the Heatmap.dll library from GenomeStudio indtead, you can use the following code

wget ftp://webdata2:[email protected]/downloads/software/genomestudio/genomestudio-software-v2-0-4-5-installer.zip
unzip -oj genomestudio-software-v2-0-4-5-installer.zip
cabextract GenomeStudioInstaller.exe
msiextract a0
mv Illumina/GenomeStudio\ 2.0/Heatmap.dll Illumina/AutoConvert\ 2.0/
rm genomestudio-software-v2-0-4-5-installer.zip GenomeStudioInstaller.exe {,a}0 u{0..5} Illumina/GenomeStudio\ 2.0 -r

You can run Illumina's proprietary GenCall algorithm on a single IDAT file pair

mono $HOME/bin/AutoConvert\ 2.0/AutoConvert.exe \
  $idat_green_file \
  $path_to_output_folder \
  $bpm_manifest_file \
  $egt_cluster_file

Make sure that the red IDAT file is in the same folder as the green IDAT file. Alternatively you can run on multiple IDAT file pairs

mono $HOME/bin/AutoConvert\ 2.0/AutoConvert.exe \
  $path_to_idat_folder \
  $path_to_output_folder \
  $bpm_manifest_file \
  $egt_cluster_file

Make sure that the IDAT files have the same name prefix as the IDAT folder name. The software might require up to 8GB of RAM to run. Illumina provides manifest (BPM) and cluster (EGT) files for their arrays here. Notice that if you provide the wrong BPM file, you will get an error such as: Normalization failed! Unable to normalize! and if you provide the wrong EGT file, you will get an error such as System.Exception: Unrecoverable Error...Exiting! Unable to find manifest entry ######## in the cluster file!

Illumina Array Analysis Platform Genotyping Command Line Interface

Illumina provides the Illumina Array Analysis Platform Genotyping Command Line Interface software for free for research use and this includes the iaap-cli 1.1.0 which runs natively on Linux

mkdir -p $HOME/bin && cd /tmp
wget ftp://webdata2:[email protected]/downloads/software/iaap/iaap-cli-linux-x64-1.1.0.tar.gz
tar xzvf iaap-cli-linux-x64-1.1.0.tar.gz -C $HOME/bin/ iaap-cli-linux-x64-1.1.0/iaap-cli --strip-components=1
sed -i -e ':a' -e 'N' -e '$!ba' -e 's/\x28\x17\x01\x00\x0a\x13\x07\x12\x07\x72\xdd\x23\x00\x70\x28\x18\x01\x00\x0a/\x00\x00\x00\x00\x00\x00\x00\x00\x00\x7e\x92\x00\x00\x0a\x00\x00\x00\x00\x00/' $HOME/bin/iaap-cli/ArrayAnalysis.NormToGenCall.Services.dll
rm iaap-cli-linux-x64-1.1.0.tar.gz

Once iaap-cli is properly installed in your system, run Illumina's proprietary GenCall algorithm on multiple IDAT file pairs

CLR_ICU_VERSION_OVERRIDE="$(uconv -V | sed 's/.* //g')" LANG="en_US.UTF-8" $HOME/bin/iaap-cli/iaap-cli \
  gencall \
  $bpm_manifest_file \
  $egt_cluster_file \
  $path_to_output_folder \
  --idat-folder $path_to_idat_folder \
  --output-gtc \
  --gender-estimate-call-rate-threshold 0.0

It is important to set the LANG environmental variable to en_US.UTF-8, if this is set to other values, due to a bug in iaap-cli causing malformed GTC files to be generated as a result. Due to another bug in iaap-cli, IDAT filenames cannot include more than two _ characters and should be formatted as BARCODE_POSITION_(Red|Grn).idat. When using iaap_cli you cannot process old array manifest files with loci data encoded as version 5 or older, such as HumanHap650Yv3_A.bpm, as the corresponding code was not carried over and you will get the error Error in reading file. Unknown Manifest version. The AutoConvert command line tool can read older manifest files. We change the autosomal call rate threshold to 0.0 both to more aggressively call gender in lower quality samples and to deal with an implementation issue that causes loci with null cluster scores to be included in the determination of the autosomal call rate threshold

Illumina Microarray Analytics Array Analysis Command Line Interface

Illumina provides the Illumina Microarray Analytics Array Analysis Command Line Interface software for free for research use and this includes the array-analysis-cli 2.1.0 which runs natively on Linux

mkdir -p $HOME/bin && cd /tmp
wget http://support.illumina.com/softwaredownload.html?assetId=72f8a34f-0933-4256-bad6-73d830436c74&assetDetails=IlluminaMicroarrayAnalyticsArrayAnalysisCLIv2.1LinuxInstaller-2.1-array-analysis-cli-linux-x64-v2.1.0.tar.gz
tar xzvf array-analysis-cli-linux-x64-v2.1.0.tar.gz -C $HOME/bin/ --strip-components=1
sed -i -e ':a' -e 'N' -e '$!ba' -e 's/\x28\x89\x00\x00\x0a\x0A\x12\x00\x72\xa3\x15\x00\x70\x28\x8a\x00\x00\x0a/\x00\x00\x00\x00\x00\x00\x00\x00\x72\xfc\x0d\x00\x70\x00\x00\x00\x00\x00/' $HOME/bin/array-analysis-cli//ArrayAnalysis.Core.dll
rm array-analysis-cli-linux-x64-v2.1.0.tar.gz

Once array-analysis-cli is properly installed in your system, run Illumina's proprietary GenCall algorithm on multiple IDAT file pairs

$HOME/bin/array-analysis-cli/array-analysis-cli \
  genotype call \
  --bpm-manifest $bpm_manifest_file \
  --cluster-file $egt_cluster_file \
  --idat-folder .

We cannot change the autosomal call rate threshold to 0.0 both to more aggressively call gender in lower quality samples as the default 0.97 value is hardcoded

Acknowledgements

This work is supported by NIH grant R01 HG006855, NIH grant R01 MH104964, NIH grant R01MH123451, US Department of Defense Breast Cancer Research Breakthrough Award W81XWH-16-1-0316 (project BC151244), and the Stanley Center for Psychiatric Research

gtc2vcf's People

Contributors

Stargazers

Watchers

gtc2vcf's Issues

Can't install wine32 on my machine

I have wine64 installed, but the commands I ran from read me also needed wine32. So I switched to a 32 bit vm and then it needs wine64. Pretty frustrated with installing this pipeline as I would prefer using this than running beeline 100x to get my idat to gtc.

More GS report queries...

Hi,

I have attempted the thankless task of using a genomestudio .txt file. Don't have other options.

This is my genomestudio header:

Index   Name    Address Chr     Position        GenTrain Score  59_1.GType      59_1.Score      59_1.Theta      59_1.R  59_1.X Raw      59_1.Y Raw      59_1.X  59_1.Y  59_1.B Allele Freq      59_1.Log R Ratio
        59_1.Top Alleles        59_1.Import Calls       59_1.Concordance        59_1.Orig Call  59_1.CNV Value  59_1.CNV Confidence     59_1.Plus/Minus Alleles
1       rs1000000       95775890        12      126890980       0.7825049       AB      0.7878883       0.4333902       2.230212        14921   7256    1.232208        0.9980044       0.5075449       0.008405539     AG              -1                              AG
2       rs1000002       20798118        3       183635768       0.8463691       AB      0.879837        0.4056498       1.041987        7707    3384    0.5987776       0.4432094       0.4658202       -0.06460849     AG              -1                              TC

This is what I get after running the --genome studio option.
As you can see the gtf almost exclusively has A/N as reference and G/N for alternative.
Counts REF: A=400K+, N=270K+, C=817. No G or T
Counts ALT: G=380K+, C=80K+, N=230K+, T=600. No A

I assume something went wrong there, if ther is a fix, would be rather grateful for advice.
Jakub

> ##contig=<ID=chrUn_GL000218v1,length=161147>
> ##contig=<ID=chrEBV,length=171823>
> ##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
> ##FORMAT=<ID=IGC,Number=1,Type=Float,Description="Illumina GenCall Confidence Score">
> ##FORMAT=<ID=BAF,Number=1,Type=Float,Description="B Allele Frequency">
> ##FORMAT=<ID=LRR,Number=1,Type=Float,Description="Log R Ratio">
> ##bcftools_+gtc2vcfVersion=1.9+htslib-1.9
> ##bcftools_+gtc2vcfCommand=gtc2vcf -f GCA_000001405.15_GRCh38_no_alt_analysis_set.fna --genome-studio P150645.txt -o P150645.vcf; Date=Sun Apr 19 16:38:13 2020
> #CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  59_1
> chr12   126890980       rs1000000       A       G       .       .       .       GT:IGC:BAF:LRR  0/1:0.787888:0.507545:0.00840554
> chr3    183635768       rs1000002       A       G       .       .       .       GT:IGC:BAF:LRR  0/1:0.879837:0.46582:-0.0646085
> chr4    95733906        rs10000023      A       N       .       .       .       GT:IGC:BAF:LRR  0/0:0.755057:0.0114496:0.0626198
> chr3    98342907        rs1000003       A       N       .       .       .       GT:IGC:BAF:LRR  0/0:0.790033:0.0316309:0.157232
> chr4    103374154       rs10000030      N       G       .       .       .       GT:IGC:BAF:LRR  1/1:0.7819:0.989484:0.0112141
> chr4    38924330        rs10000037      A       G       .       .       .       GT:IGC:BAF:LRR  0/1:0.899376:0.512505:-0.0105628
> chr4    165621955       rs10000041      A       N       .       .       .       GT:IGC:BAF:LRR  0/0:0.923617:0.0272044:0.23714
> chr4    5237152 rs10000042      N       G       .       .       .       GT:IGC:BAF:LRR  1/1:0.784419:1:0.0955165
> chr4    118948220       rs10000049      A       N       .       .       .       GT:IGC:BAF:LRR  0/0:0.432164:0.00305451:0.0203382
> chr2    237752054       rs1000007       A       N       .       .       .       GT:IGC:BAF:LRR  0/0:0.908865:0.0369862:-0.0206654
> chr4    43022222        rs10000073      A       N       .       .       .       GT:IGC:BAF:LRR  0/0:0.925892:0.0150725:0.0507056
> chr4    17348363        rs10000081      A       N       .       .       .       GT:IGC:BAF:LRR  0/0:0.905235:0:0.0410943
> chr4    21895517        rs10000092      A       G       .       .       .       GT:IGC:BAF:LRR  0/1:0.839045:0.558548:-0.333746
> chr4    53623677        rs10000105      N       G       .       .       .       GT:IGC:BAF:LRR  1/1:0.864878:0.981329:0.114554
> chr4    37796830        rs10000119      N       G       .       .       .       GT:IGC:BAF:LRR  1/1:0.907763:1:-0.069378
> chr4    109106451       rs10000124      N       C       .       .       .       GT:IGC:BAF:LRR  1/1:0.810363:0.995153:-0.108182
> chr4    80666077        rs10000154      A       N       .       .       .       GT:IGC:BAF:LRR  0/0:0.926977:0:-0.175949
> chr2    235690982       rs1000016       A       N       .       .       .       GT:IGC:BAF:LRR  0/0:0.870474:0:-0.068737
> chr4    69033099        rs10000160      N       G       .       .       .       GT:IGC:BAF:LRR  1/1:0.901467:1:0.174004

Docker

Dear Giulio,

Thanks a lot for such nice workfellow for the conversion of the gtc files to vcf.

because of some limitations, I wasn't able to install everything and tried to convert the whole package to a docker and I failed here too.

Do you have any plane to make a docker container that does the whole process?

Really appreciate it.

Regards

Some package installed wrong in Centos

Dear freeseek,
I have some trouble in installing gtc2vcf. When I installed htslib,there is something wrong.
$./configure
checking for gcc... /usr/local/anaconda3/bin/x86_64-conda_cos6-linux-gnu-cc
checking whether the C compiler works... yes
checking for C compiler default output file name... a.out
checking for suffix of executables...
checking whether we are cross compiling... no
checking for suffix of object files... o
checking whether we are using the GNU C compiler... yes
checking whether /usr/local/anaconda3/bin/x86_64-conda_cos6-linux-gnu-cc accepts -g... yes
checking for /usr/local/anaconda3/bin/x86_64-conda_cos6-linux-gnu-cc option to accept ISO C89... none needed
checking for ranlib... /usr/local/anaconda3/bin/x86_64-conda_cos6-linux-gnu-ranlib
checking for grep that handles long lines and -e... /usr/bin/grep
checking for C compiler warning flags... -Wall
checking for pkg-config... /usr/bin/pkg-config
checking pkg-config is at least version 0.9.0... yes
checking for special C compiler options needed for large files... no
checking for _FILE_OFFSET_BITS value needed for large files... no
checking shared library type for unknown-Linux... plain .so
checking whether the compiler accepts -fvisibility=hidden... yes
checking how to run the C preprocessor... /usr/local/anaconda3/bin/x86_64-conda_cos6-linux-gnu-cpp
checking for egrep... /usr/bin/grep -E
checking for ANSI C header files... yes
checking for sys/types.h... yes
checking for sys/stat.h... yes
checking for stdlib.h... yes
checking for string.h... yes
checking for memory.h... yes
checking for strings.h... yes
checking for inttypes.h... yes
checking for stdint.h... yes
checking for unistd.h... yes
checking for stdlib.h... (cached) yes
checking for unistd.h... (cached) yes
checking for sys/param.h... yes
checking for getpagesize... yes
checking for working mmap... yes
checking for gmtime_r... yes
checking for fsync... yes
checking for drand48... yes
checking for srand48_deterministic... no
checking whether fdatasync is declared... yes
checking for fdatasync... yes
checking for library containing log... -lm
checking for zlib.h... no
checking for inflate in -lz... no
configure: error: zlib development files not found

HTSlib uses compression routines from the zlib library http://zlib.net.
Building HTSlib requires zlib development files to be installed on the build
machine; you may need to ensure a package such as zlib1g-dev (on Debian or
Ubuntu Linux) or zlib-devel (on RPM-based Linux distributions or Cygwin)
is installed.
FAILED. This error must be resolved in order to build HTSlib successfully.
But my zlib-devel has been installed ,the version of zlib-devel:zlib-devel-1.2.7-18.el7.x86_64

I wish you can give me some help. Thank you for your help.

Best wishes,
Crane

Problem converting Illumina Genome reports to vcf

Hi,
Thank you for the wonderful set of tool for converting the illumina reports to vcf files.

I am getting a error while using the matrix format illumina reports.

Error is as follows:

./bcftools +gtc2vcf.so --no-version -o --genome-studio /Users/vikrants/Desktop/testvcf/ILHC24-12806_FinalReport.txt -f /Users/vikrants/res/hg38.fa

Reading GTC file /Users/vikrants/Desktop/testvcf/ILHC24-12806_FinalReport.txt
GTC file /Users/vikrants/Desktop/testvcf/ILHC24-12806_FinalReport.txt format identifier is bad

Can you please have a look and let me know why i am getting this error.

P.S. - I have generated the matrix format report from the genome studio.

Thanks in advance,
Vikrant

ERROR in converting CHP to VCF

hi, devoloper. After I install bcftools-1.11 and gtc2vcf, I run the following code
/data_6t/lizhan/02.software/bcftools-1.11/bcftools +affy2vcf \ --no-version -Ou \ --csv $csv_manifest_file \ --fasta-ref $ref \ --chps $path_to_chp_folder \ --snp $path_to_txt_folder/AxiomGT1.snp-posteriors.txt \ --extra $out_prefix.tsv

but there are some error message.

Writing to ./bcftools-sort.ribgu4
/data_6t/lizhan/02.software/bcftools-1.11/plugins/affy2vcf.so:
dlopen .. /data_6t/lizhan/02.software/bcftools-1.11/plugins/affy2vcf.so: undefined symbol: set_wmode
affy2vcf:
dlopen .. affy2vcf: cannot open shared object file: No such file or directory

The bcftools plugin "affy2vcf" was not found or is not functional in
BCFTOOLS_PLUGINS="/data_6t/lizhan/02.software/bcftools-1.11/plugins".

Is the plugin path correct?
Run "bcftools plugin -l" or "bcftools plugin -lvv" for a list of available plugins.

Could not load "affy2vcf".

RUN bcftools +affy2vcf --models get some error . How do i fix it?

I'm sorry to bother you.
I got this error "Probe Set AX-82929059 not found in models file" when I run bcftools +affy2vcf.
How do i fix it? Thanks!

Could I not use this command --models xxxxxx.snp-posteriors.txt when I ran bcftools +affy2vcf. Any Different? => If I don't use --models command, I can get vcf file.

bcftools +affy2vcf
--csv ../APT-library/biobank/Axiom_BioBank1.na35.annot.csv
--fasta-ref ../resource-humanv37/human_g1k_v37.fasta
--calls ./GPS-step7-output/AxiomGT1.calls.txt
--confidences ./GPS-step7-output/AxiomGT1.confidences.txt
--summary ./GPS-step7-output/AxiomGT1.summary.txt
--models ./GPS-step7-output/AxiomGT1.snp-posteriors.txt
--output ./bcf-output/AxiomGT1.vcf

--- RUNNING LOG ---
Reading CSV file ../APT-library/biobank/Axiom_BioBank1.na35.annot.csv
Reading SNP file ./GPS-step7-output/AxiomGT1.snp-posteriors.txt
Writing VCF file
Probe Set AX-82929059 not found in models file

bcftools +affy2vcf
--csv ../APT-library/biobank/Axiom_BioBank1.na35.annot.csv
--fasta-ref ../resource-humanv37/human_g1k_v37.fasta
--chps ./GPS-step7-output/cc-chp/
--models ./GPS-step7-output/AxiomGT1.snp-posteriors.txt
--output bcf0517chp.vcf

--- RUNNING LOG ---
Reading CSV file ../APT-library/biobank/Axiom_BioBank1.na35.annot.csv
Reading CHP file ./GPS-step7-output/cc-chp//xxxxxxxxxxx.chp
...
Reading SNP file ./GPS-step7-output/AxiomGT1.snp-posteriors.txt
Writing VCF file
Probe Set AX-82929059 not found in models file

patch not working; vcf version is out of date

Hello -- We could not patch the +gtc2vcf plugin using bcftools/1.9 on centOs 6

in another install attempt "MODE_SWAP" said undefined in the c code - first attempt to install.

bcftools-1.9/plugins]$ patch < fixref.patch
patching file fixref.c
Hunk #1 FAILED at 91.
Hunk #2 succeeded at 104 (offset -1 lines).
Hunk #3 FAILED at 134.
Hunk #4 FAILED at 155.
Hunk #5 succeeded at 180 (offset -5 lines).
Hunk #6 succeeded at 193 with fuzz 2 (offset -6 lines).
Hunk #7 succeeded at 236 (offset -6 lines).
Hunk #8 succeeded at 428 with fuzz 2 (offset -14 lines).
Hunk #9 succeeded at 586 (offset -14 lines).
3 out of 9 hunks FAILED -- saving rejects to file fixref.c.rej

This is with bcftools-1.9 etc.
somehow we something without the patch and it gave vcf version 3ish not 4.2?
Any plans to do more with this plugin maybe cover indels and some updating for the vcf spec?

I like the concept of making a bcftools plugin - that's kinda nifty :-)

vcf files not being saved

I have approximately 4000 gtcs that I am trying to convert to vcf files using the gtc2vcf plugin but even though the script reads gtcs correctly and writes the vcf file - no output is produced. I have tried to run it by reducing the number of gtcs to 8 and get the same result.
I get this output;
Writing to ./bcftools-sort.XXXXXXMMTHoa gtc2vcf 2022-01-12 https://github.com/freeseek/gtc2vcf Reading BPM file /bochica/shared/numom/raw_babies/GUER_20211019_MEGA_1001_1002/Multi-EthnicGlobal_D2.bpm Reading EGT file /bochica/shared/numom/raw_babies/GUER_20211019_MEGA_1001_1002/Multi-EthnicGlobal_D1_ClusterFile.egt Reading GTC file /home5/maamir/mfgitry/somegtc/206043240081_R02C01.gtc Reading GTC file /home5/maamir/mfgitry/somegtc/206043240081_R07C01.gtc Reading GTC file /home5/maamir/mfgitry/somegtc/206043240081_R06C01.gtc Reading GTC file /home5/maamir/mfgitry/somegtc/206043240081_R01C01.gtc Reading GTC file /home5/maamir/mfgitry/somegtc/206043240081_R08C01.gtc Reading GTC file /home5/maamir/mfgitry/somegtc/206043240081_R03C01.gtc Reading GTC file /home5/maamir/mfgitry/somegtc/206043240081_R05C01.gtc Reading GTC file /home5/maamir/mfgitry/somegtc/206043240081_R04C01.gtc Writing VCF file Lines total/missing-reference/skipped: 1748250/23814/14885 Merging 2 temporary files Cleaning Lines total/split/realigned/skipped: 1733365/0/0/23817

But no sub directory of bcftools-sort.XXXXXXMMTHoa is present in my directory when the programme has stopped running.

Below is the code I am using -

ref="/home5/maamir/GRCh38/GCA_000001405.15_GRCh38_no_alt_analysis_set.fna" bcftools +gtc2vcf --no-version -Ou --bpm /bochica/shared/numom/raw_babies/GUER_20211019_MEGA_1001_1002/Multi-EthnicGlobal_D2.bpm --egt /bochica/shared/numom/raw_babies/GUER_20211019_MEGA_1001_1002/Multi-EthnicGlobal_D1_ClusterFile.egt --gtcs /home5/maamir/mfgi --fasta-ref $ref --extra $out_prefix.tsv | bcftools sort -Ou -T ./bcftools-sort.XXXXXX | bcftools norm --no-version -Ob -c x -f $ref && \ bcftools index --f $out_prefix.bcf

NORMX/NORMY/R/THETA missing from GenomeStudio text output

Thanks for an excellent tool! I have been trying to use it to generate input for a CNV calling pipeline, and was pleased to discover the -Ot option for GenomeStudio text format export, which looked close enough to the format I needed. However, it seems some fields that make it to the VCF output are not exported to the text format.

Specifically, the ones I miss are NORMX/NORMY/R/THETA. I checked the code of gtcs_to_gs, and all the missing fields seem to depend on BPM_LOOKUPS being set. I couldn't see a reason why it wouldn't be though, so maybe this is the wrong track.

Exporting the same collection of GTCs to VCF had the proper format tags included.

This call:

bcftools +${GTC2VCF} \
        -Ot \
        --bpm ${BATCH1_MFT_BPM} \
        --csv ${BATCH1_MFT_CSV} \
        --egt ${BATCH1_EGT} \
        --gtcs ${GTCDIR}/${BATCH1_NAME} \
        --fasta-ref ${REF} > ${OUT_PREFIX}.FDT.tsv

Produces output with these columns (truncated):

Index
Name
Address
Chr
Position
GenTrain Score
Frac A
Frac C
Frac G
Frac T
204379800081_R02C02.GType
204379800081_R02C02.Score
204379800081_R02C02.B Allele Freq
204379800081_R02C02.Log R Ratio
204379800081_R02C02.X Raw
204379800081_R02C02.Y Raw
204379800081_R02C02.Top Alleles
204379800081_R02C02.Plus/Minus Alleles
204379800081_R02C01.GType
204379800081_R02C01.Score
204379800081_R02C01.B Allele Freq
204379800081_R02C01.Log R Ratio
204379800081_R02C01.X Raw
204379800081_R02C01.Y Raw
...

While an equivalent call requesting vcf output:

bcftools +${GTC2VCF} \
        -Ou \
        --bpm ${BATCH1_MFT_BPM} \
        --csv ${BATCH1_MFT_CSV} \
        --egt ${BATCH1_EGT} \
        --gtcs ${GTCDIR}/${BATCH1_NAME} \
        --fasta-ref ${REF} \
        --extra ${OUT_PREFIX}.tsv | \
        bcftools sort -Ou -T $TMPDIR/bcftools-sort.XXXXXX | \
        bcftools norm -Oz -o ${OUT_PREFIX}.vcf.gz -c x -f $REF

produces a VCF with the expected format tags:

GT:GQ:IGC:BAF:LRR:NORMX:NORMY:R:THETA:X:Y

Tested on the stable version from http://software.broadinstitute.org/software/gtc2vcf/ and the current github version getting the same results.

I can query the VCF to get the data I need, but thought I should report this since the behavior was unexpected.

Failed to read 1359180426 bytes when convert gtc files

Dear Giulio，
Thank you for developing such a good tooI to deal with idat files. I have converted gtc files from idat successfully, thank you for your suggestion. When I run the code just like the guide, an error occured and I saw someone have the similar issue, but not suitable for me (#13). I used the -gtcs, the folder have 103 gtc files and less files still have the same error.

$bcftools +gtc2vcf \

--no-version -Ou \

--bpm $bpm_manifest_file \

--csv $csv_manifest_file \

--egt $egt_cluster_file \

--gtcs $path_to_gtc_folder \

--fasta-ref $ref \

--extra $out_prefix.tsv

gtc2vcf 2020-08-26 https://github.com/freeseek/gtc2vcf

Reading BPM file /media/EXTend2018/Wanghe2019/GEO/GSE113093/InfiniumPsychArray-24v1-1_A1.bpm

Reading CSV file /media/EXTend2018/Wanghe2019/GEO/GSE113093/InfiniumPsychArray-24v1-1_A1.csv

Reading EGT file /media/EXTend2018/Wanghe2019/GEO/GSE113093/InfiniumPsychArray-24v1-1_A1_ClusterFile.egt

Reading GTC file /media/EXTend2018/Wanghe2019/GEO/GSE113093/GSE113093_GTC/GSM3096512_200687150051.gtc

Failed to read 1359180426 bytes from stream

Best wishes,
Crane

affy2vcf: How to make the IDs of generated vcf files be rsid from annotation files given by Affymetrix but not probeset id?

Hi,

I had succeeded transforming CEL files to vcf files, but I found the ID column of vcf files were still probeset ID. I have tried

bcftools annotate -a 00-All.vcf.gz -c ID xxx.vcf.gz

to make the ID column annotated by rsids, but there are still some SNPs failing to be annotated for 00-All.vcf.gz not containing all the SNPs from GenomeWideSNP_6.na35.annot.csv. Is there anyway to annotate the IDs in the step

bcftools +affy2vcf \
  --no-version -Ou \
  --csv $csv_manifest_file \
  --fasta-ref $ref \
  --chps $path_to_chp_folder \
  --snp $path_to_txt_folder/AxiomGT1.snp-posteriors.txt \
  --extra $out_prefix.tsv | \
  bcftools sort -Ou -T ./bcftools-sort.XXXXXX | \
  bcftools norm --no-version -Ob -o $out_prefix.bcf -c x -f $ref && \
  bcftools index -f $out_prefix.bcf

or transform the GenomeWideSNP_6.na35.annot.csv to vcf annotation file? Thank you!

Some installation issues

Hi Giulio,

I met some issues when installing the tools. I'm using Ubuntu 16.04 and I'm not experienced at Ubuntu installation. Could you help me with them?

Cannot install libicu66

sudo apt install libicu66
Reading package lists... Done
Building dependency tree       
Reading state information... Done
E: Unable to locate package libicu66

I did some search on Google but did not find a package named libicu66. If I just want to convert .idat files into .vcf files (do not have .bpm files), do I need to install this package?

Cannot install gtc2vcf correctly
I tried to use the first method to install gtc2vcf:

git clone --branch=develop --recurse-submodules git://github.com/samtools/htslib.git
git clone --branch=develop git://github.com/samtools/bcftools.git
/bin/rm -f bcftools/plugins/{gtc2vcf.{c,h},affy2vcf.c}
wget -P bcftools/plugins https://raw.githubusercontent.com/freeseek/gtc2vcf/master/{gtc2vcf.{c,h},affy2vcf.c}
cd htslib && autoheader && (autoconf || autoconf) && ./configure --disable-bz2 --disable-gcs --disable-lzma && make && cd ..
cd bcftools && make && cd ..
/bin/cp bcftools/{bcftools,plugins/{gtc,affy}2vcf.so} $HOME/bin/
export PATH="$HOME/bin:$PATH"
export BCFTOOLS_PLUGINS="$HOME/bin"

These commands all run correctly but when I tried to use

gtc2vcf

I got

gtc2vcf: command not found

When I tried

gtc2vcf.so

I got

Segmentation fault (core dumped)

My system has 16GB RAM and 8 cores. Do you think it due to the lack of RAM?

When I tried to use the alternative method to install gtc2vcf, I got :

sudo apt install ./{libhts3_1.11-4,bcftools_1.11-1,gtc2vcf_1.11-dev}_amd64.deb
Reading package lists... Done
Building dependency tree       
Reading state information... Done
Note, selecting 'libhts3' instead of './libhts3_1.11-4_amd64.deb'
Note, selecting 'bcftools' instead of './bcftools_1.11-1_amd64.deb'
Note, selecting 'gtc2vcf' instead of './gtc2vcf_1.11-dev_amd64.deb'
Some packages could not be installed. This may mean that you have
requested an impossible situation or if you are using the unstable
distribution that some required packages have not yet been created
or been moved out of Incoming.
The following information may help to resolve the situation:

The following packages have unmet dependencies:
 bcftools : Depends: libc6 (>= 2.29) but 2.23-0ubuntu11.2 is to be installed
 gtc2vcf : Depends: libc6 (>= 2.29) but 2.23-0ubuntu11.2 is to be installed
 libhts3 : Depends: libc6 (>= 2.29) but 2.23-0ubuntu11.2 is to be installed
           Depends: libdeflate0 (>= 1.0) but it is not installable
           Depends: libssl1.1 (>= 1.1.0) but it is not installable
E: Unable to correct problems, you have held broken packages.

Consequently, when I tried to find BPM manifest infomation, I got nothing.

bcftools + gtc2vcf -i -g ~/Desktop/test/
Could not initialize , neither run or init found

Any suggestions would be greatly appreciated!

Thank you!
Xiaotong

Feature request: alternative genome reference for --genome-studio input

Hello, and thanks for a great tool!

I am working on some older genotype data (on the PsychChip) where the IDAT files have unfortunately been lost to time, but where we do have a reasonably rich GenomeStudio text format export, and the original csv manifest file used when generating the export. I want to combine this with newer genotyping waves where we do have the IDATs, and would like to remap the markers using gtc2vcf to hopefully be done with strand and allele issues once and for all. But currently gtc2vcf does not permit --genome-studio to be used with --csv and/or --sam-flank.

Would it be possible to extend gtc2vcf to this use case, or is there some vital information I am missing that makes it a bad idea or impossible?

The GS export has columns (followed by 6-15 repeated for each sample):

1: Index
2: Name
3: Address
4: Chr
5: Position
6: S1.GType
7: S1.Score
8: S1.Theta
9: S1.R
10: S1.X Raw
11: S1.Y Raw
12: S1.X
13: S1.Y
14: S1.B Allele Freq
15: S1.Log R Ratio
16: ...

My csv manifest has columns:

1: IlmnID
2: Name
3: IlmnStrand
4: SNP
5: AddressA_ID
6: AlleleA_ProbeSeq
7: AddressB_ID
8: AlleleB_ProbeSeq
9: GenomeBuild
10: Chr
11: MapInfo
12: Ploidy
13: Species
14: Source
15: SourceVersion
16: SourceStrand
17: SourceSeq
18: TopGenomicSeq
19: BeadSetID

No releases

Commit messages contain phrases like new release or new version, but there are no versioned releases/tags for this repo. That makes it hard to create a reproducible deployment for reproducible science...

build in macOS

Dear gtc2vcf team

I was wondering whether a prebuilt binary file for mac exists?
if not, is there any recipe/instruction to build the package from source in macos?

Thank you in advance.

Regards,
Sina

idat to vcf without bpm file

Hi @freeseek ,

I dont have the bpm manifest file but I have csv manifest file. Is there any options to convert the idat file to gtc & vcf?

Regards,
Karthick

Include other metrics in the vcf output

Hi there,

I'm looking for a way to include the "cluster separation" [0-1] metric to the output vcf produced using the gtc2vcf method. Could someone please tell me if this would be possible and how I could change the code to achieve this goal?

Thank you!

No output VCF files

Hello

I see it's necessary two steps to convert from .CEL to .VCF. In the first step is generated xxxxx.AxiomGT1.chp files (where xxxxx is the name of the original file) is this correct?

Now, I'm having problem with the second step. When I run that part of the program I have no errors but also I can't find the VCF files. This is the code I'm running:

bcftools +affy2vcf
--no-version -Ou
--csv "GenomeWideSNP_6.na35.annot.csv"
--fasta-ref "human_g1k_v37.fasta"
--chps /home/adrianib/Proyecto/cc-chp
--snp /home/adrianib/Proyecto/AxiomGT1.snp-posteriors.txt
--extra result.tsv |
bcftools sort -Ou -T ./bcftools-sort.XXXXXX |
bcftools norm --no-version -Ob -o result.bcf -c x -f "human_g1k_v37.fasta" &&
bcftools index -f result.bcf

I see there is no command to indicate the output folder as in the first step. This could be the reason I don't have output VCF files?

In summary, I have this:
Original file: xxxxx.CEL
1st step (CEL to CHP): xxxxx.AxiomGT1.chp
2nd step (CHP to VCF): ?

And my question is: Should I have a xxxxx.VCF file at the end of the second step?

Thanks for your help
Adrian

will generate same output as using AutoConvert via Beeline?

Hello,

Thank you for the handy tool! I'm able to generate gtc files from idat files using your software. However, I'd like to know if the results are the same as Beeline's AutoConvert function. I don't have a windows os with illumina, so I can't compare by myself. I really appreciate if anyone has any inputs.

Thanks
Fan

Genomestudio file to vcf

Dear Freeseek,

The conversion from a genomestudio file to a vcf file works fine, but a lot of SNPs are missing after this conversion. I looked into this and observed that only the SNPs without any missings are in the vcf file, but I am not sure about this yet, so I have some questions about this.

Is it true that the gtc2vcf tool only keep the complete SNPs without any missings after conversion? Or is there another way to handle them in this tool? And is it right if I use -- for missings in the Genomestudio file?

Thanks in advance!

how to convert CEL to CHP?

After I install bcftools, I follow the README document and run the following code, but there is a error message.
path_to_output_folder="..." cel_list_file="..." apt-probeset-genotype \ --analysis-files-path . \ --xml-file GenomeWideSNP_6.apt-probeset-genotype.AxiomGT1.xml \ --out-dir $path_to_output_folder \ --cel-files $cel_list_file \ --special-snps GenomeWideSNP_6.specialSNPs \ --chip-type GenomeWideEx_6 \ --chip-type GenomeWideSNP_6 \ --table-output false \ --cc-chp-output \ --write-models \ --read-models-brlmmp GenomeWideSNP_6.generic_prior.txt

The question is that what software should be install when use [apt-probeset-genotype]?

Question affy2vcf

Hi Giulio,

Quick question, I see affy2vcf can convert cel to chp and chp to vcf. I am just wondering if this is required to do two steps to get from cel to vcf? I don't see in description requiring this and I know PennCNV goes from cel to vcf but requires multiple steps. Let me know whether we can go straight from CEL to VCF. Thanks.
Brian

No output files generated for Illunmina reports

Hello,

I have tried to use the following command to convert Illumina reports to VCF.

bcftools +gtc2vcf --genome-studio FinalReport24.txt -o GenotypeReport24.vcf

Output from the run in the terminal is only one line:

gtc2vcf 2021-06-01 https://github.com/freeseek/gtc2vcf

And the GenotypeReport24.vcf file is created but with no contents in it.

An extract from the Illumina report:

[Header]
GSGT Version	2.0.4
Processing Date	3/29/2021 4:13 PM
Content		GSA-24v3-0_A2.bpm
Num SNPs	654027
Total SNPs	654027
Num Samples	24
Total Samples	24
File 	24 of 24
[Data]
Sample Index	Sample ID	Sample Name	SNP Index	SNP Name	Chr	Position	GT Score	GC Score	Allele1 - AB	Allele2 - AB	Allele1 - Top	Allele2 - Top	Allele1 - Forward	Allele2 - Forward	Allele1 - Design	Allele2 - Design	Theta	R	X Raw	Y Raw	X	Y	B Allele Freq	Log R Ratio	SNP Aux	SNP	ILMN Strand	Top Genomic Sequence	Customer Strand
24	03-031		1	1:103380393	1	102914837	0.7987	0.8136	B	B	G	G	G	G	C	C	0.963	0.722	1101	3453	0.040	0.682	1.0000	0.3609	0	[T/C]	BOT		TOP
24	03-031		2	1:109439680	1	108897058	0.8792	0.4803	A	A	A	A	A	A	A	A	0.039	0.895	11409	497	0.843	0.052	0.0000	0.4173	0	[A/G]	TOP		TOP

I spent some hours trying to figure out what i might be doing wrong but couldn't figure it out.
Any tips on what might be going wrong with my steps is appreciated.

Thanks,
Rashindrie

Update

Tried with below command

ref="/tmp/GCA_000001405.15_GRCh38_no_alt_analysis_set.fna"
bcftools +gtc2vcf   --no-version -Ov -o $out_prefix  --genome-studio "FinalReport24.txt" -f $ref

Output on terminal

gtc2vcf 2021-06-01 https://github.com/freeseek/gtc2vcf
Writing VCF file
Could not recognize INFO field: [Header]

cannot open more than 4096 files at once while 30546 is required

Hi!
I get this error when using your pretty tool trying to convert gtcs to vcfs:

$HOME/bin/bcftools +$HOME/bin/gtc2vcf.so --no-version -Ou -b $manifest_file -e $egt_file -g $gtc_list -f $ref -x $out.sex
...cannot open more than 4096 files at once while 30546 is required

I need another machine >4Gb RAM or I can do something with RAM capacities?

Thank you in advance Dr. Genovese!!

IDAT not found

Hi freeseek,
Thank you for your help yesterday and I installed the gtc2vcf successfully. But when I convert idat to gtc, the IDAT always not found at the location. I tried many times but can't solve it. I checked the bpm and egt, I am sure they are right:
Chip Prefix (Guess),InfiniumPsychArray-24v1-1
I don't know why the idat not found, my idat files just like this:
GSM3096512_200687150051_R01C01_Grn.idat
GSM3096512_200687150051_R01C01_Red.idat

This is the log:
ArrayAnalysis.NormToGenCall.CLI.App[0]
[10:25:21 2352]: Crawling /media/EXTend2018/Wanghe2019/GEO/GSE113093/GSE113093_RAW for samples ...
info: ArrayAnalysis.NormToGenCall.CLI.App[0]
[10:25:21 3578]: Number of samples to process: 103
info: ArrayAnalysis.NormToGenCall.Services.NormToGenCallSvc[0]
[10:25:21 3714]:
Starting processing...
Manifest file: /media/EXTend2018/Wanghe2019/GEO/GSE113093/InfiniumPsychArray-24v1-1_A1.bpm
Cluster file: /media/EXTend2018/Wanghe2019/GEO/GSE113093/InfiniumPsychArray-24v1-1_A1_ClusterFile.egt
Include file:
Output directory: /media/EXTend2018/Wanghe2019/GEO/GSE113093
GenCall score cutoff: 0.15
GenTrain ID: 3
Gender Estimate Settings:
Version: 2
MinAutosomalLoci : 100
MaxAutosomalLoci : 10000
MinXLoci : 20
MinYLoci : 20
AutosomalCallRateThreshold : 0.97
YIntensityThreshold : 0.3
XIntensityThreshold : 0.9
XHetRateThreshold : 0.1
Output Settings:
Output GTC: True
Output PED: False
PED tab delmited: False
PED use customer strand: False
Number of threads: 1
Buffer size: 131072

info: ArrayAnalysis.NormToGenCall.Services.NormToGenCallSvc[0]
[49m: ArrayAnalysis.NormToGenCall.Services.NormToGenCallSvc[0]
[12:33:32 8929]: Failed to normalize or gencall - GSM3096512_200687150051_R01C01: IDAT not found at location: /media/EXTend2018/Wanghe2019/GEO/GSE113093/GSE113093_RAW/GSM3096512_200687150051_Red.idat
at ArrayAnalysis.NormToGenCall.Services.SampleNormToGenCallSvc.LoadIdat(String idatPath, Manifest manifest) in /src/ArrayAnalysis.NormToGenCall.Services/Services/SampleNormToGenCallSvc.cs:line 63
at ArrayAnalysis.NormToGenCall.Services.SampleNormToGenCallSvc.Normalize(NormalizationBase normAlg, Manifest manifest, Byte[] transformLookups, Boolean needGreen, Boolean needRed, SampleData sample, String[] includeLociNames) in /src/ArrayAnalysis.NormToGenCall.Services/Services/SampleNormToGenCallSvc.cs:line 106
at ArrayAnalysis.NormToGenCall.Services.NormToGenCallSvc.<>c__DisplayClass7_0.b__2(SampleData sample) in /src/ArrayAnalysis.NormToGenCall.Services/Services/NormToGenCallSvc.cs:line 113
...There are many idat files fault like this.

Best wishes,
Crane

Failed to open file "Ou" : No such file or directory

Hello, freeseek,

I cannot seem to convert my .gtc files to a vcf file using the following code:

bpm_manifest_file="InfiniumOmni2-5-8v1-5_A1.bpm"
csv_manifest_file="InfiniumOmni2-5-8v1-5_A1.csv"
egt_cluster_file="InfiniumOmni2-5-8v1-5_A1_ClusterFile.egt"
ref="$HOME/GRCh37/human_g1k_v37.fasta"
out_prefix="batch1_vcf"
bcftools +gtc2vcf --no-version -Ou --bpm $bpm_manifest_file --csv $csv_manifest_file --egt $egt_cluster_file --gtcs $path_to_gtc_folder --fasta-ref $ref --extra $out_prefix.tsv | bcftools sort -Ou -T ./bcftools-sort.XXXXXX | bcftools norm --no-version -Ob -c x -f $ref | tee $out_prefix.bcf | bcftools index --force --output $out_prefix.bcf.csi

The error that I receive is [E: :hts_open_format] Failed to open file "Ou" : No such file or directory
Reading BPM file InfiniumOmni2-5-8v1-5_A1.bpm
Could not read Ou
Failed to read from standard input: unknown file type
index: "-" is in a format that cannot be usefully indexed

I've tried adapting the command by reading through the other issues that have come up, but have had no luck creating a bcf file that has > 0 bytes. May I ask for assistance in resolving this issue? I should mention that the manifest and cluster files provided by illumina are in the same directory in which I am running this command.

Thank you,
Chris

Sample_ID from samples file not saved to VCF -file

First of all, thank You very much for this excellent pipeline!

I have been able to convert idat files successfully to GTC and during the conversion, iaap-cli recognises the sample ID from samples file successfully. How ever, when converting from GTC to VCF, ID is set back to "SentrixBarcode_A_SentrixPosition_A"

Samples CSV file is structured as follows:

[Data]
Sample_ID,SentrixBarcode_A,SentrixPosition_A,Path

During the iaap-cli conversion i get message:
info: ArrayAnalysis.NormToGenCall.Services.NormToGenCallSvc[0]
[07:09:03 1893]: Writing [Sample_ID_Obfuscated] to gtc...

when I query the IDs from the converted VCF file: bcftools query -l
I get:
[SentrixBarcode_A][SentrixPosition_A]
[SentrixBarcode_A][SentrixPosition_A]
[SentrixBarcode_A]_[SentrixPosition_A]
.....

I know I can annotate VCF IDs again, but would rather form a pipeline where this is not nescessary.

Error in names(object) <- nm gtc2vcf_plot.R

Dear freeseek,

I have some issues with running the R script gtc2vcf_plot.R to generate plots. My input was first a .vcf file, but i got an error about the file format, so I converted it with bgzip to a vcf.gz file (as suggested in the message) with the following command: bgzip file.vcf. After converting the file to a .vcf.gz file format, I got the error below.

gtc2vcf_plot.R 2020-09-01 https://github.com/freeseek/gtc2vcf
Command: bcftools query --format [%CHROM\t%POS\t%ID\t%INFO/meanR_AA\t%INFO/meanR_AB\t%INFO/meanR_BB\t%INFO/meanTHETA_AA\t%INFO/meanTHETA_AB\t%INFO/meanTHETA_BB\t%INFO/devR_AA\t%INFO/devR_AB\t%INFO/devR_BB\t%INFO/devTHETA_AA\t%INFO/devTHETA_AB\t%INFO/devTHETA_BB\t%GT\t%X\t%Y\t%NORMX\t%NORMY\t%R\t%THETA\t%BAF\t%LRR\n]" all_qc.unphased_extra.vcf.gz -r 11:66328095-66328095
Error in names(object) <- nm :
  'names' attribute [24] must be the same length as the vector [0]
Calls: setNames
In addition: Warning message:
In fread(cmd = cmd, sep = "\t", header = FALSE, na.strings = ".",  :
  File '/tmp/RtmpaEu4mo/file13974573efd5' has size 0. Returning a NULL data.frame.
Execution halted

Thanks in advance!

gtc2vcf cannot open gtc files

I used the GenCall algorithm to generate gtc files. My generated gtc files are not in a readable format- is this supposed to be the case? (I have set the LANG variable as instructed). My egt and bpm files are also correctly called on, and GenCall seems to run fine.
However, the gtc2vcf plugin is also unable to read in these gtc files.

This is the command I have used to generate gtc files:
LANG="en_US.UTF-8" $HOME/bin/iaap-cli/iaap-cli gencall /path/to/manifest/file.bpm /path/to/cluster/file.egt /path/to/output/folder --idat-folder /path/to/idat/folder/--output-gtc --gender-estimate-call-rate-threshold -0.1

Am I generating gtc files incorrectly?

GTC file format identifier is bad

I'm using bcftools Version: 1.10.2 (using htslib 1.10.2)

bcftools +gtc2vcf -c HumanOmniExpressExome-8-v1-0-B.csv -f human_g1k_v37.fasta test.gtc -o test.vcf
================================================================================
Reading CSV file HumanOmniExpressExome-8-v1-0-B.csv
BPM manifest file version = 0
Name of manifest = HumanOmniExpressExome-8v1_B.bpm
Number of loci = 951117
================================================================================
Reading GTC files
GTC file test.gtc format identifier is bad

First couple of lines in my gtc file

[Header]
Autocall Version        1.6.2.2
Processing Date 8/24/2012 9:10 PM
Content HumanOmniExpressExome-8v1_B.bpm
Cluster File    StCtrCEPH_OMXEX_B.egt
Gender  F
Num SNPs        951117
Total SNPs      951117
Num Samples     1
Total Samples   1
[Data]
SNP Name        Chromosome      Position        GC Score        Allele1 - Top   Allele2 - Top   Allele1 - AB    Allele
2 - AB  X       Y       Raw X   Raw Y   R Illumina      Theta Illumina  bAllele Freq    Log R Ratio Illumina
200610-104      MT      212     0.4097353       A       A       A       A       3.1761038       0.037173282     23245.
0       433.0   3.213277        0.0074506905    0.0020603272    0.27923167
200610-106      MT      246     0.3716166       A       A       A       A       3.1220326       0.1416725       22856.
0       1400.0  3.2637053       0.028868914     0.0     0.3078592

my manifest file

HumanOmniExpressExome-8-v1-0-B.csv
Illumina, Inc.
[Heading]
Descriptor File Name,HumanOmniExpressExome-8v1_B.bpm
Assay Format,Infinium HD Super
Date Manufactured,4/21/2014
Loci Count ,951117
[Assay]
IlmnID,Name,IlmnStrand,SNP,AddressA_ID,AlleleA_ProbeSeq,AddressB_ID,AlleleB_ProbeSeq,GenomeBuild,Chr,MapInfo,Ploidy,Sp
ecies,Source,SourceVersion,SourceStrand,SourceSeq,TopGenomicSeq,BeadSetID,RefStrand,Exp_Clusters
200610-104-0_B_F_1867864664,200610-104,BOT,[T/C],0095685332,CGCACCTACGTTCAATATTACAGGCGAACATACTTACTAAAGTGTGTTAA,,,37,MT
,212,diploid,Homo sapiens,BGI,0,BOT,TTATTTATCGCACCTACGTTCAATATTACAGGCGAACATACTTACTAAAGTGTGTTAA[T/C]TAATTAATGCTTGTAGGAC
ATAATAATAACAATTGAATGTCTGCACAGCCACTTTCCACACAGACATCATAACAA,TTGTTATGATGTCTGTGTGGAAAGTGGCTGTGCAGACATTCAATTGTTATTATTATGTCCT
ACAAGCATTAATTA[A/G]TTAACACACTTTAGTAAGTATGTTCGCCTGTAATATTGAACGTAGGTGCGATAAATAA,485,+,2

Issue "Reading EGT file: Data block version 5 in cluster file not supported"

Hello,

after I obtained gtc files from idat files using Human CVN 370 manifest (.egt and .bpm files), I ran this code to get vcf file:

source /software/bcftools/1.9/start_bcftools.sh
bpm_manifest_file="humancnv370v1_c.bpm"
egt_cluster_file="HumanCNV370v1_C.egt"
gtc_list_file="gtc_370.txt"
ref="human_g1k_v37.fasta"
out_prefix="X"
bcftools +gtc2vcf
--no-version -Ov
-b $bpm_manifest_file
-e $egt_cluster_file
-g $gtc_list_file
-f $ref
-x $out_prefix.sex |
bcftools sort -Ov -T ./bcftools-sort.XXXXXX |
bcftools norm --no-version -Ov -o $out_prefix.vcf -c x -f $ref &&
bcftools index -f $out_prefix.vcf

I get the following error:
Reading EGT file HumanCNV370v1_C.egt
Data block version 5 in cluster file not supported
[E::bcf_hdr_read] Input is not detected as bcf or vcf format
Could not read VCF/BCF headers from -
Cleaning
Failed to read from standard input: unknown file type

Can you please help me with this?

Thank you.

Error in running Affymetrix

I already have the call, intensities, and confidence file. I am running the gtc2vcf on my Affymetrix genotype calls and intensities with the code provided but it returns with this error message:
[W::bcf_record_check] Bad BCF record: Invalid CONTIG id -1

[E::bcf_hdr_read] Input is not detected as bcf or vcf format

Hello,

when I try to convert .gtc files to .vcf I get the error "[E::bcf_hdr_read] Input is not detected as bcf or vcf format". It seems like the .gtc header size is bigger than expected. Can you please help me to fix this error?

Thank you.

compressed VCF

Hi,

When I'm changing from CHP files to BCF this is the command:

bcftools +affy2vcf \
--no-version -Ou \
--csv "GenomeWideSNP_6.na35.annot.csv" \
--fasta-ref "human_g1k_v37.fasta" \
--chps /home/user/project/cc-chp/NAME \
--snp /home/user/project/AxiomGT1.snp-posteriors.txt \
--extra NAME.tsv | \
bcftools sort -Ou -T ./bcftools-sort.XXXXXX | \
bcftools norm --no-version -Ob -o NAME.vcf -c x -f "human_g1k_v37.fasta" && \
bcftools index -f NAME.vcf

I was wondering, if I want to change the format to VCF I need to change the lines 2, 8 and 9 to "-Ov", "-Ov" and "-Oz", respectively? I mean, because "-Ov" and "-Oz" is for VCF, instead of "-Ou" and "-Ob" that is for BCF format.

If this is correct, It would look like this:

bcftools +affy2vcf \
--no-version -Ov \
--csv "GenomeWideSNP_6.na35.annot.csv" \
--fasta-ref "human_g1k_v37.fasta" \
--chps /home/user/project/cc-chp/NAME \
--snp /home/user/project/AxiomGT1.snp-posteriors.txt \
--extra NAME.tsv | \
bcftools sort -Ov -T ./bcftools-sort.XXXXXX | \
bcftools norm --no-version -Oz -o NAME.vcf -c x -f "human_g1k_v37.fasta" && \
bcftools index -f NAME.vcf

When I run it in this way, I have the VCF file in the end, but also I have this message:

index: "NAME.vcf" is in a format that cannot be usefully indexed

I just want to know if the change is correct and if its correct, there is any way to index the file usefully?

idat to vcf conversion

Hello, I have a list of idat files. I can read them in R using https://github.com/HenrikBengtsson/illuminaio
But how can I convert them into a vcf file? If I use +gtc2vcf plugin as follows:
bcftools +gtc2vcf -c /shire/databases/InfiniumOmni2-5-8v1-5_A1.csv -f /shire/databases/GCA_000001405.15_GRCh38_no_alt_analysis_set.fna -i EGAF00000868323.idat -o test.vcf
I am getting this:
IDAT file only allowed when converting to CSV
Any help? Best, Zillur

Problem compile affy2vcf.c

Hello,

I just try to compile bcftools with your new plugin with error:

gcc -fPIC -shared -g -Wall -O2 -I. -I../htslib    -o plugins/affy2vcf.so version.c plugins/affy2vcf.c 
In file included from plugins/affy2vcf.c:39:0:
plugins/gtc2vcf.h: In function ‘flank_reverse_complement’:
plugins/gtc2vcf.h:186:2: error: ‘for’ loop initial declarations are only allowed in C99 mode
  for (size_t i = 0; i < len / 2; i++) {
  ^
plugins/gtc2vcf.h:186:2: note: use option -std=c99 or -std=gnu99 to compile your code
plugins/gtc2vcf.h: In function ‘flank_left_shift’:
plugins/gtc2vcf.h:215:2: error: ‘for’ loop initial declarations are only allowed in C99 mode
  for (const char *ptr = middle + 2; ptr < right; ptr++)
  ^
plugins/gtc2vcf.h: In function ‘get_position’:
plugins/gtc2vcf.h:306:4: error: ‘for’ loop initial declarations are only allowed in C99 mode
    for (int k = 0; k < n_cigar && qlen > 1; k++) {
    ^
plugins/affy2vcf.c: In function ‘read_bytes’:
plugins/affy2vcf.c:79:3: error: ‘for’ loop initial declarations are only allowed in C99 mode
   for (int i = 0; i < nbytes; i++)
   ^
plugins/affy2vcf.c: In function ‘read_string16’:
plugins/affy2vcf.c:132:3: error: ‘for’ loop initial declarations are only allowed in C99 mode
   for (int i = 0; i < len; i++) {
   ^
plugins/affy2vcf.c: In function ‘xda_cel_print’:
plugins/affy2vcf.c:298:3: error: ‘for’ loop initial declarations are only allowed in C99 mode
   for (int i = 0; i < xda_cel->num_cells; i++)
   ^
plugins/affy2vcf.c:308:12: error: redefinition of ‘i’
   for (int i = 0; i < xda_cel->num_masked_cells; i++)
            ^
plugins/affy2vcf.c:298:12: note: previous definition of ‘i’ was here
   for (int i = 0; i < xda_cel->num_cells; i++)
            ^
plugins/affy2vcf.c:308:3: error: ‘for’ loop initial declarations are only allowed in C99 mode
   for (int i = 0; i < xda_cel->num_masked_cells; i++)
   ^
plugins/affy2vcf.c:317:12: error: redefinition of ‘i’
   for (int i = 0; i < xda_cel->num_outlier_cells; i++)
            ^
plugins/affy2vcf.c:308:12: note: previous definition of ‘i’ was here
   for (int i = 0; i < xda_cel->num_masked_cells; i++)
            ^
plugins/affy2vcf.c:317:3: error: ‘for’ loop initial declarations are only allowed in C99 mode
   for (int i = 0; i < xda_cel->num_outlier_cells; i++)
   ^
plugins/affy2vcf.c: In function ‘agcc_read_data_header’:
plugins/affy2vcf.c:459:2: error: ‘for’ loop initial declarations are only allowed in C99 mode
  for (int i = 0; i < data_header->n_parameters; i++)
  ^
plugins/affy2vcf.c:465:11: error: redefinition of ‘i’
  for (int i = 0; i < data_header->n_parents; i++)
           ^
plugins/affy2vcf.c:459:11: note: previous definition of ‘i’ was here
  for (int i = 0; i < data_header->n_parameters; i++)
           ^
plugins/affy2vcf.c:465:2: error: ‘for’ loop initial declarations are only allowed in C99 mode
  for (int i = 0; i < data_header->n_parents; i++)
  ^
plugins/affy2vcf.c: In function ‘agcc_read_data_set’:
plugins/affy2vcf.c:477:2: error: ‘for’ loop initial declarations are only allowed in C99 mode
  for (int i = 0; i < data_set->n_parameters; i++)
  ^
plugins/affy2vcf.c:482:11: error: redefinition of ‘i’
  for (int i = 0; i < data_set->n_cols; i++) {
           ^
plugins/affy2vcf.c:477:11: note: previous definition of ‘i’ was here
  for (int i = 0; i < data_set->n_parameters; i++)
           ^
plugins/affy2vcf.c:482:2: error: ‘for’ loop initial declarations are only allowed in C99 mode
  for (int i = 0; i < data_set->n_cols; i++) {
  ^
plugins/affy2vcf.c:492:11: error: redefinition of ‘i’
  for (int i = 0; i < data_set->n_cols; i++) {
           ^
plugins/affy2vcf.c:482:11: note: previous definition of ‘i’ was here
  for (int i = 0; i < data_set->n_cols; i++) {
           ^
plugins/affy2vcf.c:492:2: error: ‘for’ loop initial declarations are only allowed in C99 mode
  for (int i = 0; i < data_set->n_cols; i++) {
  ^
plugins/affy2vcf.c: In function ‘agcc_read_data_group’:
plugins/affy2vcf.c:514:2: error: ‘for’ loop initial declarations are only allowed in C99 mode
  for (int i = 0; i < data_group->num_data_sets; i++)
  ^
plugins/affy2vcf.c: In function ‘agcc_init’:
plugins/affy2vcf.c:548:2: error: ‘for’ loop initial declarations are only allowed in C99 mode
  for (int i = 0; i < agcc->num_data_groups; i++)
  ^
plugins/affy2vcf.c: In function ‘agcc_destroy_parameters’:
plugins/affy2vcf.c:576:2: error: ‘for’ loop initial declarations are only allowed in C99 mode
  for (int i = 0; i < n_parameters; i++) {
  ^
plugins/affy2vcf.c: In function ‘agcc_destroy_data_header’:
plugins/affy2vcf.c:591:2: error: ‘for’ loop initial declarations are only allowed in C99 mode
  for (int i = 0; i < data_header->n_parents; i++)
  ^
plugins/affy2vcf.c: In function ‘agcc_destroy_data_set’:
plugins/affy2vcf.c:600:2: error: ‘for’ loop initial declarations are only allowed in C99 mode
  for (int i = 0; i < data_set->n_cols; i++)
  ^
plugins/affy2vcf.c: In function ‘agcc_destroy_data_group’:
plugins/affy2vcf.c:610:2: error: ‘for’ loop initial declarations are only allowed in C99 mode
  for (int i = 0; i < data_group->num_data_sets; i++)
  ^
plugins/affy2vcf.c: In function ‘agcc_destroy’:
plugins/affy2vcf.c:623:2: error: ‘for’ loop initial declarations are only allowed in C99 mode
  for (int i = 0; i < agcc->num_data_groups; i++)
  ^
plugins/affy2vcf.c: In function ‘agcc_print_parameters’:
plugins/affy2vcf.c:639:2: error: ‘for’ loop initial declarations are only allowed in C99 mode
  for (int i = 0; i < n_parameters; i++) {
  ^
plugins/affy2vcf.c:674:4: error: ‘for’ loop initial declarations are only allowed in C99 mode
    for (int j = 0; j < parameters[i].n_value / 2; j++)
    ^
plugins/affy2vcf.c: In function ‘agcc_print_data_header’:
plugins/affy2vcf.c:694:2: error: ‘for’ loop initial declarations are only allowed in C99 mode
  for (int i = 0; i < data_header->n_parents; i++)
  ^
plugins/affy2vcf.c: In function ‘agcc_print_data_set’:
plugins/affy2vcf.c:731:2: error: ‘for’ loop initial declarations are only allowed in C99 mode
  for (int i = 0; i < data_set->n_cols; i++)
  ^
plugins/affy2vcf.c:749:11: error: redefinition of ‘i’
  for (int i = 0; i < data_set->n_cols; i++) {
           ^
plugins/affy2vcf.c:731:11: note: previous definition of ‘i’ was here
  for (int i = 0; i < data_set->n_cols; i++)
           ^
plugins/affy2vcf.c:749:2: error: ‘for’ loop initial declarations are only allowed in C99 mode
  for (int i = 0; i < data_set->n_cols; i++) {
  ^
plugins/affy2vcf.c:774:11: error: redefinition of ‘i’
  for (int i = 0; i < data_set->n_rows; i++) {
           ^
plugins/affy2vcf.c:749:11: note: previous definition of ‘i’ was here
  for (int i = 0; i < data_set->n_cols; i++) {
           ^
plugins/affy2vcf.c:774:2: error: ‘for’ loop initial declarations are only allowed in C99 mode
  for (int i = 0; i < data_set->n_rows; i++) {
  ^
plugins/affy2vcf.c:776:3: error: ‘for’ loop initial declarations are only allowed in C99 mode
   for (int j = 0; j < data_set->n_cols; j++) {
   ^
plugins/affy2vcf.c: In function ‘agcc_print_data_group’:
plugins/affy2vcf.c:788:2: error: ‘for’ loop initial declarations are only allowed in C99 mode
  for (int i = 0; i < data_group->num_data_sets; i++)
  ^
plugins/affy2vcf.c: In function ‘agcc_print’:
plugins/affy2vcf.c:799:2: error: ‘for’ loop initial declarations are only allowed in C99 mode
  for (int i = 0; i < agcc->num_data_groups; i++)
  ^
plugins/affy2vcf.c: In function ‘agccs_to_tsv’:
plugins/affy2vcf.c:826:2: error: ‘for’ loop initial declarations are only allowed in C99 mode
  for (int j = 0; j < 20; j++)
  ^
plugins/affy2vcf.c:829:2: error: ‘for’ loop initial declarations are only allowed in C99 mode
  for (int i = 0; i < n; i++) {
  ^
plugins/affy2vcf.c:833:3: error: ‘for’ loop initial declarations are only allowed in C99 mode
   for (int j = 0, k = 0; j < 20; j++) {
   ^
plugins/affy2vcf.c: In function ‘cels_to_tsv’:
plugins/affy2vcf.c:976:2: error: ‘for’ loop initial declarations are only allowed in C99 mode
  for (int i = 0; i < n; i++) {
  ^
plugins/affy2vcf.c:1004:4: error: ‘for’ loop initial declarations are only allowed in C99 mode
    for (int k = 0; k < data_header->parameters[j].n_value / 2; k++)
    ^
plugins/affy2vcf.c: In function ‘models_init’:
plugins/affy2vcf.c:1119:2: error: ‘for’ loop initial declarations are only allowed in C99 mode
  for (int i = 0; i < 2; i++) {
  ^
plugins/affy2vcf.c: In function ‘models_destroy’:
plugins/affy2vcf.c:1225:2: error: ‘for’ loop initial declarations are only allowed in C99 mode
  for (int i = 0; i < 2; i++) {
  ^
plugins/affy2vcf.c:1227:3: error: ‘for’ loop initial declarations are only allowed in C99 mode
   for (int j = 0; j < models->n_snps[i]; j++)
   ^
plugins/affy2vcf.c: In function ‘annot_init’:
plugins/affy2vcf.c:1316:2: error: ‘for’ loop initial declarations are only allowed in C99 mode
  for (int i = 0; i < ncols; i++) {
  ^
plugins/affy2vcf.c:1421:5: error: ‘for’ loop initial declarations are only allowed in C99 mode
     for (int i = 1; i < ncols; i++) {
     ^
plugins/affy2vcf.c: In function ‘annot_destroy’:
plugins/affy2vcf.c:1538:2: error: ‘for’ loop initial declarations are only allowed in C99 mode
  for (int i = 0; i < annot->n_records; i++) {
  ^
plugins/affy2vcf.c: In function ‘report_destroy’:
plugins/affy2vcf.c:1594:2: error: ‘for’ loop initial declarations are only allowed in C99 mode
  for (int i = 0; i < report->n_samples; i++)
  ^
plugins/affy2vcf.c: In function ‘varitr_init_cc’:
plugins/affy2vcf.c:1645:2: error: ‘for’ loop initial declarations are only allowed in C99 mode
  for (int i = 0; i < n; i++) {
  ^
plugins/affy2vcf.c: In function ‘varitr_init_txt’:
plugins/affy2vcf.c:1700:3: error: ‘for’ loop initial declarations are only allowed in C99 mode
   for (int i = 1; i < ncols; i++) {
   ^
plugins/affy2vcf.c:1716:4: error: ‘for’ loop initial declarations are only allowed in C99 mode
    for (int i = 1; i < ncols; i++) {
    ^
plugins/affy2vcf.c:1733:4: error: ‘for’ loop initial declarations are only allowed in C99 mode
    for (int i = 1; i < ncols; i++) {
    ^
plugins/affy2vcf.c: In function ‘varitr_loop’:
plugins/affy2vcf.c:1782:3: error: ‘for’ loop initial declarations are only allowed in C99 mode
   for (int i = 0; i < varitr->nsmpl; i++) {
   ^
plugins/affy2vcf.c:1839:4: error: ‘for’ loop initial declarations are only allowed in C99 mode
    for (int i = 1; i < 1 + varitr->nsmpl; i++)
    ^
plugins/affy2vcf.c:1852:4: error: ‘for’ loop initial declarations are only allowed in C99 mode
    for (int i = 1; i < 1 + varitr->nsmpl; i++)
    ^
plugins/affy2vcf.c:1885:4: error: ‘for’ loop initial declarations are only allowed in C99 mode
    for (int i = 1; i < 1 + varitr->nsmpl; i++)
    ^
plugins/affy2vcf.c:1895:13: error: redefinition of ‘i’
    for (int i = 1; i < 1 + varitr->nsmpl; i++) {
             ^
plugins/affy2vcf.c:1885:13: note: previous definition of ‘i’ was here
    for (int i = 1; i < 1 + varitr->nsmpl; i++)
             ^
plugins/affy2vcf.c:1895:4: error: ‘for’ loop initial declarations are only allowed in C99 mode
    for (int i = 1; i < 1 + varitr->nsmpl; i++) {
    ^
plugins/affy2vcf.c: In function ‘hdr_init’:
plugins/affy2vcf.c:1949:2: error: ‘for’ loop initial declarations are only allowed in C99 mode
  for (int i = 0; i < n; i++) {
  ^
plugins/affy2vcf.c: In function ‘adjust_clusters’:
plugins/affy2vcf.c:2139:2: error: ‘for’ loop initial declarations are only allowed in C99 mode
  for (int i = 0; i < n; i++) {
  ^
plugins/affy2vcf.c: In function ‘compute_baf_lrr’:
plugins/affy2vcf.c:2228:2: error: ‘for’ loop initial declarations are only allowed in C99 mode
  for (int i = 0; i < n; i++) {
  ^
plugins/affy2vcf.c: In function ‘process’:
plugins/affy2vcf.c:2340:5: error: ‘for’ loop initial declarations are only allowed in C99 mode
     for (int i = 0; i < nsmpl; i++) {
     ^
plugins/affy2vcf.c:2389:4: error: ‘for’ loop initial declarations are only allowed in C99 mode
    for (int i = 0; i < 2; i++) {
    ^
plugins/affy2vcf.c: In function ‘run’:
plugins/affy2vcf.c:2708:3: error: ‘for’ loop initial declarations are only allowed in C99 mode
   for (int i = 0; i < report->n_samples; i++) {
   ^
plugins/affy2vcf.c:2729:2: error: ‘for’ loop initial declarations are only allowed in C99 mode
  for (int i = 0; i < nfiles; i++) {
  ^
plugins/affy2vcf.c:2825:3: error: ‘for’ loop initial declarations are only allowed in C99 mode
   for (int i = 0; i < nfiles; i++)
   ^
plugins/affy2vcf.c:2829:11: error: redefinition of ‘i’
  for (int i = 0; i < nfiles; i++) {
           ^
plugins/affy2vcf.c:2729:11: note: previous definition of ‘i’ was here
  for (int i = 0; i < nfiles; i++) {
           ^
plugins/affy2vcf.c:2829:2: error: ‘for’ loop initial declarations are only allowed in C99 mode
  for (int i = 0; i < nfiles; i++) {

Any suggestion to solve this compiling problem? I Have CentOS 7 and all compilation with all plugins work perfectly fine.

Best,

Petr.

Error: Too many open files

Hi,

I am currently using gtc2vcf tools to transform ~1800 GTC files into a single BCF file. I met a error report as below. However, when I tried small samples (like 20 samples) including the reported error sample 9479477122_R04C01.gtc , the pipeline could work with the correct bcf file produced. Is this a memory problem? Would you pls help me to figure this problem? Thank you very much for your help!

"Could not open 9479477122_R04C01.gtc: Too many open files
[E::bcf_hdr_read] Input is not detected as bcf or vcf format
Could not read VCF/BCF headers from -
Cleaning
Failed to read from standard input: unknown file type"

Best regards,
Qidi

Possible to extract SNP table metrics?

Thanks you for developing this tool! The one single Windows dependency we have is in running GenomeStudio, and getting rid of this is a huge help.

I am wondering if it would be possible to extract SNP table metrics using this tool. For instance we are often faced with the need to extract eg. logR-ratio and B allele frequencies when using PennCNV (http://penncnv.openbioinformatics.org/en/latest/user-guide/input/) among other minor interactions with GenomeStudio. Would it be possible to extract these starting from IDAT files without ever having to interact with GenomeStudio?

Thanks again for your work!!

## can't find file to patch at input line 3

Hi!
When using your fantastic tool towards the readme file, i get this step and i do not know how to proceed. In fact, I jump to the next step (compile htslib and bcftools...). At the end I can use the converter IDAT to GTC for llumina but I want to run the whole tool.
Coul you please help me with this?

I paste the error and some aditional information

Add patch (to allow the fixref plugin to flip BAF values) and code for plugins

/bin/rm -f bcftools/plugins/{gtc2vcf.c,affy2vcf.c,fixref.patch}
wget -P bcftools/plugins https://raw.githubusercontent.com/freeseek/gtc2vcf/master/{gtc2vcf.c,affy2vcf.c,fixref.patch}
cd bcftools/plugins && patch < fixref.patch && cd ../..

the error is:
can't find file to patch at input line 3
Perhaps you should have used the -p or --strip option?
The text leading up to this was:

|--- fixref.c 2018-09-05 12:00:00.000000000 -0500
|+++ fixref.c 2018-09-05 12:00:00.000000000 -0500

File to patch:

Could not initialize gtc2vcf.so, neither run or init found

Hello, I am trying to use the "Convert Illumina GTC files to VCF" example shown in the README, but I am getting this error:

Writing to .
Could not initialize gtc2vcf.so, neither run or init found
[E::bcf_hdr_read] Input is not detected as bcf or vcf format
Failed to open -: unknown file type
Failed to open -: unknown file type

Looking in the file, there is a run function defined, but no init function, and bcftools vcfplugin.c appears to be checking for both.

I am using bcftools version 1.9. Any idea what could be causing this?

plink matrix format

Hi, I have data from dbGaP that is in 'plink matrix format'. Can I use this tool? If not, what is the best way to prepare this data for MoCHa?

GTC files cannot be listed through both command interface and file list when only submitting a .txt file

Hi,

I am getting the error message "GTC files cannot be listed through both command interface and file list" even though I am only submitting a single .txt file with a list of the gtc file names. I have tried this where the actual gtc files are in the directory where I am running the script, and also where they are in their own directory. I am running on a google cloud instance and using a singularity container. Here is the code, and I have attached the gtc_list file.

`bpm_manifest_file="./GDA_PGx-8v1-0_20042614_A2.bpm"
csv_manifest_file="./ProjectDetailReport ILMN GDA 07-11-22 AMS1.csv"
egt_cluster_file="./GDA FINAL 3 plate validation reclustered 06302022.egt"
path_to_gtc_folder="./gtc_file_list.csv"
ref="./GRCh38_full_analysis_set_plus_decoy_hla.fa" # or ref="$HOME/GRCh37/human_g1k_v37.fasta"
out_prefix="206486390022"

singularity exec gtc2vcf_072922.sif bcftools +gtc2vcf
--no-version -Ou
--bpm $bpm_manifest_file
--csv $csv_manifest_file
--egt $egt_cluster_file
--gtcs ./gtc_list_file.txt
--fasta-ref $ref
--output $out_prefix.vcf
--output-type v
--extra $out_prefix.tsv
--verbose
`

Thank you
Harry
gtc_list_file.txt

How to install wine64 without "sudo"?

Hi,

Can you tell me if there is a way to install wine64 not using the sudo command. I don't have the right to use sudo on the cluster that I use.

Thanks.

Citation

Hi Giulio,

I'm trying to cite this tool in my manuscripts but I did not find a related paper. Could you please share a citation format?

Great thanks!

Xiaotong

idat or gtc in command line

I will want to used idat file more than gtc, do you have an example of command line?
bcftools +gtc2vcf -Ou --bpm .bpm --egt egt --idat filelink --fasta-ref fasta --extra gtc2vcf_idat".tsv" --output gtc2vcf_idat".vcf.gz" --threads 35 --output-type z
with filelink contains each idat file
error that I obtained ;
The --idat option can only be used alone or with option --gtcs
Could you explained more how to use idat with gtc2vcf? what algoritms ? what is the interrest?
thank you

VCF lines lacking GT tag

Dear freeseek,

I installed the gtc2vcf plugin yesterday in docker:
https://gitlab.com/intelliseq/workflows/-/blob/BIOINFO-998-genotype-source/src/main/docker/task/task_gtc-to-vcf/Dockerfile
(the reference is added later).
The plugin works without raising any error, but some vcf lines don't have the GT tag:

chr1    30345446        22:24375752_CNV_GSTT1   A       C       .       .       GC=0.4625;ALLELE_A=0;ALLELE_B=1;FRAC_A=0.360656;FRAC_C=0.262295;FRAC_G=0.229508;FRAC_T=0.147541;NORM_ID=1;BEADSET_ID=1705;INTENSITY_ONLY;ASSAY_TYPE=0;GenTrain_Score=0;Orig_Score=0.68275;Cluster_Sep=0.948275;N_AA=1236;N_AB=0;N_BB=0;devR_AA=0.30742;devR_AB=0.39422;devR_BB=0.20131;devTHETA_AA=0.0121041;devTHETA_AB=0.0223607;devTHETA_BB=0.0223607;meanR_AA=2.73401;meanR_AB=3.4935;meanR_BB=2.30665;meanTHETA_AA=0.130089;meanTHETA_AB=0.554171;meanTHETA_BB=0.978252;Intensity_Threshold=0.05     GQ:IGC:BAF:LRR:NORMX:NORMY:R:THETA:X:Y   0:0:0.0246714:-0.31685:1.76758:0.427339:2.19492:0.151014:32616:2228     0:0:0.0246714:-0.31685:1.76758:0.427339:2.19492:0.151014:32616:2228
chr1    109685814       1:110228436_CNV_GSTM1   T       C       .       .       GC=0.385;ALLELE_A=0;ALLELE_B=1;FRAC_A=0.147541;FRAC_C=0;FRAC_G=0.180328;FRAC_T=0.672131;NORM_ID=0;BEADSET_ID=1625;INTENSITY_ONLY;ASSAY_TYPE=0;GenTrain_Score=0;Orig_Score=0.376871;Cluster_Sep=0.173743;N_AA=0;N_AB=0;N_BB=1239;devR_AA=0.1;devR_AB=0.1;devR_BB=0.1;devTHETA_AA=0.0223607;devTHETA_AB=0.0223607;devTHETA_BB=0.140788;meanR_AA=0.17845;meanR_AB=0.194985;meanR_BB=0.207459;meanTHETA_AA=0.0145364;meanTHETA_AB=0.297995;meanTHETA_BB=0.581454;Intensity_Threshold=0.05     GQ:IGC:BAF:LRR:NORMX:NORMY:R:THETA:X:Y  0:0:1.28708:-0.170678:0.0549625:0.129349:0.184312:0.744207:1017:614      0:0:1.28708:-0.170678:0.0549625:0.129349:0.184312:0.744207:1017:614

The program is run with this wdl:
https://gitlab.com/intelliseq/workflows/-/blob/dev/src/main/wdl/tasks/gtc-to-vcf/gtc-to-vcf.wdl

Is it intentional? This has not happened with the previous installation (bcftools11-54-gaf54707, htslib1.11-74-gb8dcbd1
and gtc2vcf cloned on 2021-01-20).
Best,
Kasia

How to get Call_Freq, AAfreq, BBfreq, and AB Freq

Hi thanks for the great tool. I have some query, I want to remove some poor quality snps from the vcf file. The filteration I want should be based the following threshold

"Call Freq" < 0.97
"AA Freq" = 1 AND "AA T Mean" > 0.3
"BB Freq" = 1 AND BB T Dev" > 0.06

I can see that AA T Mean and BB T Dev are there in the VCF file but I am unable to find Call Freq, AA Freq, BB Freq and AB freq.
Please let me know how can I get these values.
Awaiting for your reply
Thanks

Can't open .xcl.bcf file

Hello,

I got the .xcl.bcf file after this step:

/bcftools annotate --no-version -Ob -o $pfx.unphased.bcf -x ID,QUAL,INFO,^FMT/GT,^FMT/BAF,^FMT/LRR $pfx.vcf &&
/bcftools index -f $pfx.unphased.bcf

n=$(/bcftools query -l $pfx.unphased.bcf|wc -l);
ns=$((n*98/100));
echo '##INFO=<ID=JK,Number=1,Type=Float,Description="Jukes Cantor">' |
/bcftools annotate --no-version -Ou -a $dup -c CHROM,FROM,TO,JK -h /dev/stdin $pfx.unphased.bcf |
/bcftools +/fill-tags.so --no-version -Ou -- -t NS,ExcHet |
bcftools +mochatools.so --no-version -Ou -- -x $sex -G |
bcftools annotate --no-version -Ob -o $pfx.xcl.bcf
-i 'FILTER!="." && FILTER!="PASS" || JK<.02 || NS<'$ns' || ExcHet<1e-6 || AC_Sex_Test>6'
-x FILTER,^INFO/JK,^INFO/NS,^INFO/ExcHet,^INFO/AC_Sex_Test &&
bcftools index -f $pfx.xcl.bcf

Then, when I ran eagle:

for chr in {1..22} X; do
eagle
--geneticMapFile $map
--chrom $chr
--outPrefix $pfx.chr$chr
--numThreads 4
--vcfRef $kgp_pfx${chr}$kgp_sfx.bcf
--vcfTarget $pfx.unphased.bcf
--vcfOutFormat b
--noImpMissing
--outputUnphased
--vcfExclude $dir/$pfx.xcl.bcf && bcftools index -f $pfx.chr$chr.bcf
done

I get the following: ERROR: Could not open X.xcl.bcf for reading: unknown file type.

I have full permissions on the file. I am not sure if it's the eagle problem or it's the file generating issues.

Can you please help me with this?

Thank you.

IDAT to GTC not working

Hello, I tried to obtain gtc files from idat using the command line in the tutorial :

mono $HOME/bin/autoconvert/AutoConvert.exe $path_to_idat_folder $path_to_output_folder $manifest_file $egt_file

unfortunately the process gives me, as you said, the normalization error. I tried to use a custom cluster file and a custom manifest file with a ".csv" extension, could it be possible that the error raises because of this. For me it's mandatory to use custom egt and csv or bpm files because of some added SNPs is there a solution to this issue?

Thank you

freeseek / gtc2vcf Goto Github PK

gtc2vcf's Introduction

gtc2vcf

Usage

Installation

Identifying chip type for IDAT and CEL files

Convert Illumina IDAT files to GTC files

Convert Illumina GTC files to VCF

Convert Affymetrix CEL files to CHP files

Convert Affymetrix CHP files to VCF

Using an alternative genome reference

Plot variants

Illumina GenCall

Illumina AutoConvert

Illumina AutoConvert 2.0

Illumina Array Analysis Platform Genotyping Command Line Interface

Illumina Microarray Analytics Array Analysis Command Line Interface

Acknowledgements

gtc2vcf's People

Contributors

Stargazers

Watchers

Forkers

gtc2vcf's Issues

./bcftools +gtc2vcf.so --no-version -o --genome-studio /Users/vikrants/Desktop/testvcf/ILHC24-12806_FinalReport.txt -f /Users/vikrants/res/hg38.fa

Add patch (to allow the fixref plugin to flip BAF values) and code for plugins

the error is: can't find file to patch at input line 3 Perhaps you should have used the -p or --strip option? The text leading up to this was:

|--- fixref.c 2018-09-05 12:00:00.000000000 -0500 |+++ fixref.c 2018-09-05 12:00:00.000000000 -0500

Recommend Projects

Recommend Topics

Recommend Org

the error is:
can't find file to patch at input line 3
Perhaps you should have used the -p or --strip option?
The text leading up to this was:

|--- fixref.c 2018-09-05 12:00:00.000000000 -0500
|+++ fixref.c 2018-09-05 12:00:00.000000000 -0500