broadinstitute / seqr-loading-pipelines Goto Github PK

View Code? Open in Web Editor NEW

19.0 23.0 17.0 177.36 MB

hail-based pipelines for annotating variant callsets and exporting them to elasticsearch

License: MIT License

Python 98.84% Dockerfile 0.50% Shell 0.66%

seqr-loading-pipelines's Introduction

The hail scripts in this repo can be used to pre-process variant callsets and export them to elasticsearch.

See hail_elasticsearch_pipelines/luigi_pipeline/README.md for details.

seqr-loading-pipelines's People

Contributors

Stargazers

Watchers

Forkers

jinjie-duan juhis nicklecomptebch ssadedin nlsvtn danking vladsavelyev kyounghyoun karynne7 populationgenomics thehyve bpblanken gregsmi centerformedicalgeneticsghent theackx barslmn

seqr-loading-pipelines's Issues

ability to add new annotation columns to a previously-loaded dataset

samples_num_alt_1 not exists in the elastic search index

I used v0.1 version of load_dataset_to_es to load the VCF into Elastic Search. seqr expects a filed "samples_num_alt_1" which doesn't exist in the index schema while I tried to add a dataset into seqr via web interface. But I couldn't find this specific filed in the _mapping/variant. I do see sample_id_num_alt, e.g. S00001_CIDR_num_alt. I have the most updated codes on hail-elasticsearch-pipelines. Could you point me out what I might miss? Thanks!

combined_reference_data_grch37.ht on Google Cloud - metadata.json.gz is absent

For the input reference files we need to have files of the .ht type. There are 3 such important files: clinvar, hgmd, and combined reference ones. These files are accessible as VCF files, so looking into the code I found the file:

https://github.com/macarthur-lab/hail-elasticsearch-pipelines/blob/master/hail_scripts/v02/convert_vcf_to_hail.py

That I used to generate all 3 .ht files from VCFs. It ran successfully, however, when I am running the pipeline now, it is writing for the combined_reference_data_grch37 that metadata.json.gz is absent. Maybe it is just that I need to use a different file for combined_reference_data_grch37 generation:

https://github.com/macarthur-lab/hail-elasticsearch-pipelines/blob/master/download_and_create_reference_datasets/v02/hail_scripts/write_combined_reference_data_ht.py

So, I am trying to see how to use it, however, it makes things more complex since I already have the VCF, and do not need to generate it. Is there a way to generate this metadata.json.gz and fix the issue?

Seqr loading command does not recognized --subset-samples argument

The seqr loading command is not recognizing the --subset-sample argument

bash-3.2$ python gcloud_dataproc/load_GRCh38_dataset.py --host 10.56.10.4 --subset-samples gs://seqr-datasets/GRCh38/Rare_Genomes_Project/RGP_ids.txt --project-guid Rare_Genomes_Project --sample-type WGS --dataset-type GATK_VARIANTS --remap-sample-ids gs://seqr-datasets/GRCh38/Rare_Genomes_Project/remap_ids.txt gs://seqr-datasets/GRCh38/RDG_WGS_Broad_Internal/v1/RDG_WGS_Broad_Internal.vcf.gz
(re)download fam file from seqr? [Y/n]

According to the repo, this should only come up when a fam file or a subset file is not supplied.

Thanks!

gcloud_dataproc scripts failing due to missing gcloud resource

Our import pipeline started failing today because it refers to a resource in google cloud storage that no longer seems to be available:

ERROR: (gcloud.beta.dataproc.clusters.create) INVALID_ARGUMENT: Google Cloud Storage object does not exist 'gs://hail-common/vep/vep/GRCh37/vep85-GRCh37-init.sh'.

I can see that it truly doesn't exist any more using gsutil. In fact, the parent folders serveral levels up have been removed, eg:

gsutil ls gs://hail-common/vep/
CommandException: One or more URLs matched no objects.

However the content is still referenced by the scripts in this repo - eg:

https://github.com/macarthur-lab/hail-elasticsearch-pipelines/blob/master/gcloud_dataproc/v01/create_cluster_GRCh37.py

Any guidance on whether this content was intentionally removed (if so, is there another location we can point to) or if it can be restored?

VEP annotation discrepancy

Eg. No variants seen for ACTN3 in the new browser b/c they are not annotated as being in the gene.

http://gnomad.broadinstitute.org/gene/ENSG00000248746

http://gnomad-beta.broadinstitute.org/gene/ACTN3

Launch Luigi pipeline on Spark

Hi,

I need to submit Luigi pipeline to a Spark cluster. I do not see anywhere 'PySparkTask' to be defined/inherited, so does it mean that there is no way to start Luigi on Spark or I just need to write a simple wrapper that would initiate the command:

LUIGI_CONFIG_PATH=configs/seqr-loading-local.cfg python seqr_loading.py SeqrMTToESTask --local-scheduler

specified in the README.md? How would you recommend starting it on Spark local cluster (not Google Cloud)?

submit.sh assumes zip is installed

zip -r /tmp/utils.zip utils
./submit.sh: line 33: zip: command not found
...

Note: zip is not installed by default on a google vm

update existing elasticsearch callsets: add computed Hom, Hemi counts

hail export VCF errors

Hi guys,

In the export vcf command in the hail_scripts, v01 script 'load_dataset_to_es.py' given below -

if args.export_vcf:
    logger.info("Writing out to VCF...")
    vds.export_vcf(args.step0_output_vcf, overwrite=True)

the overwrite parameter throws an error as the argument does not exist, here is my code -

Stage 4:====================================================>(2956 + 1) / 2957]2020-07-26 12:36:59,314 INFO Writing out to VCF...
Traceback (most recent call last):
File "/home/ttysam01/Seqr_install/seqr/hail_elasticsearch_pipelines/hail_scripts/v01/load_dataset_to_es.py", line 900, in
run_pipeline()
File "/home/ttysam01/Seqr_install/seqr/hail_elasticsearch_pipelines/hail_scripts/v01/load_dataset_to_es.py", line 867, in run_pipeline
hc, vds = step0_init_and_run_vep(hc, vds, args)
File "/home/ttysam01/Seqr_install/seqr/hail_elasticsearch_pipelines/hail_scripts/v01/load_dataset_to_es.py", line 162, in wrapper
result = f(*args, **kwargs)
File "/home/ttysam01/Seqr_install/seqr/hail_elasticsearch_pipelines/hail_scripts/v01/load_dataset_to_es.py", line 439, in step0_init_and_run_vep
vds.export_vcf(args.step0_output_vcf, overwrite=True)
TypeError: export_vcf() got an unexpected keyword argument 'overwrite'

Is it okay if I just remove the parameter?

typo in readme

In the deployment section:
./servctl deploy {label} # label can be 'local', 'gcloud-dev', or 'gcloud-dev'

According to https://github.com/macarthur-lab/hail-db-utils/blob/master/deploy/utils/servctl_utils.py
..
elif deployment_label.startswith("gcloud"):
suffix = deployment_label.split("-")[-1] # "dev" or "prod"
...

It should read:
./servctl deploy {label} # label can be 'local', 'gcloud-dev', or 'gcloud-prod'

validate project-guid in load_dataset_to_es

Error exporting hail table to elastic search

I've tried to export the hail table to elastic search with the following parameters and encountered error. If I removed the "raised NotImplementedError" in the code, I get a different set of errors (run.log).

/root/export_ht_to_es.py --ht-url 's3://mybucket/gnomad.exomes.r2.1.sites.ht' --host 'XX.XX.XX.XX' --port XXXX --index-name 'data' --index-type 'variant' --num-shards 10 --es-block-size 5000

Traceback (most recent call last):
File "/root/export_ht_to_es.py", line 55, in
main(args)
File "/root/export_ht_to_es.py", line 41, in main
verbose=True
File "/root/hail_scripts/v02/utils/elasticsearch_client.py", line 143, in export_table_to_elasticsearch
disable_index_for_fields=disable_index_for_fields,
File "/root/hail_scripts/v02/utils/elasticsearch_utils.py", line 51, in elasticsearch_schema_for_table
properties = _elasticsearch_mapping_for_type(table.row_value.dtype)["properties"]
File "/root/hail_scripts/v02/utils/elasticsearch_utils.py", line 23, in _elasticsearch_mapping_for_type
return {"properties": {field: _elasticsearch_mapping_for_type(dtype[field]) for field in dtype.fields}}
File "/root/hail_scripts/v02/utils/elasticsearch_utils.py", line 23, in
return {"properties": {field: _elasticsearch_mapping_for_type(dtype[field]) for field in dtype.fields}}
File "/root/hail_scripts/v02/utils/elasticsearch_utils.py", line 25, in _elasticsearch_mapping_for_type
element_mapping = _elasticsearch_mapping_for_type(dtype.element_type)
File "/root/hail_scripts/v02/utils/elasticsearch_utils.py", line 23, in _elasticsearch_mapping_for_type
return {"properties": {field: _elasticsearch_mapping_for_type(dtype[field]) for field in dtype.fields}}
File "/root/hail_scripts/v02/utils/elasticsearch_utils.py", line 23, in
return {"properties": {field: _elasticsearch_mapping_for_type(dtype[field]) for field in dtype.fields}}
File "/root/hail_scripts/v02/utils/elasticsearch_utils.py", line 31, in _elasticsearch_mapping_for_type
raise NotImplementedError
NotImplementedError

Thanks and Best,
Jack

Make it possible to change vep-config.json path via command-line arg for SeqrMTToESTask

Add a command line arg that makes it possible to change the vep config paths @
https://github.com/macarthur-lab/hail-elasticsearch-pipelines/blob/master/hail_scripts/v02/utils/hail_utils.py#L119
https://github.com/macarthur-lab/hail-elasticsearch-pipelines/blob/master/hail_scripts/v02/utils/hail_utils.py#L122

This allows different VEP paths for local installs, allows
#165
to be fixed, etc.

Update ES index in luigi pipeline instead of recreating it

Hi,

In 0.1 pipeline there was a script called update_reference_data_in_existing_index.py which allowed us to update already existing index with additional annotations. How would you recommend doing it in 0.2 version? I do not see any scripts there for the purpose, so does it mean that we just need to write our own ones based off 0.1 pipeline?

Regards.

exportvariantssolr, exportvariantscass, and seqrserver support for new datasets having a different variant schema than existing datasets

will it work if some datasets are loaded with one schema, and then additional fields are added in new datasets

deploy/kubernetes/scripts/delete_all.sh hangs on gcloud delete commands

The reason why it hangs as it requires the prompts to be answered.

If --quiet or -q is added to the two delete commands it will answer with default, i.e. delete without response from prompt.

luigi v02 pipeline: GC spark out-of-memory error. Need way to set spark params like driver & executor memory.

Is there a way to adjust spark params like
https://github.com/macarthur-lab/hail-elasticsearch-pipelines/blob/master/gcloud_dataproc/submit.py#L13-L17
?

Currently, getting GC out-of-memory errors when trying to load the test dataset locally:

Traceback (most recent call last):
  File "/usr/local/lib/python3.6/dist-packages/luigi/worker.py", line 199, in run
    new_deps = self._run_get_new_deps()
  File "/usr/local/lib/python3.6/dist-packages/luigi/worker.py", line 141, in _run_get_new_deps
    task_gen = self.task.run()
  File "seqr_loading.py", line 51, in run
    mt.write(self.output().path, stage_locally=True)
  File "</usr/local/lib/python3.6/dist-packages/decorator.py:decorator-gen-1008>", line 2, in write
  File "/usr/local/lib/python3.6/dist-packages/hail/typecheck/check.py", line 585, in wrapper
    return __original_func(*args_, **kwargs_)
  File "/usr/local/lib/python3.6/dist-packages/hail/matrixtable.py", line 2500, in write
    Env.backend().execute(MatrixWrite(self._mir, writer))
  File "/usr/local/lib/python3.6/dist-packages/hail/backend/backend.py", line 108, in execute
    result = json.loads(Env.hc()._jhc.backend().executeJSON(self._to_java_ir(ir)))
  File "/usr/local/lib/python3.6/dist-packages/py4j/java_gateway.py", line 1257, in __call__
    answer, self.gateway_client, self.target_id, self.name)
  File "/usr/local/lib/python3.6/dist-packages/hail/utils/java.py", line 221, in deco
    'Error summary: %s' % (deepest, full, hail.__version__, deepest)) from None
hail.utils.java.FatalError: OutOfMemoryError: GC overhead limit exceeded

Java stack trace:
java.lang.OutOfMemoryError: GC overhead limit exceeded
	at org.json4s.JsonAST$JObject$.apply(JsonAST.scala:158)
	at org.json4s.MonadicJValue.org$json4s$MonadicJValue$$rec$4(MonadicJValue.scala:158)
	at org.json4s.MonadicJValue$$anonfun$org$json4s$MonadicJValue$$rec$4$1.apply(MonadicJValue.scala:158)
	at org.json4s.MonadicJValue$$anonfun$org$json4s$MonadicJValue$$rec$4$1.apply(MonadicJValue.scala:158)
	at scala.collection.immutable.List.map(List.scala:284)
	at org.json4s.MonadicJValue.org$json4s$MonadicJValue$$rec$4(MonadicJValue.scala:158)
	at org.json4s.MonadicJValue$$anonfun$org$json4s$MonadicJValue$$rec$4$2.apply(MonadicJValue.scala:159)
	at org.json4s.MonadicJValue$$anonfun$org$json4s$MonadicJValue$$rec$4$2.apply(MonadicJValue.scala:159)
	at scala.collection.immutable.List.map(List.scala:288)
	at org.json4s.MonadicJValue.org$json4s$MonadicJValue$$rec$4(MonadicJValue.scala:159)
	at org.json4s.MonadicJValue$$anonfun$org$json4s$MonadicJValue$$rec$4$1.apply(MonadicJValue.scala:158)
	at org.json4s.MonadicJValue$$anonfun$org$json4s$MonadicJValue$$rec$4$1.apply(MonadicJValue.scala:158)
	at scala.collection.immutable.List.map(List.scala:288)
	at org.json4s.MonadicJValue.org$json4s$MonadicJValue$$rec$4(MonadicJValue.scala:158)
	at org.json4s.MonadicJValue.mapField(MonadicJValue.scala:162)
	at org.json4s.MonadicJValue.transformField(MonadicJValue.scala:175)
	at is.hail.rvd.AbstractRVDSpec$.read(AbstractRVDSpec.scala:34)
	at is.hail.variant.RVDComponentSpec.rvdSpec(MatrixTable.scala:96)
	at is.hail.variant.RVDComponentSpec.read(MatrixTable.scala:108)
	at is.hail.expr.ir.TableNativeReader.apply(TableIR.scala:170)
	at is.hail.expr.ir.TableRead.execute(TableIR.scala:294)
	at is.hail.expr.ir.TableLeftJoinRightDistinct.execute(TableIR.scala:897)
	at is.hail.expr.ir.TableMapRows.execute(TableIR.scala:921)
	at is.hail.expr.ir.Interpret$.apply(Interpret.scala:768)
	at is.hail.expr.ir.Interpret$.apply(Interpret.scala:90)
	at is.hail.expr.ir.CompileAndEvaluate$$anonfun$1.apply(CompileAndEvaluate.scala:33)
	at is.hail.utils.ExecutionTimer.time(ExecutionTimer.scala:24)
	at is.hail.expr.ir.CompileAndEvaluate$.apply(CompileAndEvaluate.scala:33)
	at is.hail.backend.Backend$$anonfun$execute$1.apply(Backend.scala:86)
	at is.hail.backend.Backend$$anonfun$execute$1.apply(Backend.scala:86)
	at is.hail.expr.ir.ExecuteContext$$anonfun$scoped$1.apply(ExecuteContext.scala:8)
	at is.hail.expr.ir.ExecuteContext$$anonfun$scoped$1.apply(ExecuteContext.scala:7)



Hail version: 0.2.18-08ec699f0fd4
Error summary: OutOfMemoryError: GC overhead limit exceeded

Add contig check to loading pipeline

In the past year, the MacArthur lab has twice been delivered a VCF that was missing at least one chromosome. We need to add a contig check to seqr loading pipeline, likely in the VCFtoMT task to prevent us from loading these in.

Running section in Readme needs editing

export_callset_to_ES.py will try to VEP annotate if vds/vcf is unannotated

...
if not any(field.name == "vep" for field in vds.variant_schema.fields):
vep_output_path = args.dataset_path.replace(".vds", "").replace(".vcf.gz", "").replace(".vcf.bgz", "") + ".vep.vds"
if not inputs_older_than_outputs([args.dataset_path], [vep_output_path]):
vds = vds.vep(config="/vep/vep-gcloud.properties", root='va.vep', block_size=1000) #, csq=True)
vds.write(vep_output_path, overwrite=True)
...

If someone running it takes the dataproc cluster without VEP option then this will error. Also I'm not sure if this check is working correctly as annotation was attempted on my already annotated vds input.

VEP run from within the pipeline fails for GRCh38 and VEP99

I run successfully VEP locally, the very same command that fails from within my current Hail 0.2 pipeline:

/vep/ensembl-tools-release-99/vep -i batch1.vcf --format vcf --json --everything --allele_number --no_stats --cache --offline --minimal --verbose --assembly GRCh38 --dir_cache /var/lib/spark/vep/vep_cache --fasta /vep/homo_sapiens/GRCh38/hg38.fa --plugin LoF,human_ancestor_fa:/vep/loftee_data_grch38/human_ancestor.fa.gz,filter_position:0.05,min_intron_size:15,conservation_file:/vep/loftee_data_grch38/loftee.sql,loftee_path:/vep/loftee_grch38,gerp_bigwig:/vep/loftee_data_grch38/gerp_conservation_scores.homo_sapiens.GRCh38.bw,run_splice_predictions:0 --dir_plugins /vep/loftee_grch38 -o vep_output

The output from the pipeline is very uninformative: just when mt.write() happens, VEP fails, the command (shown above, but without -i batch1.vcf and with -o STDOUT) printed, and the code of the error is 2. Here is the output:

ERROR: [pid 4129] Worker Worker(...)
Traceback (most recent call last):
  File "/.conda/envs/py37/lib/python3.7/site-packages/luigi/worker.py", line 199, in run
    new_deps = self._run_get_new_deps()
  File "/.conda/envs/py37/lib/python3.7/site-packages/luigi/worker.py", line 141, in _run_get_new_deps
    task_gen = self.task.run()
  File "/hail-elasticsearch-pipelines/luigi_pipeline/seqr_loading.py", line 54, in run
    self.read_vcf_write_mt()
  File "/hail-elasticsearch-pipelines/luigi_pipeline/seqr_loading.py", line 90, in read_vcf_write_mt
    mt.write(self.output().path, overwrite=True)
  File "</.conda/envs/py37/lib/python3.7/site-packages/decorator.py:decorator-gen-1036>", line 2, in write
  File "/hail-elasticsearch-pipelines/hail_builds/v02/hail-0.2-3a68be23cb82d7c7fb5bf72668edcd1edf12822e.zip/hail/typecheck/check.py", line 585, in wrapper
    return __original_func(*args_, **kwargs_)
  File "/hail-elasticsearch-pipelines/hail_builds/v02/hail-0.2-3a68be23cb82d7c7fb5bf72668edcd1edf12822e.zip/hail/matrixtable.py", line 2508, in write
    Env.backend().execute(MatrixWrite(self._mir, writer))
  File "/hail-elasticsearch-pipelines/hail_builds/v02/hail-0.2-3a68be23cb82d7c7fb5bf72668edcd1edf12822e.zip/hail/backend/backend.py", line 109, in execute
    result = json.loads(Env.hc()._jhc.backend().executeJSON(self._to_java_ir(ir)))
  File "/spark/spark-2.4.4-bin-hadoop2.7/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 1257, in __call__
    answer, self.gateway_client, self.target_id, self.name)
  File "/hail-elasticsearch-pipelines/hail_builds/v02/hail-0.2-3a68be23cb82d7c7fb5bf72668edcd1edf12822e.zip/hail/utils/java.py", line 225, in deco
    'Error summary: %s' % (deepest, full, hail.__version__, deepest)) from None
hail.utils.java.FatalError: HailException: VEP command '/vep/ensembl-tools-release-99/vep --format vcf --json --everything --allele_number --no_stats --cache --offline --minimal --verbose --assembly GRCh38 --dir_cache /vep/vep_cache --fasta /vep/homo_sapiens/GRCh38/hg38.fa --plugin LoF,human_ancestor_fa:/vep/loftee_data_grch38/human_ancestor.fa.gz,filter_position:0.05,min_intron_size:15,conservation_file:/vep/loftee_data_grch38/loftee.sql,loftee_path:/vep/loftee_grch38,gerp_bigwig:/vep/loftee_data_grch38/gerp_conservation_scores.homo_sapiens.GRCh38.bw,run_splice_predictions:0 --dir_plugins /vep/loftee_grch38 -o STDOUT' failed with non-zero exit status 2
  VEP Error output:


Java stack trace:
org.apache.spark.SparkException: Job aborted due to stage failure: Task 2 in stage 7.0 failed 4
 times, most recent failure: Lost task 2.3 in stage 7.0 (TID 3017, 137.187.60.61, executor 2): 
is.hail.utils.HailException: VEP command '/vep/ensembl-tools-release-99/vep --format vcf --
json --everything --allele_number --no_stats --cache --offline --minimal --verbose --assembly 
GRCh38 --dir_cache /vep/vep_cache --fasta /vep/homo_sapiens/GRCh38/hg38.fa --plugin 
LoF,human_ancestor_fa:/vep/loftee_data_grch38/human_ancestor.fa.gz,filter_position:0.05,min_i
ntron_size:15,conservation_file:/vep/loftee_data_grch38/loftee.sql,loftee_path:/vep/loftee_grch3
8,gerp_bigwig:/vep/loftee_data_grch38/gerp_conservation_scores.homo_sapiens.GRCh38.bw,ru
n_splice_predictions:0 --dir_plugins /vep/loftee_grch38 -o STDOUT' failed with non-zero exit 
status 2

Error code 2 may be related to permissions, but I just changed all VEP folders permissions and all the subfolders and files to 777. I checked hadoop file permissions too, they are fine. I am not even sure how to proceed debugging the issue. Could you suggest anything I could further look into?

I checked hadoop with hdfs dfsadmin -report and there is 200Gb available (DFS Remaining%: 20.37%), but cache usage is 100%, not sure whether it is ok or not, but hadoop seems to be working well otherwise.

I asked it also on the Hail forum but they are not sure:

https://discuss.hail.is/t/no-vep-debug-output/1302/6

I am using new VEP99.

What is the estimate for Hail 0.2 pipeline to be ready?

Hi,

I am wondering whether Hail 0.2 version is ready or not, whether we can already run the same pipelines in 0.2 as in 0.1 and if not, what is the estimate for it to be ready.

Regards

Broken link in the readme

The following link doesn't work
https://getcarina.com/docs/tutorials/docker-install-mac/

have the pipeline check and print a warning if the VCF wasn't filtered

Error when uploading multiple vcf files together for family set | HailException: arguments refer to no file

Hi all,

I have been successful in uploading individual vcf samples to our seqr local installation, but hail fails when I try and upload multiple vcf files together (still in the same directory). If I upload them individually only the child sample takes, but seqr won't accept index of the parent vcf files saying it expects to see the other already there.

When I attempt to upload multiple vcf's for the family set together I get this error instead.

Here is my bash script (this has worked for single samples)
#!/bin/bash

GENOME_VERSION="37" # should be "37" or "38" SAMPLE_TYPE="WES" # can be "WES" or "WGS" DATASET_TYPE="VARIANTS" # can be "VARIANTS" (for GATK VCFs) or "SV" (for Manta VCFs) PROJECT_GUID="R0001_project1" # should match the ID in the url of the project page INPUT_VCF="file1.vcf.gz,file2.vcf.gz,file3.vcf.gz"

python2.7 gcloud_dataproc/submit.py --run-locally --driver-memory 40G --executor-memory 40G hail_scripts/v01/load_dataset_to_es.py --spark-home $SPARK_HOME --genome-version $GENOME_VERSION --project-guid $PROJECT_GUID --sample-type $SAMPLE_TYPE --dataset-type $DATASET_TYPE --skip-validation --exclude-hgmd --vep-block-size 100 --es-block-size 10 --num-shards 1 --hail-version 0.1 --output-vds GEN14 --use-nested-objects-for-vep --use-nested-objects-for-genotypes --cpu-limit 8 $INPUT_VCF

It won't let me run it without supplying --output-vds to name the whole set with a prefix.
I was also told to try using wildcard * so file*.vcf.gz but it seems like it only uses the first matching file.

Without wildcard I get this error,

[thasan@seqr03 hail_elasticsearch_pipelines]$ ./multi-upload.sh /usr/local/seqr/seqr/../bin/spark-2.0.2-bin-hadoop2.7/bin/spark-submit --master local[8] --driver-memory 40G --executor-memory 40G --num-executors 10 --conf spark.driver.extraJavaOptions=-Xss4M --conf spark.executor.extraJavaOptions=-Xss4M --conf spark.executor.memoryOverhead=5g --conf spark.driver.maxResultSize=30g --conf spark.kryoserializer.buffer.max=1g --conf spark.memory.fraction=0.1 --conf spark.default.parallelism=1 --jars hail_builds/v01/hail-v01-10-8-2018-90c855449.jar --conf spark.driver.extraClassPath=hail_builds/v01/hail-v01-10-8-2018-90c855449.jar --conf spark.executor.extraClassPath=hail_builds/v01/hail-v01-10-8-2018-90c855449.jar --py-files hail_builds/v01/hail-v01-10-8-2018-90c855449.zip "hail_scripts/v01/load_dataset_to_es.py" "--genome-version" "37" "--project-guid" "R0001_project1" "--sample-type" "WES" "--dataset-type" "VARIANTS" "--skip-validation" "--exclude-hgmd" "--vep-block-size" "100" "--es-block-size" "10" "--num-shards" "1" "--output-vds" "GEN14" "--use-nested-objects-for-vep" "--use-nested-objects-for-genotypes" "file1.vcf.gz,file2.vcf.gz,file3.vcf.gz" --username 'thasan' --directory 'seqr03.nygenome.org:/usr/local/seqr/seqr/hail_elasticsearch_pipelines'

DEPRECATION: Python 2.7 will reach the end of its life on January 1st, 2020. Please upgrade your Python as Python 2.7 won't be maintained after that date. A future version of pip will drop support for Python 2.7. More details about Python 2 support in pip, can be found at https://pip.pypa.io/en/latest/development/release-process/#python-2-support Requirement already satisfied: elasticsearch in /usr/lib/python2.7/site-packages (7.0.4) Requirement already satisfied: urllib3>=1.21.1 in /usr/lib/python2.7/site-packages (from elasticsearch) (1.25.3) WARNING: You are using pip version 19.2.3, however version 19.3.1 is available.
You should consider upgrading via the 'pip install --upgrade pip' command. 2019-10-24 13:25:44,312 INFO Index name: r0001_project1__wes__grch37__variants__20191024 2019-10-24 13:25:44,312 INFO Command args: /usr/local/seqr/seqr/hail_elasticsearch_pipelines/hail_scripts/v01/load_dataset_to_es.py --index r0001_project1__wes__grch37__variants__20191024 2019-10-24 13:25:44,315 INFO Parsed args: {'cpu_limit': None, 'create_snapshot': False, 'dataset_type': 'VARIANTS', 'directory': 'seqr03.nygenome.org:/usr/local/seqr/seqr/hail_elasticsearch_pipelines', 'discard_missing_genotypes': False, 'dont_delete_intermediate_vds_files': False, 'dont_update_operations_log': False, 'es_block_size': 10, 'exclude_1kg': False, 'exclude_cadd': False, 'exclude_clinvar': False, 'exclude_dbnsfp': False, 'exclude_eigen': False, 'exclude_exac': False, 'exclude_gene_constraint': False, 'exclude_gnomad': False, 'exclude_gnomad_coverage': False, 'exclude_hgmd': True, 'exclude_mpc': False, 'exclude_omim': False, 'exclude_primate_ai': False, 'exclude_splice_ai': False, 'exclude_topmed': False, 'exclude_vcf_info_field': False, 'export_vcf': False, 'fam_file': None, 'family_id': None, 'filter_interval': '1-MT', 'genome_version': '37', 'host': 'localhost', 'ignore_extra_sample_ids_in_tables': False, 'ignore_extra_sample_ids_in_vds': False, 'index': 'r0001_project1__wes__grch37__variants__20191024', 'individual_id': None, 'input_dataset': 'file1.vcf.gz,file2.vcf.gz,file3.vcf.gz', 'max_samples_per_index': 250, 'not_gatk_genotypes': False, 'num_shards': 1, 'only_export_to_elasticsearch_at_the_end': False, 'output_vds': 'GEN14', 'port': '9200', 'project_guid': 'R0001_project1', 'remap_sample_ids': None, 'sample_type': 'WES', 'skip_annotations': False, 'skip_validation': True, 'skip_vep': False, 'skip_writing_intermediate_vds': False, 'start_with_sample_group': 0, 'start_with_step': 0, 'stop_after_step': 1000, 'subset_samples': None, 'use_child_docs_for_genotypes': False, 'use_nested_objects_for_genotypes': True, 'use_nested_objects_for_vep': True, 'use_temp_loading_nodes': False, 'username': 'thasan', 'vep_block_size': 100}
2019-10-24 13:25:44,315 INFO ==> create HailContext Running on Apache Spark version 2.0.2 SparkUI available at http://10.1.27.167:4040 Welcome to __ __ <>__ / /_/ /__ __/ / / __ / _ / / / // //_,//_/ version 0.1-105a497 2019-10-24 13:25:46,449 INFO is_running_locally = True 2019-10-24 13:25:46,449 INFO

=============================== pipeline - step 0 - run vep =============================== 2019-10-24 13:25:46,449
INFO ==> import: file1.vcf.gz,file2.vcf.gz,file3.vcf.gz 2019-10-24 13:25:46 Hail: WARN: file1.vcf.gz,file2.vcf.gz,file3.vcf.gz' refers to no files Traceback (most recent call last): File "/usr/local/seqr/seqr/hail_elasticsearch_pipelines/hail_scripts/v01/load_dataset_to_es.py", line 900, in run_pipeline() File "/usr/local/seqr/seqr/hail_elasticsearch_pipelines/hail_scripts/v01/load_dataset_to_es.py", line 867, in run_pipeline hc, vds = step0_init_and_run_vep(hc, vds, args) File "/usr/local/seqr/seqr/hail_elasticsearch_pipelines/hail_scripts/v01/load_dataset_to_es.py", line 162, in wrapper result = f(*args, **kwargs) File "/usr/local/seqr/seqr/hail_elasticsearch_pipelines/hail_scripts/v01/load_dataset_to_es.py", line 419, in step0_init_and_run_vep not_gatk_genotypes=args.not_gatk_genotypes, File "/usr/local/seqr/seqr/hail_elasticsearch_pipelines/hail_scripts/v01/utils/vds_utils.py", line 67, in read_in_dataset vds = hc.import_vcf(input_path, force_bgz=True, min_partitions=10000, generic=not_gatk_genotypes) File "", line 2, in import_vcf File "/usr/local/seqr/seqr/hail_elasticsearch_pipelines/hail_builds/v01/hail-v01-10-8-2018-90c855449.zip/hail/java.py", line 121, in handle_py4j hail.java.FatalError: HailException: arguments refer to no files

Java stack trace: is.hail.utils.HailException: arguments refer to no files at is.hail.utils.ErrorHandling$class.fatal(ErrorHandling.scala:6) at is.hail.utils.package$.fatal(package.scala:27) at is.hail.io.vcf.LoadVCF$.globAllVCFs(LoadVCF.scala:105) at is.hail.HailContext.importVCFs(HailContext.scala:544) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:237) at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357) at py4j.Gateway.invoke(Gateway.java:280) at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132) at py4j.commands.CallCommand.execute(CallCommand.java:79) at py4j.GatewayConnection.run(GatewayConnection.java:214) at java.lang.Thread.run(Thread.java:748)

Hail version: 0.1-105a497 Error summary: HailException: arguments refer to no files Traceback (most recent call last): File "gcloud_dataproc/submit.py", line 99, in subprocess.check_call(command, shell=True) File "/usr/lib64/python2.7/subprocess.py", line 542, in check_call raise CalledProcessError(retcode, cmd) subprocess.CalledProcessError: Command '/usr/local/seqr/seqr/../bin/spark-2.0.2-bin-hadoop2.7/bin/spark-submit --master local[8] --driver-memory 40G --executor-memory 40G --num-executors 10 --conf spark.driver.extraJavaOptions=-Xss4M --conf spark.executor.extraJavaOptions=-Xss4M --conf spark.executor.memoryOverhead=5g --conf spark.driver.maxResultSize=30g --conf spark.kryoserializer.buffer.max=1g --conf spark.memory.fraction=0.1 --conf spark.default.parallelism=1 --jars hail_builds/v01/hail-v01-10-8-2018-90c855449.jar --conf spark.driver.extraClassPath=hail_builds/v01/hail-v01-10-8-2018-90c855449.jar --conf spark.executor.extraClassPath=hail_builds/v01/hail-v01-10-8-2018-90c855449.jar --py-files hail_builds/v01/hail-v01-10-8-2018-90c855449.zip "hail_scripts/v01/load_dataset_to_es.py" "--genome-version" "37" "--project-guid" "R0001_project1" "--sample-type" "WES" "--dataset-type" "VARIANTS" "--skip-validation" "--exclude-hgmd" "--vep-block-size" "100" "--es-block-size" "10" "--num-shards" "1" "--output-vds" "GEN14" "--use-nested-objects-for-vep" "--use-nested-objects-for-genotypes" "file1.vcf.gz,file2.vcf.gz,file3.vcf.gz" --username 'thasan' --directory 'seqr03.nygenome.org:/usr/local/seqr/seqr/hail_elasticsearch_pipelines' ' returned non-zero exit status 1

update existing elasticsearch indexes: clinvar, omim

Execution on AWS of luigi pipeline shows too high memory usage

Hi,

While running the pipeline on AWS a dataset of size 2Gb, one of the spark action collect() steps requires 1.4TiB which seems not right. I just ran luigi pipeline the way it is without any additional modifications. Here is stackoverflow thread:

https://stackoverflow.com/questions/60528553/why-aws-input-parameter-in-application-history-of-emr-is-so-big-for-spark-ap

I suspect that there is some part of code that needs to be optimized either in the luigi pipeline or in some native Hail code but I do not know how to trace spark memory utilization down to the lines of Hail code. How would you go about that?

Issue in load_dataset_to_es in v0.1

I got a general out of index error in step2_export_to_elasticsearch as below,
Caused by: java.lang.IndexOutOfBoundsException: 0
File "/opt/seqr/hail-elasticsearch-pipelines/hail_scripts/v01/utils/elasticsearch_client.py", line 373, in export_kt_to_elasticsearch
kt.to_dataframe().show(n=5)

I debugged it a little bit, and found this line return an empty KeyTable. And I dig a little bit and found domains = va.domains[va.aIndex-1] in site_fields_list prevents the application to generate a KeyTable and thus df is empty as well. I had to manipulate the list to skip this field and was loaded successfully . I believe the split_multi() has already been run since I saw the message "Hail: WARN: called redundant split on an already split VDS" somewhere else.
Any thoughts on this issue?

luigi v02 pipeline: make HGMD path optional

Currently if hgmd_ht_path isn't set, or is set to empty string, I get errors like

ERROR: [pid 12013] Worker Worker(salt=203839206, workers=1, host=seqr-install-ubuntu, username=weisburd, pid=12013) failed    SeqrVCFToMTTask(source_paths=["./tests/data/1kg_30variants.vcf.bgz"], dest_path=test.mt, genome_version=37, vep_runner=VEP, reference_ht_path=gs://seqr-reference-data/GRCh37/all_reference_data/combined_reference_data_grch37.ht, clinvar_ht_path=gs://seqr-reference-data/GRCh37/clinvar/clinvar.GRCh37.ht, hgmd_ht_path=, sample_type=WGS, validate=False, dataset_type=VARIANTS, remap_path=./tests/data/remap_testing.tsv, subset_path=./tests/data/subset_testing.tsv)
Traceback (most recent call last):
  File "/usr/local/lib/python3.6/dist-packages/luigi/worker.py", line 199, in run
    new_deps = self._run_get_new_deps()
  File "/usr/local/lib/python3.6/dist-packages/luigi/worker.py", line 141, in _run_get_new_deps
    task_gen = self.task.run()
  File "seqr_loading.py", line 45, in run
    hgmd = hl.read_table(self.hgmd_ht_path)
  File "</usr/local/lib/python3.6/dist-packages/decorator.py:decorator-gen-1204>", line 2, in read_table
  File "/usr/local/lib/python3.6/dist-packages/hail/typecheck/check.py", line 585, in wrapper
    return __original_func(*args_, **kwargs_)
  File "/usr/local/lib/python3.6/dist-packages/hail/methods/impex.py", line 2196, in read_table
    return Table(TableRead(tr, False))
  File "/usr/local/lib/python3.6/dist-packages/hail/table.py", line 336, in __init__
    self._type = self._tir.typ
  File "/usr/local/lib/python3.6/dist-packages/hail/ir/base_ir.py", line 142, in typ
    self._compute_type()
  File "/usr/local/lib/python3.6/dist-packages/hail/ir/table_ir.py", line 201, in _compute_type
    self._type = Env.backend().table_type(self)
  File "/usr/local/lib/python3.6/dist-packages/hail/backend/backend.py", line 119, in table_type
    jir = self._to_java_ir(tir)
  File "/usr/local/lib/python3.6/dist-packages/hail/backend/backend.py", line 104, in _to_java_ir
    ir._jir = ir.parse(r(ir), ir_map=r.jirs)
  File "/usr/local/lib/python3.6/dist-packages/hail/ir/base_ir.py", line 147, in parse
    return Env.hail().expr.ir.IRParser.parse_table_ir(code, ref_map, ir_map)
  File "/usr/local/lib/python3.6/dist-packages/py4j/java_gateway.py", line 1257, in __call__
    answer, self.gateway_client, self.target_id, self.name)
  File "/usr/local/lib/python3.6/dist-packages/hail/utils/java.py", line 221, in deco
    'Error summary: %s' % (deepest, full, hail.__version__, deepest)) from None
hail.utils.java.FatalError: IllegalArgumentException: Can not create a Path from an empty string

Java stack trace:
java.lang.IllegalArgumentException: Can not create a Path from an empty string
	at org.apache.hadoop.fs.Path.checkPathArg(Path.java:126)
	at org.apache.hadoop.fs.Path.<init>(Path.java:134)
	at is.hail.io.fs.HadoopFS.is$hail$io$fs$HadoopFS$$_fileSystem(HadoopFS.scala:156)
	at is.hail.io.fs.HadoopFS.isDir(HadoopFS.scala:173)
	at is.hail.variant.RelationalSpec$.read(MatrixTable.scala:39)
	at is.hail.expr.ir.TableNativeReader.spec$lzycompute(TableIR.scala:138)
	at is.hail.expr.ir.TableNativeReader.spec(TableIR.scala:135)
	at is.hail.expr.ir.TableNativeReader.fullType(TableIR.scala:145)
	at is.hail.expr.ir.IRParser$$anonfun$table_ir_1$2.apply(Parser.scala:1024)
	at is.hail.expr.ir.IRParser$$anonfun$table_ir_1$2.apply(Parser.scala:1024)
	at scala.Option.getOrElse(Option.scala:121)
	at is.hail.expr.ir.IRParser$.table_ir_1(Parser.scala:1024)
	at is.hail.expr.ir.IRParser$.table_ir(Parser.scala:998)
	at is.hail.expr.ir.IRParser$$anonfun$parse_table_ir$2.apply(Parser.scala:1395)
	at is.hail.expr.ir.IRParser$$anonfun$parse_table_ir$2.apply(Parser.scala:1395)
	at is.hail.expr.ir.IRParser$.parse(Parser.scala:1384)
	at is.hail.expr.ir.IRParser$.parse_table_ir(Parser.scala:1395)
	at is.hail.expr.ir.IRParser$.parse_table_ir(Parser.scala:1394)
	at is.hail.expr.ir.IRParser.parse_table_ir(Parser.scala)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
	at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
	at py4j.Gateway.invoke(Gateway.java:282)
	at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
	at py4j.commands.CallCommand.execute(CallCommand.java:79)
	at py4j.GatewayConnection.run(GatewayConnection.java:238)
	at java.lang.Thread.run(Thread.java:748)



Hail version: 0.2.18-08ec699f0fd4
Error summary: IllegalArgumentException: Can not create a Path from an empty string
DEBUG: 1 running tasks, waiting for next task to finish```

base_mt_schema.py, line 61 purpose

Hi,

I am not completely sure since I am relatively new to python, but I am just wondering, in the file

https://github.com/macarthur-lab/hail-elasticsearch-pipelines/blob/master/luigi_pipeline/lib/model/base_mt_schema.py

on line 61 there is such a thing:

getattr(self, fn_require.__name__)()

And you do not save the returned value anywhere it seems, so is it a bug, or how you get the returned value saved then?

For instance, if you pass 'fn_require=genotypes' like you do on line 204 in this file:

https://github.com/macarthur-lab/hail-elasticsearch-pipelines/blob/master/luigi_pipeline/lib/model/seqr_mt_schema.py

Then getattr(self, fn_require.__name__)() just returns:

hl.agg.collect(hl.struct(**self._genotype_fields()))

Which is an array.

Remapping & subsetting doesn't work when sample IDs are integers

Traceback (most recent call last):
  File "/tmp/45994490-c6e6-467a-a534-f6344d6c0896/entire_vds_pipeline.py", line 458, in <module>
    raise ValueError(warning_message)
ValueError: Found only 0 out of 27 samples specified for subsetting
ERROR: (gcloud.dataproc.jobs.submit.pyspark) Job [45994490-c6e6-467a-a534-f6344d6c0896] entered state [ERROR] while waiting for [DONE].```

add computed Hom, Hemi counts to entire_vds_pipeline

New vep: where to get the reference files

Hi,

I see that for the new 0.2 pipeline we have a different VEP config file:

https://storage.googleapis.com/seqr-hail/vep/vep85-loftee-gcloud.json

And there we see a new reference file that is needed that was not present in the previous pipeline version - /root/.vep/loftee_data/GERP_scores.final.sorted.txt.gz

Where should I download the file? I found one here, but I can't download it for some reason (either speed is too slow, or upon finishing download of 11Gb it turns out to be a text file with 60Kb of data):

https://personal.broadinstitute.org/konradk/loftee_data/GRCh37/

Error when trying to load VCF in ES

I came across your scripts when trying to import my VCFs file into a local ES. I am trying to use "hail_scripts/v01/load_dataset_to_es.py" but it crashes regardless the parameters I use.

The error seems to be related with the use of a ES mapping function. I have tried both python3 and 2.7 with identic reults. Have you seen this bug before? Am I doing something wrong?

28 # elasticsearch field types for arrays are the same as for simple types:
29 for vds_type, es_type in VDS_TO_ES_TYPE_MAPPING.items():
30 VDS_TO_ES_TYPE_MAPPING.update({"Array[%s]" % vds_type: es_type})
31 VDS_TO_ES_TYPE_MAPPING.update({"Set[%s]" % vds_type: es_type})
RuntimeError: dictionary changed size during iteration

When loading a new vcf, subset the samples based on a .ped file, but keep AFs based on the full vcf.

a subset of samples from a large joint-called vcf may be loaded into the db.
even when subsetting, the AFs should be based on the original full VCF to allow cohort-based filtering.

Check that the input vcf directory is writable, and fall back on /tmp if not.

Currently if the directory is not writable, the pipeline fails after step1 when it goes to write the results to a vds.

Add option to select Hail version for job

Currently, gcloud_dataproc/submit.py (and by extension run_script.py, run_script_GRCh37.py, and run_script_GRCh38.py) is hard coded to use Hail 0.1.

https://github.com/macarthur-lab/hail-elasticsearch-pipelines/blob/d19569eab46beae68181d180fcc254048efe4a52/gcloud_dataproc/submit.py#L25

With #42 adding scripts that use Hail 0.2, there should be an option to select the version of Hail used when submitting a job.

Error in load_clinvar_to_es in v0.1

File "/opt/seqr/hail-elasticsearch-pipelines/hail_scripts/v01/load_clinvar_to_es.py", line 100, in
summary = vds.summarize()
File "", line 2, in summarize
File "/opt/seqr/hail-elasticsearch-pipelines/hail_builds/v01/hail-v01-10-8-2018-90c855449.zip/hail/java.py", line 121, in handle_py4j
hail.java.FatalError: NumberFormatException: For input string: "NW_003315947.1"

I did some search and is this related to VCF header?
Thanks

lines in the deploy/kubernetes/templates/elasticsearch/elasticsearch.gcloud.yaml need to be commented

envFrom:

- configMapRef:

name: all-settings

luigi.parameter._FrozenOrderedDict param issue

Hi,

While I was running your pipeline recently I faced an error on line 54 of the file:

https://github.com/macarthur-lab/hail-elasticsearch-pipelines/blob/master/luigi_pipeline/lib/hail_tasks.py

luigi.parameter._FrozenOrderedDict was giving an error:

AttributeError: module 'luigi.parameter' has no attribute '_FrozenOrderedDict'

I checked luigi.parameter master branch and I do not see there _FrozenOrderedDict. If we substitute it to FrozenOrderedDict without _, it works, but I am not sure whether it is safe to do that. Do you know why this is happening? Not sure, but it can be a bug.

update entire_vds_pipeline: add AC for exac, AN for 1kg, exac, gnomAD, topmed

Export to ES on step2 throws HTTP_EXCEPTION 'Root mapping definition has unsupported parameters'

I've tried running this submit script a few different ways but ultimately run into this error each time. I've even tried to redeploy seqr/elasticsearch and hail but setup is still iffy using the two step installer scripts. The main error here seems to be that ES has a transport error with code 400 but I can't discern why that might be.

Output of step 2 is below,
2019-08-08 08:18:27,278 INFO ==> creating elasticsearch index r0001_project1__wes__grch37__variants__20190807 2019-08-08 08:18:27,294 WARNING PUT http://localhost:9200/r0001_project1__wes__grch37__variants__20190807 [status:400 request:0.015s] Traceback (most recent call last): File "/usr/local/seqr/seqr/hail_elasticsearch_pipelines/hail_scripts/v01/load_dataset_to_es.py", line 900, in <module> run_pipeline() File "/usr/local/seqr/seqr/hail_elasticsearch_pipelines/hail_scripts/v01/load_dataset_to_es.py", line 869, in run_pipeline hc, vds = step2_export_to_elasticsearch(hc, vds, args) File "/usr/local/seqr/seqr/hail_elasticsearch_pipelines/hail_scripts/v01/load_dataset_to_es.py", line 162, in wrapper result = f(*args, **kwargs) File "/usr/local/seqr/seqr/hail_elasticsearch_pipelines/hail_scripts/v01/load_dataset_to_es.py", line 653, in step2_export_to_elasticsearch run_after_index_exists=(lambda: route_index_to_temp_es_cluster(True, args)) if args.use_temp_loading_nodes else None, File "/usr/local/seqr/seqr/hail_elasticsearch_pipelines/hail_scripts/v01/load_dataset_to_es.py", line 398, in export_to_elasticsearch force_merge=force_merge, File "/usr/local/seqr/seqr/hail_elasticsearch_pipelines/hail_scripts/v01/utils/elasticsearch_client.py", line 207, in export_vds_to_elasticsearch verbose=verbose) File "/usr/local/seqr/seqr/hail_elasticsearch_pipelines/hail_scripts/v01/utils/elasticsearch_client.py", line 366, in export_kt_to_elasticsearch _meta=_meta) File "/usr/local/seqr/seqr/hail_elasticsearch_pipelines/hail_scripts/shared/elasticsearch_client.py", line 142, in create_or_update_mapping self.es.indices.create(index=index_name, body=elasticsearch_mapping) File "/usr/lib/python2.7/site-packages/elasticsearch/client/utils.py", line 76, in _wrapped return func(*args, params=params, **kwargs) File "/usr/lib/python2.7/site-packages/elasticsearch/client/indices.py", line 91, in create params=params, body=body) File "/usr/lib/python2.7/site-packages/elasticsearch/transport.py", line 314, in perform_request status, headers_response, data = connection.perform_request(method, url, params, body, headers=headers, ignore=ignore, timeout=timeout) File "/usr/lib/python2.7/site-packages/elasticsearch/connection/http_urllib3.py", line 180, in perform_request self._raise_error(response.status, raw_data) File "/usr/lib/python2.7/site-packages/elasticsearch/connection/base.py", line 125, in _raise_error raise HTTP_EXCEPTIONS.get(status_code, TransportError)(status_code, error_message, additional_info) elasticsearch.exceptions.RequestError: TransportError(400, u'mapper_parsing_exception', u'Root mapping definition has unsupported parameters: [variant : {_meta={gencodeVersion=19, genomeVersion=37, datasetType=VARIANTS, sampleType=WES, sourceFilePath=/usr/local/seqr/seqr/hail_elasticsearch_pipelines/GEN18-61P-1-D1.germline.vep.vds}, _all={enabled=false}, properties={samples_ab_10_to_15={eager_global_ordinals=true, type=keyword}, samples_gq_55_to_60={eager_global_ordinals=true, type=keyword}, codingGeneIds={type=keyword}, originalAltAlleles={type=keyword}, samples_ab_0_to_5={eager_global_ordinals=true, type=keyword}, mainTranscript_gene_symbol={type=keyword}, mainTranscript_lof_filter={type=keyword}, samples_num_alt_2={eager_global_ordinals=true, type=keyword}, samples_num_alt_1={eager_global_ordinals=true, type=keyword}, mainTranscript_hgvsp={type=keyword}, ref={type=keyword}, samples_gq_90_to_95={eager_global_ordinals=true, type=keyword}, mainTranscript_hgvs={type=keyword}, variantId={type=keyword}, mainTranscript_cdna_end={type=integer}, samples_ab_25_to_30={eager_global_ordinals=true, type=keyword}, samples_gq_45_to_50={eager_global_ordinals=true, type=keyword}, samples_gq_30_to_35={eager_global_ordinals=true, type=keyword}, samples_gq_50_to_55={eager_global_ordinals=true, type=keyword}, mainTranscript_amino_acids={type=keyword}, samples_ab_15_to_20={eager_global_ordinals=true, type=keyword}, sortedTranscriptConsequences={type=nested, properties={amino_acids={type=keyword}, biotype={type=keyword}, lof={type=keyword}, lof_flags={type=keyword}, major_consequence_rank={type=integer}, codons={type=keyword}, gene_symbol={type=keyword}, domains={type=keyword}, canonical={type=integer}, transcript_rank={type=integer}, lof_filter={type=keyword}, hgvs={type=keyword}, cdna_end={type=integer}, hgvsc={type=keyword}, cdna_start={type=integer}, transcript_id={type=keyword}, protein_id={type=keyword}, category={type=keyword}, gene_id={type=keyword}, major_consequence={type=keyword}, hgvsp={type=keyword}, consequence_terms={type=keyword}}}, mainTranscript_transcript_id={type=keyword}, AC={type=integer}, mainTranscript_codons={type=keyword}, AF={type=half_float}, mainTranscript_canonical={type=integer}, mainTranscript_protein_id={type=keyword}, alt={type=keyword}, rsid={type=keyword}, domains={type=keyword}, xstop={type=long}, mainTranscript_category={type=keyword}, AN={type=integer}, transcriptIds={type=keyword}, xstart={type=long}, mainTranscript_gene_id={type=keyword}, transcriptConsequenceTerms={type=keyword}, samples_gq_60_to_65={eager_global_ordinals=true, type=keyword}, samples_gq_20_to_25={eager_global_ordinals=true, type=keyword}, samples_ab_20_to_25={eager_global_ordinals=true, type=keyword}, samples_gq_15_to_20={eager_global_ordinals=true, type=keyword}, mainTranscript_domains={type=keyword}, mainTranscript_biotype={type=keyword}, docId={type=keyword}, samples_gq_0_to_5={eager_global_ordinals=true, type=keyword}, mainTranscript_major_consequence={type=keyword}, aIndex={type=integer}, samples_ab_5_to_10={eager_global_ordinals=true, type=keyword}, pos={type=integer}, samples_gq_85_to_90={eager_global_ordinals=true, type=keyword}, end={type=integer}, samples_gq_70_to_75={eager_global_ordinals=true, type=keyword}, geneIds={type=keyword}, samples_gq_10_to_15={eager_global_ordinals=true, type=keyword}, samples_gq_25_to_30={eager_global_ordinals=true, type=keyword}, mainTranscript_major_consequence_rank={type=integer}, samples_ab_30_to_35={eager_global_ordinals=true, type=keyword}, samples_gq_65_to_70={eager_global_ordinals=true, type=keyword}, mainTranscript_hgvsc={type=keyword}, xpos={type=long}, samples_gq_5_to_10={eager_global_ordinals=true, type=keyword}, start={type=integer}, genotypes={type=nested, properties={ab={type=half_float, doc_values=false}, num_alt={type=byte, doc_values=false}, sample_id={type=keyword}, gq={type=byte, doc_values=false}, dp={type=short, doc_values=false}}}, filters={type=keyword}, contig={type=keyword}, mainTranscript_lof_flags={type=keyword}, mainTranscript_lof={type=keyword}, mainTranscript_cdna_start={type=integer}, samples_ab_35_to_40={eager_global_ordinals=true, type=keyword}, samples_no_call={eager_global_ordinals=true, type=keyword}, samples_gq_75_to_80={eager_global_ordinals=true, type=keyword}, samples_gq_80_to_85={eager_global_ordinals=true, type=keyword}, samples_gq_35_to_40={eager_global_ordinals=true, type=keyword}, samples_ab_40_to_45={eager_global_ordinals=true, type=keyword}, samples_gq_40_to_45={eager_global_ordinals=true, type=keyword}}}]') /usr/local/seqr/bin/spark-2.0.2-bin-hadoop2.7/bin/spark-submit --driver-memory 5G --executor-memory 5G --num-executors 8 --conf spark.driver.extraJavaOptions=-Xss4M --conf spark.executor.extraJavaOptions=-Xss4M --conf spark.executor.memoryOverhead=5g --conf spark.driver.maxResultSize=30g --conf spark.kryoserializer.buffer.max=1g --conf spark.memory.fraction=0.1 --conf spark.default.parallelism=1 --jars hail_builds/v01/hail-v01-10-8-2018-90c855449.jar --conf spark.driver.extraClassPath=hail_builds/v01/hail-v01-10-8-2018-90c855449.jar --conf spark.executor.extraClassPath=hail_builds/v01/hail-v01-10-8-2018-90c855449.jar --py-files hail_builds/v01/hail-v01-10-8-2018-90c855449.zip "hail_scripts/v01/load_dataset_to_es.py" "--genome-version" "37" "--project-guid" "R0001_project1" "--sample-type" "WES" "--dataset-type" "VARIANTS" "--skip-validation" "--exclude-hgmd" "--vep-block-size" "100" "--es-block-size" "10" "--num-shards" "1" "--use-nested-objects-for-vep" "--use-nested-objects-for-genotypes" "/usr/local/seqr/seqr/hail_elasticsearch_pipelines/GEN18-61P-1-D1.germline.vcf.gz" --username 'root' --directory 'seqr02.nygenome.org:/usr/local/seqr/seqr/hail_elasticsearch_pipelines'

Traceback (most recent call last): File "gcloud_dataproc/submit.py", line 99, in <module> subprocess.check_call(command, shell=True) File "/usr/lib64/python2.7/subprocess.py", line 542, in check_call raise CalledProcessError(retcode, cmd) subprocess.CalledProcessError: Command '/usr/local/seqr/bin/spark-2.0.2-bin-hadoop2.7/bin/spark-submit --driver-memory 5G --executor-memory 5G --num-executors 8 --conf spark.driver.extraJavaOptions=-Xss4M --conf spark.executor.extraJavaOptions=-Xss4M --conf spark.executor.memoryOverhead=5g --conf spark.driver.maxResultSize=30g --conf spark.kryoserializer.buffer.max=1g --conf spark.memory.fraction=0.1 --conf spark.default.parallelism=1 --jars hail_builds/v01/hail-v01-10-8-2018-90c855449.jar --conf spark.driver.extraClassPath=hail_builds/v01/hail-v01-10-8-2018-90c855449.jar --conf spark.executor.extraClassPath=hail_builds/v01/hail-v01-10-8-2018-90c855449.jar --py-files hail_builds/v01/hail-v01-10-8-2018-90c855449.zip "hail_scripts/v01/load_dataset_to_es.py" "--genome-version" "37" "--project-guid" "R0001_project1" "--sample-type" "WES" "--dataset-type" "VARIANTS" "--skip-validation" "--exclude-hgmd" "--vep-block-size" "100" "--es-block-size" "10" "--num-shards" "1" "--use-nested-objects-for-vep" "--use-nested-objects-for-genotypes" "/usr/local/seqr/seqr/hail_elasticsearch_pipelines/GEN18-61P-1-D1.germline.vcf.gz" --username 'root' --directory 'seqr02.nygenome.org:/usr/local/seqr/seqr/hail_elasticsearch_pipelines' ' returned non-zero exit status 1

How dataset_type is used in luigi pipeline?

There is a parameter in luigi_pipeline - dataset_type. Is it used anywhere in it? I can't see the code somehow for how it is utilized anywhere in the pipeline:

https://github.com/macarthur-lab/hail-elasticsearch-pipelines/blob/42ddf5e5105f600786c3a4c4fd105676ec6d8ab3/luigi_pipeline/seqr_loading.py#L35

You seem to have a separate file - https://github.com/macarthur-lab/hail-elasticsearch-pipelines/blob/master/sv_pipeline/load_data.py - for SVs, so does it mean that we should run it separately for SVs instead of the luigi one?

Update CADD scores

We're currently using precomputed SNP and INDEL CADD scores v1.3 (released in 2015).
CADD v1.4 just got released 1 month ago, and has

... support of the genome build GRCh38/hg38, and also includes a new CADD model for GRCh37/hg19. We further fixed some minor issues identified in CADD v1.3 with respect to how annotations were interpreted in the model and updated some of the underlying datasets. We included a new splice score (dbscSNV) and measures of genome-wide variant density. The GRCh38 and GRCh37 models are based on the same (or, if these were not available, very similar) annotations. Comprehensive information about the new models can be found in the release notes.

(from http://cadd.gs.washington.edu/news, http://cadd.gs.washington.edu/download)

We should update the CADD files we currently use for annotations:

gs://seqr-reference-data/GRCh37/CADD/CADD_snvs_and_indels.vds
gs://seqr-reference-data/GRCh38/CADD/CADD_snvs_and_indels.vds

which were generated by downloading the pre-computed .tsvs from http://cadd.gs.washington.edu/download, converting them to VCFs by running

cat header.vcf <(zcat ../GRCh37/${input_tsv} | grep -v ^# | awk -v OFS='\t' '($3 ~ /(A|C|G|T)/) && ($4 ~ /(A|C|G|T)/) { print $1, $2, ".", $3, $4, ".", ".", "RawScore="$5";PHRED="$6 }' ) | bgzip >  ${input_base}.vcf.gz

in /humgen/atgu1/fs03/shared_resources/CADD/GRCh38/

and then converting the VCFs to VDS using https://github.com/macarthur-lab/hail-elasticsearch-pipelines/blob/master/download_and_create_reference_datasets/hail_scripts/v01/write_cadd_vds.py

Add command line and/or config file args to SeqrMTToESTask

The v01 pipeline supports these options on the command line

--genome-version $GENOME_VERSION
--project-guid $PROJECT_GUID
--sample-type $SAMPLE_TYPE
--dataset-type $DATASET_TYPE
--vep-block-size 100
--es-block-size 10
--num-shards 1

Is there a way to set these for SeqrMTToESTask?

Created Index Does not have Valid Schema

Hello, I need some assistance figuring out why 3 of my created indexes will not submit properly to the seqr web interface. We have 6 families total, and 6 vcf samples, and while I was able to upload each vcf against the hail_elastic_pieline, I was only able to add the index for 3 of them, while the other 3 have the same error "does not have a valid schema".

I am wondering what could have caused this, and can show you my .xls family and individuals files in addition to the index that was generated from the .vcf file.

After thoroughly checking them, I don't believe it's a typo or a name error keeping it from uploading since otherwise ES would have told me the index does not exist.

Now able to download refrence data at Google storage

Dear Team,

I am not able to download datasets from google storage space.
This is one of the example where tried to download data but it didnt work.

/gpfs/projects/bioinfo/najeeb/tools/google-cloud-sdk/bin/gsutil cp gs://seqr-reference-data/GRCh37/TopMed/bravo-dbsnp-all.removed_chr_prefix.liftunder_GRCh37.mt/ .

While I can see data are available at this location.
/gpfs/projects/bioinfo/najeeb/tools/google-cloud-sdk/bin/gsutil ls gs://seqr-reference-data/GRCh37/TopMed/bravo-dbsnp-all.removed_chr_prefix.liftunder_GRCh37.mt/
gs://seqr-reference-data/GRCh37/TopMed/bravo-dbsnp-all.removed_chr_prefix.liftunder_GRCh37.mt/
gs://seqr-reference-data/GRCh37/TopMed/bravo-dbsnp-all.removed_chr_prefix.liftunder_GRCh37.mt/_SUCCESS
gs://seqr-reference-data/GRCh37/TopMed/bravo-dbsnp-all.removed_chr_prefix.liftunder_GRCh37.mt/metadata.json.gz
gs://seqr-reference-data/GRCh37/TopMed/bravo-dbsnp-all.removed_chr_prefix.liftunder_GRCh37.mt/cols/
gs://seqr-reference-data/GRCh37/TopMed/bravo-dbsnp-all.removed_chr_prefix.liftunder_GRCh37.mt/entries/
gs://seqr-reference-data/GRCh37/TopMed/bravo-dbsnp-all.removed_chr_prefix.liftunder_GRCh37.mt/globals/
gs://seqr-reference-data/GRCh37/TopMed/bravo-dbsnp-all.removed_chr_prefix.liftunder_GRCh37.mt/references/
gs://seqr-reference-data/GRCh37/TopMed/bravo-dbsnp-all.removed_chr_prefix.liftunder_GRCh37.mt/rows/

Could you please help with this?

Thanks
Najeeb

Add run_splice_predictions:0 to v2 VEP plugin config

Modify v2 hail VEP config to add

run_splice_predictions:0

https://github.com/macarthur-lab/hail-elasticsearch-pipelines/blob/master/gcloud_dataproc/vep_init/vep-gcloud-grch37.properties

This makes VEP run ~5x faster.

mt.write() is not working: hail.utils.java.FatalError: IOException: error=2, No such file or directory

I am running the first task SeqrVCFToMTTask and all of the steps run successfully until line 63 which is writing out the generated MatrixTable of the final result:

mt.write(self.output().path, stage_locally=True, overwrite=True)

The output file is created (but its size is 0 bytes). The stack trace that I see is the following:

Traceback (most recent call last):
File "/opt/seqr/.conda/envs/py37/lib/python3.7/site-packages/luigi/worker.py", line 199, in run
new_deps = self._run_get_new_deps()
File "/opt/seqr/.conda/envs/py37/lib/python3.7/site-packages/luigi/worker.py", line 141, in _run_get_new_deps
task_gen = self.task.run()
File "/opt/seqr/hail-elasticsearch-pipelines/luigi_pipeline/seqr_loading.py", line 37, in run
self.read_vcf_write_mt()
File "/opt/seqr/hail-elasticsearch-pipelines/luigi_pipeline/seqr_loading.py", line 71, in read_vcf_write_mt
mt.write(self.output().path, overwrite=True)
File "</opt/seqr/.conda/envs/py37/lib/python3.7/site-packages/decorator.py:decorator-gen-1036>", line 2, in write
File "/opt/seqr/hail-elasticsearch-pipelines/hail_builds/v02/hail-0.2-3a68be23cb82d7c7fb5bf72668edcd1edf12822e.zip/hail/typecheck/check.py", line 585, in wrapper
return original_func(*args, **kwargs)
File "/opt/seqr/hail-elasticsearch-pipelines/hail_builds/v02/hail-0.2-3a68be23cb82d7c7fb5bf72668edcd1edf12822e.zip/hail/matrixtable.py", line 2508, in write
Env.backend().execute(MatrixWrite(self._mir, writer))
File "/opt/seqr/hail-elasticsearch-pipelines/hail_builds/v02/hail-0.2-3a68be23cb82d7c7fb5bf72668edcd1edf12822e.zip/hail/backend/backend.py", line 109, in execute
result = json.loads(Env.hc()._jhc.backend().executeJSON(self._to_java_ir(ir)))
File "/opt/seqr/spark/spark-2.4.4-bin-hadoop2.7/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 1257, in call
answer, self.gateway_client, self.target_id, self.name)
File "/opt/seqr/hail-elasticsearch-pipelines/hail_builds/v02/hail-0.2-3a68be23cb82d7c7fb5bf72668edcd1edf12822e.zip/hail/utils/java.py", line 225, in deco
'Error summary: %s' % (deepest, full, hail.version, deepest)) from None
hail.utils.java.FatalError: IOException: error=2, No such file or directory

Java stack trace:
org.apache.spark.SparkException: Job aborted due to stage failure: Task 3 in stage 7.0 failed 1 times, most recent failure: Lost task 3.0 in stage 7.0 (TID 2666, localhost, executor driver): java.io.IOException: Cannot run program "/vep": error=2, No such file or directory
at java.lang.ProcessBuilder.start(ProcessBuilder.java:1048)
at is.hail.utils.richUtils.RichIterator$.pipe$extension(RichIterator.scala:47)

I found almost identical thread on Hail 0.2 forum:

https://discuss.hail.is/t/cant-write-vep-annotated-hail-table/855/3

And the issue is that VEP fails because Hail can't find path to /vep that is specified in vep85-loftee-local.json config file. Ok, but is it possible to install VEP on hadoop (in your case Google Cloud)? Or somehow change dynamically the path that Hail accesses in the luigi pipeline? Locally VEP is intalled, and I copying all what could have been copied to hadoop for VEP (while copying to hadoop it gives an error: filepath is not a valid DFS filename since there are such names of the files, e.g.: Bio::DB::HTS.3pm).

Use VEP configuration for selected reference genome in Hail 0.2 ClinVar script

#47 updates hail_scripts/v01/utils/vds_utils.py to use a different VEP configuration file based on the selected reference genome. The Hail 0.2 version of the ClinVar script should do the same.