Giter Site home page Giter Site logo

test-process's People

Contributors

iris-garden avatar

Watchers

 avatar

test-process's Issues

'GT' are present in MT but missing from VDS for same samples

akhattab said:

Hi, hope all is well.
I’m working on the AllofUs platform to calculate some PRS. I’m using a custom code and Hail MT for calculations, which is working but is time-consuming and expensive.
A part of the code is subsetting the variants in the PRS weights df from the MT, and I came across @tpoterba post about filtering VDS using hail.vds.filter_intervals which is SUPER fast compared to MT. The issue is that most samples are missing ‘GT’ even though they are present in the MT.
image

My code:

PGS_sites = hl.import_locus_intervals(interval_file, reference_genome=‘GRCh38’, skip_invalid_intervals=True)
vds_prs = hl.vds.filter_intervals(vds_subset, PGS_sites, keep=True)

The number of variants from a PRS score present in MT:
download

The number of variants from the same PRS score present in VDS for the same cohort:
download

Am I missing something here? Any thoughts?

Thank you!
Ahmed

danking said:

What code are you using to show the genotypes? vds.variant_data.GT.show() shows only the variant data. The VDS is a split representation: reference data and variant data are stored separately, each on in its own Hail Matrix Table: vds.reference_data and vds.variant_data.

If, for your analysis, you need reference information (in particular, if you need to decide if a given sample is a confident homozygous reference call instead of a no call), you need to densify the VDS. There’s some more information about this at the AoU VDS page.

I recommend you seriously consider using the smaller, pre-densified callsets unless you absolutely need genome-wide data. Densification is not a cheap process, it’s converting from a very sparse representation to a very large and dense representation. This is particularly true if you only need GTs! The majority of the VDS is quality metadata like PL and AD.

akhattab said:

Hi @danking, thank you for your response!
Yes, I used vds.variant_data.GT.show() to show the genotypes.

danking:

If, for your analysis, you need reference information (in particular, if you need to decide if a given sample is a confident homozygous reference call instead of a no call)

Does that mean an NA could either be a nocall or a homozygous reference? I queried vds.reference_data for some of the variants with NA in the vds.variant_data and they also have NA also in the reference_data
image

image

Is it safe to assume that the variants with NA are homozygous reference?

danking said:

Does that mean an NA could either be a nocall or a homozygous reference?

Yes, that’s exactly what NA means.

The reference data is not stored row-by-row. The reference data is stored like a GVCF: there are “reference blocks” which implicitly span multiple rows.

Is it safe to assume that the variants with NA are homozygous reference?

No it is not.

If you need to confidently know if a sample is homozygous reference or no call you must use hl.vds.to_dense_mt:

mt = hl.vds.to_dense_mt(vds)
mt.GT.show()  # if you see NA here they are no calls

Note that densification is an expensive and time-consuming process! You probably want to filter to a small set of intervals-of-interest first using hl.vds.filter_intervals:

vds = hl.vds.filter_intervals(
    vds,
    hl.literal([
        hl.locus_interval('chr1', 123, 456)
    ])
)

If possible, I recommend using the pre-densified datasets because the All of Us project has already paid the cost of producing dense data at those loci!

akhattab said:

Great! Thank you so much, Dan. I appreciate the help!

akhattab said:

Hi @danking, sorry to keep bugging you with this but I have a follow-up question.
I’m trying to understand how Hail treats the missing calls.
When using mt.GT.show I could see the missing calls as NA:

image

But when using hl.is_missing, they are still NA instead of True I guess:
image

Same when using is_defined:
image

My question is, why are these calls ‘ignored’?
Thank you!

danking said:

Hey! Sorry for the delay, I answered in your other question How does Hail treat missing calls?

Potential Outdated Error Statement for Hail Version Incompatibility

darn_matren said:

Hello team,

I hit an odd error yesterday when reading in a Hail Table that a coworker had written in Hail 2.120 when I was working with Hail 2.113. The error said that I could not read the file because it was written in a more recent version of Hail - which makes sense because next I just updated my Hail version and it worked - but oddly the file said that the Hail Table was written in 1.60 and my current version of 1.70 was not able to read it. Which is wrong! I was able to replicate the issue (after downgrading to Hail 2.113 in a conda environment) and I’ll paste it down below:

Hail version: 0.2.113-cf32652c5077
Error summary: HailException: incompatible file format when reading: gs://filename.ht
supported version: 1.6.0, found 1.7.0
The cause of this error is usually an attempt to use an older version of Hail to read files generated by a newer version. This is not supported (Hail native files are back-compatible, but not forward-compatible).
To read this file, use a newer version of Hail.

danking said:

1.6.0 is a version of the file format. 0.2.113 is a version of the Hail Python library. I’ll update the message to be more clear.

danking said:

I’ve made this change here. I would love your feedback on the verbiage there!

darn_matren said:

that’s interesting, thanks for the quick response!

Error in Terra When Initializing HAIL

beneopp said:

I was unable to implement the 01-genome-wide-association-study.ipynb notebook on Hail-Notebook-Tutorials. I received the following warning after the hl.init() command:

/opt/conda/lib/python3.10/site-packages/hailtop/aiocloud/aiogoogle/user_config.py:43: UserWarning: Reading spark-defaults.conf to determine GCS requester pays configuration. This is deprecated. Please use `hailctl config set gcs_requester_pays/project` and `hailctl config set gcs_requester_pays/buckets`.
  warnings.warn(
SLF4J: No SLF4J providers were found.
SLF4J: Defaulting to no-operation (NOP) logger implementation
SLF4J: See https://www.slf4j.org/codes.html#noProviders for further details.
SLF4J: Class path contains SLF4J bindings targeting slf4j-api versions 1.7.x or earlier.
SLF4J: Ignoring binding found at [jar:file:/usr/lib/spark/jars/log4j-slf4j-impl-2.18.0.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See https://www.slf4j.org/codes.html#ignoredBindings for an explanation.
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
Running on Apache Spark version 3.3.0
SparkUI available at http://saturn-26616ce0-7b3e-4578-bb69-4567cf490d15-m.c.terra-67a6826d.internal:37893

Then I got the following error when attempting to download the file:

2023-10-03 21:56:25.636 Hail: INFO: downloading 1KG VCF ...
  Source: https://storage.googleapis.com/hail-tutorial/1kg.vcf.bgz
2023-10-03 21:56:28.692 Hail: INFO: importing VCF and writing to matrix table...
---------------------------------------------------------------------------
Py4JNetworkError                          Traceback (most recent call last)
File /opt/conda/lib/python3.10/site-packages/py4j/clientserver.py:516, in ClientServerConnection.send_command(self, command)
    515 if answer.strip() == "":
--> 516     raise Py4JNetworkError("Answer from Java side is empty")
    517 if answer.startswith(proto.RETURN_MESSAGE):

Py4JNetworkError: Answer from Java side is empty

During handling of the above exception, another exception occurred:

Py4JNetworkError Traceback (most recent call last)
File /opt/conda/lib/python3.10/site-packages/py4j/java_gateway.py:1038, in GatewayClient.send_command(self, command, retry, binary)
1037 try:
-> 1038 response = connection.send_command(command)
1039 if binary:

File /opt/conda/lib/python3.10/site-packages/py4j/clientserver.py:539, in ClientServerConnection.send_command(self, command)
538 logger.info("Error while receiving.", exc_info=True)
--> 539 raise Py4JNetworkError(
540 "Error while sending or receiving", e, proto.ERROR_ON_RECEIVE)

Py4JNetworkError: Error while sending or receiving

During handling of the above exception, another exception occurred:

TypeError Traceback (most recent call last)
Cell In[3], line 1
----> 1 hl.utils.get_1kg('data/')

File /opt/conda/lib/python3.10/site-packages/hail/utils/tutorial.py:86, in get_1kg(output_dir, overwrite)
84 cluster_readable_vcf = _copy_to_tmp(fs, local_path_uri(tmp_vcf), extension='vcf.bgz')
85 info('importing VCF and writing to matrix table...')
---> 86 hl.import_vcf(cluster_readable_vcf, min_partitions=16).write(matrix_table_path, overwrite=True)
88 tmp_sample_annot = os.path.join(tmp_dir, '1kg_annotations.txt')
89 source = resources['1kg_annotations']

File <decorator-gen-1484>:2, in import_vcf(path, force, force_bgz, header_file, min_partitions, drop_samples, call_fields, reference_genome, contig_recoding, array_elements_required, skip_invalid_loci, entry_float_type, filter, find_replace, n_partitions, block_size, _create_row_uids, _create_col_uids)

File /opt/conda/lib/python3.10/site-packages/hail/typecheck/check.py:584, in _make_dec.<locals>.wrapper(__original_func, *args, **kwargs)
581 @decorator
582 def wrapper(original_func, *args, **kwargs):
583 args
, kwargs
= check_all(__original_func, args, kwargs, checkers, is_method=is_method)
--> 584 return original_func(*args, **kwargs)

File /opt/conda/lib/python3.10/site-packages/hail/methods/impex.py:2822, in import_vcf(path, force, force_bgz, header_file, min_partitions, drop_samples, call_fields, reference_genome, contig_recoding, array_elements_required, skip_invalid_loci, entry_float_type, filter, find_replace, n_partitions, block_size, _create_row_uids, _create_col_uids)
2812 hl.utils.warning(
2813 f'You are trying to read {path} with ONE core of parallelism. This '
2814 'will be very slow. If this file is block-gzipped (bgzip-ed), use '
2815 'force_bgz=True instead.'
2816 )
2818 reader = ir.MatrixVCFReader(path, call_fields, entry_float_type, header_file,
2819 n_partitions, block_size, min_partitions,
2820 reference_genome, contig_recoding, array_elements_required,
2821 skip_invalid_loci, force_bgz, force, filter, find_replace)
-> 2822 return MatrixTable(ir.MatrixRead(reader, drop_cols=drop_samples, drop_row_uids=not _create_row_uids, drop_col_uids=not _create_col_uids))

File /opt/conda/lib/python3.10/site-packages/hail/matrixtable.py:709, in MatrixTable.init(self, mir)
706 self._col_indices = Indices(self, {self._col_axis})
707 self._entry_indices = Indices(self, {self._row_axis, self._col_axis})
--> 709 self._type = self._mir.typ
711 self._global_type = self._type.global_type
712 self._col_type = self._type.col_type

File /opt/conda/lib/python3.10/site-packages/hail/ir/base_ir.py:494, in MatrixIR.typ(self)
491 @property
492 def typ(self):
493 if self._type is None:
--> 494 self.compute_type(deep_typecheck=False)
495 return self._type

File /opt/conda/lib/python3.10/site-packages/hail/ir/base_ir.py:485, in MatrixIR.compute_type(self, deep_typecheck)
483 def compute_type(self, deep_typecheck):
484 if deep_typecheck or self._type is None:
--> 485 computed = self._compute_type(deep_typecheck)
486 if self._type is not None:
487 assert self._type == computed

File /opt/conda/lib/python3.10/site-packages/hail/ir/matrix_ir.py:185, in MatrixRead._compute_type(self, deep_typecheck)
183 def _compute_type(self, deep_typecheck):
184 if self._type is None:
--> 185 return Env.backend().matrix_type(self)
186 else:
187 return self._type

File /opt/conda/lib/python3.10/site-packages/hail/backend/py4j_backend.py:184, in Py4JBackend.matrix_type(self, mir)
183 def matrix_type(self, mir):
--> 184 jir = self._to_java_matrix_ir(mir)
185 return tmatrix._from_java(jir.typ())

File /opt/conda/lib/python3.10/site-packages/hail/backend/py4j_backend.py:170, in Py4JBackend._to_java_matrix_ir(self, ir)
169 def _to_java_matrix_ir(self, ir):
--> 170 return self._to_java_ir(ir, self._parse_matrix_ir)

File /opt/conda/lib/python3.10/site-packages/hail/backend/py4j_backend.py:145, in Py4JBackend._to_java_ir(self, ir, parse)
143 r = CSERenderer(stop_at_jir=True)
144 # FIXME parse should be static
--> 145 ir._jir = parse(r(finalize_randomness(ir)), ir_map=r.jirs)
146 return ir._jir

File /opt/conda/lib/python3.10/site-packages/hail/backend/py4j_backend.py:158, in Py4JBackend._parse_matrix_ir(self, code, ir_map)
157 def _parse_matrix_ir(self, code, ir_map={}):
--> 158 return self._jbackend.parse_matrix_ir(code, ir_map)

File /opt/conda/lib/python3.10/site-packages/py4j/java_gateway.py:1321, in JavaMember.call(self, *args)
1314 args_command, temp_args = self._build_args(*args)
1316 command = proto.CALL_COMMAND_NAME +
1317 self.command_header +
1318 args_command +
1319 proto.END_COMMAND_PART
-> 1321 answer = self.gateway_client.send_command(command)
1322 return_value = get_return_value(
1323 answer, self.gateway_client, self.target_id, self.name)
1325 for temp_arg in temp_args:

File /opt/conda/lib/python3.10/site-packages/py4j/java_gateway.py:1055, in GatewayClient.send_command(self, command, retry, binary)
1053 response = self.send_command(command, binary=binary)
1054 else:
-> 1055 logging.exception(
1056 "Exception while sending command.")
1057 response = proto.ERROR
1058 except KeyboardInterrupt:
1059 # For KeyboardInterrupt triggered from Python shell, it should
1060 # clean up the connection so the connection is
(...)
1064 # See also py4j/py4j#440 for
1065 # more details.

File /opt/conda/lib/python3.10/logging/init.py:2113, in exception(msg, exc_info, *args, **kwargs)
2107 def exception(msg, *args, exc_info=True, **kwargs):
2108 """
2109 Log a message with severity 'ERROR' on the root logger, with exception
2110 information. If the logger has no handlers, basicConfig() is called to add
2111 a console handler with a pre-defined format.
2112 """
-> 2113 error(msg, *args, exc_info=exc_info, **kwargs)

File /opt/conda/lib/python3.10/logging/init.py:2105, in error(msg, *args, **kwargs)
2103 if len(root.handlers) == 0:
2104 basicConfig()
-> 2105 root.error(msg, *args, **kwargs)

File /opt/conda/lib/python3.10/logging/init.py:1506, in Logger.error(self, msg, *args, **kwargs)
1497 """
1498 Log 'msg % args' with severity 'ERROR'.
1499
(...)
1503 logger.error("Houston, we have a %s", "major problem", exc_info=1)
1504 """
1505 if self.isEnabledFor(ERROR):
-> 1506 self._log(ERROR, msg, args, **kwargs)

TypeError: Log._log() got an unexpected keyword argument 'exc_info'

danking said:

Hi @beneopp !

This looks like a broken Hail installation. Can you describe how you started your Hail cluster and how you installed Hail?

beneopp said:

I launched the Jupyter Cloud Environment on Terra. I used the standard Hail application configuration and reduced the number of CPU’s to 2. Attached is a screenshot of the configurations I used.

danking said:

@beneopp hmm. Most folks use clusters, not “Spark single node”, so it’s possible that path is less well tested in Terra.

Can you trigger the error again and send us the Hail .log file? The easiest way to get that is to SSH to the master node of the cluster (always named CLUSTER_NAME-m). If you don’t have the ability to SSH, you’ll have to use the notebook to gsutil cp ...log gs://your-bucket/logfile then download it on your laptop.

Hail is definitely failing to start properly and the reason why should be in that log file.

beneopp said:

Thank you for your prompt reply.

I tried the environment with a “spark single node” and 4 CPU’s, and the starting step worked fine. I think the error might be caused by the machine used. Attached is the log file from when 2CPUs are used and the error occurs.

In general, I am wondering what makes the “Spark single node” parameter important. I am new to using this tool and don’t know anything about Spark. If it is important, I think it would be helpful to explain this in the Hail-Notebook-Tutorials workspace.

Dashboard

hail-20231004-1700-0.2.120-f00f916faf78.log (21.6 KB)

danking said:

@beneopp , which workspace are you referring to? The Hail team doesn’t own any workspaces, those are produced by Terra. I can ask them to update the workspaces though if there is specific feedback.

We have some general advice on using the cloud in the Hail docs. In general, people use “Spark clusters”. Google has a “Dataproc” product which allows you to start and stop clusters. We provide a tool that helps you do that called hailctl dataproc.

In Terra, you have to use their UI instead. For most analyses, you don’t want to use “Spark single node” because that means you’re using one “node” (aka “VM” aka “computer”) rather than a cluster of nodes. That means you’re limited to the number of cores on that one computer. Most sequencing datasets are too large to wait for a single computer to read all the data.


Hmm, this log indicates that you have two SparkContexts and that one of them is stopped and the other is active. This shouldn’t normally happen. If you start a new notebook from scratch and run

import hail as hl
hl.init()
hl.balding_nichols_model(1, 10, 10).show(10, 10)

Do you see a matrix of genotypes? If not, can you copy all the output you see here? We need to sort out why your notebook somehow has two SparkContext s

Potential Outdated Error Statement for Hail Version Incompatibility

darn_matren said:

Hello team,

I hit an odd error yesterday when reading in a Hail Table that a coworker had written in Hail 2.120 when I was working with Hail 2.113. The error said that I could not read the file because it was written in a more recent version of Hail - which makes sense because next I just updated my Hail version and it worked - but oddly the file said that the Hail Table was written in 1.60 and my current version of 1.70 was not able to read it. Which is wrong! I was able to replicate the issue (after downgrading to Hail 2.113 in a conda environment) and I’ll paste it down below:

Hail version: 0.2.113-cf32652c5077
Error summary: HailException: incompatible file format when reading: gs://filename.ht
supported version: 1.6.0, found 1.7.0
The cause of this error is usually an attempt to use an older version of Hail to read files generated by a newer version. This is not supported (Hail native files are back-compatible, but not forward-compatible).
To read this file, use a newer version of Hail.

danking said:

1.6.0 is a version of the file format. 0.2.113 is a version of the Hail Python library. I’ll update the message to be more clear.

danking said:

I’ve made this change here. I would love your feedback on the verbiage there!

darn_matren said:

that’s interesting, thanks for the quick response!

Efficient merging of bgen shards - export bgen

jwillett said:

Does the Hail team have recommendations on efficiently merging shards from export_bgen(…, parallel=“header_per_shard”). I was transforming bgen to pgen, then merging using plink2, but the output was not being accepted by regenie.

danking said:

Hey @jwillett !

Hmm. I wouldn’t use parallel="header_per_shard", I’d use parallel='separate_header' and then cat (perhaps a 100-way tree of cat s if there’s too many files). The header_per_shard mode is really meant for when you’re going to run some code on each partition separately.

jwillett said:

Is there a way to remove headers on produced shards to enable the use of cat? I had to move my projects forward, using header_per_shard, given that I will be doing GWAS (so running code on each separate partition is acceptable). I would prefer to combine all the shards into a single file before running GWAS to keep things organized.

danking said:

The BGEN Format is precisely defined so this should be possible.

According to the BGEN format, the first four bytes are an unsigned little endian 32-bit integer indicating the offset to the start position of the variant data (relative to the end of this 32-bit integer). For each file you can get that offset:

import hailtop.fs as hfs
import struct

x = bytearray(4)
hfs.open(path_to_bgen, 'rb').readinto(x)
offset = struct.unpack('<I', x)[0]
print(offset + 4)

Then you can use tail -c +offset to skip the header and the sample block.

jwillett said:

Great, thanks! That will save on file count a lot.

Hail SPARK version error

jin0008 said:

Hi I want to install Hail in my ubuntu.
But when I run simple hail query, I got this message.
Could you give me some advice for me?
Thanks

This Hail JAR was compiled for Spark 3.1.3, cannot run with Spark 3.4.1.
The major and minor versions must agree, though the patch version can differ.

danking said:

You need to install Apache Spark 3.1.3. You have Apache Spark 3.4.1. Try:

pip3 install pyspark==3.1.3

Cannot use exome_v7.1

aou_user_01 said:

I received an email that the srWGS smaller callsets are upgraded to v7.1. Here is the path of exome I tried: gs://fc-aou-datasets-controlled/v7/wgs/short_read/snpindel/exome_v7.1/multiMT/hail.mt

and I received this error.

Java stack trace:
is.hail.utils.HailException: incompatible file format when reading: gs://fc-aou-datasets-controlled/v7/wgs/short_read/snpindel/exome_v7.1/multiMT/hail.mt
supported version: 1.6.0, found 1.7.0
The cause of this error is usually an attempt to use an older version of Hail to read files generated by a newer version. This is not supported (Hail native files are back-compatible, but not forward-compatible).
To read this file, use a newer version of Hail.

I’m using workbench version 7 so I don’t know why I received an error about supported version discrepancy. Has anyone try the new dataset and succeeded?

asppagh54 said:

Hi,
Any solution for this error? I am facing the same error.
Thanks

danking said:

@aou_user_01 , @asppagh54 ,

This means you’re using an older version of Hail than the version which was used to generate that matrix table. You can update your version of Hail with pip (pip install -U hail), but I’ll also contact AoU to ask if they can update the default version of Hail in the AoU RWB.

How does Hail treat missing calls?

akhattab said:

Hello, I’m trying to understand how Hail treats the missing calls.
When using mt.GT.show I could see the missing calls as NA:

image

But when using hl.is_missing, they are still NA instead of True:
image

Same when using is_defined:
image

My question is, why are these calls ‘ignored’?
Thank you!

danking said:

Hey @akhattab ,

Yeah, this is an unfortunate and confusing part of the Hail interface, particularly as it relates to show. Those NA values are not missing but “filtered”. Filtered entries are produced with filter_entries. This is the entry analogue to row or column filtering, but unlike row or column filtering, we can’t simply remove the entire row or column (because there are other not filtered entries at the same row or column).

You can convert filtered entries to missing entries with mt.unfilter_entries(). You can compute statistics about the number of filtered entires with mt.compute_entry_filter_stats.

There is no way to test for the filtered status of a particular field, because its not the field that is filtered, it is the entry of the matrix that is filtered.

By convention:

  • “Filtered” entries are like filtered rows or columns: we want to pretend that we didn’t even observe them. They’re not part of our dataset anymore.
  • “Missing” values are data we measured but we’re uncertain of its value. We want to consider them as a part of our analysis but as a separate third category of thing: something we know exists but we can’t observe directly.

akhattab said:

Great. Thank you so much!

Future support of Elasticsearch version 8

souckmi said:

Hello everyone,

I wanted to ask if the hail.methods.export_elasticsearch() method will support Elasticsearch version 8 in the near future, since it supports Elasticsearch versions 6.8.x - 7.x.x. according to the documentation?

Thanks for the answer!

danking said:

We don’t have plans to add version 8 support, but we welcome a pull request to add that support. My guess is that its a relatively easy change.

What's different between mt and ht file format?

SimonLi5601 said:

I thought those are simple concepts, but I can’t find anywhere except the VDS documentation. And also how they are different from VDS file? Thanks

danking said:

Hey SimonLi5601 !

You’re right that the native file formats are not well documented. The bulk of our users are uninterested in these details. I think you’re the third person to ask about their details in the eight years of the project.

I’ll directly answer your question below, but I’m curious to better understand your motivations and goals. That will help me give you a more useful answer.


An “.mt” “file” is a compressed, partitioned, indexed, binary format for Matrix Tables (which is like a pandas DataFrame with an extra index dimension). An “.ht” “file” is a compressed, partitioned, indexed, binary format for Tables (which is effectively an out-of-core pandas DataFrame).


To be a bit more specific, an “.mt” file is a little bit of metadata plus four “.ht” files (though they lack the extension):

  • globals
  • cols
  • rows
  • entries

and two optional pieces:

  • index
  • references

An “.ht” file is a little bit of metadata plus:

  • rows (this contains the partitioned data)

and two optional pieces:

  • index
  • references

Zip: length mismatch error

klaricch said:

Hi,

If I run the code in the attached file
am_test.txt (1.3 KB)
I get the below error:

&gt; 
&gt; 2023-08-09 15:56:05.872 Hail: INFO: wrote table with 104 rows in 104 partitions to gs://gnomad-tmp-4day/__iruid_845-wg8w0gBbPsilRDfRegbVW3
INFO (constraint_pipeline 463): Copying log to logging bucket...
2023-08-09 15:56:33.276 Hail: INFO: copying log to 'gs://gnomad-tmp/gnomad_v2.1.1_testing/constraint/logging/constraint_pipeline.log'...
Traceback (most recent call last):
  File "/tmp/17e6dae80e9548a9a94895fd07e6dee0/constraint_pipeline.py", line 785, in &lt;module&gt;
    main(args)
  File "/tmp/17e6dae80e9548a9a94895fd07e6dee0/constraint_pipeline.py", line 412, in main
    oe_ht = apply_models(
  File "/tmp/17e6dae80e9548a9a94895fd07e6dee0/pyscripts_ge6ozu1m.zip/gnomad_constraint/utils/constraint.py", line 396, in apply_models
  File "/tmp/17e6dae80e9548a9a94895fd07e6dee0/pyscripts_ge6ozu1m.zip/gnomad_constraint/utils/constraint.py", line 279, in create_observed_and_possible_ht
  File "&lt;decorator-gen-1118&gt;", line 2, in checkpoint
  File "/opt/conda/default/lib/python3.10/site-packages/hail/typecheck/check.py", line 584, in wrapper
    return __original_func(*args_, **kwargs_)
  File "/opt/conda/default/lib/python3.10/site-packages/hail/table.py", line 1331, in checkpoint
    self.write(output=output, overwrite=overwrite, stage_locally=stage_locally, _codec_spec=_codec_spec)
  File "&lt;decorator-gen-1120&gt;", line 2, in write
  File "/opt/conda/default/lib/python3.10/site-packages/hail/typecheck/check.py", line 584, in wrapper
    return __original_func(*args_, **kwargs_)
  File "/opt/conda/default/lib/python3.10/site-packages/hail/table.py", line 1377, in write
    Env.backend().execute(ir.TableWrite(self._tir, ir.TableNativeWriter(output, overwrite, stage_locally, _codec_spec)))
  File "/opt/conda/default/lib/python3.10/site-packages/hail/backend/py4j_backend.py", line 82, in execute
    raise e.maybe_user_error(ir) from None
  File "/opt/conda/default/lib/python3.10/site-packages/hail/backend/py4j_backend.py", line 76, in execute
    result_tuple = self._jbackend.executeEncode(jir, stream_codec, timed)
  File "/usr/lib/spark/python/lib/py4j-0.10.9.5-src.zip/py4j/java_gateway.py", line 1321, in __call__
  File "/opt/conda/default/lib/python3.10/site-packages/hail/backend/py4j_backend.py", line 35, in deco
    raise fatal_error_from_java_error_triplet(deepest, full, error_id) from None
hail.utils.java.FatalError: HailException: zip: length mismatch: 62164, 104

Java stack trace:
is.hail.utils.HailException: zip: length mismatch: 62164, 104
	at __C8160Compiled.__m8201split_ToArray(Emit.scala)
	at __C8160Compiled.__m8169split_CollectDistributedArray(Emit.scala)
	at __C8160Compiled.__m8164split_Let(Emit.scala)
	at __C8160Compiled.apply(Emit.scala)
	at is.hail.expr.ir.CompileAndEvaluate$.$anonfun$_apply$4(CompileAndEvaluate.scala:61)
	at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
	at is.hail.utils.ExecutionTimer.time(ExecutionTimer.scala:81)
	at is.hail.expr.ir.CompileAndEvaluate$.$anonfun$_apply$2(CompileAndEvaluate.scala:61)
	at is.hail.expr.ir.CompileAndEvaluate$.$anonfun$_apply$2$adapted(CompileAndEvaluate.scala:59)
	at is.hail.backend.ExecuteContext.$anonfun$scopedExecution$1(ExecuteContext.scala:140)
	at is.hail.utils.package$.using(package.scala:635)
	at is.hail.backend.ExecuteContext.scopedExecution(ExecuteContext.scala:140)
	at is.hail.expr.ir.CompileAndEvaluate$._apply(CompileAndEvaluate.scala:59)
	at is.hail.expr.ir.CompileAndEvaluate$.evalToIR(CompileAndEvaluate.scala:33)
	at is.hail.expr.ir.LowerOrInterpretNonCompilable$.evaluate$1(LowerOrInterpretNonCompilable.scala:30)
	at is.hail.expr.ir.LowerOrInterpretNonCompilable$.rewrite$1(LowerOrInterpretNonCompilable.scala:58)
	at is.hail.expr.ir.LowerOrInterpretNonCompilable$.apply(LowerOrInterpretNonCompilable.scala:63)
	at is.hail.expr.ir.lowering.LowerOrInterpretNonCompilablePass$.transform(LoweringPass.scala:67)
	at is.hail.expr.ir.lowering.LoweringPass.$anonfun$apply$3(LoweringPass.scala:16)
	at is.hail.utils.ExecutionTimer.time(ExecutionTimer.scala:81)
	at is.hail.expr.ir.lowering.LoweringPass.$anonfun$apply$1(LoweringPass.scala:16)
	at is.hail.utils.ExecutionTimer.time(ExecutionTimer.scala:81)
	at is.hail.expr.ir.lowering.LoweringPass.apply(LoweringPass.scala:14)
	at is.hail.expr.ir.lowering.LoweringPass.apply$(LoweringPass.scala:13)
	at is.hail.expr.ir.lowering.LowerOrInterpretNonCompilablePass$.apply(LoweringPass.scala:62)
	at is.hail.expr.ir.lowering.LoweringPipeline.$anonfun$apply$1(LoweringPipeline.scala:22)
	at is.hail.expr.ir.lowering.LoweringPipeline.$anonfun$apply$1$adapted(LoweringPipeline.scala:20)
	at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
	at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
	at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
	at is.hail.expr.ir.lowering.LoweringPipeline.apply(LoweringPipeline.scala:20)
	at is.hail.expr.ir.CompileAndEvaluate$._apply(CompileAndEvaluate.scala:50)
	at is.hail.backend.spark.SparkBackend._execute(SparkBackend.scala:462)
	at is.hail.backend.spark.SparkBackend.$anonfun$executeEncode$2(SparkBackend.scala:498)
	at is.hail.backend.ExecuteContext$.$anonfun$scoped$3(ExecuteContext.scala:75)
	at is.hail.utils.package$.using(package.scala:635)
	at is.hail.backend.ExecuteContext$.$anonfun$scoped$2(ExecuteContext.scala:75)
	at is.hail.utils.package$.using(package.scala:635)
	at is.hail.annotations.RegionPool$.scoped(RegionPool.scala:17)
	at is.hail.backend.ExecuteContext$.scoped(ExecuteContext.scala:63)
	at is.hail.backend.spark.SparkBackend.withExecuteContext(SparkBackend.scala:350)
	at is.hail.backend.spark.SparkBackend.$anonfun$executeEncode$1(SparkBackend.scala:495)
	at is.hail.utils.ExecutionTimer$.time(ExecutionTimer.scala:52)
	at is.hail.backend.spark.SparkBackend.executeEncode(SparkBackend.scala:494)
	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.base/java.lang.reflect.Method.invoke(Method.java:566)
	at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
	at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
	at py4j.Gateway.invoke(Gateway.java:282)
	at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
	at py4j.commands.CallCommand.execute(CallCommand.java:79)
	at py4j.ClientServerConnection.waitForCommands(ClientServerConnection.java:182)
	at py4j.ClientServerConnection.run(ClientServerConnection.java:106)
	at java.base/java.lang.Thread.run(Thread.java:829)



Hail version: 0.2.120-f00f916faf78
Error summary: HailException: zip: length mismatch: 62164, 104

One of the input tables (context_ht) has 8061724269 rows and 62164 partitions, and another input_table (mutation_ht) has 104 rows and 104 partitions. If I add “mutation_ht = mutation_ht = mutation_ht.repartition(1)” at line 37, the script runs fine. If I add “mutation_ht = mutation_ht.naive_coalesce(1)” it gives the same error.

Why does the original partitioning of the input tables result in an error?
Why does repartition make it work but naive_coalesce doesn’t?
How do you typically choose whether to use repartition vs naive_coalesce?

Thank you!

danking said:

klaricch:

create_observed_and_possible_ht

Which is here:

  <a href="https://github.com/broadinstitute/gnomad-constraint/blob/9fa1849c7fc2d63063fbc07b8c0035b2b2850885/gnomad_constraint/utils/constraint.py#L157" target="_blank" rel="noopener">github.com</a>

broadinstitute/gnomad-constraint/blob/9fa1849c7fc2d63063fbc07b8c0035b2b2850885/gnomad_constraint/utils/constraint.py#L157

<pre class="onebox"><code class="lang-py">
  <ol class="start lines" start="147" style="counter-reset: li-counter 146 ;">
      <li>    # vep root annotation.</li>
      <li>    ht = add_most_severe_csq_to_tc_within_vep_root(ht)</li>
      <li></li>
      <li>    if require_exome_coverage:</li>
      <li>        # Filter out locus with undefined exome coverage.</li>
      <li>        ht = ht.filter(hl.is_defined(ht.exome_coverage))</li>
      <li></li>
      <li>    return ht</li>
      <li></li>
      <li></li>
      <li class="selected">def create_observed_and_possible_ht(</li>
      <li>    exome_ht: hl.Table,</li>
      <li>    context_ht: hl.Table,</li>
      <li>    mutation_ht: hl.Table,</li>
      <li>    max_af: float = 0.001,</li>
      <li>    keep_annotations: Tuple[str] = (</li>
      <li>        "context",</li>
      <li>        "ref",</li>
      <li>        "alt",</li>
      <li>        "methylation_level",</li>
      <li>    ),</li>
  </ol>
</code></pre>

chrisvittal said:

I was able to download the log file and will be investigating more now.

chrisvittal said:

klaricch Sorry to bother you, but the most recent log is one from a successful run. Do you have a log from a failed one?

klaricch said:

I’ll rerun it quickly and send an example of a failed one

Ruchit10 said:

I’m getting a similar (length mismatch) error while running RMC specifically this script. Mu cluster config is below.

rmc/pipeline/two_breaks/run_batches_dataproc.py
--run-sections-over-threshold \
    --search-num \
    2 \
    --freeze \
    7 \
    --group-size \
    50

The hail log is attached
round2search_for_two_breaks_run_batches_dataproc.log (5.1 MB)

Ruchit10 said:

Possibly this line is causing it? I’ll see if removing naive_coalesce fixes it like it did for klaricch

What's different between mt and ht file format?

SimonLi5601 said:

I thought those are simple concepts, but I can’t find anywhere except the VDS documentation. And also how they are different from VDS file? Thanks

danking said:

Hey SimonLi5601 !

You’re right that the native file formats are not well documented. The bulk of our users are uninterested in these details. I think you’re the third person to ask about their details in the eight years of the project.

I’ll directly answer your question below, but I’m curious to better understand your motivations and goals. That will help me give you a more useful answer.


An “.mt” “file” is a compressed, partitioned, indexed, binary format for Matrix Tables (which is like a pandas DataFrame with an extra index dimension). An “.ht” “file” is a compressed, partitioned, indexed, binary format for Tables (which is effectively an out-of-core pandas DataFrame).


To be a bit more specific, an “.mt” file is a little bit of metadata plus four “.ht” files (though they lack the extension):

  • globals
  • cols
  • rows
  • entries

and two optional pieces:

  • index
  • references

An “.ht” file is a little bit of metadata plus:

  • rows (this contains the partitioned data)

and two optional pieces:

  • index
  • references

Estimating Large-Scale Storage Requirements

mgarcia said:

Hello,

I am in the process of running our hail-based sample quality control script on roughly 470,000 exomes on the UK Biobank Research Analysis Platform. While running a test, my job failed due to insufficient storage space.

I saw this thread that suggested that the requirements for a full-scale job can often be linearly estimated from running a lower-scale test. Do you think this would be helpful in estimating the full-scale storage requirements for processing 470,000 exomes?

For context, these were the parameters used for running this job:
98 nodes, each node with 64 GB of memory and roughly 300 GB of storage. You can see in the screenshot below that the storage requirements reached the limit just prior to the job failing.

Unfortunately, the temporary JupyterLab environment closed shortly after failing, and I was unable to get the hail log file. Do you have any suggestions for setting a safe upper limit of storage that avoids future errors such as this one?

We are estimating the job to run for roughly 2-3 days. The storage requirements were roughly 130GB per node in the first 8 hours of running the job, but in the ninth hour, this would overcome the allocated 300GB of storage.

For reference, I am also sharing our updated Sample QC script.
hail_sample_qc_YL_MG_Step1_082823.txt (10.2 KB)

What's different between mt and ht file format?

SimonLi5601 said:

I thought those are simple concepts, but I can’t find anywhere except the VDS documentation. And also how they are different from VDS file? Thanks

danking said:

Hey SimonLi5601 !

You’re right that the native file formats are not well documented. The bulk of our users are uninterested in these details. I think you’re the third person to ask about their details in the eight years of the project.

I’ll directly answer your question below, but I’m curious to better understand your motivations and goals. That will help me give you a more useful answer.


An “.mt” “file” is a compressed, partitioned, indexed, binary format for Matrix Tables (which is like a pandas DataFrame with an extra index dimension). An “.ht” “file” is a compressed, partitioned, indexed, binary format for Tables (which is effectively an out-of-core pandas DataFrame).


To be a bit more specific, an “.mt” file is a little bit of metadata plus four “.ht” files (though they lack the extension):

  • globals
  • cols
  • rows
  • entries

and two optional pieces:

  • index
  • references

An “.ht” file is a little bit of metadata plus:

  • rows (this contains the partitioned data)

and two optional pieces:

  • index
  • references

Future support of Elasticsearch version 8

souckmi said:

Hello everyone,

I wanted to ask if the hail.methods.export_elasticsearch() method will support Elasticsearch version 8 in the near future, since it supports Elasticsearch versions 6.8.x - 7.x.x. according to the documentation?

Thanks for the answer!

danking said:

We don’t have plans to add version 8 support, but we welcome a pull request to add that support. My guess is that its a relatively easy change.

Zip: length mismatch error

klaricch said:

Hi,

If I run the code in the attached file
am_test.txt (1.3 KB)
I get the below error:

```python > > 2023-08-09 15:56:05.872 Hail: INFO: wrote table with 104 rows in 104 partitions to gs://gnomad-tmp-4day/__iruid_845-wg8w0gBbPsilRDfRegbVW3 INFO (constraint_pipeline 463): Copying log to logging bucket... 2023-08-09 15:56:33.276 Hail: INFO: copying log to 'gs://gnomad-tmp/gnomad_v2.1.1_testing/constraint/logging/constraint_pipeline.log'... Traceback (most recent call last): File "/tmp/17e6dae80e9548a9a94895fd07e6dee0/constraint_pipeline.py", line 785, in <module> main(args) File "/tmp/17e6dae80e9548a9a94895fd07e6dee0/constraint_pipeline.py", line 412, in main oe_ht = apply_models( File "/tmp/17e6dae80e9548a9a94895fd07e6dee0/pyscripts_ge6ozu1m.zip/gnomad_constraint/utils/constraint.py", line 396, in apply_models File "/tmp/17e6dae80e9548a9a94895fd07e6dee0/pyscripts_ge6ozu1m.zip/gnomad_constraint/utils/constraint.py", line 279, in create_observed_and_possible_ht File "<decorator-gen-1118>", line 2, in checkpoint File "/opt/conda/default/lib/python3.10/site-packages/hail/typecheck/check.py", line 584, in wrapper return __original_func(*args_, **kwargs_) File "/opt/conda/default/lib/python3.10/site-packages/hail/table.py", line 1331, in checkpoint self.write(output=output, overwrite=overwrite, stage_locally=stage_locally, _codec_spec=_codec_spec) File "<decorator-gen-1120>", line 2, in write File "/opt/conda/default/lib/python3.10/site-packages/hail/typecheck/check.py", line 584, in wrapper return __original_func(*args_, **kwargs_) File "/opt/conda/default/lib/python3.10/site-packages/hail/table.py", line 1377, in write Env.backend().execute(ir.TableWrite(self._tir, ir.TableNativeWriter(output, overwrite, stage_locally, _codec_spec))) File "/opt/conda/default/lib/python3.10/site-packages/hail/backend/py4j_backend.py", line 82, in execute raise e.maybe_user_error(ir) from None File "/opt/conda/default/lib/python3.10/site-packages/hail/backend/py4j_backend.py", line 76, in execute result_tuple = self._jbackend.executeEncode(jir, stream_codec, timed) File "/usr/lib/spark/python/lib/py4j-0.10.9.5-src.zip/py4j/java_gateway.py", line 1321, in __call__ File "/opt/conda/default/lib/python3.10/site-packages/hail/backend/py4j_backend.py", line 35, in deco raise fatal_error_from_java_error_triplet(deepest, full, error_id) from None hail.utils.java.FatalError: HailException: zip: length mismatch: 62164, 104

Java stack trace:
is.hail.utils.HailException: zip: length mismatch: 62164, 104
at __C8160Compiled.__m8201split_ToArray(Emit.scala)
at __C8160Compiled.__m8169split_CollectDistributedArray(Emit.scala)
at __C8160Compiled.__m8164split_Let(Emit.scala)
at __C8160Compiled.apply(Emit.scala)
at is.hail.expr.ir.CompileAndEvaluate$.$anonfun$_apply$4(CompileAndEvaluate.scala:61)
at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
at is.hail.utils.ExecutionTimer.time(ExecutionTimer.scala:81)
at is.hail.expr.ir.CompileAndEvaluate$.$anonfun$_apply$2(CompileAndEvaluate.scala:61)
at is.hail.expr.ir.CompileAndEvaluate$.$anonfun$_apply$2$adapted(CompileAndEvaluate.scala:59)
at is.hail.backend.ExecuteContext.$anonfun$scopedExecution$1(ExecuteContext.scala:140)
at is.hail.utils.package$.using(package.scala:635)
at is.hail.backend.ExecuteContext.scopedExecution(ExecuteContext.scala:140)
at is.hail.expr.ir.CompileAndEvaluate$._apply(CompileAndEvaluate.scala:59)
at is.hail.expr.ir.CompileAndEvaluate$.evalToIR(CompileAndEvaluate.scala:33)
at is.hail.expr.ir.LowerOrInterpretNonCompilable$.evaluate$1(LowerOrInterpretNonCompilable.scala:30)
at is.hail.expr.ir.LowerOrInterpretNonCompilable$.rewrite$1(LowerOrInterpretNonCompilable.scala:58)
at is.hail.expr.ir.LowerOrInterpretNonCompilable$.apply(LowerOrInterpretNonCompilable.scala:63)
at is.hail.expr.ir.lowering.LowerOrInterpretNonCompilablePass$.transform(LoweringPass.scala:67)
at is.hail.expr.ir.lowering.LoweringPass.$anonfun$apply$3(LoweringPass.scala:16)
at is.hail.utils.ExecutionTimer.time(ExecutionTimer.scala:81)
at is.hail.expr.ir.lowering.LoweringPass.$anonfun$apply$1(LoweringPass.scala:16)
at is.hail.utils.ExecutionTimer.time(ExecutionTimer.scala:81)
at is.hail.expr.ir.lowering.LoweringPass.apply(LoweringPass.scala:14)
at is.hail.expr.ir.lowering.LoweringPass.apply$(LoweringPass.scala:13)
at is.hail.expr.ir.lowering.LowerOrInterpretNonCompilablePass$.apply(LoweringPass.scala:62)
at is.hail.expr.ir.lowering.LoweringPipeline.$anonfun$apply$1(LoweringPipeline.scala:22)
at is.hail.expr.ir.lowering.LoweringPipeline.$anonfun$apply$1$adapted(LoweringPipeline.scala:20)
at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
at is.hail.expr.ir.lowering.LoweringPipeline.apply(LoweringPipeline.scala:20)
at is.hail.expr.ir.CompileAndEvaluate$._apply(CompileAndEvaluate.scala:50)
at is.hail.backend.spark.SparkBackend._execute(SparkBackend.scala:462)
at is.hail.backend.spark.SparkBackend.$anonfun$executeEncode$2(SparkBackend.scala:498)
at is.hail.backend.ExecuteContext$.$anonfun$scoped$3(ExecuteContext.scala:75)
at is.hail.utils.package$.using(package.scala:635)
at is.hail.backend.ExecuteContext$.$anonfun$scoped$2(ExecuteContext.scala:75)
at is.hail.utils.package$.using(package.scala:635)
at is.hail.annotations.RegionPool$.scoped(RegionPool.scala:17)
at is.hail.backend.ExecuteContext$.scoped(ExecuteContext.scala:63)
at is.hail.backend.spark.SparkBackend.withExecuteContext(SparkBackend.scala:350)
at is.hail.backend.spark.SparkBackend.$anonfun$executeEncode$1(SparkBackend.scala:495)
at is.hail.utils.ExecutionTimer$.time(ExecutionTimer.scala:52)
at is.hail.backend.spark.SparkBackend.executeEncode(SparkBackend.scala:494)
at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.base/java.lang.reflect.Method.invoke(Method.java:566)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:282)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.ClientServerConnection.waitForCommands(ClientServerConnection.java:182)
at py4j.ClientServerConnection.run(ClientServerConnection.java:106)
at java.base/java.lang.Thread.run(Thread.java:829)

Hail version: 0.2.120-f00f916faf78
Error summary: HailException: zip: length mismatch: 62164, 104

<p>One of the input tables (context_ht) has 8061724269 rows and 62164 partitions, and another input_table (mutation_ht) has 104 rows and 104 partitions. If I add “mutation_ht = mutation_ht = mutation_ht.repartition(1)” at line 37, the script runs fine. If I add “mutation_ht = mutation_ht.naive_coalesce(1)” it gives the same error.</p>
<p>Why does the original partitioning of the input tables result in an error?<br>
Why does repartition make it work but naive_coalesce doesn’t?<br>
How do you typically choose whether to use repartition vs naive_coalesce?</p>
<p>Thank you!</p>

<h2><a href='https://discuss.hail.is/t/3548/2'>danking said</a>:</h2>
<aside class="quote no-group" data-username="klaricch" data-post="1" data-topic="3548">
<div class="title">
<div class="quote-controls"></div>
<img loading="lazy" alt="" width="24" height="24" src="https://sjc6.discourse-cdn.com/standard10/user_avatar/discuss.hail.is/klaricch/48/643_2.png" class="avatar"> klaricch:</div>
<blockquote>
<p><code>create_observed_and_possible_ht</code></p>
</blockquote>
</aside>
<p>Which is here:</p><aside class="onebox githubblob" data-onebox-src="https://github.com/broadinstitute/gnomad-constraint/blob/9fa1849c7fc2d63063fbc07b8c0035b2b2850885/gnomad_constraint/utils/constraint.py#L157">
  <header class="source">

      <a href="https://github.com/broadinstitute/gnomad-constraint/blob/9fa1849c7fc2d63063fbc07b8c0035b2b2850885/gnomad_constraint/utils/constraint.py#L157" target="_blank" rel="noopener">github.com</a>
  </header>

  <article class="onebox-body">
    <h4><a href="https://github.com/broadinstitute/gnomad-constraint/blob/9fa1849c7fc2d63063fbc07b8c0035b2b2850885/gnomad_constraint/utils/constraint.py#L157" target="_blank" rel="noopener">broadinstitute/gnomad-constraint/blob/9fa1849c7fc2d63063fbc07b8c0035b2b2850885/gnomad_constraint/utils/constraint.py#L157</a></h4>



    <pre class="onebox"><code class="lang-py">
      <ol class="start lines" start="147" style="counter-reset: li-counter 146 ;">
          <li>    # vep root annotation.</li>
          <li>    ht = add_most_severe_csq_to_tc_within_vep_root(ht)</li>
          <li></li>
          <li>    if require_exome_coverage:</li>
          <li>        # Filter out locus with undefined exome coverage.</li>
          <li>        ht = ht.filter(hl.is_defined(ht.exome_coverage))</li>
          <li></li>
          <li>    return ht</li>
          <li></li>
          <li></li>
          <li class="selected">def create_observed_and_possible_ht(</li>
          <li>    exome_ht: hl.Table,</li>
          <li>    context_ht: hl.Table,</li>
          <li>    mutation_ht: hl.Table,</li>
          <li>    max_af: float = 0.001,</li>
          <li>    keep_annotations: Tuple[str] = (</li>
          <li>        "context",</li>
          <li>        "ref",</li>
          <li>        "alt",</li>
          <li>        "methylation_level",</li>
          <li>    ),</li>
      </ol>
    </code></pre>



  </article>

  <div class="onebox-metadata">
    
    
  </div>

  <div style="clear: both"></div>
</aside>


<h2><a href='https://discuss.hail.is/t/3548/3'>chrisvittal said</a>:</h2>
<p>I was able to download the log file and will be investigating more now.</p>

<h2><a href='https://discuss.hail.is/t/3548/4'>chrisvittal said</a>:</h2>
<p><a class="mention" href="/u/klaricch">klaricch</a> Sorry to bother you, but the most recent log is one from a successful run. Do you have a log from a failed one?</p>

<h2><a href='https://discuss.hail.is/t/3548/5'>klaricch said</a>:</h2>
<p>I’ll rerun it quickly and send an example of a failed one</p>

<h2><a href='https://discuss.hail.is/t/3548/6'>Ruchit10 said</a>:</h2>
<p>I’m getting a similar (length mismatch) error while running RMC specifically <a href="https://github.com/broadinstitute/regional_missense_constraint/blob/runtime_freeze7/rmc/pipeline/two_breaks/run_batches_dataproc.py" rel="noopener nofollow ugc">this</a> script. Mu cluster config is below.</p>
```python
rmc/pipeline/two_breaks/run_batches_dataproc.py
--run-sections-over-threshold \
    --search-num \
    2 \
    --freeze \
    7 \
    --group-size \
    50

The hail log is attached
round2search_for_two_breaks_run_batches_dataproc.log (5.1 MB)

Ruchit10 said:

Possibly this line is causing it? I’ll see if removing naive_coalesce fixes it like it did for klaricch

Hail can't export tsv.bgz

TaotaoTan said:

Hi,

I was using Hail to perform some variant QC for summary statistics download from UKBB in the AllofUs platform. I was able to run the pipeline but failed to export the QC-ed summary stats as a bgz file. Here is a sketch of my pipeline:

var_lst = hl.read_table("varlst_filename_in")
sumstats = hl.read_table("sumstats_filename_in")
var_lst = var_lst.annotate(**sumstats[var_lst.locus])
var_lst = var_lst.filter(xxx)
sumstats_qced = var_lst.select(col1 = var_lst.col1,
                         col2 = var_lst.col2
)

sumstats_qced.show() # this works
sumstats_qced.count() # this works, 8,604,715 rows
sumstats_qced.export("sumstats_qced.tsv.bgz") # this doesn't work, see error message

Here are the error messages:

ERROR:root:Exception while sending command.=============>(140094 + 32) / 140126]
Traceback (most recent call last):
  File "/opt/conda/lib/python3.7/site-packages/py4j/java_gateway.py", line 1224, in send_command
    raise Py4JNetworkError("Answer from Java side is empty")
py4j.protocol.Py4JNetworkError: Answer from Java side is empty

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/opt/conda/lib/python3.7/site-packages/py4j/java_gateway.py", line 1038, in send_command
response = connection.send_command(command)
File "/opt/conda/lib/python3.7/site-packages/py4j/java_gateway.py", line 1229, in send_command
"Error while receiving", e, proto.ERROR_ON_RECEIVE)
py4j.protocol.Py4JNetworkError: Error while receiving

Py4JError Traceback (most recent call last)
/tmp/ipykernel_1091/2060613055.py in <module>
4 sumstats_QC_quant(f"{bucket}/Sumstats/continuous-50-both_sexes-irnt_checkpoint.ht",
5 f"{bucket}/Sumstats/WGS_Height_QCed.tsv.bgz",
----> 6 var_wgs)
7
8 print("DBP " + str(datetime.now()))

/tmp/ipykernel_1091/205199695.py in sumstats_QC_quant(ht_filename_in, filename_out, var_wgs)
41 neglog10_pval_meta = var_wgs.neglog10_pval_meta_hq)
42
---> 43 sumstats_QCed.export(filename_out)

<decorator-gen-1190> in export(self, output, types_file, header, parallel, delimiter)

/opt/conda/lib/python3.7/site-packages/hail/typecheck/check.py in wrapper(__original_func, *args, **kwargs)
575 def wrapper(original_func, *args, **kwargs):
576 args
, kwargs
= check_all(__original_func, args, kwargs, checkers, is_method=is_method)
--> 577 return original_func(*args, **kwargs)
578
579 return wrapper

/opt/conda/lib/python3.7/site-packages/hail/table.py in export(self, output, types_file, header, parallel, delimiter)
1097 parallel = ir.ExportType.default(parallel)
1098 Env.backend().execute(
-> 1099 ir.TableWrite(self._tir, ir.TableTextWriter(output, types_file, header, parallel, delimiter)))
1100
1101 def group_by(self, *exprs, **named_exprs) -> 'GroupedTable':

/opt/conda/lib/python3.7/site-packages/hail/backend/py4j_backend.py in execute(self, ir, timed)
97 # print(self._hail_package.expr.ir.Pretty.apply(jir, True, -1))
98 try:
---> 99 result_tuple = self._jbackend.executeEncode(jir, stream_codec, timed)
100 (result, timings) = (result_tuple._1(), result_tuple._2())
101 value = ir.typ._from_encoding(result)

/opt/conda/lib/python3.7/site-packages/py4j/java_gateway.py in call(self, *args)
1321 answer = self.gateway_client.send_command(command)
1322 return_value = get_return_value(
-> 1323 answer, self.gateway_client, self.target_id, self.name)
1324
1325 for temp_arg in temp_args:

/opt/conda/lib/python3.7/site-packages/hail/backend/py4j_backend.py in deco(*args, **kwargs)
19 import pyspark
20 try:
---> 21 return f(*args, **kwargs)
22 except py4j.protocol.Py4JJavaError as e:
23 s = e.java_exception.toString()

/opt/conda/lib/python3.7/site-packages/py4j/protocol.py in get_return_value(answer, gateway_client, target_id, name)
334 raise Py4JError(
335 "An error occurred while calling {0}{1}{2}".
--> 336 format(target_id, ".", name))
337 else:
338 type = answer[1]

Py4JError: An error occurred while calling o1.executeEncode

danking said:

Hey @TaotaoTan !

Sorry you’re having trouble with Hail! This error usually means you’ve run out of RAM on the driver machine. What size driver are you using for your cluster? I recommend an n1-standard-16 (16 cores, 3.75GB RAM per core) when working with >100k partitions.

Help reformatting variant ID string to annotate MatrixTable keyed by locus, allele

jbs said:

Hello! I’m new to Hail so apologies if this question has been asked before or is a bit basic.

I am trying to annotate a matrix table of genomic data (mt) keyed by [locus, allele], with variant annotations in another hail MT (vat_table). Specifically, I would like to eventually filter the variants mt by their corresponding variant type found in vat_table.variant_type.

However, the fields in vat_table don’t seem to be well-formatted to parsing as loci. The variant ID field (vat_table.vid) is formatted as ‘1-10001-T-C’ (str), which is not recognized by hl.parse_variant(). I can use the replace(‘-’, ‘:’) method to change the format, but then calling:

hl.parse_variant(vat_table.vid, reference_genome = ‘GRCh38’)

throws an error, since the contig needs to be in the form ‘chr1’. One solution would be to concatenate ‘chr’ with each of the vat_table.vid character strings, but I am having trouble figuring out how to do this in Hail.

Another option would be to just concatenate four fields in of character strings, vat_table.contig (ex: ‘chr1’), vat_table.position (ex: ‘10001’), vat_table.ref_allele (ex: ‘T’), and vat_table.alt_allele (ex: ‘C’) to form a DIY vid field. However, I am having trouble doing this as well.

A final option would be to change the format of the the vat_table.vid field when I originally load the table with the call below:

vat_table = hl.import_table(vat_path, force = True, quote = ‘"’, delimiter = “\t”, force_bgz = True)

I see this thread advising on how to adjust the contig format when reading in a VCF, but I’m not sure if there is a nice way to adjust this from a tab-delimited file. I am unfortunately not able to make any changed directly in the source file.

Hopefully that is clear, and thanks so much for you help!

iris-garden said:

Hi,

If you’d like to go with the first option, you could do something along these lines:

vat_table = hl.import_table(...)
vat_table = vat_table.annotate(vid = "chr" + vat_table.vid.replace("-", ":"))

For the second,

vat_table = vat_table.annotate(vid = vat_table.contig + ":" + vat_table.position + ":" + vat_table.ref_allele + ":" + vat_table.alt_allele)

As far as I can tell, we unfortunately don’t have an equivalent to the contig_recoding keyword for import_table, but someone else from the team may be able to chime in if I’m missing something there.

Hope that helps!

Appending/merging/combining multiple VCF files using Hail

Aks said:

Hello Everyone,

I have a list of VCF files from specific ethnicity such as American Indian, Chinese, European etc

Under each ethnicity, I have around 100+ files.

Currently, I computed the VARIANT QC metrics such as call_rate, n_het etc for one file as shown in the hail tutorial (refer image below)

image is here

However, now I would like to have one file for each ethnicity and then compute VARIANT_QC metrics.

I already referred to this post and this post but don’t think this addresses my query

How can I do this across all files under a specific ethnicity?

Can help me with this?

Is there any hail way to do this? or any python or R approach that veterans here are aware of??

tpoterba said:

You’re generally going to have a better time if you combine everything into one dataset and compute statistics on subsets from that dataset, rather than iterating through many smaller datasets. We need more information about these input VCFs, though. Are these project VCFs (GT, GQ, etc FORMAT fields) for a group of samples from sequencing data? Those cannot be losslessly combined (a site might appear in one VCF but not another).

If your VCFs are genotype data, it’s probably possible to combine since those have the same set of variants.

Aks said:

Hi tpoterba

tpoterba:

Are these project VCFs (GT, GQ, etc FORMAT fields) for a group of samples from sequencing data? Those cannot be losslessly combined (a site might appear in one VCF but not another).

Yes, my VCF file contains fields such as GT, GP, etc FORMAT fields.

I also did a mt.describe() for one file from each ethnicity and they all look the same as shown below

image

Does hail have any function to combine all these files together? I see VCF combiner but I believe this is different from what I am looking for (they talk about gVCF which I think is different from my file which has vcf.gz extension)

Support importing phased BGEN files

RossDeVito said:

Hey team,

Hail preserves phase information when importing VCF files, but import_bgen() notes that it does not. Is there something about BGEN files that makes this harder? It would be useful for my current work simulating and modeling haplotype level cis interactions.

Hail SPARK version error

jin0008 said:

Hi I want to install Hail in my ubuntu.
But when I run simple hail query, I got this message.
Could you give me some advice for me?
Thanks

This Hail JAR was compiled for Spark 3.1.3, cannot run with Spark 3.4.1.
The major and minor versions must agree, though the patch version can differ.

danking said:

You need to install Apache Spark 3.1.3. You have Apache Spark 3.4.1. Try:

pip3 install pyspark==3.1.3

An error occurred while calling z:is.hail.backend.spark.SparkBackend.apply. : java.lang.ExceptionInInitializerError

ldomenech said:

Hello,

I installed hail in my Mac M2 Pro with pip:

pip install hail

Then I tried to run my first query, using your example, to verify everything was okay:

ipython
import hail as hl
mt = hl.balding_nichols_model(n_populations=3,
                              n_samples=10,
                              n_variants=100)
mt.show()

But the command

mt = hl.balding_nichols_model(n_populations=3,
                              n_samples=10,
                              n_variants=100)

gives me an error that I don’t know how to solve:

An error occurred while calling z:is.hail.backend.spark.SparkBackend.apply.
: java.lang.ExceptionInInitializerError

I was trying to do what you specified here How do I fix "java.net.BindException: Can't assign requested address"? - #4 by tpoterba :

"First try this:

sudo /bin/sh -c “echo "127.0.0.1 $(hostname)" >> /etc/hostname”
Now try to use Hail. If Hail gives the same error, try this:

echo ‘export PYSPARK_SUBMIT_ARGS=’'‘–driver-memory 16g --executor-memory 16g --conf “spark.driver.bindAddress=127.0.0.1” pyspark-shell’' >> ~/.(basename {SHELL})rc

Now try Hail again. If it still fails, respond to this thread."

But no success.

Some other info that may be useful:

java --version
openjdk 21 2023-09-19

OpenJDK Runtime Environment Zulu21.28+85-CA (build 21+35)

OpenJDK 64-Bit Server VM Zulu21.28+85-CA (build 21+35, mixed mode, sharing)

$python --version

Python 3.11.5



$ipython --version

8.15.0

Could you please help me to solve this?

Thank you very much!

danking said:

Hey @ldomenech !

Can you post the complete stack trace and any other output? I need all that info to be certain what is wrong.

In the meantime, you should try using Java 8 or Java 11. I’m not certain Hail will work with Java 21. We have some links about install java here. If you’re using a Mac, you can check these instructions for changing the version of Java once you’ve installed an older version:

I have Mac OS X. But I already installed Java 10! How can I install Java 8 and make hail work? This is the error message I’m getting! py4j.protocol.Py4JJavaError: An error occurred while calling z:is.hail.HailContext.apply. : is.hail.utils.HailException: Hail requires Java 8, found 10.0.1 at is.hail.utils.ErrorHandling$class.fatal(ErrorHandling.scala:9) at is.hail.utils.package$.fatal(package.scala:26) at is.hail.HailContext$.apply(HailContext.scala:221) at is.hail.HailContext.apply(HailCo…

ldomenech said:

danking:

stack trace

Hi @danking ,

I’ve installed Java 8 and used it, and the example query is working now!

Thank you so much!

(let me know if you want more information about the error anyway)

Help reformatting variant ID string to annotate MatrixTable keyed by locus, allele

jbs said:

Hello! I’m new to Hail so apologies if this question has been asked before or is a bit basic.

I am trying to annotate a matrix table of genomic data (mt) keyed by [locus, allele], with variant annotations in another hail MT (vat_table). Specifically, I would like to eventually filter the variants mt by their corresponding variant type found in vat_table.variant_type.

However, the fields in vat_table don’t seem to be well-formatted to parsing as loci. The variant ID field (vat_table.vid) is formatted as ‘1-10001-T-C’ (str), which is not recognized by hl.parse_variant(). I can use the replace(‘-’, ‘:’) method to change the format, but then calling:

hl.parse_variant(vat_table.vid, reference_genome = ‘GRCh38’)

throws an error, since the contig needs to be in the form ‘chr1’. One solution would be to concatenate ‘chr’ with each of the vat_table.vid character strings, but I am having trouble figuring out how to do this in Hail.

Another option would be to just concatenate four fields in of character strings, vat_table.contig (ex: ‘chr1’), vat_table.position (ex: ‘10001’), vat_table.ref_allele (ex: ‘T’), and vat_table.alt_allele (ex: ‘C’) to form a DIY vid field. However, I am having trouble doing this as well.

A final option would be to change the format of the the vat_table.vid field when I originally load the table with the call below:

vat_table = hl.import_table(vat_path, force = True, quote = ‘"’, delimiter = “\t”, force_bgz = True)

I see this thread advising on how to adjust the contig format when reading in a VCF, but I’m not sure if there is a nice way to adjust this from a tab-delimited file. I am unfortunately not able to make any changed directly in the source file.

Hopefully that is clear, and thanks so much for you help!

iris-garden said:

Hi,

If you’d like to go with the first option, you could do something along these lines:

vat_table = hl.import_table(...)
vat_table = vat_table.annotate(vid = "chr" + vat_table.vid.replace("-", ":"))

For the second,

vat_table = vat_table.annotate(vid = vat_table.contig + ":" + vat_table.position + ":" + vat_table.ref_allele + ":" + vat_table.alt_allele)

As far as I can tell, we unfortunately don’t have an equivalent to the contig_recoding keyword for import_table, but someone else from the team may be able to chime in if I’m missing something there.

Hope that helps!

Hail SPARK version error

jin0008 said:

Hi I want to install Hail in my ubuntu.
But when I run simple hail query, I got this message.
Could you give me some advice for me?
Thanks

This Hail JAR was compiled for Spark 3.1.3, cannot run with Spark 3.4.1.
The major and minor versions must agree, though the patch version can differ.

danking said:

You need to install Apache Spark 3.1.3. You have Apache Spark 3.4.1. Try:

```python pip3 install pyspark==3.1.3

When to densify?

skoyamamd said:

Hi Dan King,

Thank you for your insightful presentation today!
Per our conversation, I’d appreciate it if you could take a moment to assess the code below for computational equivalence. Our goal here is to extract biallelic variants with AC >= 100
I expect bypassing densifying entire vds will significantly reduce computational time.

While my initial tests suggest consistent results, I want to ensure its reliability across various scenarios. Your expertise in this review would be highly appreciated.

## Code1
vds = hl.vds.read_vds("some_very_big.vds")
mt = vds.variant_data
mt = hl.split_multi_hts(mt)
mt = hl.variant_qc(mt)
mt = mt.annotate_rows(AC100 = mt.variant_qc.AC[1] > 99)
mt = mt.filter_rows(mt.AC100)
vds.variant_data = mt
mt = hl.vds.to_dense_mt(vds)
mt.write("filtered.mt", overwrite=True)
## Code2
vds = hl.vds.read_vds("some_very_big.vds")
mt = hl.vds.to_dense_mt(vds)
mt = hl.split_multi_hts(mt)
mt = hl.variant_qc(mt)
mt = mt.annotate_rows(AC100 = mt.variant_qc.AC[1] > 99)
mt = mt.filter_rows(mt.AC100)
mt.write("filtered.mt", overwrite=True)

danking said:

OK, Patrick and I gave this a thought and we agree that both should have the same output. If a variant row doesn’t exist, it won’t exist in the densification. If you remove a row after densification, it won’t exist.

The presence or absence of a variant row should not affect the value of other variant rows. Densified variant rows are affected by the source variant row and the overlapping reference data. You don’t touch the reference data, so there should be no effect.

We’ll look into why Hail doesn’t automatically push the filter up into the variant data.

danking said:

Hmm. I think part of the challenge for Hail automatically doing this is that variant_qc very much depends on reference data, but you’ve cleverly chosen to only look at the first alternate allele’s allele count.

I think it’d be hard for us to automatically do this optimization. We’ll keep this in mind as we develop further.

danking said:

And an issue to track the optimization: [query] Hail should be able to automatically push backwards alternate allele only VDS filters · Issue #13695 · hail-is/hail · GitHub

Help reformatting variant ID string to annotate MatrixTable keyed by locus, allele

Hello! I’m new to Hail so apologies if this question has been asked before or is a bit basic.

I am trying to annotate a matrix table of genomic data (mt) keyed by [locus, allele], with variant annotations in another hail MT (vat_table). Specifically, I would like to eventually filter the variants mt by their corresponding variant type found in vat_table.variant_type.

However, the fields in vat_table don’t seem to be well-formatted to parsing as loci. The variant ID field (vat_table.vid) is formatted as ‘1-10001-T-C’ (str), which is not recognized by hl.parse_variant(). I can use the replace(‘-’, ‘:’) method to change the format, but then calling:

hl.parse_variant(vat_table.vid, reference_genome = ‘GRCh38’)

throws an error, since the contig needs to be in the form ‘chr1’. One solution would be to concatenate ‘chr’ with each of the vat_table.vid character strings, but I am having trouble figuring out how to do this in Hail.

Another option would be to just concatenate four fields in of character strings, vat_table.contig (ex: ‘chr1’), vat_table.position (ex: ‘10001’), vat_table.ref_allele (ex: ‘T’), and vat_table.alt_allele (ex: ‘C’) to form a DIY vid field. However, I am having trouble doing this as well.

A final option would be to change the format of the the vat_table.vid field when I originally load the table with the call below:

vat_table = hl.import_table(vat_path, force = True, quote = ‘"’, delimiter = “\t”, force_bgz = True)

I see this thread advising on how to adjust the contig format when reading in a VCF, but I’m not sure if there is a nice way to adjust this from a tab-delimited file. I am unfortunately not able to make any changed directly in the source file.

Hopefully that is clear, and thanks so much for you help!

Appending/merging/combining multiple VCF files using Hail

Aks said:

Hello Everyone,

I have a list of VCF files from specific ethnicity such as American Indian, Chinese, European etc

Under each ethnicity, I have around 100+ files.

Currently, I computed the VARIANT QC metrics such as call_rate, n_het etc for one file as shown in the hail tutorial (refer image below)

image is here

However, now I would like to have one file for each ethnicity and then compute VARIANT_QC metrics.

I already referred to this post and this post but don’t think this addresses my query

How can I do this across all files under a specific ethnicity?

Can help me with this?

Is there any hail way to do this? or any python or R approach that veterans here are aware of??

tpoterba said:

You’re generally going to have a better time if you combine everything into one dataset and compute statistics on subsets from that dataset, rather than iterating through many smaller datasets. We need more information about these input VCFs, though. Are these project VCFs (GT, GQ, etc FORMAT fields) for a group of samples from sequencing data? Those cannot be losslessly combined (a site might appear in one VCF but not another).

If your VCFs are genotype data, it’s probably possible to combine since those have the same set of variants.

Aks said:

Hi tpoterba

tpoterba:

Are these project VCFs (GT, GQ, etc FORMAT fields) for a group of samples from sequencing data? Those cannot be losslessly combined (a site might appear in one VCF but not another).

Yes, my VCF file contains fields such as GT, GP, etc FORMAT fields.

I also did a mt.describe() for one file from each ethnicity and they all look the same as shown below

image

Does hail have any function to combine all these files together? I see VCF combiner but I believe this is different from what I am looking for (they talk about gVCF which I think is different from my file which has vcf.gz extension)

Docker error when running VEP

dlcotter said:

I’m running the following code:

import hail as hl
hl.init()
all_genomes_mt = hl.read_matrix_table( f'gs://{genomes_mt_path}' )
result = hl.vep(all_genomes_mt)

And getting the following error:

hail.utils.java.FatalError: HailException: VEP command '/vep --format vcf --json --everything --allele_number --no_stats --cache --offline --minimal --assembly GRCh38 --fasta /opt/vep/.vep/homo_sapiens/95_GRCh38/Homo_sapiens.GRCh38.dna.toplevel.fa.gz --plugin LoF,loftee_path:/opt/vep/Plugins/,gerp_bigwig:/opt/vep/.vep/gerp_conservation_scores.homo_sapiens.GRCh38.bw,human_ancestor_fa:/opt/vep/.vep/human_ancestor.fa.gz,conservation_file:/opt/vep/.vep/loftee.sql --dir_plugins /opt/vep/Plugins/ -o STDOUT' failed with non-zero exit status 125
  VEP Error output:
docker: Cannot connect to the Docker daemon at unix:///var/run/docker.sock. Is the docker daemon running?

I used the following command to create the Dataproc cluster the above is running on:

hailctl dataproc start vep-cluster \
    --image-version=2.1.7-debian11 \
    --autoscaling-policy=autoscaling-policy  \
    --master-machine-type=n1-highmem-8 \
    --worker-machine-type=n1-highmem-8 \
    --worker-boot-disk-size=1000 \
    --secondary-worker-type=non-preemptible \
    --preemptible-worker-boot-disk-size=1000 \
    --properties=dataproc:dataproc.logging.stackdriver.enable=true,dataproc:dataproc.monitoring.stackdriver.enable=true,spark:spark.sql.shuffle.partitions=24240,spark:spark.default.parallelism=24240 \
    --vep GRCh38

I’ve kept everything very close to default since this is the first time I’ve run VEP with Hail, but the default configuration doesn’t seem to work. Do I need to configure Docker a certain way? Or change something about the way I’m creating the cluster?

Thank you,
Daniel Cotter

danking said:

Hey @dlcotter !

My apologies! This should be fixed in latest Hail 0.2.123. The issue with some more information about root cause is #12936.

danking said:

A post was split to a new topic: LowerUninterpretable error while running VEP

Sample id name problem

guido said:

I have a problem, is this only support sample id named “s”?

Best,
meng

iris-garden said:

Hi Meng,

I’m not entirely clear on what your question is. Would you mind sharing a code example that demonstrates the problem you’re running into? That should make it easier for us to help.

Thanks!

Hail progress report UKBB RAP

yluo411 said:

Hi,

May I ask whether Hail provide any command or interface to print some log files showing the current status or progress, for instance, I am going to run Hail on UKBB RAP for 500,000 samples. It is necessary for me to print what percentage we have done and how many left to track the progress. I would to appreciate any suggestions. Thank you.

Best,
Yuyang

danking said:

Hi @yluo411 !

Hail uses Apache Spark for distributed computing. Apache Spark will give you a progress bar that shows you how many partitions remain in the current stage. However, in general, Spark cannot tell you how many more stages are left.

If you’re looking for detailed information about what’s happening in the cluster, the Apache Spark status page is the right place to look. That’s usually served on port 4040 on a master node of the Spark cluster.

Hail also writes everything it’s doing to a Hail log file. The location of that file is printed when Hail initializes. You can also set a specific location using hl.init. In general, the log file is very large and hard for users to interpret.

Can you share more specifics about what you want to know about your pipeline?

Future support of Elasticsearch version 8

souckmi said:

Hello everyone,

I wanted to ask if the hail.methods.export_elasticsearch() method will support Elasticsearch version 8 in the near future, since it supports Elasticsearch versions 6.8.x - 7.x.x. according to the documentation?

Thanks for the answer!

danking said:

We don’t have plans to add version 8 support, but we welcome a pull request to add that support. My guess is that its a relatively easy change.

Support importing phased BGEN files

RossDeVito said:

Hey team,

Hail preserves phase information when importing VCF files, but import_bgen() notes that it does not. Is there something about BGEN files that makes this harder? It would be useful for my current work simulating and modeling haplotype level cis interactions.

Error summary: ClassTooLargeException: Class too large

MsUTR said:

Hi all! I know this has been posted a few times, but I do not think that any of the solutions are applicable to this. I am trying to import a gnomAD VCF (gnomad.exomes.r2.1.1.sites.liftover_grch38.vcf.bgz) with a simple chunk of code:

Hail version: 0.2.108-fc03e9d5dc08
Error summary: ClassTooLargeException: Class too large: __C30721collect_distributed_array_matrix_native_writer

I am not too sure how to resolve this, but weirdly, when I ran this on the non-liftover VCF (gnomad.exomes.r2.1.1.sites.vcf.bgz), it actually worked fine. Will appreciate any input on this matter!

danking said:

Hey, MsUTR , I’m sorry you’re running into this issue. We’ll look into it, but this kind of problem is hard to fix. The root cause is that this VCF file has a very large number of fields. Hail’s parser generates code so that it can parse these kinds of VCFs really fast but, due to JVM limitations, that code can grow too large.

This isn’t particularly high priority for us because gnomAD has publicly released Hail Tables for these VCFs. Using a Hail Table avoids the parsing problem and saves you the cost of importing. Can you use one of these tables instead?

  1. gs://gcp-public-data–gnomad/release/2.1.1/liftover_grch38/ht/exomes/gnomad.exomes.r2.1.1.sites.liftover_grch38.ht
  2. s3://gnomad-public-us-east-1/release/2.1.1/liftover_grch38/ht/exomes/gnomad.exomes.r2.1.1.sites.liftover_grch38.ht
  3. https://datasetgnomad.blob.core.windows.net/dataset/release/2.1.1/liftover_grch38/ht/exomes/gnomad.exomes.r2.1.1.sites.liftover_grch38.ht

danking said:

Hmm.

I was unable to reproduce this, so I closed the GitHub issue I had created for the issue. [query] gnomAD 2.1.1 sites table VCF is not parseable with Hail · Issue #13249 · hail-is/hail · GitHub .

Error summary: ClassTooLargeException: Class too large

MsUTR said:

Hi all! I know this has been posted a few times, but I do not think that any of the solutions are applicable to this. I am trying to import a gnomAD VCF (gnomad.exomes.r2.1.1.sites.liftover_grch38.vcf.bgz) with a simple chunk of code:

Hail version: 0.2.108-fc03e9d5dc08
Error summary: ClassTooLargeException: Class too large: __C30721collect_distributed_array_matrix_native_writer

I am not too sure how to resolve this, but weirdly, when I ran this on the non-liftover VCF (gnomad.exomes.r2.1.1.sites.vcf.bgz), it actually worked fine. Will appreciate any input on this matter!

danking said:

Hey, MsUTR , I’m sorry you’re running into this issue. We’ll look into it, but this kind of problem is hard to fix. The root cause is that this VCF file has a very large number of fields. Hail’s parser generates code so that it can parse these kinds of VCFs really fast but, due to JVM limitations, that code can grow too large.

This isn’t particularly high priority for us because gnomAD has publicly released Hail Tables for these VCFs. Using a Hail Table avoids the parsing problem and saves you the cost of importing. Can you use one of these tables instead?

  1. gs://gcp-public-data–gnomad/release/2.1.1/liftover_grch38/ht/exomes/gnomad.exomes.r2.1.1.sites.liftover_grch38.ht
  2. s3://gnomad-public-us-east-1/release/2.1.1/liftover_grch38/ht/exomes/gnomad.exomes.r2.1.1.sites.liftover_grch38.ht
  3. https://datasetgnomad.blob.core.windows.net/dataset/release/2.1.1/liftover_grch38/ht/exomes/gnomad.exomes.r2.1.1.sites.liftover_grch38.ht

danking said:

Hmm.

I was unable to reproduce this, so I closed the GitHub issue I had created for the issue. [query] gnomAD 2.1.1 sites table VCF is not parseable with Hail · Issue #13249 · hail-is/hail · GitHub .

Combine GVCF to VDS (variant dataset in Hail) and convert VDS to VCF

Dong said:

Hi all

I combined GVCF 5K files to one VDS by using Hail library

And I get vds (variant dataset) directory

So I want to read vds data and convert MT (matrix table including VCF data in Hail)

But I can not convert MT to VCF format files with INFO field

Below code is my code that combine multiple GVCF to one vds and convert it to vcf)

import glob
import hail as hl
hl.init()

from gnomad.utils.annotations import (
get_adj_expr,
get_lowqual_expr,
)
from gnomad.utils.sparse_mt import (
get_as_info_expr,
get_site_info_expr,
INFO_INT32_SUM_AGG_FIELDS,
INFO_SUM_AGG_FIELDS,
split_info_annotation,
split_lowqual_annotation,
)

combine multiple GVCF to VDS

gvcf_list = glob.glob("*.gvcf.gz")
combiner = hl.vds.new_combiner(
gvcf_paths = gvcf_list,
output_path = output.vds,
use_genome_default_intervals=True,
reference_genome = "GRCh38"
)
combiner.run()

read vds data and make mt

vds = hl.vds.read_vds(output_vds)
mt = hl.vds.to_dense_mt(vds)
mt = mt.filter_rows((hl.len(mt.alleles) > 1))
mt = mt.transmute_entries(**mt.gvcf_info)
mt = mt.annotate_rows(alt_alleles_range_array=hl.range(1, hl.len(mt.alleles)))

make INFO expression

info_expr = get_site_info_expr(
mt,
sum_agg_fields=[],
int32_sum_agg_fields=[ ],
array_sum_agg_fields=['SB']
)

info_expr = info_expr.annotate(
**get_as_info_expr(
mt,
sum_agg_fields=[],
int32_sum_agg_fields=[],
array_sum_agg_fields=['SB']
)
)

Add AC and AC_raw

First compute ACs for each non-ref allele, grouped by adj

grp_ac_expr = hl.agg.array_agg(
lambda ai: hl.agg.filter(
mt.LA.contains(ai),
hl.agg.group_by(
get_adj_expr(mt.LGT, mt.GQ, mt.DP, mt.LAD),
hl.agg.sum(
mt.LGT.one_hot_alleles(mt.LA.map(hl.str))[mt.LA.index(ai)]
),
),
),
mt.alt_alleles_range_array,
)

Then, for each non-ref allele, compute

AC as the adj group

AC_raw as the sum of adj and non-adj groups

info_expr = info_expr.annotate(
AC_raw=grp_ac_expr.map(
lambda i: hl.int32(i.get(True, 0) + i.get(False, 0))
),
AC=grp_ac_expr.map(lambda i: hl.int32(i.get(True, 0))),
)

Annotating raw MT with pab max

info_expr = info_expr.annotate(
AS_pab_max=hl.agg.array_agg(
lambda ai: hl.agg.filter(
mt.LA.contains(ai) & mt.LGT.is_het(),
hl.agg.max(
hl.binom_test(
mt.LAD[mt.LA.index(ai)], hl.sum(mt.LAD), 0.5, 'two-sided'
)
),
),
mt.alt_alleles_range_array,
)
)

info_ht = mt.select_rows(info=info_expr).rows()

mt = mt.annotate_rows(info=info_ht[mt.row_key].info)
mt = mt.annotate_entries(
GT = hl.vds.lgt_to_gt(mt.LGT, mt.LA)
)

drop_to_field = ['BaseQRankSum',
'ExcessHet',
'InbreedingCoeff',
'MLEAC',
'MLEAF',
'MQRankSum',
'RAW_MQandDP',
'ReadPosRankSum',
"LGT"]

mt = mt.drop(*drop_to_field)
hl.export_vcf(mt, output.vcf.bgz)

I got error message
“Error summary: HailException: VCF does not support SIndexablePointer(PCArray[PInt32])”

What is my fault?

Thank you.

danking said:

Hey @Dong !

My apologies, this is our fault. You should never see this kind of error message. I’ve asked a member of the development team to figure out why this is happening and to fix it for you. We’ll respond with a quick-fix if we have one.

danking said:

@Dong , it looks likely that you’re trying to export to VCF a field which is an array of arrays of integers. Can you share the output of mt.describe() just before hl.export_vcf?

VCF doesn’t permit arrays of arrays, so you’ll need to either use a TSV/CSV or change that field to not be an array of arrays.

Dong said:

Hi @danking

it’s my mt.describe() result

>>> mt.describe()
----------------------------------------
Global fields:
    None
----------------------------------------
Column fields:
    's': str
----------------------------------------
Row fields:
    'locus': locus<GRCh38>
    'alleles': array<str>
    'rsid': str
    'alt_alleles_range_array': array<int32>
    'info': struct {
        ReadPosRankSum: float64, 
        MQRankSum: float64, 
        SB: array<int32>, 
        FS: float64, 
        SOR: float64, 
        AS_ReadPosRankSum: array<float64>, 
        AS_MQRankSum: array<float64>, 
        AS_SB_TABLE: array<array<int32>>, 
        AS_FS: array<float64>, 
        AS_SOR: array<float64>, 
        AC_raw: array<int32>, 
        AC: array<int32>, 
        AS_pab_max: array<float64>
    }
----------------------------------------
Entry fields:
    'DP': int32
    'GQ': int32
    'MIN_DP': int32
    'LA': array<int32>
    'LAD': array<int32>
    'LPGT': call
    'LPL': array<int32>
    'RGQ': int32
    'PID': str
    'PS': int32
    'SB': array<int32>
    'GT': call
----------------------------------------
Column key: ['s']
Row key: ['locus', 'alleles']
----------------------------------------

Thank you

danking said:

Yeah, @Dong, the issue is AS_SB_TABLE. VCF doesn’t support arrays of arrays. You have to choose a different representation. For example, you might do this:

mt = mt.annotate(
    info = mt.info.annotate(
        AS_SB_TABLE = mt.info.AS_SB_TABLE.map(
            lambda inner_array: hl.str('|').join(inner_array)
        )
    )
)

But be aware! Whoever consumes this VCF will need to know how to handle AS_SB_TABLE’s |-delimited nature.

Dong said:

To @danking
Thank you for your reply

But I got error message.
I changed some code

mt.annotate → mt.annotate_rows

hl.str(‘|’).join(inner_array) → hl.str(‘|’),join(list(map(lambda x : str(x), inner_array)))
because i got a error “TypeError: Expected str collection, int32 found”

so I run below code

mt = mt.annotate_rows(
    info = mt.info.annotate(
        AS_SB_TABLE = mt.info.AS_SB_TABLE.map(
            lambda inner_array : hl.str('|').join(
                list(map(lambda x : str(x), inner_array))
            )
        )
    )
)

Error message …

hail.expr.expressions.base_expression.ExpressionException: <ArrayNumericExpression of type array> object is not iterabl

How can i fix it?

Thank you!!

danking said:

@Dong ,

You have to use the Hail versions of all these methods:

mt = mt.annotate_rows(
    info = mt.info.annotate(
        AS_SB_TABLE = mt.info.AS_SB_TABLE.map(
            lambda inner_array : hl.str('|').join(
                hl.map(lambda x: hl.str(x), inner_array)
            )
        )
    )
)

hl.list is not necessary in this case.

Appending/merging/combining multiple VCF files using Hail

Aks said:

Hello Everyone,

I have a list of VCF files from specific ethnicity such as American Indian, Chinese, European etc

Under each ethnicity, I have around 100+ files.

Currently, I computed the VARIANT QC metrics such as call_rate, n_het etc for one file as shown in the hail tutorial (refer image below)

image is here

However, now I would like to have one file for each ethnicity and then compute VARIANT_QC metrics.

I already referred to this post and this post but don’t think this addresses my query

How can I do this across all files under a specific ethnicity?

Can help me with this?

Is there any hail way to do this? or any python or R approach that veterans here are aware of??

tpoterba said:

You’re generally going to have a better time if you combine everything into one dataset and compute statistics on subsets from that dataset, rather than iterating through many smaller datasets. We need more information about these input VCFs, though. Are these project VCFs (GT, GQ, etc FORMAT fields) for a group of samples from sequencing data? Those cannot be losslessly combined (a site might appear in one VCF but not another).

If your VCFs are genotype data, it’s probably possible to combine since those have the same set of variants.

Aks said:

Hi tpoterba

tpoterba:

Are these project VCFs (GT, GQ, etc FORMAT fields) for a group of samples from sequencing data? Those cannot be losslessly combined (a site might appear in one VCF but not another).

Yes, my VCF file contains fields such as GT, GP, etc FORMAT fields.

I also did a mt.describe() for one file from each ethnicity and they all look the same as shown below

image

Does hail have any function to combine all these files together? I see VCF combiner but I believe this is different from what I am looking for (they talk about gVCF which I think is different from my file which has vcf.gz extension)

Help reformatting variant ID string to annotate MatrixTable keyed by locus, allele

jbs said:

Hello! I’m new to Hail so apologies if this question has been asked before or is a bit basic.

I am trying to annotate a matrix table of genomic data (mt) keyed by [locus, allele], with variant annotations in another hail MT (vat_table). Specifically, I would like to eventually filter the variants mt by their corresponding variant type found in vat_table.variant_type.

However, the fields in vat_table don’t seem to be well-formatted to parsing as loci. The variant ID field (vat_table.vid) is formatted as ‘1-10001-T-C’ (str), which is not recognized by hl.parse_variant(). I can use the replace(‘-’, ‘:’) method to change the format, but then calling:

hl.parse_variant(vat_table.vid, reference_genome = ‘GRCh38’)

throws an error, since the contig needs to be in the form ‘chr1’. One solution would be to concatenate ‘chr’ with each of the vat_table.vid character strings, but I am having trouble figuring out how to do this in Hail.

Another option would be to just concatenate four fields in of character strings, vat_table.contig (ex: ‘chr1’), vat_table.position (ex: ‘10001’), vat_table.ref_allele (ex: ‘T’), and vat_table.alt_allele (ex: ‘C’) to form a DIY vid field. However, I am having trouble doing this as well.

A final option would be to change the format of the the vat_table.vid field when I originally load the table with the call below:

vat_table = hl.import_table(vat_path, force = True, quote = ‘"’, delimiter = “\t”, force_bgz = True)

I see this thread advising on how to adjust the contig format when reading in a VCF, but I’m not sure if there is a nice way to adjust this from a tab-delimited file. I am unfortunately not able to make any changed directly in the source file.

Hopefully that is clear, and thanks so much for you help!

iris-garden said:

Hi,

If you’d like to go with the first option, you could do something along these lines:

vat_table = hl.import_table(...)
vat_table = vat_table.annotate(vid = "chr" + vat_table.vid.replace("-", ":"))

For the second,

vat_table = vat_table.annotate(vid = vat_table.contig + ":" + vat_table.position + ":" + vat_table.ref_allele + ":" + vat_table.alt_allele)

As far as I can tell, we unfortunately don’t have an equivalent to the contig_recoding keyword for import_table, but someone else from the team may be able to chime in if I’m missing something there.

Hope that helps!

Hail stuck when running array data

hyslin said:

Hi,

Each time I run a Hail GWAS in All of Us with array data, I encounter issues when trying to save a checkpoint or when running logistic regression. I believe my code is concise, and considering the data size in the latest version (v7) of the array data, it should be manageable. However, the program consistently gets stuck, usually at (0 + 48) / 74 .

2023-10-03 16:24:19.862 Hail: INFO: logistic_regression_rows: running wald on 6529 samples for response variable y,
    with input variable x, and 5 additional covariates...
[Stage 4:>                                                        (0 + 48) / 74]

The array data I used was:

# read in mircoarray data
mt = hl.read_matrix_table('gs://fc-aou-datasets-controlled/v7/microarray/hail.mt/')

After annotating this matrix table with my phenotype/ancestry file, conducting some quick QC, and attempting to run logistic regression on each variant, the program gets stuck.

However, if I randomly select some variants across the genome to run the GWAS, the program runs smoothly and at a good speed.

# only use a subset of intervals
mt = hl.filter_intervals(mt,[hl.parse_locus_interval(x,)for x in allintervals[1:4000] ])

2023-10-03 16:21:06.348 Hail: INFO: logistic_regression_rows: running wald on 6529 samples for response variable y,
    with input variable x, and 5 additional covariates...
2023-10-03 16:22:20.280 Hail: INFO: wrote table with 745 rows in 64 partitions to /tmp/persist_table4hGORAWhdh
    Total size: 45.17 KiB
    * Rows: 45.15 KiB
    * Globals: 11.00 B
    * Smallest partition: 1 rows (125.00 B)
    * Largest partition:  26 rows (1.47 KiB)

So, I assumed that the size of the matrix table might be the cause of the issue. However, how do I resolve this?

danking said:

Hey @hyslin ,

I’m sorry to hear Hail is giving you trouble!

How many variants and samples are in this matrix table? mt.count() will give you that. It looks like you have 74 partitions, which is a very very small number of partitions, particularly if you’re working with a large dataset. It also looks like your cluster only has 48 cores, which is very small for working with large datasets. Folks typically use large clusters (100-1000 cores) for brief periods of time. Using an autoscaling policy is really helpful for doing this, but I’m not sure if those are available inside the AoU RWB yet.

Your second example is running logistic regression on just 745 rows, so I’m not surprised that it is faster.

I expect logistic regression to scale roughly linearly in the variants.

Can you also share the full script you ran? There could be innocuous looking things that are slowing down the analysis.

hyslin said:

Hi danking,

Thanks you for your reply!

I had 1739268 variants and 6529 samples before QC. I didn’t get a chance to check how many variants and samples were left after QC because it took too long to compute. I guess QC is the most time intensive and computation heavy step?

Initially, I used the default Hail Genomic Analysis environment setting plus 10 Preemptible workers. As I increased those preemp. workers to 100, the program no longer got stuck at ( 0 + 48 ) / 74, but it still ran quite slowly and seemed to stop working at (12 + 62 ) / 74. Are there any other configurations I could change to significantly reduce the computation time?

Here I am pasting my full script for the first example:

## import packages
import os
import hail as hl

set up hail

hl.init(default_reference = "GRCh38")

get the workspace bucket

bucket = os.getenv("WORKSPACE_BUCKET")
bucket

what's in my bucket

!gsutil ls {bucket}/data

read in mircoarray data

mt = hl.read_matrix_table('gs://fc-aou-datasets-controlled/v7/microarray/hail.mt/')

read ancestry+phenotype file

sample_filename = 'gs://......../data/afa.tsv'
afa = (hl.import_table(sample_filename,
types={'research_id':hl.tstr},
impute=True,
key='research_id')
)

annotate ancestry and phenotype

mt = mt.annotate_cols(anc_pheno = afa[mt.s])
mt = mt.filter_cols(hl.is_defined(mt.anc_pheno), keep=True)

how many variants and samples are there

mt.count()

simple QC

mt = hl.variant_qc(mt)
mt = mt.filter_rows(mt.variant_qc.AF[1] > 0.01, keep=True)
mt = mt.filter_rows(mt.variant_qc.call_rate > 0.95, keep=True)

how many variants and samples are there after QC

#mt.count()

also skip this

#mt = mt.checkpoint('gs://......./data/output/mt_checkpoint.mt')

make covariates

covariates = [1.0, mt.anc_pheno.is_male,
mt.anc_pheno.PC1,
mt.anc_pheno.PC2,
mt.anc_pheno.PC3]

run logistic regression for each row

log_reg = hl.logistic_regression_rows(
test='wald',
y=mt.anc_pheno.is_case,
x=mt.GT.n_alt_alleles(),
covariates=covariates
)

save to my bucket

log_reg.export(f'{bucket}/data/output/log_reg.tsv.bgz')

Any suggestions would be greatly appreciated!

danking said:

If most of your variants survive the filter, I’d expect this to take ~2000 minutes which is over a day.


Since you’re doing the QC process in the same pipeline as the logistic regression, it’s hard to know what the slow part is. Usually folks use two steps:

# do QC ...
mt.rows().select().write('gs://bucket/rows-that-pass-qc.ht')

ht = hl.read_table('gs://bucket/rows-that-pass-qc.ht')
print(f'{ht.count()} rows passing QC') # counting a table after read is instant

mt = hl.read_matrix_table(...)
mt = mt.semi_join_rows(ht) # filter to the passing variants

This also lets us diagnose how long we expect this to take.


Hmm. 74 partitions means you’re doing ~23000 logistic regressions per partition (assuming most variants pass filters). That’s a lot of regressions! We need to make this parameter a public & supported parameter, but for now try using _n_partitions to increase the number of partitions:

hl.read_matrix_table(..., _n_partitions=750)

I definitely recommend using checkpoint after importing a non-Hail-format dataset:

sample_filename = 'gs://......../data/afa.tsv'
afa = (hl.import_table(sample_filename,
                              types={'research_id':hl.tstr},
                              impute=True,
                              key='research_id')
             )
afa = afa.checkpoint('gs://...', overwrite=True)

I recommend against directly exporting as generating a compressed text file can be memory intensive. Instead try this:

log_reg.write('f{bucket}/data/output/log_reg.ht')
log_reg = hl.read_table('f{bucket}/data/output/log_reg.ht')
log_reg.export(...)

hyslin said:

Thank you so much, danking!

I modified my code using the suggestions you provided, and they have been very helpful! I have separated my code into different notebooks for running in the background. However, as I ran the QC notebook, I encountered an issue when writing out columns (samples). Could this be due to a memory issue? The error message appears sporadically, which is quite disappointing. What steps can I take to prevent this from happening?
(for running this notebook I used 100 preemptible workers and 5000 partitions)

The very last step that cause error:

# write out QC'ed variants and samples
mt.rows().select().write(f'{bucket}/data/qc_var.ht', overwrite=True)
mt.cols().select().write(f'{bucket}/data/qc_sam.ht', overwrite=True)
2023-10-08 20:31:10.452 Hail: INFO: wrote table with 1046146 rows in 74 partitions to gs://fc-secure-80e50d85-fbc9-4359-9a7b-e4948a1089e0/data/qc_var.ht
2023-10-08 20:31:11.236 Hail: WARN: cols(): Resulting column table is sorted by 'col_key'.
    To preserve matrix table column order, first unkey columns with 'key_cols_by()'
ERROR:root:Exception while sending command.
Traceback (most recent call last):
  File "/opt/conda/lib/python3.7/site-packages/py4j/java_gateway.py", line 1224, in send_command
    raise Py4JNetworkError("Answer from Java side is empty")
py4j.protocol.Py4JNetworkError: Answer from Java side is empty

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/opt/conda/lib/python3.7/site-packages/py4j/java_gateway.py", line 1038, in send_command
response = connection.send_command(command)
File "/opt/conda/lib/python3.7/site-packages/py4j/java_gateway.py", line 1229, in send_command
"Error while receiving", e, proto.ERROR_ON_RECEIVE)
py4j.protocol.Py4JNetworkError: Error while receiving

LowerUninterpretable error while running VEP

dlcotter said:

Hi Dan,
I upgraded Hail to 0.2.124, and VEP is working on the cluster, which is great, but I am running into a different error now, and I can’t tell from the error message if it is being caused by the input to VEP or something else.

Here is the code:

all_genomes_mt = hl.read_matrix_table( f'gs://{genomes_mt_path}' )
just_variants = all_genomes_mt.rows()
vep_results = hl.vep( just_variants )

And here is the full stacktrace:

Traceback (most recent call last):                      (16566 + 2504) / 121044]
  File "/tmp/d9c64535d84a478bba71557458f7876a/run-vep.py", line 31, in <module>
    vep_results = hl.vep( just_variants )
  File "<decorator-gen-1752>", line 2, in vep
  File "/opt/conda/default/lib/python3.10/site-packages/hail/typecheck/check.py", line 587, in wrapper
    return __original_func(*args_, **kwargs_)
  File "/opt/conda/default/lib/python3.10/site-packages/hail/methods/qc.py", line 1181, in vep
    'tolerateParseError': tolerate_parse_error})).persist()
  File "<decorator-gen-1224>", line 2, in persist
  File "/opt/conda/default/lib/python3.10/site-packages/hail/typecheck/check.py", line 587, in wrapper
    return __original_func(*args_, **kwargs_)
  File "/opt/conda/default/lib/python3.10/site-packages/hail/table.py", line 2112, in persist
    return Env.backend().persist(self)
  File "/opt/conda/default/lib/python3.10/site-packages/hail/backend/backend.py", line 208, in persist
    persisted = dataset.checkpoint(tempfile.__enter__())
  File "<decorator-gen-1214>", line 2, in checkpoint
  File "/opt/conda/default/lib/python3.10/site-packages/hail/typecheck/check.py", line 587, in wrapper
    return __original_func(*args_, **kwargs_)
  File "/opt/conda/default/lib/python3.10/site-packages/hail/table.py", line 1331, in checkpoint
    self.write(output=output, overwrite=overwrite, stage_locally=stage_locally, _codec_spec=_codec_spec)
  File "<decorator-gen-1216>", line 2, in write
  File "/opt/conda/default/lib/python3.10/site-packages/hail/typecheck/check.py", line 587, in wrapper
    return __original_func(*args_, **kwargs_)
  File "/opt/conda/default/lib/python3.10/site-packages/hail/table.py", line 1377, in write
    Env.backend().execute(ir.TableWrite(self._tir, ir.TableNativeWriter(output, overwrite, stage_locally, _codec_spec)))
  File "/opt/conda/default/lib/python3.10/site-packages/hail/backend/py4j_backend.py", line 82, in execute
    raise e.maybe_user_error(ir) from None
  File "/opt/conda/default/lib/python3.10/site-packages/hail/backend/py4j_backend.py", line 76, in execute
    result_tuple = self._jbackend.executeEncode(jir, stream_codec, timed)
  File "/usr/lib/spark/python/lib/py4j-0.10.9.5-src.zip/py4j/java_gateway.py", line 1321, in __call__
  File "/opt/conda/default/lib/python3.10/site-packages/hail/backend/py4j_backend.py", line 35, in deco
    raise fatal_error_from_java_error_triplet(deepest, full, error_id) from None
hail.utils.java.FatalError: SparkException: Job 0 cancelled because SparkContext was shut down

Java stack trace:
org.apache.spark.SparkException: Job 0 cancelled because SparkContext was shut down
at org.apache.spark.scheduler.DAGScheduler.$anonfun$cleanUpAfterSchedulerStop$1(DAGScheduler.scala:1188)
at org.apache.spark.scheduler.DAGScheduler.$anonfun$cleanUpAfterSchedulerStop$1$adapted(DAGScheduler.scala:1186)
at scala.collection.mutable.HashSet.foreach(HashSet.scala:79)
at org.apache.spark.scheduler.DAGScheduler.cleanUpAfterSchedulerStop(DAGScheduler.scala:1186)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onStop(DAGScheduler.scala:2887)
at org.apache.spark.util.EventLoop.stop(EventLoop.scala:84)
at org.apache.spark.scheduler.DAGScheduler.stop(DAGScheduler.scala:2784)
at org.apache.spark.SparkContext.$anonfun$stop$11(SparkContext.scala:2095)
at org.apache.spark.util.Utils$.tryLogNonFatalError(Utils.scala:1484)
at org.apache.spark.SparkContext.stop(SparkContext.scala:2095)
at org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend$MonitorThread.run(YarnClientSchedulerBackend.scala:125)
at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:952)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2228)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2249)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2268)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2293)
at org.apache.spark.rdd.RDD.$anonfun$collect$1(RDD.scala:1021)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:406)
at org.apache.spark.rdd.RDD.collect(RDD.scala:1020)
at is.hail.backend.spark.SparkBackend.parallelizeAndComputeWithIndex(SparkBackend.scala:406)
at is.hail.backend.BackendUtils.collectDArray(BackendUtils.scala:86)
at __C832Compiled.__m836split_Let(Emit.scala)
at __C832Compiled.apply(Emit.scala)
at is.hail.expr.ir.CompileAndEvaluate$.$anonfun$_apply$4(CompileAndEvaluate.scala:61)
at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
at is.hail.utils.ExecutionTimer.time(ExecutionTimer.scala:81)
at is.hail.expr.ir.CompileAndEvaluate$.$anonfun$_apply$2(CompileAndEvaluate.scala:61)
at is.hail.expr.ir.CompileAndEvaluate$.$anonfun$_apply$2$adapted(CompileAndEvaluate.scala:59)
at is.hail.backend.ExecuteContext.$anonfun$scopedExecution$1(ExecuteContext.scala:140)
at is.hail.utils.package$.using(package.scala:637)
at is.hail.backend.ExecuteContext.scopedExecution(ExecuteContext.scala:140)
at is.hail.expr.ir.CompileAndEvaluate$._apply(CompileAndEvaluate.scala:59)
at is.hail.expr.ir.CompileAndEvaluate$.$anonfun$apply$1(CompileAndEvaluate.scala:19)
at is.hail.utils.ExecutionTimer.time(ExecutionTimer.scala:81)
at is.hail.expr.ir.CompileAndEvaluate$.apply(CompileAndEvaluate.scala:19)
at is.hail.expr.ir.TableWriter.apply(TableWriter.scala:46)
at is.hail.expr.ir.Interpret$.run(Interpret.scala:865)
at is.hail.expr.ir.Interpret$.alreadyLowered(Interpret.scala:59)
at is.hail.expr.ir.LowerOrInterpretNonCompilable$.evaluate$1(LowerOrInterpretNonCompilable.scala:20)
at is.hail.expr.ir.LowerOrInterpretNonCompilable$.rewrite$1(LowerOrInterpretNonCompilable.scala:58)
at is.hail.expr.ir.LowerOrInterpretNonCompilable$.apply(LowerOrInterpretNonCompilable.scala:63)
at is.hail.expr.ir.lowering.LowerOrInterpretNonCompilablePass$.transform(LoweringPass.scala:77)
at is.hail.expr.ir.lowering.LoweringPass.$anonfun$apply$3(LoweringPass.scala:26)
at is.hail.utils.ExecutionTimer.time(ExecutionTimer.scala:81)
at is.hail.expr.ir.lowering.LoweringPass.$anonfun$apply$1(LoweringPass.scala:26)
at is.hail.utils.ExecutionTimer.time(ExecutionTimer.scala:81)
at is.hail.expr.ir.lowering.LoweringPass.apply(LoweringPass.scala:24)
at is.hail.expr.ir.lowering.LoweringPass.apply$(LoweringPass.scala:23)
at is.hail.expr.ir.lowering.LowerOrInterpretNonCompilablePass$.apply(LoweringPass.scala:72)
at is.hail.expr.ir.lowering.LoweringPipeline.$anonfun$apply$1(LoweringPipeline.scala:22)
at is.hail.expr.ir.lowering.LoweringPipeline.$anonfun$apply$1$adapted(LoweringPipeline.scala:20)
at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
at is.hail.expr.ir.lowering.LoweringPipeline.apply(LoweringPipeline.scala:20)
at is.hail.expr.ir.CompileAndEvaluate$._apply(CompileAndEvaluate.scala:50)
at is.hail.backend.spark.SparkBackend._execute(SparkBackend.scala:505)
at is.hail.backend.spark.SparkBackend.$anonfun$executeEncode$2(SparkBackend.scala:541)
at is.hail.backend.ExecuteContext$.$anonfun$scoped$3(ExecuteContext.scala:76)
at is.hail.utils.package$.using(package.scala:637)
at is.hail.backend.ExecuteContext$.$anonfun$scoped$2(ExecuteContext.scala:76)
at is.hail.utils.package$.using(package.scala:637)
at is.hail.annotations.RegionPool$.scoped(RegionPool.scala:17)
at is.hail.backend.ExecuteContext$.scoped(ExecuteContext.scala:62)
at is.hail.backend.spark.SparkBackend.$anonfun$withExecuteContext$1(SparkBackend.scala:345)
at is.hail.backend.spark.SparkBackend.$anonfun$executeEncode$1(SparkBackend.scala:538)
at is.hail.utils.ExecutionTimer$.time(ExecutionTimer.scala:52)
at is.hail.backend.spark.SparkBackend.executeEncode(SparkBackend.scala:537)
at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.base/java.lang.reflect.Method.invoke(Method.java:566)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:282)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.ClientServerConnection.waitForCommands(ClientServerConnection.java:182)
at py4j.ClientServerConnection.run(ClientServerConnection.java:106)
at java.base/java.lang.Thread.run(Thread.java:829)

The Hail logs on the master node say the error preceding the shutdown of the SparkContext was an "error while applying lowering ‘LowerOrInterpretNonCompilable’, presumably in reference to is.hail.expr.ir.LowerOrInterpretNonCompilable, but I don’t know what that function does (I looked at the source, but couldn’t even guess what it was trying to do, specifically). Any idea where I should start troubleshooting?

Thanks,
Daniel Cotter

danking said:

Hey @dlcotter ! Can you share the log file? You can upload it to gist.github.com if its too large to upload here.

dlcotter said:

Attached.
burden-testing_hail-20230929-1752-0.2.124-13536b531342.log (606.6 KB)

Error summary: ClassTooLargeException: Class too large

MsUTR said:

Hi all! I know this has been posted a few times, but I do not think that any of the solutions are applicable to this. I am trying to import a gnomAD VCF (gnomad.exomes.r2.1.1.sites.liftover_grch38.vcf.bgz) with a simple chunk of code:

```python Hail version: 0.2.108-fc03e9d5dc08 Error summary: ClassTooLargeException: Class too large: __C30721collect_distributed_array_matrix_native_writer
<p>I am not too sure how to resolve this, but weirdly, when I ran this on the non-liftover VCF (gnomad.exomes.r2.1.1.sites.vcf.bgz), it actually worked fine. Will appreciate any input on this matter!</p>

<h2><a href='https://discuss.hail.is/t/3491/2'>danking said</a>:</h2>
<p>Hey, <a class="mention" href="/u/msutr">MsUTR</a> , I’m sorry you’re running into this issue. We’ll look into it, but this kind of problem is hard to fix. The root cause is that this VCF file has a very large number of fields. Hail’s parser generates code so that it can parse these kinds of VCFs really fast but, due to JVM limitations, that code can grow too large.</p>
<p>This isn’t particularly high priority for us because gnomAD has publicly released Hail Tables for these VCFs. Using a Hail Table avoids the parsing problem and saves you the cost of importing. Can you use one of these tables instead?</p>
<ol>
<li>gs://gcp-public-data–gnomad/release/2.1.1/liftover_grch38/ht/exomes/gnomad.exomes.r2.1.1.sites.liftover_grch38.ht</li>
<li>s3://gnomad-public-us-east-1/release/2.1.1/liftover_grch38/ht/exomes/gnomad.exomes.r2.1.1.sites.liftover_grch38.ht</li>
<li><a href="https://datasetgnomad.blob.core.windows.net/dataset/release/2.1.1/liftover_grch38/ht/exomes/gnomad.exomes.r2.1.1.sites.liftover_grch38.ht">https://datasetgnomad.blob.core.windows.net/dataset/release/2.1.1/liftover_grch38/ht/exomes/gnomad.exomes.r2.1.1.sites.liftover_grch38.ht</a></li>
</ol>

<h2><a href='https://discuss.hail.is/t/3491/3'>danking said</a>:</h2>
<p>Hmm.</p>
<p>I was unable to reproduce this, so I closed the GitHub issue I had created for the issue. <a href="https://github.com/hail-is/hail/issues/13249#issuecomment-1703341525" class="inline-onebox">[query] gnomAD 2.1.1 sites table VCF is not parseable with Hail · Issue #13249 · hail-is/hail · GitHub</a> .</p>

PCA Projection onto existing PCA

soisa001 said:

I have a srWGS (41 million SNPs) and lrWGS (6 million SNPs) Hail MT. The lrWGS MT SNPs are a subset of the srWGS SNPs (same variant sites, but with slightly different GT calls). The samples are the same ID between both.

I am looking to generate a PCA of the pruned srWGS SNPs, and then generate a PCA of the pruned lrWGS SNPs, and project the new points onto the existing srWGS dataset. Essentially, this would show how “different” or impactful the missing SNPs are in the lrWGS datset as compared to the srWGS dataset. Is this doable with Hail’s built in functions?

If I merge the two tables together (rename lrWGS sample IDs), then generate a single PCA, would this yield the same result as above?

Thank you.

patrick-schultz said:

Hi @soisa001,
Normally we advise to only perform PCA using common variants, which I believe is standard practice (I’m not a scientist). But it seems you want to verify that practice?

Hail’s PCA method returns all the information you should need to compute some measure of “difference” of the two PCAs, but how to do that would depend on what measure you want to use. Did you have an idea how you want to do that?

If I merge the two tables together (rename lrWGS sample IDs), then generate a single PCA, would this yield the same result as above?

This would definitely be different than either two datasets alone. I think this is probably not what you want to do.

patrick-schultz said:

Also be aware tha PCA with 41 million SNPs would be an absolutely massive computation. Hail has a scalable PCA method which should be able to do it, but I’m not sure if a PCA of that scale has been attempted before, or how expensive it would be.

soisa001 said:

Hi Patrick,

Thanks for the quick response. I was looking to do something similar to https://github.com/DReichLab/EIG/blob/master/POPGEN/lsqproject.pdf
where you could consider the srWGS as the complete dataset, and the lrWGS as like an archaic dataset (lrWGS is low coverage and has a subset of variants of the srWGS).
First I would generate the srWGS PCA and then “project” the lrWGS samples onto that.

Would this be possible in Hail?
I would probably downsample variants as well before creating the PCA, so that it’s not so large.

Thanks

patrick-schultz said:

Thanks for the reference, I think I understand now. To make sure: you would downsample to common variants, but even then, some of those variants would be missing (at least in some samples) in lrWGS, and you want to project the lrWGS samples into the PC space of srWGS, using the method in the pdf you linked rather than mean imputing the missing variants. Is that right?

soisa001 said:

Yes, that’s the general gist of it. We are also interested in computing the srWGS space with the complete variant set (i.e not downsampling), then projecting the lrWGS onto it.
It sounds like the current hail method imputes missing information?

Help reformatting variant ID string to annotate MatrixTable keyed by locus, allele

jbs said:

Hello! I’m new to Hail so apologies if this question has been asked before or is a bit basic.

I am trying to annotate a matrix table of genomic data (mt) keyed by [locus, allele], with variant annotations in another hail MT (vat_table). Specifically, I would like to eventually filter the variants mt by their corresponding variant type found in vat_table.variant_type.

However, the fields in vat_table don’t seem to be well-formatted to parsing as loci. The variant ID field (vat_table.vid) is formatted as ‘1-10001-T-C’ (str), which is not recognized by hl.parse_variant(). I can use the replace(‘-’, ‘:’) method to change the format, but then calling:

hl.parse_variant(vat_table.vid, reference_genome = ‘GRCh38’)

throws an error, since the contig needs to be in the form ‘chr1’. One solution would be to concatenate ‘chr’ with each of the vat_table.vid character strings, but I am having trouble figuring out how to do this in Hail.

Another option would be to just concatenate four fields in of character strings, vat_table.contig (ex: ‘chr1’), vat_table.position (ex: ‘10001’), vat_table.ref_allele (ex: ‘T’), and vat_table.alt_allele (ex: ‘C’) to form a DIY vid field. However, I am having trouble doing this as well.

A final option would be to change the format of the the vat_table.vid field when I originally load the table with the call below:

vat_table = hl.import_table(vat_path, force = True, quote = ‘"’, delimiter = “\t”, force_bgz = True)

I see this thread advising on how to adjust the contig format when reading in a VCF, but I’m not sure if there is a nice way to adjust this from a tab-delimited file. I am unfortunately not able to make any changed directly in the source file.

Hopefully that is clear, and thanks so much for you help!

iris-garden said:

Hi,

If you’d like to go with the first option, you could do something along these lines:

vat_table = hl.import_table(...)
vat_table = vat_table.annotate(vid = "chr" + vat_table.vid.replace("-", ":"))

For the second,

vat_table = vat_table.annotate(vid = vat_table.contig + ":" + vat_table.position + ":" + vat_table.ref_allele + ":" + vat_table.alt_allele)

As far as I can tell, we unfortunately don’t have an equivalent to the contig_recoding keyword for import_table, but someone else from the team may be able to chime in if I’m missing something there.

Hope that helps!

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.