The test-process from iris-garden

'GT' are present in MT but missing from VDS for same samples

akhattab said:

Hi, hope all is well.
I’m working on the AllofUs platform to calculate some PRS. I’m using a custom code and Hail MT for calculations, which is working but is time-consuming and expensive.
A part of the code is subsetting the variants in the PRS weights df from the MT, and I came across @tpoterba post about filtering VDS using hail.vds.filter_intervals which is SUPER fast compared to MT. The issue is that most samples are missing ‘GT’ even though they are present in the MT.

My code:

PGS_sites = hl.import_locus_intervals(interval_file, reference_genome=‘GRCh38’, skip_invalid_intervals=True)
vds_prs = hl.vds.filter_intervals(vds_subset, PGS_sites, keep=True)

The number of variants from a PRS score present in MT:

The number of variants from the same PRS score present in VDS for the same cohort:

Am I missing something here? Any thoughts?

Thank you!
Ahmed

danking said:

What code are you using to show the genotypes? vds.variant_data.GT.show() shows only the variant data. The VDS is a split representation: reference data and variant data are stored separately, each on in its own Hail Matrix Table: vds.reference_data and vds.variant_data.

If, for your analysis, you need reference information (in particular, if you need to decide if a given sample is a confident homozygous reference call instead of a no call), you need to densify the VDS. There’s some more information about this at the AoU VDS page.

I recommend you seriously consider using the smaller, pre-densified callsets unless you absolutely need genome-wide data. Densification is not a cheap process, it’s converting from a very sparse representation to a very large and dense representation. This is particularly true if you only need GTs! The majority of the VDS is quality metadata like PL and AD.

akhattab said:

Hi @danking, thank you for your response!
Yes, I used vds.variant_data.GT.show() to show the genotypes.

danking:

If, for your analysis, you need reference information (in particular, if you need to decide if a given sample is a confident homozygous reference call instead of a no call)

Does that mean an NA could either be a nocall or a homozygous reference? I queried vds.reference_data for some of the variants with NA in the vds.variant_data and they also have NA also in the reference_data

Is it safe to assume that the variants with NA are homozygous reference?

danking said:

Does that mean an NA could either be a nocall or a homozygous reference?

Yes, that’s exactly what NA means.

The reference data is not stored row-by-row. The reference data is stored like a GVCF: there are “reference blocks” which implicitly span multiple rows.

Is it safe to assume that the variants with NA are homozygous reference?

No it is not.

If you need to confidently know if a sample is homozygous reference or no call you must use hl.vds.to_dense_mt:

mt = hl.vds.to_dense_mt(vds)
mt.GT.show()  # if you see NA here they are no calls

Note that densification is an expensive and time-consuming process! You probably want to filter to a small set of intervals-of-interest first using hl.vds.filter_intervals:

vds = hl.vds.filter_intervals(
    vds,
    hl.literal([
        hl.locus_interval('chr1', 123, 456)
    ])
)

If possible, I recommend using the pre-densified datasets because the All of Us project has already paid the cost of producing dense data at those loci!

akhattab said:

Great! Thank you so much, Dan. I appreciate the help!

akhattab said:

Hi @danking, sorry to keep bugging you with this but I have a follow-up question.
I’m trying to understand how Hail treats the missing calls.
When using mt.GT.show I could see the missing calls as NA:

But when using hl.is_missing, they are still NA instead of True I guess:

Same when using is_defined:

My question is, why are these calls ‘ignored’?
Thank you!

danking said:

Hey! Sorry for the delay, I answered in your other question How does Hail treat missing calls?

Potential Outdated Error Statement for Hail Version Incompatibility

darn_matren said:

Hello team,

I hit an odd error yesterday when reading in a Hail Table that a coworker had written in Hail 2.120 when I was working with Hail 2.113. The error said that I could not read the file because it was written in a more recent version of Hail - which makes sense because next I just updated my Hail version and it worked - but oddly the file said that the Hail Table was written in 1.60 and my current version of 1.70 was not able to read it. Which is wrong! I was able to replicate the issue (after downgrading to Hail 2.113 in a conda environment) and I’ll paste it down below:

Hail version: 0.2.113-cf32652c5077
Error summary: HailException: incompatible file format when reading: gs://filename.ht
supported version: 1.6.0, found 1.7.0
The cause of this error is usually an attempt to use an older version of Hail to read files generated by a newer version. This is not supported (Hail native files are back-compatible, but not forward-compatible).
To read this file, use a newer version of Hail.

danking said:

1.6.0 is a version of the file format. 0.2.113 is a version of the Hail Python library. I’ll update the message to be more clear.

danking said:

I’ve made this change here. I would love your feedback on the verbiage there!

darn_matren said:

that’s interesting, thanks for the quick response!

Welcome to the Hail community!

system said:

Welcome to the discussion forum for Hail, an open-source, scalable framework for exploring and analyzing genetic data.

The Hail team is based in the Neale Lab at the Stanley Center for Psychiatric Research of the Broad Institute of MIT and Harvard and the Analytic and Translational Genetics Unit of Massachusetts General Hospital.

Error in Terra When Initializing HAIL

beneopp said:

I was unable to implement the 01-genome-wide-association-study.ipynb notebook on Hail-Notebook-Tutorials. I received the following warning after the hl.init() command:

/opt/conda/lib/python3.10/site-packages/hailtop/aiocloud/aiogoogle/user_config.py:43: UserWarning: Reading spark-defaults.conf to determine GCS requester pays configuration. This is deprecated. Please use `hailctl config set gcs_requester_pays/project` and `hailctl config set gcs_requester_pays/buckets`.
  warnings.warn(
SLF4J: No SLF4J providers were found.
SLF4J: Defaulting to no-operation (NOP) logger implementation
SLF4J: See https://www.slf4j.org/codes.html#noProviders for further details.
SLF4J: Class path contains SLF4J bindings targeting slf4j-api versions 1.7.x or earlier.
SLF4J: Ignoring binding found at [jar:file:/usr/lib/spark/jars/log4j-slf4j-impl-2.18.0.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See https://www.slf4j.org/codes.html#ignoredBindings for an explanation.
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
Running on Apache Spark version 3.3.0
SparkUI available at http://saturn-26616ce0-7b3e-4578-bb69-4567cf490d15-m.c.terra-67a6826d.internal:37893

Then I got the following error when attempting to download the file:

2023-10-03 21:56:25.636 Hail: INFO: downloading 1KG VCF ...
  Source: https://storage.googleapis.com/hail-tutorial/1kg.vcf.bgz
2023-10-03 21:56:28.692 Hail: INFO: importing VCF and writing to matrix table...
---------------------------------------------------------------------------
Py4JNetworkError                          Traceback (most recent call last)
File /opt/conda/lib/python3.10/site-packages/py4j/clientserver.py:516, in ClientServerConnection.send_command(self, command)
    515 if answer.strip() == "":
--> 516     raise Py4JNetworkError("Answer from Java side is empty")
    517 if answer.startswith(proto.RETURN_MESSAGE):
Py4JNetworkError: Answer from Java side is empty
During handling of the above exception, another exception occurred:
Py4JNetworkError                          Traceback (most recent call last)

File /opt/conda/lib/python3.10/site-packages/py4j/java_gateway.py:1038, in GatewayClient.send_command(self, command, retry, binary)

1037 try:

-> 1038     response = connection.send_command(command)

1039     if binary:
File /opt/conda/lib/python3.10/site-packages/py4j/clientserver.py:539, in ClientServerConnection.send_command(self, command)

538 logger.info("Error while receiving.", exc_info=True)

--> 539 raise Py4JNetworkError(

540     "Error while sending or receiving", e, proto.ERROR_ON_RECEIVE)
Py4JNetworkError: Error while sending or receiving
During handling of the above exception, another exception occurred:
TypeError                                 Traceback (most recent call last)

Cell In[3], line 1

----> 1 hl.utils.get_1kg('data/')
File /opt/conda/lib/python3.10/site-packages/hail/utils/tutorial.py:86, in get_1kg(output_dir, overwrite)

84 cluster_readable_vcf = _copy_to_tmp(fs, local_path_uri(tmp_vcf), extension='vcf.bgz')

85 info('importing VCF and writing to matrix table...')

---> 86 hl.import_vcf(cluster_readable_vcf, min_partitions=16).write(matrix_table_path, overwrite=True)

88 tmp_sample_annot = os.path.join(tmp_dir, '1kg_annotations.txt')

89 source = resources['1kg_annotations']
File <decorator-gen-1484>:2, in import_vcf(path, force, force_bgz, header_file, min_partitions, drop_samples, call_fields, reference_genome, contig_recoding, array_elements_required, skip_invalid_loci, entry_float_type, filter, find_replace, n_partitions, block_size, _create_row_uids, _create_col_uids)
File /opt/conda/lib/python3.10/site-packages/hail/typecheck/check.py:584, in _make_dec.<locals>.wrapper(__original_func, *args, **kwargs)

581 @decorator

582 def wrapper(original_func, *args, **kwargs):

583     args, kwargs = check_all(__original_func, args, kwargs, checkers, is_method=is_method)

--> 584     return original_func(*args, **kwargs)
File /opt/conda/lib/python3.10/site-packages/hail/methods/impex.py:2822, in import_vcf(path, force, force_bgz, header_file, min_partitions, drop_samples, call_fields, reference_genome, contig_recoding, array_elements_required, skip_invalid_loci, entry_float_type, filter, find_replace, n_partitions, block_size, _create_row_uids, _create_col_uids)

2812     hl.utils.warning(

2813         f'You are trying to read {path} with ONE core of parallelism. This '

2814         'will be very slow. If this file is block-gzipped (bgzip-ed), use '

2815         'force_bgz=True instead.'

2816     )

2818 reader = ir.MatrixVCFReader(path, call_fields, entry_float_type, header_file,

2819                             n_partitions, block_size, min_partitions,

2820                             reference_genome, contig_recoding, array_elements_required,

2821                             skip_invalid_loci, force_bgz, force, filter, find_replace)

-> 2822 return MatrixTable(ir.MatrixRead(reader, drop_cols=drop_samples, drop_row_uids=not _create_row_uids, drop_col_uids=not _create_col_uids))
File /opt/conda/lib/python3.10/site-packages/hail/matrixtable.py:709, in MatrixTable.init(self, mir)

706 self._col_indices = Indices(self, {self._col_axis})

707 self._entry_indices = Indices(self, {self._row_axis, self._col_axis})

--> 709 self._type = self._mir.typ

711 self._global_type = self._type.global_type

712 self._col_type = self._type.col_type
File /opt/conda/lib/python3.10/site-packages/hail/ir/base_ir.py:494, in MatrixIR.typ(self)

491 @property

492 def typ(self):

493     if self._type is None:

--> 494         self.compute_type(deep_typecheck=False)

495     return self._type
File /opt/conda/lib/python3.10/site-packages/hail/ir/base_ir.py:485, in MatrixIR.compute_type(self, deep_typecheck)

483 def compute_type(self, deep_typecheck):

484     if deep_typecheck or self._type is None:

--> 485         computed = self._compute_type(deep_typecheck)

486         if self._type is not None:

487             assert self._type == computed
File /opt/conda/lib/python3.10/site-packages/hail/ir/matrix_ir.py:185, in MatrixRead._compute_type(self, deep_typecheck)

183 def _compute_type(self, deep_typecheck):

184     if self._type is None:

--> 185         return Env.backend().matrix_type(self)

186     else:

187         return self._type
File /opt/conda/lib/python3.10/site-packages/hail/backend/py4j_backend.py:184, in Py4JBackend.matrix_type(self, mir)

183 def matrix_type(self, mir):

--> 184     jir = self._to_java_matrix_ir(mir)

185     return tmatrix._from_java(jir.typ())
File /opt/conda/lib/python3.10/site-packages/hail/backend/py4j_backend.py:170, in Py4JBackend._to_java_matrix_ir(self, ir)

169 def _to_java_matrix_ir(self, ir):

--> 170     return self._to_java_ir(ir, self._parse_matrix_ir)
File /opt/conda/lib/python3.10/site-packages/hail/backend/py4j_backend.py:145, in Py4JBackend._to_java_ir(self, ir, parse)

143     r = CSERenderer(stop_at_jir=True)

144     # FIXME parse should be static

--> 145     ir._jir = parse(r(finalize_randomness(ir)), ir_map=r.jirs)

146 return ir._jir
File /opt/conda/lib/python3.10/site-packages/hail/backend/py4j_backend.py:158, in Py4JBackend._parse_matrix_ir(self, code, ir_map)

157 def _parse_matrix_ir(self, code, ir_map={}):

--> 158     return self._jbackend.parse_matrix_ir(code, ir_map)
File /opt/conda/lib/python3.10/site-packages/py4j/java_gateway.py:1321, in JavaMember.call(self, *args)

1314 args_command, temp_args = self._build_args(*args)

1316 command = proto.CALL_COMMAND_NAME +

1317     self.command_header +

1318     args_command +

1319     proto.END_COMMAND_PART

-> 1321 answer = self.gateway_client.send_command(command)

1322 return_value = get_return_value(

1323     answer, self.gateway_client, self.target_id, self.name)

1325 for temp_arg in temp_args:
File /opt/conda/lib/python3.10/site-packages/py4j/java_gateway.py:1055, in GatewayClient.send_command(self, command, retry, binary)

1053         response = self.send_command(command, binary=binary)

1054     else:

-> 1055         logging.exception(

1056             "Exception while sending command.")

1057         response = proto.ERROR

1058 except KeyboardInterrupt:

1059     # For KeyboardInterrupt triggered from Python shell, it should

1060     # clean up the connection so the connection is

(...)

1064     # See also py4j/py4j#440 for

1065     # more details.
File /opt/conda/lib/python3.10/logging/init.py:2113, in exception(msg, exc_info, *args, **kwargs)

2107 def exception(msg, *args, exc_info=True, **kwargs):

2108     """

2109     Log a message with severity 'ERROR' on the root logger, with exception

2110     information. If the logger has no handlers, basicConfig() is called to add

2111     a console handler with a pre-defined format.

2112     """

-> 2113     error(msg, *args, exc_info=exc_info, **kwargs)
File /opt/conda/lib/python3.10/logging/init.py:2105, in error(msg, *args, **kwargs)

2103 if len(root.handlers) == 0:

2104     basicConfig()

-> 2105 root.error(msg, *args, **kwargs)
File /opt/conda/lib/python3.10/logging/init.py:1506, in Logger.error(self, msg, *args, **kwargs)

1497 """

1498 Log 'msg % args' with severity 'ERROR'.

1499

(...)

1503 logger.error("Houston, we have a %s", "major problem", exc_info=1)

1504 """

1505 if self.isEnabledFor(ERROR):

-> 1506     self._log(ERROR, msg, args, **kwargs)
TypeError: Log._log() got an unexpected keyword argument 'exc_info'

danking said:

Hi @beneopp !

This looks like a broken Hail installation. Can you describe how you started your Hail cluster and how you installed Hail?

beneopp said:

I launched the Jupyter Cloud Environment on Terra. I used the standard Hail application configuration and reduced the number of CPU’s to 2. Attached is a screenshot of the configurations I used.

Screenshot 2023-10-04 at 10.22.16 AM1336×1564 227 KB

danking said:

@beneopp hmm. Most folks use clusters, not “Spark single node”, so it’s possible that path is less well tested in Terra.

Can you trigger the error again and send us the Hail .log file? The easiest way to get that is to SSH to the master node of the cluster (always named CLUSTER_NAME-m). If you don’t have the ability to SSH, you’ll have to use the notebook to gsutil cp ...log gs://your-bucket/logfile then download it on your laptop.

Hail is definitely failing to start properly and the reason why should be in that log file.

beneopp said:

Thank you for your prompt reply.

I tried the environment with a “spark single node” and 4 CPU’s, and the starting step worked fine. I think the error might be caused by the machine used. Attached is the log file from when 2CPUs are used and the error occurs.

In general, I am wondering what makes the “Spark single node” parameter important. I am new to using this tool and don’t know anything about Spark. If it is important, I think it would be helpful to explain this in the Hail-Notebook-Tutorials workspace.

Dashboard

hail-20231004-1700-0.2.120-f00f916faf78.log (21.6 KB)

danking said:

@beneopp , which workspace are you referring to? The Hail team doesn’t own any workspaces, those are produced by Terra. I can ask them to update the workspaces though if there is specific feedback.

We have some general advice on using the cloud in the Hail docs. In general, people use “Spark clusters”. Google has a “Dataproc” product which allows you to start and stop clusters. We provide a tool that helps you do that called hailctl dataproc.

In Terra, you have to use their UI instead. For most analyses, you don’t want to use “Spark single node” because that means you’re using one “node” (aka “VM” aka “computer”) rather than a cluster of nodes. That means you’re limited to the number of cores on that one computer. Most sequencing datasets are too large to wait for a single computer to read all the data.

Hmm, this log indicates that you have two SparkContexts and that one of them is stopped and the other is active. This shouldn’t normally happen. If you start a new notebook from scratch and run

import hail as hl
hl.init()
hl.balding_nichols_model(1, 10, 10).show(10, 10)

Do you see a matrix of genotypes? If not, can you copy all the output you see here? We need to sort out why your notebook somehow has two SparkContext s

Potential Outdated Error Statement for Hail Version Incompatibility

darn_matren said:

Hello team,

I hit an odd error yesterday when reading in a Hail Table that a coworker had written in Hail 2.120 when I was working with Hail 2.113. The error said that I could not read the file because it was written in a more recent version of Hail - which makes sense because next I just updated my Hail version and it worked - but oddly the file said that the Hail Table was written in 1.60 and my current version of 1.70 was not able to read it. Which is wrong! I was able to replicate the issue (after downgrading to Hail 2.113 in a conda environment) and I’ll paste it down below:

Hail version: 0.2.113-cf32652c5077
Error summary: HailException: incompatible file format when reading: gs://filename.ht
supported version: 1.6.0, found 1.7.0
The cause of this error is usually an attempt to use an older version of Hail to read files generated by a newer version. This is not supported (Hail native files are back-compatible, but not forward-compatible).
To read this file, use a newer version of Hail.

danking said:

1.6.0 is a version of the file format. 0.2.113 is a version of the Hail Python library. I’ll update the message to be more clear.

danking said:

I’ve made this change here. I would love your feedback on the verbiage there!

darn_matren said:

that’s interesting, thanks for the quick response!

Efficient merging of bgen shards - export bgen

jwillett said:

Does the Hail team have recommendations on efficiently merging shards from export_bgen(…, parallel=“header_per_shard”). I was transforming bgen to pgen, then merging using plink2, but the output was not being accepted by regenie.

danking said:

Hey @jwillett !

Hmm. I wouldn’t use parallel="header_per_shard", I’d use parallel='separate_header' and then cat (perhaps a 100-way tree of cat s if there’s too many files). The header_per_shard mode is really meant for when you’re going to run some code on each partition separately.

jwillett said:

Is there a way to remove headers on produced shards to enable the use of cat? I had to move my projects forward, using header_per_shard, given that I will be doing GWAS (so running code on each separate partition is acceptable). I would prefer to combine all the shards into a single file before running GWAS to keep things organized.

danking said:

The BGEN Format is precisely defined so this should be possible.

According to the BGEN format, the first four bytes are an unsigned little endian 32-bit integer indicating the offset to the start position of the variant data (relative to the end of this 32-bit integer). For each file you can get that offset:

import hailtop.fs as hfs
import struct
x = bytearray(4)

hfs.open(path_to_bgen, 'rb').readinto(x)

offset = struct.unpack('<I', x)[0]

print(offset + 4)

Then you can use tail -c +offset to skip the header and the sample block.

jwillett said:

Great, thanks! That will save on file count a lot.

Hail SPARK version error

jin0008 said:

Hi I want to install Hail in my ubuntu.
But when I run simple hail query, I got this message.
Could you give me some advice for me?
Thanks

This Hail JAR was compiled for Spark 3.1.3, cannot run with Spark 3.4.1.
The major and minor versions must agree, though the patch version can differ.

danking said:

You need to install Apache Spark 3.1.3. You have Apache Spark 3.4.1. Try:

pip3 install pyspark==3.1.3

Cannot use exome_v7.1

aou_user_01 said:

I received an email that the srWGS smaller callsets are upgraded to v7.1. Here is the path of exome I tried: gs://fc-aou-datasets-controlled/v7/wgs/short_read/snpindel/exome_v7.1/multiMT/hail.mt

and I received this error.

Java stack trace:
is.hail.utils.HailException: incompatible file format when reading: gs://fc-aou-datasets-controlled/v7/wgs/short_read/snpindel/exome_v7.1/multiMT/hail.mt
supported version: 1.6.0, found 1.7.0
The cause of this error is usually an attempt to use an older version of Hail to read files generated by a newer version. This is not supported (Hail native files are back-compatible, but not forward-compatible).
To read this file, use a newer version of Hail.

I’m using workbench version 7 so I don’t know why I received an error about supported version discrepancy. Has anyone try the new dataset and succeeded?

asppagh54 said:

Hi,
Any solution for this error? I am facing the same error.
Thanks

danking said:

@aou_user_01 , @asppagh54 ,

This means you’re using an older version of Hail than the version which was used to generate that matrix table. You can update your version of Hail with pip (pip install -U hail), but I’ll also contact AoU to ask if they can update the default version of Hail in the AoU RWB.

How does Hail treat missing calls?

akhattab said:

Hello, I’m trying to understand how Hail treats the missing calls.
When using mt.GT.show I could see the missing calls as NA:

But when using hl.is_missing, they are still NA instead of True:

Same when using is_defined:

My question is, why are these calls ‘ignored’?
Thank you!

danking said:

Hey @akhattab ,

Yeah, this is an unfortunate and confusing part of the Hail interface, particularly as it relates to show. Those NA values are not missing but “filtered”. Filtered entries are produced with filter_entries. This is the entry analogue to row or column filtering, but unlike row or column filtering, we can’t simply remove the entire row or column (because there are other not filtered entries at the same row or column).

You can convert filtered entries to missing entries with mt.unfilter_entries(). You can compute statistics about the number of filtered entires with mt.compute_entry_filter_stats.

There is no way to test for the filtered status of a particular field, because its not the field that is filtered, it is the entry of the matrix that is filtered.

By convention:

“Filtered” entries are like filtered rows or columns: we want to pretend that we didn’t even observe them. They’re not part of our dataset anymore.
“Missing” values are data we measured but we’re uncertain of its value. We want to consider them as a part of our analysis but as a separate third category of thing: something we know exists but we can’t observe directly.

akhattab said:

Great. Thank you so much!

Future support of Elasticsearch version 8

souckmi said:

Hello everyone,

I wanted to ask if the hail.methods.export_elasticsearch() method will support Elasticsearch version 8 in the near future, since it supports Elasticsearch versions 6.8.x - 7.x.x. according to the documentation?

Thanks for the answer!

danking said:

We don’t have plans to add version 8 support, but we welcome a pull request to add that support. My guess is that its a relatively easy change.

What's different between mt and ht file format?

SimonLi5601 said:

I thought those are simple concepts, but I can’t find anywhere except the VDS documentation. And also how they are different from VDS file? Thanks

danking said:

Hey SimonLi5601 !

You’re right that the native file formats are not well documented. The bulk of our users are uninterested in these details. I think you’re the third person to ask about their details in the eight years of the project.

I’ll directly answer your question below, but I’m curious to better understand your motivations and goals. That will help me give you a more useful answer.

An “.mt” “file” is a compressed, partitioned, indexed, binary format for Matrix Tables (which is like a pandas DataFrame with an extra index dimension). An “.ht” “file” is a compressed, partitioned, indexed, binary format for Tables (which is effectively an out-of-core pandas DataFrame).

To be a bit more specific, an “.mt” file is a little bit of metadata plus four “.ht” files (though they lack the extension):

globals
cols
rows
entries

and two optional pieces:

index
references

An “.ht” file is a little bit of metadata plus:

rows (this contains the partitioned data)

and two optional pieces:

index
references

Zip: length mismatch error

klaricch said:

Hi,

If I run the code in the attached file
am_test.txt (1.3 KB)
I get the below error:

&gt; 
&gt; 2023-08-09 15:56:05.872 Hail: INFO: wrote table with 104 rows in 104 partitions to gs://gnomad-tmp-4day/__iruid_845-wg8w0gBbPsilRDfRegbVW3
INFO (constraint_pipeline 463): Copying log to logging bucket...
2023-08-09 15:56:33.276 Hail: INFO: copying log to 'gs://gnomad-tmp/gnomad_v2.1.1_testing/constraint/logging/constraint_pipeline.log'...
Traceback (most recent call last):
  File "/tmp/17e6dae80e9548a9a94895fd07e6dee0/constraint_pipeline.py", line 785, in &lt;module&gt;
    main(args)
  File "/tmp/17e6dae80e9548a9a94895fd07e6dee0/constraint_pipeline.py", line 412, in main
    oe_ht = apply_models(
  File "/tmp/17e6dae80e9548a9a94895fd07e6dee0/pyscripts_ge6ozu1m.zip/gnomad_constraint/utils/constraint.py", line 396, in apply_models
  File "/tmp/17e6dae80e9548a9a94895fd07e6dee0/pyscripts_ge6ozu1m.zip/gnomad_constraint/utils/constraint.py", line 279, in create_observed_and_possible_ht
  File "&lt;decorator-gen-1118&gt;", line 2, in checkpoint
  File "/opt/conda/default/lib/python3.10/site-packages/hail/typecheck/check.py", line 584, in wrapper
    return __original_func(*args_, **kwargs_)
  File "/opt/conda/default/lib/python3.10/site-packages/hail/table.py", line 1331, in checkpoint
    self.write(output=output, overwrite=overwrite, stage_locally=stage_locally, _codec_spec=_codec_spec)
  File "&lt;decorator-gen-1120&gt;", line 2, in write
  File "/opt/conda/default/lib/python3.10/site-packages/hail/typecheck/check.py", line 584, in wrapper
    return __original_func(*args_, **kwargs_)
  File "/opt/conda/default/lib/python3.10/site-packages/hail/table.py", line 1377, in write
    Env.backend().execute(ir.TableWrite(self._tir, ir.TableNativeWriter(output, overwrite, stage_locally, _codec_spec)))
  File "/opt/conda/default/lib/python3.10/site-packages/hail/backend/py4j_backend.py", line 82, in execute
    raise e.maybe_user_error(ir) from None
  File "/opt/conda/default/lib/python3.10/site-packages/hail/backend/py4j_backend.py", line 76, in execute
    result_tuple = self._jbackend.executeEncode(jir, stream_codec, timed)
  File "/usr/lib/spark/python/lib/py4j-0.10.9.5-src.zip/py4j/java_gateway.py", line 1321, in __call__
  File "/opt/conda/default/lib/python3.10/site-packages/hail/backend/py4j_backend.py", line 35, in deco
    raise fatal_error_from_java_error_triplet(deepest, full, error_id) from None
hail.utils.java.FatalError: HailException: zip: length mismatch: 62164, 104

Java stack trace:
is.hail.utils.HailException: zip: length mismatch: 62164, 104
	at __C8160Compiled.__m8201split_ToArray(Emit.scala)
	at __C8160Compiled.__m8169split_CollectDistributedArray(Emit.scala)
	at __C8160Compiled.__m8164split_Let(Emit.scala)
	at __C8160Compiled.apply(Emit.scala)
	at is.hail.expr.ir.CompileAndEvaluate$.$anonfun$_apply$4(CompileAndEvaluate.scala:61)
	at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
	at is.hail.utils.ExecutionTimer.time(ExecutionTimer.scala:81)
	at is.hail.expr.ir.CompileAndEvaluate$.$anonfun$_apply$2(CompileAndEvaluate.scala:61)
	at is.hail.expr.ir.CompileAndEvaluate$.$anonfun$_apply$2$adapted(CompileAndEvaluate.scala:59)
	at is.hail.backend.ExecuteContext.$anonfun$scopedExecution$1(ExecuteContext.scala:140)
	at is.hail.utils.package$.using(package.scala:635)
	at is.hail.backend.ExecuteContext.scopedExecution(ExecuteContext.scala:140)
	at is.hail.expr.ir.CompileAndEvaluate$._apply(CompileAndEvaluate.scala:59)
	at is.hail.expr.ir.CompileAndEvaluate$.evalToIR(CompileAndEvaluate.scala:33)
	at is.hail.expr.ir.LowerOrInterpretNonCompilable$.evaluate$1(LowerOrInterpretNonCompilable.scala:30)
	at is.hail.expr.ir.LowerOrInterpretNonCompilable$.rewrite$1(LowerOrInterpretNonCompilable.scala:58)
	at is.hail.expr.ir.LowerOrInterpretNonCompilable$.apply(LowerOrInterpretNonCompilable.scala:63)
	at is.hail.expr.ir.lowering.LowerOrInterpretNonCompilablePass$.transform(LoweringPass.scala:67)
	at is.hail.expr.ir.lowering.LoweringPass.$anonfun$apply$3(LoweringPass.scala:16)
	at is.hail.utils.ExecutionTimer.time(ExecutionTimer.scala:81)
	at is.hail.expr.ir.lowering.LoweringPass.$anonfun$apply$1(LoweringPass.scala:16)
	at is.hail.utils.ExecutionTimer.time(ExecutionTimer.scala:81)
	at is.hail.expr.ir.lowering.LoweringPass.apply(LoweringPass.scala:14)
	at is.hail.expr.ir.lowering.LoweringPass.apply$(LoweringPass.scala:13)
	at is.hail.expr.ir.lowering.LowerOrInterpretNonCompilablePass$.apply(LoweringPass.scala:62)
	at is.hail.expr.ir.lowering.LoweringPipeline.$anonfun$apply$1(LoweringPipeline.scala:22)
	at is.hail.expr.ir.lowering.LoweringPipeline.$anonfun$apply$1$adapted(LoweringPipeline.scala:20)
	at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
	at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
	at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
	at is.hail.expr.ir.lowering.LoweringPipeline.apply(LoweringPipeline.scala:20)
	at is.hail.expr.ir.CompileAndEvaluate$._apply(CompileAndEvaluate.scala:50)
	at is.hail.backend.spark.SparkBackend._execute(SparkBackend.scala:462)
	at is.hail.backend.spark.SparkBackend.$anonfun$executeEncode$2(SparkBackend.scala:498)
	at is.hail.backend.ExecuteContext$.$anonfun$scoped$3(ExecuteContext.scala:75)
	at is.hail.utils.package$.using(package.scala:635)
	at is.hail.backend.ExecuteContext$.$anonfun$scoped$2(ExecuteContext.scala:75)
	at is.hail.utils.package$.using(package.scala:635)
	at is.hail.annotations.RegionPool$.scoped(RegionPool.scala:17)
	at is.hail.backend.ExecuteContext$.scoped(ExecuteContext.scala:63)
	at is.hail.backend.spark.SparkBackend.withExecuteContext(SparkBackend.scala:350)
	at is.hail.backend.spark.SparkBackend.$anonfun$executeEncode$1(SparkBackend.scala:495)
	at is.hail.utils.ExecutionTimer$.time(ExecutionTimer.scala:52)
	at is.hail.backend.spark.SparkBackend.executeEncode(SparkBackend.scala:494)
	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.base/java.lang.reflect.Method.invoke(Method.java:566)
	at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
	at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
	at py4j.Gateway.invoke(Gateway.java:282)
	at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
	at py4j.commands.CallCommand.execute(CallCommand.java:79)
	at py4j.ClientServerConnection.waitForCommands(ClientServerConnection.java:182)
	at py4j.ClientServerConnection.run(ClientServerConnection.java:106)
	at java.base/java.lang.Thread.run(Thread.java:829)



Hail version: 0.2.120-f00f916faf78
Error summary: HailException: zip: length mismatch: 62164, 104

One of the input tables (context_ht) has 8061724269 rows and 62164 partitions, and another input_table (mutation_ht) has 104 rows and 104 partitions. If I add “mutation_ht = mutation_ht = mutation_ht.repartition(1)” at line 37, the script runs fine. If I add “mutation_ht = mutation_ht.naive_coalesce(1)” it gives the same error.

Why does the original partitioning of the input tables result in an error?
Why does repartition make it work but naive_coalesce doesn’t?
How do you typically choose whether to use repartition vs naive_coalesce?

Thank you!

danking said:

klaricch:

create_observed_and_possible_ht

Which is here:

  <a href="https://github.com/broadinstitute/gnomad-constraint/blob/9fa1849c7fc2d63063fbc07b8c0035b2b2850885/gnomad_constraint/utils/constraint.py#L157" target="_blank" rel="noopener">github.com</a>

broadinstitute/gnomad-constraint/blob/9fa1849c7fc2d63063fbc07b8c0035b2b2850885/gnomad_constraint/utils/constraint.py#L157

<pre class="onebox"><code class="lang-py">
  <ol class="start lines" start="147" style="counter-reset: li-counter 146 ;">
      <li>    # vep root annotation.</li>
      <li>    ht = add_most_severe_csq_to_tc_within_vep_root(ht)</li>
      <li></li>
      <li>    if require_exome_coverage:</li>
      <li>        # Filter out locus with undefined exome coverage.</li>
      <li>        ht = ht.filter(hl.is_defined(ht.exome_coverage))</li>
      <li></li>
      <li>    return ht</li>
      <li></li>
      <li></li>
      <li class="selected">def create_observed_and_possible_ht(</li>
      <li>    exome_ht: hl.Table,</li>
      <li>    context_ht: hl.Table,</li>
      <li>    mutation_ht: hl.Table,</li>
      <li>    max_af: float = 0.001,</li>
      <li>    keep_annotations: Tuple[str] = (</li>
      <li>        "context",</li>
      <li>        "ref",</li>
      <li>        "alt",</li>
      <li>        "methylation_level",</li>
      <li>    ),</li>
  </ol>
</code></pre>

chrisvittal said:

I was able to download the log file and will be investigating more now.

chrisvittal said:

klaricch Sorry to bother you, but the most recent log is one from a successful run. Do you have a log from a failed one?

klaricch said:

I’ll rerun it quickly and send an example of a failed one

Ruchit10 said:

I’m getting a similar (length mismatch) error while running RMC specifically this script. Mu cluster config is below.

rmc/pipeline/two_breaks/run_batches_dataproc.py
--run-sections-over-threshold \
    --search-num \
    2 \
    --freeze \
    7 \
    --group-size \
    50

The hail log is attached
round2search_for_two_breaks_run_batches_dataproc.log (5.1 MB)

Ruchit10 said:

Possibly this line is causing it? I’ll see if removing naive_coalesce fixes it like it did for klaricch

What's different between mt and ht file format?

SimonLi5601 said:

I thought those are simple concepts, but I can’t find anywhere except the VDS documentation. And also how they are different from VDS file? Thanks

danking said:

Hey SimonLi5601 !

You’re right that the native file formats are not well documented. The bulk of our users are uninterested in these details. I think you’re the third person to ask about their details in the eight years of the project.

I’ll directly answer your question below, but I’m curious to better understand your motivations and goals. That will help me give you a more useful answer.

An “.mt” “file” is a compressed, partitioned, indexed, binary format for Matrix Tables (which is like a pandas DataFrame with an extra index dimension). An “.ht” “file” is a compressed, partitioned, indexed, binary format for Tables (which is effectively an out-of-core pandas DataFrame).

To be a bit more specific, an “.mt” file is a little bit of metadata plus four “.ht” files (though they lack the extension):

globals
cols
rows
entries

and two optional pieces:

index
references

An “.ht” file is a little bit of metadata plus:

rows (this contains the partitioned data)

and two optional pieces:

index
references

Hail to analyze rare variants

Mms said:

Hello

How can I deal with VEP files by Hail to analyze rare variants?
And can I use Hail in rare variants association study?

Estimating Large-Scale Storage Requirements

mgarcia said:

Hello,

I am in the process of running our hail-based sample quality control script on roughly 470,000 exomes on the UK Biobank Research Analysis Platform. While running a test, my job failed due to insufficient storage space.

I saw this thread that suggested that the requirements for a full-scale job can often be linearly estimated from running a lower-scale test. Do you think this would be helpful in estimating the full-scale storage requirements for processing 470,000 exomes?

For context, these were the parameters used for running this job:
98 nodes, each node with 64 GB of memory and roughly 300 GB of storage. You can see in the screenshot below that the storage requirements reached the limit just prior to the job failing.

Screenshot 2023-10-10 at 8.52.21 AM3108×524 187 KB

Unfortunately, the temporary JupyterLab environment closed shortly after failing, and I was unable to get the hail log file. Do you have any suggestions for setting a safe upper limit of storage that avoids future errors such as this one?

We are estimating the job to run for roughly 2-3 days. The storage requirements were roughly 130GB per node in the first 8 hours of running the job, but in the ninth hour, this would overcome the allocated 300GB of storage.

For reference, I am also sharing our updated Sample QC script.
hail_sample_qc_YL_MG_Step1_082823.txt (10.2 KB)

What's different between mt and ht file format?

SimonLi5601 said:

I thought those are simple concepts, but I can’t find anywhere except the VDS documentation. And also how they are different from VDS file? Thanks

danking said:

Hey SimonLi5601 !

You’re right that the native file formats are not well documented. The bulk of our users are uninterested in these details. I think you’re the third person to ask about their details in the eight years of the project.

I’ll directly answer your question below, but I’m curious to better understand your motivations and goals. That will help me give you a more useful answer.

An “.mt” “file” is a compressed, partitioned, indexed, binary format for Matrix Tables (which is like a pandas DataFrame with an extra index dimension). An “.ht” “file” is a compressed, partitioned, indexed, binary format for Tables (which is effectively an out-of-core pandas DataFrame).

To be a bit more specific, an “.mt” file is a little bit of metadata plus four “.ht” files (though they lack the extension):

globals
cols
rows
entries

and two optional pieces:

index
references

An “.ht” file is a little bit of metadata plus:

rows (this contains the partitioned data)

and two optional pieces:

index
references

Specifying the type for each column while reading a table

asppagh54 said:

Hail import_table() method will assign all of the columns as str data type.
is there any ways to specifying the datatype for each column while reading a table?

iris-garden said:

Replied on your other post.

Future support of Elasticsearch version 8

souckmi said:

Hello everyone,

I wanted to ask if the hail.methods.export_elasticsearch() method will support Elasticsearch version 8 in the near future, since it supports Elasticsearch versions 6.8.x - 7.x.x. according to the documentation?

Thanks for the answer!

danking said:

We don’t have plans to add version 8 support, but we welcome a pull request to add that support. My guess is that its a relatively easy change.

Zip: length mismatch error

klaricch said:

Hi,

If I run the code in the attached file
am_test.txt (1.3 KB)
I get the below error:

```python > > 2023-08-09 15:56:05.872 Hail: INFO: wrote table with 104 rows in 104 partitions to gs://gnomad-tmp-4day/__iruid_845-wg8w0gBbPsilRDfRegbVW3 INFO (constraint_pipeline 463): Copying log to logging bucket... 2023-08-09 15:56:33.276 Hail: INFO: copying log to 'gs://gnomad-tmp/gnomad_v2.1.1_testing/constraint/logging/constraint_pipeline.log'... Traceback (most recent call last): File "/tmp/17e6dae80e9548a9a94895fd07e6dee0/constraint_pipeline.py", line 785, in <module> main(args) File "/tmp/17e6dae80e9548a9a94895fd07e6dee0/constraint_pipeline.py", line 412, in main oe_ht = apply_models( File "/tmp/17e6dae80e9548a9a94895fd07e6dee0/pyscripts_ge6ozu1m.zip/gnomad_constraint/utils/constraint.py", line 396, in apply_models File "/tmp/17e6dae80e9548a9a94895fd07e6dee0/pyscripts_ge6ozu1m.zip/gnomad_constraint/utils/constraint.py", line 279, in create_observed_and_possible_ht File "<decorator-gen-1118>", line 2, in checkpoint File "/opt/conda/default/lib/python3.10/site-packages/hail/typecheck/check.py", line 584, in wrapper return __original_func(*args_, **kwargs_) File "/opt/conda/default/lib/python3.10/site-packages/hail/table.py", line 1331, in checkpoint self.write(output=output, overwrite=overwrite, stage_locally=stage_locally, _codec_spec=_codec_spec) File "<decorator-gen-1120>", line 2, in write File "/opt/conda/default/lib/python3.10/site-packages/hail/typecheck/check.py", line 584, in wrapper return __original_func(*args_, **kwargs_) File "/opt/conda/default/lib/python3.10/site-packages/hail/table.py", line 1377, in write Env.backend().execute(ir.TableWrite(self._tir, ir.TableNativeWriter(output, overwrite, stage_locally, _codec_spec))) File "/opt/conda/default/lib/python3.10/site-packages/hail/backend/py4j_backend.py", line 82, in execute raise e.maybe_user_error(ir) from None File "/opt/conda/default/lib/python3.10/site-packages/hail/backend/py4j_backend.py", line 76, in execute result_tuple = self._jbackend.executeEncode(jir, stream_codec, timed) File "/usr/lib/spark/python/lib/py4j-0.10.9.5-src.zip/py4j/java_gateway.py", line 1321, in __call__ File "/opt/conda/default/lib/python3.10/site-packages/hail/backend/py4j_backend.py", line 35, in deco raise fatal_error_from_java_error_triplet(deepest, full, error_id) from None hail.utils.java.FatalError: HailException: zip: length mismatch: 62164, 104

Java stack trace:
is.hail.utils.HailException: zip: length mismatch: 62164, 104
at __C8160Compiled.__m8201split_ToArray(Emit.scala)
at __C8160Compiled.__m8169split_CollectDistributedArray(Emit.scala)
at __C8160Compiled.__m8164split_Let(Emit.scala)
at __C8160Compiled.apply(Emit.scala)
at is.hail.expr.ir.CompileAndEvaluate$.$anonfun$_apply$4(CompileAndEvaluate.scala:61)
at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
at is.hail.utils.ExecutionTimer.time(ExecutionTimer.scala:81)
at is.hail.expr.ir.CompileAndEvaluate$.$anonfun$_apply$2(CompileAndEvaluate.scala:61)
at is.hail.expr.ir.CompileAndEvaluate$.$anonfun$_apply$2$adapted(CompileAndEvaluate.scala:59)
at is.hail.backend.ExecuteContext.$anonfun$scopedExecution$1(ExecuteContext.scala:140)
at is.hail.utils.package$.using(package.scala:635)
at is.hail.backend.ExecuteContext.scopedExecution(ExecuteContext.scala:140)
at is.hail.expr.ir.CompileAndEvaluate$._apply(CompileAndEvaluate.scala:59)
at is.hail.expr.ir.CompileAndEvaluate$.evalToIR(CompileAndEvaluate.scala:33)
at is.hail.expr.ir.LowerOrInterpretNonCompilable$.evaluate$1(LowerOrInterpretNonCompilable.scala:30)
at is.hail.expr.ir.LowerOrInterpretNonCompilable$.rewrite$1(LowerOrInterpretNonCompilable.scala:58)
at is.hail.expr.ir.LowerOrInterpretNonCompilable$.apply(LowerOrInterpretNonCompilable.scala:63)
at is.hail.expr.ir.lowering.LowerOrInterpretNonCompilablePass$.transform(LoweringPass.scala:67)
at is.hail.expr.ir.lowering.LoweringPass.$anonfun$apply$3(LoweringPass.scala:16)
at is.hail.utils.ExecutionTimer.time(ExecutionTimer.scala:81)
at is.hail.expr.ir.lowering.LoweringPass.$anonfun$apply$1(LoweringPass.scala:16)
at is.hail.utils.ExecutionTimer.time(ExecutionTimer.scala:81)
at is.hail.expr.ir.lowering.LoweringPass.apply(LoweringPass.scala:14)
at is.hail.expr.ir.lowering.LoweringPass.apply$(LoweringPass.scala:13)
at is.hail.expr.ir.lowering.LowerOrInterpretNonCompilablePass$.apply(LoweringPass.scala:62)
at is.hail.expr.ir.lowering.LoweringPipeline.$anonfun$apply$1(LoweringPipeline.scala:22)
at is.hail.expr.ir.lowering.LoweringPipeline.$anonfun$apply$1$adapted(LoweringPipeline.scala:20)
at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
at is.hail.expr.ir.lowering.LoweringPipeline.apply(LoweringPipeline.scala:20)
at is.hail.expr.ir.CompileAndEvaluate$._apply(CompileAndEvaluate.scala:50)
at is.hail.backend.spark.SparkBackend._execute(SparkBackend.scala:462)
at is.hail.backend.spark.SparkBackend.$anonfun$executeEncode$2(SparkBackend.scala:498)
at is.hail.backend.ExecuteContext$.$anonfun$scoped$3(ExecuteContext.scala:75)
at is.hail.utils.package$.using(package.scala:635)
at is.hail.backend.ExecuteContext$.$anonfun$scoped$2(ExecuteContext.scala:75)
at is.hail.utils.package$.using(package.scala:635)
at is.hail.annotations.RegionPool$.scoped(RegionPool.scala:17)
at is.hail.backend.ExecuteContext$.scoped(ExecuteContext.scala:63)
at is.hail.backend.spark.SparkBackend.withExecuteContext(SparkBackend.scala:350)
at is.hail.backend.spark.SparkBackend.$anonfun$executeEncode$1(SparkBackend.scala:495)
at is.hail.utils.ExecutionTimer$.time(ExecutionTimer.scala:52)
at is.hail.backend.spark.SparkBackend.executeEncode(SparkBackend.scala:494)
at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.base/java.lang.reflect.Method.invoke(Method.java:566)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:282)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.ClientServerConnection.waitForCommands(ClientServerConnection.java:182)
at py4j.ClientServerConnection.run(ClientServerConnection.java:106)
at java.base/java.lang.Thread.run(Thread.java:829)

Hail version: 0.2.120-f00f916faf78
Error summary: HailException: zip: length mismatch: 62164, 104

<p>One of the input tables (context_ht) has 8061724269 rows and 62164 partitions, and another input_table (mutation_ht) has 104 rows and 104 partitions. If I add “mutation_ht = mutation_ht = mutation_ht.repartition(1)” at line 37, the script runs fine. If I add “mutation_ht = mutation_ht.naive_coalesce(1)” it gives the same error.</p>
<p>Why does the original partitioning of the input tables result in an error?<br>
Why does repartition make it work but naive_coalesce doesn’t?<br>
How do you typically choose whether to use repartition vs naive_coalesce?</p>
<p>Thank you!</p>

<h2><a href='https://discuss.hail.is/t/3548/2'>danking said</a>:</h2>
<aside class="quote no-group" data-username="klaricch" data-post="1" data-topic="3548">
<div class="title">
<div class="quote-controls"></div>
<img loading="lazy" alt="" width="24" height="24" src="https://sjc6.discourse-cdn.com/standard10/user_avatar/discuss.hail.is/klaricch/48/643_2.png" class="avatar"> klaricch:</div>
<blockquote>
<p><code>create_observed_and_possible_ht</code></p>
</blockquote>
</aside>
<p>Which is here:</p><aside class="onebox githubblob" data-onebox-src="https://github.com/broadinstitute/gnomad-constraint/blob/9fa1849c7fc2d63063fbc07b8c0035b2b2850885/gnomad_constraint/utils/constraint.py#L157">
  <header class="source">

      <a href="https://github.com/broadinstitute/gnomad-constraint/blob/9fa1849c7fc2d63063fbc07b8c0035b2b2850885/gnomad_constraint/utils/constraint.py#L157" target="_blank" rel="noopener">github.com</a>
  </header>

  <article class="onebox-body">
    <h4><a href="https://github.com/broadinstitute/gnomad-constraint/blob/9fa1849c7fc2d63063fbc07b8c0035b2b2850885/gnomad_constraint/utils/constraint.py#L157" target="_blank" rel="noopener">broadinstitute/gnomad-constraint/blob/9fa1849c7fc2d63063fbc07b8c0035b2b2850885/gnomad_constraint/utils/constraint.py#L157</a></h4>



    <pre class="onebox"><code class="lang-py">
      <ol class="start lines" start="147" style="counter-reset: li-counter 146 ;">
          <li>    # vep root annotation.</li>
          <li>    ht = add_most_severe_csq_to_tc_within_vep_root(ht)</li>
          <li></li>
          <li>    if require_exome_coverage:</li>
          <li>        # Filter out locus with undefined exome coverage.</li>
          <li>        ht = ht.filter(hl.is_defined(ht.exome_coverage))</li>
          <li></li>
          <li>    return ht</li>
          <li></li>
          <li></li>
          <li class="selected">def create_observed_and_possible_ht(</li>
          <li>    exome_ht: hl.Table,</li>
          <li>    context_ht: hl.Table,</li>
          <li>    mutation_ht: hl.Table,</li>
          <li>    max_af: float = 0.001,</li>
          <li>    keep_annotations: Tuple[str] = (</li>
          <li>        "context",</li>
          <li>        "ref",</li>
          <li>        "alt",</li>
          <li>        "methylation_level",</li>
          <li>    ),</li>
      </ol>
    </code></pre>



  </article>

  <div class="onebox-metadata">
    
    
  </div>

  <div style="clear: both"></div>
</aside>


<h2><a href='https://discuss.hail.is/t/3548/3'>chrisvittal said</a>:</h2>
<p>I was able to download the log file and will be investigating more now.</p>

<h2><a href='https://discuss.hail.is/t/3548/4'>chrisvittal said</a>:</h2>
<p><a class="mention" href="/u/klaricch">klaricch</a> Sorry to bother you, but the most recent log is one from a successful run. Do you have a log from a failed one?</p>

<h2><a href='https://discuss.hail.is/t/3548/5'>klaricch said</a>:</h2>
<p>I’ll rerun it quickly and send an example of a failed one</p>

<h2><a href='https://discuss.hail.is/t/3548/6'>Ruchit10 said</a>:</h2>
<p>I’m getting a similar (length mismatch) error while running RMC specifically <a href="https://github.com/broadinstitute/regional_missense_constraint/blob/runtime_freeze7/rmc/pipeline/two_breaks/run_batches_dataproc.py" rel="noopener nofollow ugc">this</a> script. Mu cluster config is below.</p>
```python
rmc/pipeline/two_breaks/run_batches_dataproc.py
--run-sections-over-threshold \
    --search-num \
    2 \
    --freeze \
    7 \
    --group-size \
    50

The hail log is attached
round2search_for_two_breaks_run_batches_dataproc.log (5.1 MB)

Ruchit10 said:

Possibly this line is causing it? I’ll see if removing naive_coalesce fixes it like it did for klaricch

Hail can't export tsv.bgz

TaotaoTan said:

Hi,

I was using Hail to perform some variant QC for summary statistics download from UKBB in the AllofUs platform. I was able to run the pipeline but failed to export the QC-ed summary stats as a bgz file. Here is a sketch of my pipeline:

var_lst = hl.read_table("varlst_filename_in")
sumstats = hl.read_table("sumstats_filename_in")
var_lst = var_lst.annotate(**sumstats[var_lst.locus])
var_lst = var_lst.filter(xxx)
sumstats_qced = var_lst.select(col1 = var_lst.col1,
                         col2 = var_lst.col2
)
sumstats_qced.show()    # this works

sumstats_qced.count()   #  this works, 8,604,715 rows

sumstats_qced.export("sumstats_qced.tsv.bgz") # this doesn't work, see error message

Here are the error messages:

ERROR:root:Exception while sending command.=============>(140094 + 32) / 140126]
Traceback (most recent call last):
  File "/opt/conda/lib/python3.7/site-packages/py4j/java_gateway.py", line 1224, in send_command
    raise Py4JNetworkError("Answer from Java side is empty")
py4j.protocol.Py4JNetworkError: Answer from Java side is empty
During handling of the above exception, another exception occurred:
Traceback (most recent call last):

File "/opt/conda/lib/python3.7/site-packages/py4j/java_gateway.py", line 1038, in send_command

response = connection.send_command(command)

File "/opt/conda/lib/python3.7/site-packages/py4j/java_gateway.py", line 1229, in send_command

"Error while receiving", e, proto.ERROR_ON_RECEIVE)

py4j.protocol.Py4JNetworkError: Error while receiving
Py4JError                                 Traceback (most recent call last)

/tmp/ipykernel_1091/2060613055.py in <module>

4 sumstats_QC_quant(f"{bucket}/Sumstats/continuous-50-both_sexes-irnt_checkpoint.ht",

5             f"{bucket}/Sumstats/WGS_Height_QCed.tsv.bgz",

----> 6             var_wgs)

7

8 print("DBP " + str(datetime.now()))
/tmp/ipykernel_1091/205199695.py in sumstats_QC_quant(ht_filename_in, filename_out, var_wgs)

41                         neglog10_pval_meta = var_wgs.neglog10_pval_meta_hq)

42

---> 43     sumstats_QCed.export(filename_out)
<decorator-gen-1190> in export(self, output, types_file, header, parallel, delimiter)
/opt/conda/lib/python3.7/site-packages/hail/typecheck/check.py in wrapper(__original_func, *args, **kwargs)

575     def wrapper(original_func, *args, **kwargs):

576         args, kwargs = check_all(__original_func, args, kwargs, checkers, is_method=is_method)

--> 577         return original_func(*args, **kwargs)

578

579     return wrapper
/opt/conda/lib/python3.7/site-packages/hail/table.py in export(self, output, types_file, header, parallel, delimiter)

1097         parallel = ir.ExportType.default(parallel)

1098         Env.backend().execute(

-> 1099             ir.TableWrite(self._tir, ir.TableTextWriter(output, types_file, header, parallel, delimiter)))

1100

1101     def group_by(self, *exprs, **named_exprs) -> 'GroupedTable':
/opt/conda/lib/python3.7/site-packages/hail/backend/py4j_backend.py in execute(self, ir, timed)

97         # print(self._hail_package.expr.ir.Pretty.apply(jir, True, -1))

98         try:

---> 99             result_tuple = self._jbackend.executeEncode(jir, stream_codec, timed)

100             (result, timings) = (result_tuple._1(), result_tuple._2())

101             value = ir.typ._from_encoding(result)
/opt/conda/lib/python3.7/site-packages/py4j/java_gateway.py in call(self, *args)

1321         answer = self.gateway_client.send_command(command)

1322         return_value = get_return_value(

-> 1323             answer, self.gateway_client, self.target_id, self.name)

1324

1325         for temp_arg in temp_args:
/opt/conda/lib/python3.7/site-packages/hail/backend/py4j_backend.py in deco(*args, **kwargs)

19         import pyspark

20         try:

---> 21             return f(*args, **kwargs)

22         except py4j.protocol.Py4JJavaError as e:

23             s = e.java_exception.toString()
/opt/conda/lib/python3.7/site-packages/py4j/protocol.py in get_return_value(answer, gateway_client, target_id, name)

334             raise Py4JError(

335                 "An error occurred while calling {0}{1}{2}".

--> 336                 format(target_id, ".", name))

337     else:

338         type = answer[1]
Py4JError: An error occurred while calling o1.executeEncode

danking said:

Hey @TaotaoTan !

Sorry you’re having trouble with Hail! This error usually means you’ve run out of RAM on the driver machine. What size driver are you using for your cluster? I recommend an n1-standard-16 (16 cores, 3.75GB RAM per core) when working with >100k partitions.

Help reformatting variant ID string to annotate MatrixTable keyed by locus, allele

jbs said:

Hello! I’m new to Hail so apologies if this question has been asked before or is a bit basic.

I am trying to annotate a matrix table of genomic data (mt) keyed by [locus, allele], with variant annotations in another hail MT (vat_table). Specifically, I would like to eventually filter the variants mt by their corresponding variant type found in vat_table.variant_type.

However, the fields in vat_table don’t seem to be well-formatted to parsing as loci. The variant ID field (vat_table.vid) is formatted as ‘1-10001-T-C’ (str), which is not recognized by hl.parse_variant(). I can use the replace(‘-’, ‘:’) method to change the format, but then calling:

hl.parse_variant(vat_table.vid, reference_genome = ‘GRCh38’)

throws an error, since the contig needs to be in the form ‘chr1’. One solution would be to concatenate ‘chr’ with each of the vat_table.vid character strings, but I am having trouble figuring out how to do this in Hail.

Another option would be to just concatenate four fields in of character strings, vat_table.contig (ex: ‘chr1’), vat_table.position (ex: ‘10001’), vat_table.ref_allele (ex: ‘T’), and vat_table.alt_allele (ex: ‘C’) to form a DIY vid field. However, I am having trouble doing this as well.

A final option would be to change the format of the the vat_table.vid field when I originally load the table with the call below:

vat_table = hl.import_table(vat_path, force = True, quote = ‘"’, delimiter = “\t”, force_bgz = True)

I see this thread advising on how to adjust the contig format when reading in a VCF, but I’m not sure if there is a nice way to adjust this from a tab-delimited file. I am unfortunately not able to make any changed directly in the source file.

Hopefully that is clear, and thanks so much for you help!

iris-garden said:

Hi,

If you’d like to go with the first option, you could do something along these lines:

vat_table = hl.import_table(...)
vat_table = vat_table.annotate(vid = "chr" + vat_table.vid.replace("-", ":"))

For the second,

vat_table = vat_table.annotate(vid = vat_table.contig + ":" + vat_table.position + ":" + vat_table.ref_allele + ":" + vat_table.alt_allele)

As far as I can tell, we unfortunately don’t have an equivalent to the contig_recoding keyword for import_table, but someone else from the team may be able to chime in if I’m missing something there.

Hope that helps!

Appending/merging/combining multiple VCF files using Hail

Aks said:

Hello Everyone,

I have a list of VCF files from specific ethnicity such as American Indian, Chinese, European etc

Under each ethnicity, I have around 100+ files.

Currently, I computed the VARIANT QC metrics such as call_rate, n_het etc for one file as shown in the hail tutorial (refer image below)

image is here

However, now I would like to have one file for each ethnicity and then compute VARIANT_QC metrics.

I already referred to this post and this post but don’t think this addresses my query

How can I do this across all files under a specific ethnicity?

Can help me with this?

Is there any hail way to do this? or any python or R approach that veterans here are aware of??

tpoterba said:

You’re generally going to have a better time if you combine everything into one dataset and compute statistics on subsets from that dataset, rather than iterating through many smaller datasets. We need more information about these input VCFs, though. Are these project VCFs (GT, GQ, etc FORMAT fields) for a group of samples from sequencing data? Those cannot be losslessly combined (a site might appear in one VCF but not another).

If your VCFs are genotype data, it’s probably possible to combine since those have the same set of variants.

Aks said:

Hi tpoterba

tpoterba:

Are these project VCFs (GT, GQ, etc FORMAT fields) for a group of samples from sequencing data? Those cannot be losslessly combined (a site might appear in one VCF but not another).

Yes, my VCF file contains fields such as GT, GP, etc FORMAT fields.

I also did a mt.describe() for one file from each ethnicity and they all look the same as shown below

Does hail have any function to combine all these files together? I see VCF combiner but I believe this is different from what I am looking for (they talk about gVCF which I think is different from my file which has vcf.gz extension)

Support importing phased BGEN files

RossDeVito said:

Hey team,

Hail preserves phase information when importing VCF files, but import_bgen() notes that it does not. Is there something about BGEN files that makes this harder? It would be useful for my current work simulating and modeling haplotype level cis interactions.

Found a bug

I'm having a problem with this.

Hail SPARK version error

jin0008 said:

Hi I want to install Hail in my ubuntu.
But when I run simple hail query, I got this message.
Could you give me some advice for me?
Thanks

This Hail JAR was compiled for Spark 3.1.3, cannot run with Spark 3.4.1.
The major and minor versions must agree, though the patch version can differ.

danking said:

You need to install Apache Spark 3.1.3. You have Apache Spark 3.4.1. Try:

pip3 install pyspark==3.1.3

An error occurred while calling z:is.hail.backend.spark.SparkBackend.apply. : java.lang.ExceptionInInitializerError

ldomenech said:

Hello,

I installed hail in my Mac M2 Pro with pip:

pip install hail

Then I tried to run my first query, using your example, to verify everything was okay:

ipython

import hail as hl
mt = hl.balding_nichols_model(n_populations=3,
                              n_samples=10,
                              n_variants=100)
mt.show()

But the command

mt = hl.balding_nichols_model(n_populations=3,
                              n_samples=10,
                              n_variants=100)

gives me an error that I don’t know how to solve:

An error occurred while calling z:is.hail.backend.spark.SparkBackend.apply.
: java.lang.ExceptionInInitializerError

I was trying to do what you specified here How do I fix "java.net.BindException: Can't assign requested address"? - #4 by tpoterba :

"First try this:

sudo /bin/sh -c “echo "127.0.0.1 $(hostname)" >> /etc/hostname”
Now try to use Hail. If Hail gives the same error, try this:

echo ‘export PYSPARK_SUBMIT_ARGS=’'‘–driver-memory 16g --executor-memory 16g --conf “spark.driver.bindAddress=127.0.0.1” pyspark-shell’' >> ~/.(basename {SHELL})rc

Now try Hail again. If it still fails, respond to this thread."

But no success.

Some other info that may be useful:

java --version

openjdk 21 2023-09-19
OpenJDK Runtime Environment Zulu21.28+85-CA (build 21+35)
OpenJDK 64-Bit Server VM Zulu21.28+85-CA (build 21+35, mixed mode, sharing)

$python --version

Python 3.11.5



$ipython --version

8.15.0

Could you please help me to solve this?

Thank you very much!

danking said:

Hey @ldomenech !

Can you post the complete stack trace and any other output? I need all that info to be certain what is wrong.

In the meantime, you should try using Java 8 or Java 11. I’m not certain Hail will work with Java 21. We have some links about install java here. If you’re using a Mac, you can check these instructions for changing the version of Java once you’ve installed an older version:

On Mac OS X, how do I install and use Java 8 if I already have a different version of Java installed? Hail Query & hailctl

I have Mac OS X. But I already installed Java 10! How can I install Java 8 and make hail work? This is the error message I’m getting! py4j.protocol.Py4JJavaError: An error occurred while calling z:is.hail.HailContext.apply. : is.hail.utils.HailException: Hail requires Java 8, found 10.0.1 at is.hail.utils.ErrorHandling$class.fatal(ErrorHandling.scala:9) at is.hail.utils.package$.fatal(package.scala:26) at is.hail.HailContext$.apply(HailContext.scala:221) at is.hail.HailContext.apply(HailCo…

ldomenech said:

danking:

stack trace

Hi @danking ,

I’ve installed Java 8 and used it, and the example query is working now!

Thank you so much!

(let me know if you want more information about the error anyway)