Problem deion This I believe is the same issue as <a class="

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

S3 ContentEncoding is disregarded about smart_open HOT 2 OPEN

goranvinterhalter commented on June 7, 2024

S3 ContentEncoding is disregarded

from smart_open.

Comments (2)

goranvinterhalter commented on June 7, 2024 2

@mpenkov I don't know if these instructions are correct or incorrect. For example, is Metadata uncompressed_size required (as used in https://github.com/etianen/django-s3-storage/blob/master/django_s3_storage/storage.py#L331)?
I observed that chrome will automatically decompress the file when a pre-signed url is used but I'm having problems replicating this in steps bellow. For now this is what I have.

Create the file and upload (I'm referring to the bucket as ):

echo "hello world" | gzip -c > a.txt
aws s3 cp a.txt <bucket>/a.txt --content-encoding gzip

Check ContentEncoding is set:

In [36]: import boto3

In [37]: client = boto3.client("s3")

In [38]: obj = client.get_object(Bucket="<bucket>", Key="a.txt")

In [39]: obj["ContentEncoding"]
Out[39]: 'gzip'

Reading with smart_open:

In [1]: import smart_open

In [2]: smart_open.open("<bucket>/a.txt", "rb").read()
Out[2]: b'\x1f\x8b\x08\x00\xa2k\x87c\x00\x03\xcbH\xcd\xc9\xc9W(\xcf/\xcaI\xe1\x02\x00-;\x08\xaf\x0c\x00\x00\x00'

In [3]: smart_open.open("<bucket>/a.txt").read()
---------------------------------------------------------------------------
UnicodeDecodeError                        Traceback (most recent call last)
Cell In [3], line 1
----> 1 smart_open.open("<bucket>/a.txt").read()

File /opt/python/3.9.14/lib/python3.9/codecs.py:322, in BufferedIncrementalDecoder.decode(self, input, final)
    319 def decode(self, input, final=False):
    320     # decode input (taking the buffer into account)
    321     data = self.buffer + input
--> 322     (result, consumed) = self._buffer_decode(data, self.errors, final)
    323     # keep undecoded input until the next call
    324     self.buffer = data[consumed:]

UnicodeDecodeError: 'utf-8' codec can't decode byte 0x8b in position 1: invalid start byte

from smart_open.

mpenkov commented on June 7, 2024

Thank you for the report.

The following two statements seem inconsistent to me:

It's hard to give precise steps
Simply put uploading a .txt file with .txt extension who's content has been gziped and ContentEncoding value is "gzip" should be automatically decompressed, but it is not

Why is it difficult to show the precise source code for 2)?

from smart_open.

Recommend Projects