Hello, thanks for such a great library; it has really made working with Athena from Py

I checked the performance. <a href="https://gist.github.com/laughingman7743/2e4d83

Fetching large result sets can be sped up massively using S3 about pyathena HOT 11 CLOSED

laughingman7743 commented on July 26, 2024

Fetching large result sets can be sped up massively using S3

from pyathena.

Comments (11)

jkleint commented on July 26, 2024 2

Yeah, it's working very well! Thank you so much, Will test a bit more next week.

from pyathena.

laughingman7743 commented on July 26, 2024 2

I checked the performance.
https://gist.github.com/laughingman7743/2e4d83ca4e394dc645e9ea9a45fe78ba
PandasCursor is ultra fast. 😆

from pyathena.

laughingman7743 commented on July 26, 2024

It would be nice to create a new PandasCursor like AsyncCursor.

cursor = connect(s3_staging_dir='s3://YOUR_S3_BUCKET/path/to/',
                 region_name='us-west-2',
                 cursor_class=PandasCursor).cursor()
df = cursor.execute("SELECT * FROM many_rows").as_pandas()

When the execution of the query is completed, the cursor will download CSV from S3 on memory, set data in read_csv, and return dataframe.
I think that it would be easy to implement if such a cursor. (It is not a cursor anymore. 😅)
Is smart_open faster than downloading CSV in boto3?

from pyathena.

laughingman7743 commented on July 26, 2024

By the way, the performance of JDBC driver seems to be improved.
https://aws.amazon.com/about-aws/whats-new/2018/08/amazon-athena-streaming-jdbc-driver/

PyAthenaJDBC is releasing v2.0.0 which corresponds to the new driver.
https://github.com/laughingman7743/PyAthenaJDBC/releases/tag/v2.0.0
Would you be willing to check performance?

from pyathena.

jkleint commented on July 26, 2024

I tried PyAthenaJDBC, and it connects, but it just hangs executing a query (PyAthena works fine). Seems to hang at cursor.py:118 running statement.execute(). I'm not very familiar with JDBC, so I may be missing something basic. Do I need to do anything besides pipenv install PyAthenaJDBC?

Update: Looks like it's a timeout:

jpype._jexception.SQLExceptionPyRaisable: java.sql.SQLException: [Simba][AthenaJDBC](100131) An error has been thrown from the AWS SDK client. Unable to execute HTTP request: Connect to athena.us-east-1.amazonaws.com:444 [athena.us-east-1.amazonaws.com/52.204.11.152, athena.us-east-1.amazonaws.com/34.197.0.216, athena.us-east-1.amazonaws.com/34.199.65.113, athena.us-east-1.amazonaws.com/34.232.91.196, athena.us-east-1.amazonaws.com/52.2.248.1, athena.us-east-1.amazonaws.com/52.71.54.205] failed: Connection timed out (Connection timed out)

I'm using the exact same connect call for both. Any reason PyAthena would connect but JDBC wouldn't?

I may look into writing a new cursor. Thanks!

from pyathena.

laughingman7743 commented on July 26, 2024

It will work if Java is installed. JDBC does not necessarily require any special environment.
It seems like a problem with your network environment. 🤔

I tried implementing PandasCursor. Please check the following branch.
/pull/47

Usage:

from pyathena import connect
from pyathena.pandas_cursor import PandasCursor

cursor = connect(s3_staging_dir='s3://YOUR_BUCKET/path/to/',
                 region_name='us-west-2',
                 cursor_class=PandasCursor).cursor()
df = cursor.execute("SELECT * FROM many_rows").as_pandas()
print(cursor.state)
print(cursor.state_change_reason)
print(cursor.completion_date_time)
print(cursor.submission_date_time)
print(cursor.data_scanned_in_bytes)
print(cursor.execution_time_in_millis)
print(cursor.output_location)
print(df.describe())

from pyathena.

jkleint commented on July 26, 2024

That looks great. One further bit is that this leaves pandas to guess the column types, but it's not always possible. Here's what I'm using to convert sql types to pandas types:

from decimal import Decimal
from binascii import unhexlify
import json

SQL_PANDAS_TYPES = {
    'boolean': bool,
    'tinyint': int,
    'smallint': int,
    'integer': int,
    'bigint': int,
    'float': float,
    'real': float,
    'double': float,
    'char': str,
    'varchar': str,
}

SQL_PANDAS_CONVERTERS = {
    'decimal': Decimal,    
    'varbinary': lambda b: unhexlify(b.replace(' ', '')),
    'json': json.loads,
}


def read_sql_csv(csvfile, sqltypes):
    """:Return: a Pandas DataFrame made from `csvfile` using the "dtypes" from a dict
    `{colname: sqltype}` giving the SQL type of each column."""
    dtype = {col: SQL_PANDAS_TYPES[sqltype] for col, sqltype in sqltypes.items() if sqltype in SQL_PANDAS_TYPES}
    converters = {col: SQL_PANDAS_CONVERTERS[sqltype] for col, sqltype in sqltypes.items() if sqltype in SQL_PANDAS_CONVERTERS}
    parse_dates = [col for col, sqltype in sqltypes.items() if sqltype in ('date', 'time', 'timestamp', 'time with time zone', 'timestamp with time zone')]
    times = [col for col, sqltype in sqltypes.items() if sqltype in ('time', 'time with time zone')]   # parse_dates will add today's date to pure times
    result = pd.read_csv(csvfile, dtype=dtype, parse_dates=parse_dates, converters=converters, infer_datetime_format=True)
    result.loc[:, times] = result.loc[:, times].apply(lambda r: r.dt.time)
    return result

(Pandas read_csv() isn't very consistent about types and conversion, so you have to special-case dates/times and a few others.)

I left array, map, and row types as strings since Athena's CSV representation of these types is ambiguous, so there's no way to parse them correctly. (I suggest cast(col as json) if you want to use complex types.)

Then you call it like:

sqltypes = dict(desc[0:2] for desc in cursor.description)
read_sql_csv(csvfile, sqltypes)

I haven't tested this thoroughly, or on Python 2 (but I tried to make it 2/3 compatible).

from pyathena.

laughingman7743 commented on July 26, 2024

Thanks for the nice advice.
I tried to implement type conversion with reference to your code. And I wrote tests.
Please check the branch. Thanks!

from pyathena.

jkleint commented on July 26, 2024

Hey that was fast! Thank you for your work.

I tried it on a table containing timestamps, and unfortunately it was still very slow. Looking into it, all the time was spent parsing datetimes. I noticed you used dateutil as a converter. When using converters=..., they are called one at a time on each value, with slow Python code. When you use Pandas' parse_dates=..., it's using fast vectorized C code. A quick test shows that Pandas' date parsing is over 60X faster than dateutil.parser.parse .

import io
from dateutil.parser import parse

csv = io.StringIO()
pd.Series(pd.date_range('2018-01-01', '2019-01-01', freq='min')).to_csv(csv, index=False, header=['time'])
csv.seek(0)
%time df = pd.read_csv(csv, parse_dates=['time'])
csv.seek(0)
%time df = pd.read_csv(csv, converters={'time': parse})

CPU times: user 723 ms, sys: 69.7 ms, total: 793 ms
Wall time: 793 ms
CPU times: user 50 s, sys: 276 ms, total: 50.3 s
Wall time: 50.3 s

That's why I went out of my way to special case dates in my code; using vectorized date parsing is much faster, even when you have to go back and pull out the times.

from pyathena.

laughingman7743 commented on July 26, 2024

Oh, dateutils.parser.parse is very very slow. 😱
I understood why you are parsing the datetime type in Pandas.
Fixed parsing of datetime type. Are there any other performance problems?

from pyathena.

jkleint commented on July 26, 2024

Over 13X faster, impressive! Thank you again for this. You are a fantastic open source maintainer, it was a pleasure working with you and with your software.

from pyathena.

Fetching large result sets can be sped up massively using S3 about pyathena HOT 11 CLOSED

Comments (11)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent