Giter Site home page Giter Site logo

Comments (11)

jkleint avatar jkleint commented on July 26, 2024 2

Yeah, it's working very well! Thank you so much, Will test a bit more next week.

from pyathena.

laughingman7743 avatar laughingman7743 commented on July 26, 2024 2

I checked the performance.
https://gist.github.com/laughingman7743/2e4d83ca4e394dc645e9ea9a45fe78ba
PandasCursor is ultra fast. 😆

from pyathena.

laughingman7743 avatar laughingman7743 commented on July 26, 2024

It would be nice to create a new PandasCursor like AsyncCursor.

cursor = connect(s3_staging_dir='s3://YOUR_S3_BUCKET/path/to/',
                 region_name='us-west-2',
                 cursor_class=PandasCursor).cursor()
df = cursor.execute("SELECT * FROM many_rows").as_pandas()

When the execution of the query is completed, the cursor will download CSV from S3 on memory, set data in read_csv, and return dataframe.
I think that it would be easy to implement if such a cursor. (It is not a cursor anymore. 😅)
Is smart_open faster than downloading CSV in boto3?

from pyathena.

laughingman7743 avatar laughingman7743 commented on July 26, 2024

By the way, the performance of JDBC driver seems to be improved.
https://aws.amazon.com/about-aws/whats-new/2018/08/amazon-athena-streaming-jdbc-driver/

PyAthenaJDBC is releasing v2.0.0 which corresponds to the new driver.
https://github.com/laughingman7743/PyAthenaJDBC/releases/tag/v2.0.0
Would you be willing to check performance?

from pyathena.

jkleint avatar jkleint commented on July 26, 2024

I tried PyAthenaJDBC, and it connects, but it just hangs executing a query (PyAthena works fine). Seems to hang at cursor.py:118 running statement.execute(). I'm not very familiar with JDBC, so I may be missing something basic. Do I need to do anything besides pipenv install PyAthenaJDBC?

Update: Looks like it's a timeout:

jpype._jexception.SQLExceptionPyRaisable: java.sql.SQLException: [Simba][AthenaJDBC](100131) An error has been thrown from the AWS SDK client. Unable to execute HTTP request: Connect to athena.us-east-1.amazonaws.com:444 [athena.us-east-1.amazonaws.com/52.204.11.152, athena.us-east-1.amazonaws.com/34.197.0.216, athena.us-east-1.amazonaws.com/34.199.65.113, athena.us-east-1.amazonaws.com/34.232.91.196, athena.us-east-1.amazonaws.com/52.2.248.1, athena.us-east-1.amazonaws.com/52.71.54.205] failed: Connection timed out (Connection timed out)

I'm using the exact same connect call for both. Any reason PyAthena would connect but JDBC wouldn't?

I may look into writing a new cursor. Thanks!

from pyathena.

laughingman7743 avatar laughingman7743 commented on July 26, 2024

It will work if Java is installed. JDBC does not necessarily require any special environment.
It seems like a problem with your network environment. 🤔

I tried implementing PandasCursor. Please check the following branch.
/pull/47

Usage:

from pyathena import connect
from pyathena.pandas_cursor import PandasCursor

cursor = connect(s3_staging_dir='s3://YOUR_BUCKET/path/to/',
                 region_name='us-west-2',
                 cursor_class=PandasCursor).cursor()
df = cursor.execute("SELECT * FROM many_rows").as_pandas()
print(cursor.state)
print(cursor.state_change_reason)
print(cursor.completion_date_time)
print(cursor.submission_date_time)
print(cursor.data_scanned_in_bytes)
print(cursor.execution_time_in_millis)
print(cursor.output_location)
print(df.describe())

from pyathena.

jkleint avatar jkleint commented on July 26, 2024

That looks great. One further bit is that this leaves pandas to guess the column types, but it's not always possible. Here's what I'm using to convert sql types to pandas types:

from decimal import Decimal
from binascii import unhexlify
import json

SQL_PANDAS_TYPES = {
    'boolean': bool,
    'tinyint': int,
    'smallint': int,
    'integer': int,
    'bigint': int,
    'float': float,
    'real': float,
    'double': float,
    'char': str,
    'varchar': str,
}

SQL_PANDAS_CONVERTERS = {
    'decimal': Decimal,    
    'varbinary': lambda b: unhexlify(b.replace(' ', '')),
    'json': json.loads,
}


def read_sql_csv(csvfile, sqltypes):
    """:Return: a Pandas DataFrame made from `csvfile` using the "dtypes" from a dict
    `{colname: sqltype}` giving the SQL type of each column."""
    dtype = {col: SQL_PANDAS_TYPES[sqltype] for col, sqltype in sqltypes.items() if sqltype in SQL_PANDAS_TYPES}
    converters = {col: SQL_PANDAS_CONVERTERS[sqltype] for col, sqltype in sqltypes.items() if sqltype in SQL_PANDAS_CONVERTERS}
    parse_dates = [col for col, sqltype in sqltypes.items() if sqltype in ('date', 'time', 'timestamp', 'time with time zone', 'timestamp with time zone')]
    times = [col for col, sqltype in sqltypes.items() if sqltype in ('time', 'time with time zone')]   # parse_dates will add today's date to pure times
    result = pd.read_csv(csvfile, dtype=dtype, parse_dates=parse_dates, converters=converters, infer_datetime_format=True)
    result.loc[:, times] = result.loc[:, times].apply(lambda r: r.dt.time)
    return result

(Pandas read_csv() isn't very consistent about types and conversion, so you have to special-case dates/times and a few others.)

I left array, map, and row types as strings since Athena's CSV representation of these types is ambiguous, so there's no way to parse them correctly. (I suggest cast(col as json) if you want to use complex types.)

Then you call it like:

sqltypes = dict(desc[0:2] for desc in cursor.description)
read_sql_csv(csvfile, sqltypes) 

I haven't tested this thoroughly, or on Python 2 (but I tried to make it 2/3 compatible).

from pyathena.

laughingman7743 avatar laughingman7743 commented on July 26, 2024

Thanks for the nice advice.
I tried to implement type conversion with reference to your code. And I wrote tests.
Please check the branch. Thanks!

from pyathena.

jkleint avatar jkleint commented on July 26, 2024

Hey that was fast! Thank you for your work.

I tried it on a table containing timestamps, and unfortunately it was still very slow. Looking into it, all the time was spent parsing datetimes. I noticed you used dateutil as a converter. When using converters=..., they are called one at a time on each value, with slow Python code. When you use Pandas' parse_dates=..., it's using fast vectorized C code. A quick test shows that Pandas' date parsing is over 60X faster than dateutil.parser.parse .

import io
from dateutil.parser import parse

csv = io.StringIO()
pd.Series(pd.date_range('2018-01-01', '2019-01-01', freq='min')).to_csv(csv, index=False, header=['time'])
csv.seek(0)
%time df = pd.read_csv(csv, parse_dates=['time'])
csv.seek(0)
%time df = pd.read_csv(csv, converters={'time': parse})
CPU times: user 723 ms, sys: 69.7 ms, total: 793 ms
Wall time: 793 ms
CPU times: user 50 s, sys: 276 ms, total: 50.3 s
Wall time: 50.3 s

That's why I went out of my way to special case dates in my code; using vectorized date parsing is much faster, even when you have to go back and pull out the times.

from pyathena.

laughingman7743 avatar laughingman7743 commented on July 26, 2024

Oh, dateutils.parser.parse is very very slow. 😱
I understood why you are parsing the datetime type in Pandas.
Fixed parsing of datetime type. Are there any other performance problems?

from pyathena.

jkleint avatar jkleint commented on July 26, 2024

Over 13X faster, impressive! Thank you again for this. You are a fantastic open source maintainer, it was a pleasure working with you and with your software.

from pyathena.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.