Comments (11)
Yeah, it's working very well! Thank you so much, Will test a bit more next week.
from pyathena.
I checked the performance.
https://gist.github.com/laughingman7743/2e4d83ca4e394dc645e9ea9a45fe78ba
PandasCursor is ultra fast. 😆
from pyathena.
It would be nice to create a new PandasCursor
like AsyncCursor
.
cursor = connect(s3_staging_dir='s3://YOUR_S3_BUCKET/path/to/',
region_name='us-west-2',
cursor_class=PandasCursor).cursor()
df = cursor.execute("SELECT * FROM many_rows").as_pandas()
When the execution of the query is completed, the cursor will download CSV from S3 on memory, set data in read_csv, and return dataframe.
I think that it would be easy to implement if such a cursor. (It is not a cursor anymore. 😅)
Is smart_open
faster than downloading CSV in boto3
?
from pyathena.
By the way, the performance of JDBC driver seems to be improved.
https://aws.amazon.com/about-aws/whats-new/2018/08/amazon-athena-streaming-jdbc-driver/
PyAthenaJDBC is releasing v2.0.0 which corresponds to the new driver.
https://github.com/laughingman7743/PyAthenaJDBC/releases/tag/v2.0.0
Would you be willing to check performance?
from pyathena.
I tried PyAthenaJDBC, and it connects, but it just hangs executing a query (PyAthena works fine). Seems to hang at cursor.py:118 running statement.execute()
. I'm not very familiar with JDBC, so I may be missing something basic. Do I need to do anything besides pipenv install PyAthenaJDBC
?
Update: Looks like it's a timeout:
jpype._jexception.SQLExceptionPyRaisable: java.sql.SQLException: [Simba][AthenaJDBC](100131) An error has been thrown from the AWS SDK client. Unable to execute HTTP request: Connect to athena.us-east-1.amazonaws.com:444 [athena.us-east-1.amazonaws.com/52.204.11.152, athena.us-east-1.amazonaws.com/34.197.0.216, athena.us-east-1.amazonaws.com/34.199.65.113, athena.us-east-1.amazonaws.com/34.232.91.196, athena.us-east-1.amazonaws.com/52.2.248.1, athena.us-east-1.amazonaws.com/52.71.54.205] failed: Connection timed out (Connection timed out)
I'm using the exact same connect
call for both. Any reason PyAthena would connect but JDBC wouldn't?
I may look into writing a new cursor. Thanks!
from pyathena.
It will work if Java is installed. JDBC does not necessarily require any special environment.
It seems like a problem with your network environment. 🤔
I tried implementing PandasCursor. Please check the following branch.
/pull/47
Usage:
from pyathena import connect
from pyathena.pandas_cursor import PandasCursor
cursor = connect(s3_staging_dir='s3://YOUR_BUCKET/path/to/',
region_name='us-west-2',
cursor_class=PandasCursor).cursor()
df = cursor.execute("SELECT * FROM many_rows").as_pandas()
print(cursor.state)
print(cursor.state_change_reason)
print(cursor.completion_date_time)
print(cursor.submission_date_time)
print(cursor.data_scanned_in_bytes)
print(cursor.execution_time_in_millis)
print(cursor.output_location)
print(df.describe())
from pyathena.
That looks great. One further bit is that this leaves pandas to guess the column types, but it's not always possible. Here's what I'm using to convert sql types to pandas types:
from decimal import Decimal
from binascii import unhexlify
import json
SQL_PANDAS_TYPES = {
'boolean': bool,
'tinyint': int,
'smallint': int,
'integer': int,
'bigint': int,
'float': float,
'real': float,
'double': float,
'char': str,
'varchar': str,
}
SQL_PANDAS_CONVERTERS = {
'decimal': Decimal,
'varbinary': lambda b: unhexlify(b.replace(' ', '')),
'json': json.loads,
}
def read_sql_csv(csvfile, sqltypes):
""":Return: a Pandas DataFrame made from `csvfile` using the "dtypes" from a dict
`{colname: sqltype}` giving the SQL type of each column."""
dtype = {col: SQL_PANDAS_TYPES[sqltype] for col, sqltype in sqltypes.items() if sqltype in SQL_PANDAS_TYPES}
converters = {col: SQL_PANDAS_CONVERTERS[sqltype] for col, sqltype in sqltypes.items() if sqltype in SQL_PANDAS_CONVERTERS}
parse_dates = [col for col, sqltype in sqltypes.items() if sqltype in ('date', 'time', 'timestamp', 'time with time zone', 'timestamp with time zone')]
times = [col for col, sqltype in sqltypes.items() if sqltype in ('time', 'time with time zone')] # parse_dates will add today's date to pure times
result = pd.read_csv(csvfile, dtype=dtype, parse_dates=parse_dates, converters=converters, infer_datetime_format=True)
result.loc[:, times] = result.loc[:, times].apply(lambda r: r.dt.time)
return result
(Pandas read_csv()
isn't very consistent about types and conversion, so you have to special-case dates/times and a few others.)
I left array, map, and row types as strings since Athena's CSV representation of these types is ambiguous, so there's no way to parse them correctly. (I suggest cast(col as json)
if you want to use complex types.)
Then you call it like:
sqltypes = dict(desc[0:2] for desc in cursor.description)
read_sql_csv(csvfile, sqltypes)
I haven't tested this thoroughly, or on Python 2 (but I tried to make it 2/3 compatible).
from pyathena.
Thanks for the nice advice.
I tried to implement type conversion with reference to your code. And I wrote tests.
Please check the branch. Thanks!
from pyathena.
Hey that was fast! Thank you for your work.
I tried it on a table containing timestamps, and unfortunately it was still very slow. Looking into it, all the time was spent parsing datetimes. I noticed you used dateutil
as a converter. When using converters=...
, they are called one at a time on each value, with slow Python code. When you use Pandas' parse_dates=...
, it's using fast vectorized C code. A quick test shows that Pandas' date parsing is over 60X faster than dateutil.parser.parse
.
import io
from dateutil.parser import parse
csv = io.StringIO()
pd.Series(pd.date_range('2018-01-01', '2019-01-01', freq='min')).to_csv(csv, index=False, header=['time'])
csv.seek(0)
%time df = pd.read_csv(csv, parse_dates=['time'])
csv.seek(0)
%time df = pd.read_csv(csv, converters={'time': parse})
CPU times: user 723 ms, sys: 69.7 ms, total: 793 ms
Wall time: 793 ms
CPU times: user 50 s, sys: 276 ms, total: 50.3 s
Wall time: 50.3 s
That's why I went out of my way to special case dates in my code; using vectorized date parsing is much faster, even when you have to go back and pull out the times.
from pyathena.
Oh, dateutils.parser.parse is very very slow. 😱
I understood why you are parsing the datetime type in Pandas.
Fixed parsing of datetime type. Are there any other performance problems?
from pyathena.
Over 13X faster, impressive! Thank you again for this. You are a fantastic open source maintainer, it was a pleasure working with you and with your software.
from pyathena.
Related Issues (20)
- `UUID` in a query gets garbled HOT 3
- Add support for Spark calculations HOT 8
- Add Endpoint_URL param to SQLAlchemy HOT 2
- SQLAlchemy dialect uses deprecated dbapi() method HOT 1
- Create documents in Sphinx and publish them on GitHub Pages HOT 1
- Breaking change in the release between 3.0.10 and 3.1.0 HOT 6
- Okta authentication support HOT 1
- Integer variant types incorrectly rendered in DDL HOT 5
- Support for Iceberg FOR SYSTEM_VERSION AS OF HOT 7
- AWS Error NETWORK_CONNECTION during HeadObject operation: curlCode: 28, Timeout was reached HOT 2
- FutureWarning: Setting an item of incompatible dtype is deprecated and will raise in a future error of pandas.
- Latest PyAthena no longer compatible with SQLAlchemy 1.4
- [Warning] SADeprecationWarning: The dbapi() classmethod on dialect classes has been renamed to import_dbapi(). HOT 1
- pyathena.error.DatabaseError: An error occurred (InvalidRequestException) when calling the StartQueryExecution operation: line 1:3242: mismatched input 'OFFSET'. Expecting: <EOF> HOT 2
- If a value for a partition key is None, to_sql doesn't warn you and no data is written
- GENERIC_INTERNAL_ERROR: io.trino.hdfs.s3.TrinoS3FileSystem$UnrecoverableS3OperationException HOT 1
- URLLIB3 / Requests HOT 4
- Formatter does not escape single quotes when operation is MERGE or UPDATE HOT 1
- Feature request: support for positional parameter substitution HOT 1
- EntityTooSmall when calling the CompleteMultipartUpload operation HOT 7
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from pyathena.