Giter Site home page Giter Site logo

everyonce / aws-data-wrangler Goto Github PK

View Code? Open in Web Editor NEW

This project forked from aws/aws-sdk-pandas

0.0 1.0 0.0 3.05 MB

Pandas on AWS

Home Page: https://aws-data-wrangler.readthedocs.io

License: Apache License 2.0

Python 89.73% Shell 0.89% Dockerfile 0.37% Jupyter Notebook 9.01%

aws-data-wrangler's Introduction

AWS Data Wrangler

Pandas on AWS

IMPORTANT NOTE: Version 1.0.0 coming soon with several breaking changes.

Please, pin the version you are using on your environment.

AWS Data Wrangler is completing 1 year, and the team is working to collect feedbacks and features requests to put in our 1.0.0 version. By now we have 3 major changes listed:

  • API redesign
  • Nested data types support
  • Deprecation of PySpark support
    • PySpark support takes considerable part of the development time and it has not been reflected in user adoption. Only 2 of our 66 issues on GitHub are related to Spark.
    • In addition, the integration between PySpark and PyArrow/Pandas remains in experimental stage and we have been experiencing tough times to keep it stable.

Release Python Version Documentation Status Coverage Average time to resolve an issue License

PyPI: PyPI Downloads

Conda: Conda Downloads

Resources

Use Cases

PySpark

FROM TO Features
PySpark DataFrame Amazon Redshift Blazing fast using parallel parquet on S3 behind the scenesAppend/Overwrite/Upsert modes
PySpark DataFrame Glue Catalog Register Parquet or CSV DataFrame on Glue Catalog
Nested PySpark
DataFrame
Flat PySpark
DataFrames
Flatten structs and break up arrays in child tables

Pandas

FROM TO Features
Pandas DataFrame Amazon S3 Parquet, CSV, Partitions, Parallelism, Overwrite/Append/Partitions-Upsert modes,
KMS Encryption, Glue Metadata (Athena, Spectrum, Spark, Hive, Presto)
Amazon S3 Pandas DataFrame Parquet (Pushdown filters), CSV, Fixed-width formatted, Partitions, Parallelism,
KMS Encryption, Multiple files
Amazon Athena Pandas DataFrame Workgroups, S3 output path, Encryption, and two different engines:

- ctas_approach=False -> Batching and restrict memory environments
- ctas_approach=True -> Blazing fast, parallelism and enhanced data types
Pandas DataFrame Amazon Redshift Blazing fast using parallel parquet on S3 behind the scenes
Append/Overwrite/Upsert modes
Amazon Redshift Pandas DataFrame Blazing fast using parallel parquet on S3 behind the scenes
Pandas DataFrame Amazon Aurora Supported engines: MySQL, PostgreSQL
Blazing fast using parallel CSV on S3 behind the scenes
Append/Overwrite modes
Amazon Aurora Pandas DataFrame Supported engines: MySQL
Blazing fast using parallel CSV on S3 behind the scenes
CloudWatch Logs Insights Pandas DataFrame Query results
Glue Catalog Pandas DataFrame List and get Tables details. Good fit with Jupyter Notebooks.

General

Feature Details
List S3 objects e.g. wr.s3.list_objects("s3://...")
Delete S3 objects Parallel
Delete listed S3 objects Parallel
Delete NOT listed S3 objects Parallel
Copy listed S3 objects Parallel
Get the size of S3 objects Parallel
Get CloudWatch Logs Insights query results
Load partitions on Athena/Glue table Through "MSCK REPAIR TABLE"
Create EMR cluster "For humans"
Terminate EMR cluster "For humans"
Get EMR cluster state "For humans"
Submit EMR step(s) "For humans"
Get EMR step state "For humans"
Query Athena to receive python primitives Returns Iterable[Dict[str, Any]
Load and Unzip SageMaker jobs outputs
Dump Amazon Redshift as Parquet files on S3
Dump Amazon Aurora as CSV files on S3 Only for MySQL engine

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.