Giter Site home page Giter Site logo

parquet-resultset's Introduction

parquet-resultset

The parquet-resultset library can be used to convert a standard SQL ResultSet into Parquet, or Arvo

For example

        /* Standard SQL connection stuff here, where statement is a standard java.sql.Statement. */

        ResultSet resultSet = statement.executeQuery("SELECT * FROM Widget");

        String mySchemaName = "Widget";
        String namespace = "org.mycompany.widgets";
        
        // here listeners can be added which will be notified when records are parsed
        List<TransformerListener> listeners = new ArrayList<>();

        listeners.add(new TransformerListener() {
        
            @Override
            public void onSchemaParsed(SchemaResults schemaResults) {
                // called when the schema is parsed
            }
            
            @Override
            public void onRecordParsed(GenericRecord record) {
                // called whenever a record/row is parsed
            }
        });

        ResultSetTransformer transformer = new ResultSetParquetTransformer();
        // or new ResultSetArvoTransformer()
        InputStream inputStream = transformer.toParquet(resultSet, schemaName, namespace, listeners); 
        

The input stream can be written out to anywhere, i.e a file or S3, which can then be queried by Athena.

Building

./gradlew build

License

This project is Open Source software released under the https://www.apache.org/licenses/LICENSE-2.0.html[Apache 2.0 license].

parquet-resultset's People

Contributors

benleov avatar

Stargazers

 avatar Sajith Ariyarathna avatar  avatar gaurav patel avatar jiniechen avatar Alex Lopatin avatar Alessio Bolognino avatar  avatar oldfish avatar  avatar  avatar Mihai Brad Alexandru Capatana avatar Piyush Gupta avatar Maciej Lesiczka avatar Igor Suhorukov avatar  avatar Gus Vine avatar

Watchers

James Cloos avatar  avatar

parquet-resultset's Issues

NUMERIC without PRECISON is mapped to Double

CREATE TABLE ifrsbox.cfe.ledger_account_entry (
    id_account_credit        NUMERIC (9)             NOT NULL
    , id_account_debit       NUMERIC (9)             NOT NULL
    , id_instrument_ref      NUMERIC (9)             NOT NULL
    , id_accounting_event    NUMERIC (5)             NOT NULL
    , value_date             TIMESTAMP               NOT NULL
    , posting_date           TIMESTAMP               NOT NULL
    , amount                 DECIMAL (23,5)          NOT NULL
    , amount_bc              DECIMAL (23,5)          NOT NULL
    , reversed               CHARACTER VARYING (1)   NOT NULL
)
;

First 4 columns would be mapped to Double (which is wrong, since they form the PRIMARY KEY).

PARQUET file can't be read

Greetings!

First of all, big thank you for providing this library. Secondly, I am just starting with DuckDB and Parquet so it could be entirely my fault.

After creating a parquet file from a simple ResultSet, I can't import it into DuckDB:

file = File.createTempFile(targetTableName + "_", ".parquet");
String namespace = "com.manticore.etl";

rs = st.executeQuery();

List<TransformerListener> listeners = new ArrayList<>();
ResultSetTransformer transformer = new ResultSetParquetTransformer();
InputStream inputStream = transformer.transform(rs, targetTableName, namespace, listeners);
FileWriter fileWriter = new FileWriter(file);
IOUtils.copy(inputStream, fileWriter, Charset.defaultCharset());
fileWriter.flush();
ileWriter.close();
D select count(*) from read_parquet('/tmp/ledger_account_entry_16076762837875017917.parquet');
Error: Invalid Error: TProtocolException: Invalid data

Am I doing something wrong in the java code?
Do I need to change the compression or deflate the file (I guess, it's SNAPPY compressed)?

I will appreciate any hint.
Thanks a lot in advance.

DATE Columns having NULL values --> NPE

Exporting works in general, but not when there is a DATE column with NULL values.
VARCHAR null seems to work though.

Sample:

CREATE TABLE ifrsbox.cfe.instrument (
    id_instrument                    CHARACTER VARYING (40)  NOT NULL
    , id_instrument_commitment       CHARACTER VARYING (40)  NULL
    , id_instrument_type             CHARACTER VARYING (12)  NOT NULL
    , start_date                     TIMESTAMP               NOT NULL
    , end_date                       TIMESTAMP               NULL
    , id_currency                    CHARACTER VARYING (3)   NOT NULL
    , id_calendar                    CHARACTER VARYING (12)  NULL
    , id_business_day_convention     CHARACTER VARYING (12)  NULL
    , discount_curve                 CHARACTER VARYING (12)  NULL
    , discount_spread                CHARACTER VARYING (12)  NULL
)
;

INSERT INTO cfe.instrument
VALUES (    'deposit_1', NULL, 'deposit'
            , {d '2015-12-31'}, {d '2017-12-31'}, 'USD'
            , 'DEFAULT', 'P', NULL
            , NULL )
;

INSERT INTO cfe.instrument
VALUES (    'acc_1', NULL, 'curr_acc'
            , {d '2015-12-31'}, NULL, 'USD'
            , 'DEFAULT', NULL, NULL
            , NULL )
;

License

Hi, can you please add a license to the project for clarification?

Like Apache 2.0 :-)

Thank you kindly
Mike

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.