Giter Site home page Giter Site logo

googleclouddataproc / hive-bigquery-storage-handler Goto Github PK

View Code? Open in Web Editor NEW
19.0 32.0 11.0 73 KB

Hive Storage Handler for interoperability between BigQuery and Apache Hive

License: Apache License 2.0

Java 99.90% Dockerfile 0.10%
hive bigquery gcp google apache hadoop

hive-bigquery-storage-handler's Introduction

Hive-BigQuery StorageHandler [No Longer Maintained]

This is a Hive StorageHandler plugin that enables Hive to interact with BigQuery. It allows you keep your existing pipelines but move to BigQuery. It utilizes the high throughput BigQuery Storage API to read data and uses the BigQuery API to write data.

The following steps are performed under Dataproc cluster in Google Cloud Platform. If you need to run in your cluster, you will need setup Google Cloud SDK and Google Cloud Storage connector for Hadoop.

Getting the StorageHandler

  1. Check it out from GitHub.
  2. Build it with the new Google Hadoop BigQuery Connector
git clone https://github.com/GoogleCloudPlatform/hive-bigquery-storage-handler  
cd hive-bigquery-storage-handler  
mvn clean install  
  1. Deploy hive-bigquery-storage-handler-1.0-shaded.jar

Using the StorageHandler to access BigQuery

  1. Enable the BigQuery Storage API. Follow these instructions and check pricing details

  2. Copy the compiled Jar to a Google Cloud Storage bucket that can be accessed by your hive cluster

  3. Open Hive CLI and load the jar as shown below:

hive> add jar gs://<Jar location>/hive-bigquery-storage-handler-1.0-shaded.jar;  
  1. Verify the jar is loaded successfully
hive> list jars;  

At this point you can operate Hive just like you used to do.

Creating BigQuery tables

If you have BigQuery table already, here is how you can define Hive table that refer to it:

CREATE TABLE bq_test (word_count bigint, word string)  
 STORED BY 
 'com.google.cloud.hadoop.io.bigquery.hive.HiveBigQueryStorageHandler' 
 TBLPROPERTIES ( 
 'bq.dataset'='<BigQuery dataset name>', 
 'bq.table'='<BigQuery table name>', 
 'mapred.bq.project.id'='<Your Project ID>', 
 'mapred.bq.temp.gcs.path'='gs://<Bucket name>/<Temporary path>', 
 'mapred.bq.gcs.bucket'='<Cloud Storage Bucket name>' 
 );

You will need to provide the following table properties:

Property Value
bq.dataset BigQuery dataset id (Optional if hive database name matches BQ dataset name)
bq.table BigQuery table name (Optional if hive table name matches BQ table name)
mapred.bq.project.id Your project id
mapred.temp.gcs.path Temporary file location in GCS bucket
mapred.bq.gcs.bucket Temporary GCS bucket name

Data Type Mapping

BigQuery Hive DESCRIPTION
INTEGER BIGINT Signed 8-byte Integer
FLOAT DOUBLE 8-byte double precision floating point number
DATE DATE FORMAT IS YYYY-[M]M-[D]D. The range of values supported for the Date type is 0001-­01-­01 to 9999-­12-­31
TIMESTAMP TIMESTAMP Represents an absolute point in time since Unix epoch with millisecond precision (on Hive) compared to Microsecond precision on Bigquery.
BOOLEAN BOOLEAN Boolean values are represented by the keywords TRUE and FALSE
STRING STRING Variable-length character data
BYTES BINARY Variable-length binary data
REPEATED ARRAY Represents repeated values
RECORD STRUCT Represents nested structures

Filtering

The new API allows column pruning and predicate filtering to only read the data you are interested in.

Column Pruning

Since BigQuery is backed by a columnar datastore, it can efficiently stream data without reading all columns.

Predicate Filtering

The Storage API supports arbitrary pushdown of predicate filters. To enable predicate pushdown ensure hive.optimize.ppd is set to true.
Filters on all primitive type columns will be pushed to storage layer improving the performance of reads. Predicate pushdown is not supported on complex types such as arrays and structs. For example - filters like address.city = "Sunnyvale" will not get pushdown to Bigquery.

Caveats

  1. Ensure that table exists in bigquery and column names are always lowercase
  2. timestamp column in hive is interpreted to be timezoneless and stored as an offset from the UNIX epoch with milliseconds precision.
    To display in human readable format from_unix_time udf can be used as
    from_unixtime(cast(cast(<timestampcolumn> as bigint)/1000 as bigint), 'yyyy-MM-dd hh:mm:ss')      

Issues

  1. Writing to BigQuery will fail when using Apache Tez as the execution engine. As a workaround set hive.execution.engine=mr to use MapReduce as the execution engine
  2. STRUCT type is not supported unless avro schema is explicitly specified using either avro.schema.literal or avro.schema.url table properties. Below table contains all supported types defining schema explicitly. Note: If table doesn't need struct then specifying schema is optional
     CREATE TABLE dbname.alltypeswithSchema(currenttimestamp TIMESTAMP,currentdate DATE, userid BIGINT, sessionid STRING, skills Array<String>,
       eventduration DOUBLE, eventcount BIGINT, is_latest BOOLEAN,keyset BINARY,addresses ARRAY<STRUCT<status: STRING, street: STRING,city: STRING, state: STRING,zip: BIGINT>> )
       STORED BY 'com.google.cloud.hadoop.io.bigquery.hive.HiveBigQueryStorageHandler'
       TBLPROPERTIES (
        'bq.dataset'='bqdataset',
        'bq.table'='bqtable',
        'mapred.bq.project.id'='bqproject',
        'mapred.bq.temp.gcs.path'='gs://bucketname/prefix',
        'mapred.bq.gcs.bucket'='bucketname',
        'avro.schema.literal'='{"type":"record","name":"alltypesnonnull",
            "fields":[{"name":"currenttimestamp","type":["null",{"type":"long","logicalType":"timestamp-micros"}], "default" : null}
                     ,{"name":"currentdate","type":{"type":"int","logicalType":"date"}, "default" : -1},{"name":"userid","type":"long","doc":"User identifier.", "default" : -1}
                     ,{"name":"sessionid","type":["null","string"], "default" : null},{"name":"skills","type":["null", {"type":"array","items":"string"}], "default" : null}
                     ,{"name":"eventduration","type":["null","double"], "default" : null},{"name":"eventcount","type":["null","long"], "default" : null}
                     ,{"name":"is_latest","type":["null","boolean"], "default" : null},{"name":"keyset","type":["null","bytes"], "default" : null}
                     ,{"name":"addresses","type":["null", {"type":"array",
                        "items":{"type":"record","name":"__s_0",
                        "fields":[{"name":"status","type":"string"},{"name":"street","type":"string"},{"name":"city","type":"string"},{"name":"state","type":"string"},{"name":"zip","type":"long"}]
                        }}], "default" : null
                      }
                    ]
            }'
       );

hive-bigquery-storage-handler's People

Contributors

dependabot[bot] avatar prathapreddy123 avatar roderickyao avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

hive-bigquery-storage-handler's Issues

Error while querying an empty table

Querying an empty table using below command

select * from table

is throwing Error: java.lang.NoClassDefFoundError: com/google/cloud/hadoop/repackaged/bigquery/com/google/cloud/bigquery/storage/v1beta1/Storage$FinalizeStreamRequest$Builder (state=,code=0)

NPE when COUNT using MapReduce

When using hive.execution.engine=mr, SELECT COUNT(ID) from TABLE will throw NPE. It seems that output table value is NULL when running this.

Job setup failed : java.lang.NullPointerException
at com.google.cloud.hadoop.io.bigquery.hive.WrappedBigQueryAvroOutputFormat.getTempFilename(WrappedBigQueryAvroOutputFormat.java:61)
at com.google.cloud.hadoop.io.bigquery.hive.HiveBigQueryOutputCommitter.setupJob(HiveBigQueryOutputCommitter.java:47)
at org.apache.hadoop.mapred.OutputCommitter.setupJob(OutputCommitter.java:265)
at org.apache.hadoop.mapreduce.v2.app.commit.CommitterEventHandler$EventProcessor.handleJobSetup(CommitterEventHandler.java:255)
at org.apache.hadoop.mapreduce.v2.app.commit.CommitterEventHandler$EventProcessor.run(CommitterEventHandler.java:235)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)

NoSuchFieldError: MANDATORY_CONFIG_PROPERTIES_INPUT

Building this repo as-is and attempting to use it in a Dataproc 1.3 cluster results in the following error:

hive> add jar gs://<snip>/artifacts/hive-bigquery-storage-handler-1.0-shaded.jar;
Added [/tmp/dba1500d-5dbd-4d08-b2f3-6a22068d88c2_resources/hive-bigquery-storage-handler-1.0-shaded.jar] to class path
Added resources: [gs://<snip>/artifacts/hive-bigquery-storage-handler-1.0-shaded.jar]
hive> list jars;
/tmp/dba1500d-5dbd-4d08-b2f3-6a22068d88c2_resources/hive-bigquery-storage-handler-1.0-shaded.jar
hive> select * from temp.bq_test limit 10;
20/07/24 17:35:53 INFO Configuration.deprecation: mapred.input.dir.recursive is deprecated. Instead, use mapreduce.input.fileinputformat.input.dir.recursive
20/07/24 17:35:53 INFO Configuration.deprecation: mapred.task.is.map is deprecated. Instead, use mapreduce.task.ismap
OK
20/07/24 17:35:54 INFO Configuration.deprecation: mapred.input.dir is deprecated. Instead, use mapreduce.input.fileinputformat.inputdir
Exception in thread "main" java.lang.NoSuchFieldError: MANDATORY_CONFIG_PROPERTIES_INPUT
        at com.google.cloud.hadoop.io.bigquery.DirectBigQueryInputFormat.getTable(DirectBigQueryInputFormat.java:91)
        at com.google.cloud.hadoop.io.bigquery.DirectBigQueryInputFormat.getSplits(DirectBigQueryInputFormat.java:76)
        at com.google.cloud.hadoop.io.bigquery.hive.WrappedBigQueryAvroInputFormat.getSplits(WrappedBigQueryAvroInputFormat.java:81)
        at org.apache.hadoop.hive.ql.exec.FetchOperator.getNextSplits(FetchOperator.java:372)
        at org.apache.hadoop.hive.ql.exec.FetchOperator.getRecordReader(FetchOperator.java:304)
        at org.apache.hadoop.hive.ql.exec.FetchOperator.getNextRow(FetchOperator.java:459)
        at org.apache.hadoop.hive.ql.exec.FetchOperator.pushRow(FetchOperator.java:428)
        at org.apache.hadoop.hive.ql.exec.FetchTask.fetch(FetchTask.java:146)
        at org.apache.hadoop.hive.ql.Driver.getResults(Driver.java:2098)
        at org.apache.hadoop.hive.cli.CliDriver.processLocalCmd(CliDriver.java:252)
        at org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:183)
        at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:399)
        at org.apache.hadoop.hive.cli.CliDriver.executeDriver(CliDriver.java:776)
        at org.apache.hadoop.hive.cli.CliDriver.run(CliDriver.java:714)
        at org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:641)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:498)
        at org.apache.hadoop.util.RunJar.run(RunJar.java:239)
        at org.apache.hadoop.util.RunJar.main(RunJar.java:153)

(Redacted the name of the bucket I've put this stuff in)

Add Cloud Build Config

Add a Cloud Build configuration to facilitate automatic building and releases of the source code.

Failing to build with dependency error

To reproduce:

  1. clone
  2. mvn clean install
[INFO] ----< com.google.cloud.hive.bigquery:hive-bigquery-storage-handler >----
[INFO] Building hive-bigquery-storage-handler 1.0
[INFO] --------------------------------[ jar ]---------------------------------
Downloading from repository.jboss.org: http://repository.jboss.org/nexus/content/groups/public/net/minidev/json-smart/maven-metadata.xml
[WARNING] Could not transfer metadata net.minidev:json-smart/maven-metadata.xml from/to repository.jboss.org (http://repository.jboss.org/nexus/content/groups/public/): Transfer failed for http://repository.jboss.org/nexus/content/groups/public/net/minidev/json-smart/maven-metadata.xml
Downloading from conjars: http://conjars.org/repo/org/pentaho/pentaho-aggdesigner-algorithm/5.1.5-jhyde/pentaho-aggdesigner-algorithm-5.1.5-jhyde.pom
[WARNING] The POM for org.pentaho:pentaho-aggdesigner-algorithm:jar:5.1.5-jhyde is missing, no dependency information available
Downloading from conjars: http://conjars.org/repo/org/pentaho/pentaho-aggdesigner-algorithm/5.1.5-jhyde/pentaho-aggdesigner-algorithm-5.1.5-jhyde.jar
[INFO] ------------------------------------------------------------------------
[INFO] BUILD FAILURE
[INFO] ------------------------------------------------------------------------
[INFO] Total time:  2.826 s
[INFO] Finished at: 2021-06-12T16:06:12-07:00
[INFO] ------------------------------------------------------------------------
[ERROR] Failed to execute goal on project hive-bigquery-storage-handler: Could not resolve dependencies for project com.google.cloud.hive.bigquery:hive-bigquery-storage-handler:jar:1.0: Failure to find org.pentaho:pentaho-aggdesigner-algorithm:jar:5.1.5-jhyde in https://repo.maven.apache.org/maven2 was cached in the local repository, resolution will not be reattempted until the update interval of central has elapsed or updates are forced -> [Help 1]

The dependency it is failing on has recently moved: https://github.com/pentaho/pentaho-aggdesigner/pull/137/files

Support BigQuery Partitioning

Right now it blocks creating partition table. BQ supports partitioning based on timestamp and integer range. This storage handler should support these.

Add publicly accessible jar

Currently each user must pull down the source, compile, and host their own version of the jar. Create a public version version so users can easily reference and list all versions:

gs://hive-bigquery-storage-handler/hive-bigquery-storage-handler-1.0-shaded.jar

Ideally these jars can be built and deployed using a cloudbuild.yml file on release tags.

Add unit tests

None of the classes currently have an associated unit test. Add tests for all relevant source files.

INSERT with no result when using Tez

When using Tez as Hive execution engine, INSERT statement will return successful but no rows are inserted. This works under MapReduce engine.

Tez somehow does not seem to call OutputCommitter at all.

Records are not inserted to BigQuery table

Facing an issue with record insertion. Insert statements are giving success message but no records are inserted to BQ table. At the time of insertion, it first stores records to temp files in GCS buckets which we set at table creation time. I am seeing generated temp files in buckets but it's not pushing those temp files to further BQ table. I am using Hive 2.3.9 with Hadoop compatible version of 2.10.1 and testing it with MR as a execution engine. MR somehow does not seem to call OutputCommitter.

Customized configuration files for my project are here along with the hive generated logs for INSERT operation.
configuration_files.zip
hive.log

Note:- Joins and Select operations are working. Below are the query output logs for Insert operation.

Insert into bq_test values ('Vishal', 33);
WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.
Query ID = hduser_20220214125524_7b6db0f8-9792-4acb-b4f6-6e4de208f2fa
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks is set to 0 since there's no reduce operator
Job running in-process (local Hadoop)
2022-02-14 12:55:27,341 Stage-3 map = 100%, reduce = 0%
Ended Job = job_local758927901_0001
MapReduce Jobs Launched:
Stage-Stage-3: HDFS Read: 7 HDFS Write: 7 SUCCESS
Total MapReduce CPU Time Spent: 0 msec
OK
Time taken: 3.203 seconds

PERMISSION_DENIED: Request had insufficient authentication scopes

hive> select * from bq_test;
OK
Failed with exception java.io.IOException:com.google.cloud.hadoop.repackaged.bigquery.com.google.api.gax.rpc.PermissionDeniedException: com.google.cloud.hadoop.repackaged.bigquery.io.grpc.StatusRuntimeException: PERMISSION_DENIED: Request had insufficient authentication scopes.
Time taken: 2.37 seconds
Note: I have setup Hadoop/Hive on GCP VM and trying to connect to BQ.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.