Giter Site home page Giter Site logo

devsisters / aws-glue-data-catalog-client-for-apache-hive-metastore Goto Github PK

View Code? Open in Web Editor NEW

This project forked from awslabs/aws-glue-data-catalog-client-for-apache-hive-metastore

2.0 2.0 1.0 224 KB

The AWS Glue Data Catalog is a fully managed, Apache Hive Metastore compatible, metadata repository. Customers can use the Data Catalog as a central repository to store structural and operational metadata for their data. AWS Glue provides out-of-box integration with Amazon EMR that enables customers to use the AWS Glue Data Catalog as an external Hive Metastore. This is an open-source implementation of the Apache Hive Metastore client on Amazon EMR clusters that uses the AWS Glue Data Catalog as an external Hive Metastore. It serves as a reference implementation for building a Hive Metastore-compatible client that connects to the AWS Glue Data Catalog. It may be ported to other Hive Metastore-compatible platforms such as other Hadoop and Apache Spark distributions

Home Page: https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-hive-metastore-glue.html

License: Apache License 2.0

Java 100.00%
data-platform

aws-glue-data-catalog-client-for-apache-hive-metastore's Introduction

AWS Glue Data Catalog Client for Apache Hive Metastore

The AWS Glue Data Catalog is a fully managed, Apache Hive Metastore compatible, metadata repository. Customers can use the Data Catalog as a central repository to store structural and operational metadata for their data.

AWS Glue provides out-of-box integration with Amazon EMR that enables customers to use the AWS Glue Data Catalog as an external Hive Metastore. To learn more, visit our documentation.

This is an open-source implementation of the Apache Hive Metastore client on Amazon EMR clusters that uses the AWS Glue Data Catalog as an external Hive Metastore. It serves as a reference implementation for building a Hive Metastore-compatible client that connects to the AWS Glue Data Catalog. It may be ported to other Hive Metastore-compatible platforms such as other Hadoop and Apache Spark distributions.

This package is compatible with Spark 3 and Hive 3.

Note: in order for this client implementation to be used with Apache Hive, a patch included in this JIRA must be applied to it. All versions of Apache Hive running on Amazon EMR that support the AWS Glue Data Catalog as the metastore already include this patch.

Patching Apache Hive and Installing It Locally

Obtain a copy of Hive from GitHub at https://github.com/apache/hive.

git clone https://github.com/apache/hive.git

To build the Hive client, you need to first apply this patch. Download this patch and move it to your local Hive git repository you created above. This patch is included in the repository. Apply the patch and build Hive.

git checkout branch-3.1
git apply -3 ~/branch_3.1.patch
mvn clean install -DskipTests

Building the Hive Client

Once you have successfully patched and installed Hive locally, move into the AWS Glue Data Catalog Client repository and update the following property in pom.xml.

<hive3.version>3.1.3</hive3.version>

You are now ready to build the Hive client.

cd aws-glue-datacatalog-hive3-client
mvn clean package -DskipTests

Building the Spark Client

As Spark uses a fork of Hive based off the 2.3 branch, in order to build the Spark client, you need Hive 2.3 built with this patch.

cd <your local Hive repo>
git checkout branch-2.3
patch -p0 <HIVE-12679.branch-2.3.patch
mvn clean install -DskipTests

Go back to the AWS Glue Data Catalog Client repository and update the following property in pom.xml to match the version of Hive you just patched and installed locally.

<spark-hive.version>2.3.10-SNAPSHOT</spark-hive.version>

You are now ready to build the Spark client.

cd aws-glue-datacatalog-spark-client
mvn clean package -DskipTests

If you are having issues with building individual folders and if you have both versions of Hive patched and installed locally, you can build both of these clients from the root directory of the AWS Glue Data Catalog Client repository.

Configuring Hive to Use the Hive Client

You need to ensure that the AWS Glue Data Catalog Client jar is in Hive's CLASSPATH and also set the "hive.metastore.client.factory.class" HiveConf variable for Hive to pick up and instantiate the AWS Glue Data Catalog Client. For instance, on Amazon EMR, the client jar is located in /usr/lib/hive/lib/ and the HiveConf is set in /usr/lib/hive/conf/hive-site.xml.

<property>
	<name>hive.metastore.client.factory.class</name>
	<value>com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory</value>
</property>

Configuring Spark to Use the Spark Client

Similarly, for Spark, you need to install the client jar in Spark's CLASSPATH and create or update Spark's own hive-site.xml to add the above property. On Amazon EMR, this is set in /usr/lib/spark/conf/hive-site.xml. You can also find the location of the Spark client jar in /usr/lib/spark/conf/spark-defaults.conf.

Enabling client side caching for catalog

Currently, we provide support for caching:

a) Table metadata - Response from Glue's GetTable operation (https://docs.aws.amazon.com/glue/latest/webapi/API_GetTable.html#API_GetTable_ResponseSyntax) b) Database metadata - Response from Glue's GetDatabase operation (https://docs.aws.amazon.com/glue/latest/webapi/API_GetDatabase.html#API_GetDatabase_ResponseSyntax)

Both these entities have dedicated caches for themselves and can be enabled/tuned individually.

To enable/tune Table cache, use the following properties in your hive/spark configuration file:

<property>
	<name>aws.glue.cache.table.enable</name>
	<value>true</value>
</property>
<property>
	<name>aws.glue.cache.table.size</name>
	<value>1000</value>
</property>
<property>
	<name>aws.glue.cache.table.ttl-mins</name>
	<value>30</value>
</property>

To enable/tune Database cache:

<property>
	<name>aws.glue.cache.db.enable</name>
	<value>true</value>
</property>
<property>
	<name>aws.glue.cache.db.size</name>
	<value>1000</value>
</property>
<property>
	<name>aws.glue.cache.db.ttl-mins</name>
	<value>30</value>
</property>

NOTE: The caching logic is disabled by default.

License

This library is licensed under the Apache 2.0 License.

aws-glue-data-catalog-client-for-apache-hive-metastore's People

Contributors

aws-austin-lee avatar aws-jeffrey-yang avatar cshengji avatar dacort avatar junbongwe avatar khush-bhatia avatar pmiten avatar rednaxelafx avatar tushar-poddar-amazon avatar vinayict avatar youngbink avatar

Stargazers

 avatar  avatar

Watchers

 avatar  avatar

Forkers

meetr-ai

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.