Basic wrapper code to execute and monitor HQL script(s) against Apache Hive via a JDBC connection.
Takes the JobConfig.root directory specified in framework_config.xml (or via -JobConfig on command line) and executes the HQL files found in that location in lexicographical order.
A sub-directory containing additional HQL files is assumed to be a separate set of independent HQL that can be executed in parallel in conjunction with other HQL files stored in sub-directories.
For example, given a JobConfig root that points to the following files: -
The wrapper will execute: -
- 00_Init1.hql
- 10_Init1.hql
and then start a separate thread to execute files in directory 20_Rule1:-
- 00_Rule1
- 20_Rule1
also start a thread to execute code in 30_Rule2 directory: -
- 10_Rule2.hql
- 20_Rule2.hql
- 99_Rule2.hql
and then:-
- 80_Fin1.hql
- 90_Fin2.hql
Note that a failure of a file existing in a sub-directory will not prevent top level or other sub-directory files from running. For example, a failure in file 00_Rule1 will prevent 20_Rule1 from executing but 30_Rule2 plus 80_Fin1 and 2 will still execute.
A failure in file 10_Init1 will prevent all other tasks from executing.
See TestHqlWrapper.java
for an example job chain.
The wrapper is configured via two XML config files that should be somewhere on the classpath. Each config item has an associated description and follows the same schmea as standard Hadoop configuration (e.g. hdfs-site.xml, hive-site.xml etc):
framework.config.xml
jdbc.drivers
- Java classname of JDBC driverjdbc.url
- JDBC connection URLHql.FileSystem
- Name of filesystem on which to retrieve HQL files to be run. (see JobConfig.root). Setting tohdfs://server_details
will make framework look for files on the Hadoop HDFS filesystem. Setting tofile:///
will use the local file system. If this parameter is not set then local filesystem is assumed.JobConfig.root
- Name of root directory where wrapper will look for HQL files to execute.monitor.interval
- Polling interval (in seconds) for task monitoring. Wrapper will sleep for this length of time before waking up and checking status of running job(s).kerberos.principal
- The Kerberos principal to authenticate against. If blank a non-kerberos system is assumed.mail.hostname
- Hostname used to send emails from the wrapper code.mail.mailfrom
- Name used in From: field of sent emailsmailto
- Comma separated list of email addresses to email.mailSuccess
- Emails send on success? (true/false)mailFailure
- Emails sent on failure? (true/false)
user-password.xml
This file should contain JDBC user/password details for Hive connection and should be kept secure. A good solution is to add your home directory to the classpath, and store user_password.xml there.
<configuration>
<property>
<name>jdbc.username</name>
<value>user</value>
<description>
Userid.
</description>
</property>
<property>
<name>jdbc.password</name>
<value>password</value>
<description>
Password.
</description>
</property>
</configuration>
--JobConfig
- Name of the job configuration directory (overridesJobConfig.root
in XML config if set).--DryRun
- Parse and run the jobs but do not submit anything to Hive.--help
- Command line help.
Maven is used as the build tool with the following goals: -
mvn clean compile test package install
Thanks to Bob Freitas for the Hive unit test framework.