Giter Site home page Giter Site logo

yassineazzouz / ahdp Goto Github PK

View Code? Open in Web Editor NEW
4.0 4.0 2.0 90 KB

This project is an ansible library of modules for integration with hadoop services, it provides a set of libraries to interact with hadoop services in a very simple and flexible way by combining ansible's easy syntax with hadoop API to provide a very powerful devops platform that can help administrators and developers to automate their operations on large scale clusters.

License: GNU General Public License v3.0

Python 100.00%

ahdp's Introduction

PyPI version

ahdp

ahdp is an ansible library of modules for integration with hadoop framework, it provides a way to interact with different hadoop services in a very simple and flexible way using ansible's easy syntax. The purpose of this project is to provide DevOps and platform administrators a simple way to automate their operations on large scale hadoop clusters using ansible.

Features

Currently ahdp provides modules to interact with HDFS through WEBHDFS or HTTPFS and with hive server 2 or impala using thrift.

  • Ansible libraries and utilities functions for HDFS operations :

    • create directories and files
    • change directories and files attributes and ownership
    • manage acls
    • manage extra attributes
    • fetch and copy files to HDFS
    • advanced search functionalities.
    • manage hdfs snapshots
    • The HDFS modules are based on pywhdfs project to establish WebHDFS and HTTPFS connections with hdfs service.
      • Support both secure (Kerberos,Token) and insecure clusters
      • Supports HA clusters and handle namenode failover
      • Supports HDFS federation with multiple nameservices and mount points.
    • Please refer to the hdfs modules documentation for more details about all the supported modules
  • Ansible libraries and utilities functions for HIVE operations :

    • create and delete databases
    • Manage privileges on hive database objects.
    • The hive modules are based on impyla client project to interact with HIVE Metastore:
      • Support for multiple types of authentication (Kerberos, LDAP, PLAIN, NOSASL)
      • Support for SSL
      • Works with both hive server 2 and impala daemons connections.
    • Please refer to the hive modules documentation for more details about all the supported modules

Installation & Configuration

The ahdp module need to be installed on both the ansible server and on client machines (for instance the gateways that you will use as targets on the playbook). Normally simple ansible modules does not need to exist on target machines, however since the hdfs and hive modules uses a custom module_utils they need to be installed also on the target machine.

To install ahdp on the ansible host :

pip install ahdp

or

pip install ahdp[ansible]

To install ahdp on the target hosts :

pip install ahdp[client]

The client extension will also install all dependencies that need to exist on target machines:

Make sure the following packages are also installed on the target machines :

  • libffi-devel
  • gcc-c++
  • python-devel
  • krb5-devel

Note: To use ahdp modules you need to configure ansible to know where the modules are located, you need simply add the library configuration to your ~/.ansible.cfg or to /etc/ansible.cfg, for instance if you have python 2.7, the modules path will be :

library = /usr/lib/python2.7/site-packages/ahdp/modules/

You can also place manually the modules in a path of your choice then set the library option to that path.

USAGE

The best way to use ahdp is to run ansible playbooks and see it as work, there are some testcases under "test" directory, which can give a high level idea of how an ansible project should be structured around hadoop.

Below a simple playbook that creates a User home directory in hdfs and a database in hive:

- hosts: localhost
  vars:
    nameservices:
      - urls: "http://localhost:50070"
        mounts: "/"
    hs2_host: "localhost"
  tasks:
    - name: Create HDFS user home directory
      hdfsfile:
        authentication: "none"
        state: "directory"
        path: "/user/ansible"
        owner: "ansible"
        group: "supergroup"
        mode: "0700"
        nameservices: "{{nameservices | to_json}}"
    - name: Create User hive database
      hive_db:
        authentication: "NOSASL"
        user: "hive"
        host: "{{hs2_host}}"
        port: 10000
        db: "ansible"
        owner: "ansible"
        state: "present"

To run the playbook, simply run:

 ansible-playbook simple_test.yml

SOME GOOD PRACTICES

The following project aim to make hadoop administration and operation easier using ansible, below some useful tips and guidelines on how to structure a hadoop project in ansible:

  • Create a separate inventory group for each hadoop cluster and create separate groups for different services and roles, If you are using a cloudera distribution you can also use dynamic inventory based the [ cloudera api ] ( tools/cloudera.py ).
  • Define a gateway or edge node for each cluster to use it as target in your ansible playbooks. Make sure the ahdp project and its dependencies are installed on all edge nodes, you can also configure pywhdfs and use its cli to interact programatically with the HDFS service.
  • Create separate group variables for every cluster where you can define the connection parameters, for instance below a configuration example for a standalone hadoop installation :
cat group_vars/local/local.yml
---
nameservices:
      - urls: "http://localhost:50070"
        mounts: "/"
hs2_host: "localhost"
impala_daemon_host: "localhost"
cloudera_manager_host: "localhost"
  • Use ansible vault for passwords, create a separate vault file for every cluster.
ansible-vault view group_vars/local/local_encrypted.yml
---
hdfs_kerberos_password: "password"
hdfs_principal: "hdfs@LOCALDOMAIN"
hive_ldap_password: "password"
  • Create ansible roles for different kind of operational tasks you perform on your platforms and use "--limit=NAME_OF_CLUSTER" to control the target cluster.

Question or Ideas ?

I'd love to hear what you think about ahdp and appreciate any idea or suggestion, Pull requests are also very welcome!

ahdp's People

Contributors

yassineazzouz avatar

Stargazers

 avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

ahdp's Issues

Example of using register for ACLs

Can you give example of using register for acl entries, and then using that for setting acls? It would be useful to have example on how to use that in the readme.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.