Giter Site home page Giter Site logo

snakebite-py3's Introduction

🐍 About This Fork🍴

This is a fork of https://github.com/spotify/snakebite, via https://github.com/kirklg/snakebite/tree/feature/python3. We maintain it enough to work for our needs at the Internet Archive. We use the library with our CDH5 cluster and have not tested it with any other versions of hadoop. Please help us improve this! Or make your own fork. No hard feelings.

Snakebite mini logo

Snakebite is a python library that provides a pure python HDFS client and a wrapper around Hadoops minicluster. The client uses protobuf for communicating with the NameNode and comes in the form of a library and a command line interface. Currently, the snakebite client supports most actions that involve the Namenode and reading data from DataNodes.

Note: all methods that read data from a data node are able to check the CRC during transfer, but this is disabled by default because of performance reasons. This is the opposite behaviour from the stock Hadoop client.

Snakebite requires python2 (python3 is not supported yet) and python-protobuf 2.4.1 or higher.

Snakebite 1.3.x has been tested mainly against Cloudera CDH4.1.3 (hadoop 2.0.0) in production. Tests pass on HortonWorks HDP 2.0.3.22-alpha (protocol versions 7 and 8)

Snakebite 2.x has been tested on Hortonworks HDP2.0 and CDH5 Beta and ONLY supports Hadoop 2.2.0 and up (protocol version 9)!

Installing

Snakebite-py3 releases will be available through pypi at https://pypi.python.org/pypi/snakebite-py3/

To install snakebite run:

pip install snakebite-py3

Documentation

More information and documentation can be found at https://snakebite.readthedocs.io/en/latest/

Development

Travis CI status: Travis Join the chat at https://gitter.im/spotify/snakebite

Copyright 2013-2016 Spotify AB Copyright 2016-2019 Internet Archive and individual contributors

snakebite-py3's People

Contributors

aarya123 avatar adam-miller avatar aeroevan avatar anyman avatar bolkedebruin avatar davefnbuck avatar dterror-zz avatar elukey avatar galgeek avatar gbadiali avatar gitter-badger avatar gpoulin avatar guhehehe avatar hammer avatar hansohn avatar jkukul avatar julian avatar kawaa avatar kirklg avatar kngenie avatar nlevitt avatar ogrisel avatar phoet avatar pragmattica avatar ravwojdyla avatar ro-ket avatar tarrasch avatar zline avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

snakebite-py3's Issues

sasl.process() fails

I try to connect to HDFS via AutoConfigClient, but have an error:
File "/mnt/kubernetes/sandbox/data-governance--change-control-statistics-reporter/lib/python3.6/site-packages/snakebite/rpc_sasl.py", line 109, in connect initiate.token = self.sasl.process() TypeError: None has type NoneType, but expected one of: bytes [in /mnt/kubernetes/sandbox/data-governance--change-control-statistics-reporter/lib/python3.6/site-packages/cmdlineutil/__init__.py:230, 140396487395136 (MainThread), 104 (MainProcess)]

Unit tests fail with errors

Tox unit tests are failing with the following errors:

test_execute_does_not_swallow_tracebacks (test.commandlineparser_test.CommandLineParserExecuteTest) ... ERROR
test_copyToLocal_directory_structure (test.copytolocal_test.CopyToLocalTest) ... FAIL
test_copyToLocal_relative_directory_structure (test.copytolocal_test.CopyToLocalTest) ... FAIL

Both are related to the Python3 migration and require minor changes to correct. Will send PR over shortly to resolve.

StopIteration Error Raised at Runtime with Python 3.7 Version

Getting error with wrapped function, where there is no way to catch StopIteration

Encountered transient exception generator raised StopIteration

In File "/opt/azkaban-executor/venvs/WIP_DAP_LD-4892-SNAPSHOT_mahima_gupta/lib/python3.9/site-packages/snakebite/client.py", line 1549, in wrapped

Change

yield next(results)

to

try:
yield next(results)
except StopIteration:
return

Deprecation warning due to invalid escape sequences in Python 3.8

Deprecation warnings are raised due to invalid escape sequences in Python 3.8 . Below is a log of the warnings raised during compiling all the python files. Using raw strings or escaping them will fix this issue.

I will add a PR

find . -iname '*.py'  | xargs -P 4 -I{} python3.8 -Wall -m py_compile {} 

./test/df_test.py:32: DeprecationWarning: invalid escape sequence \s
  (filesystem, capacity, used, remaining, pct) = re.split("\s+", expected_output)
./snakebite/rpc_sasl.py:85: DeprecationWarning: invalid escape sequence \/
  service = re.split('[\/@]', str(self.hdfs_namenode_principal))[0]
./snakebite/minicluster.py:204: DeprecationWarning: invalid escape sequence \d
  m = re.match(".*Started MiniDFSCluster -- namenode on port (\d+).*", line)
./snakebite/minicluster.py:214: DeprecationWarning: invalid escape sequence \s
  (perms, replication, owner, group, length, date, time, path) = re.split("\s+", line)
./snakebite/minicluster.py:236: DeprecationWarning: invalid escape sequence \s
  fields = re.split("\s+", line)
./snakebite/minicluster.py:240: DeprecationWarning: invalid escape sequence \s
  (length, path) = re.split("\s+", line)
./snakebite/minicluster.py:238: DeprecationWarning: invalid escape sequence \s
  (length, space_consumed, path) = re.split("\s+", line)
./snakebite/minicluster.py:253: DeprecationWarning: invalid escape sequence \s
  (_, dir_count, file_count, length, path) = re.split("\s+", line)
./snakebite/client.py:816: SyntaxWarning: "is not" with a literal. Did you mean "!="?
  elif not load['error'] is '':

Regular expression in minicluster _get_namenode_port has invalid escape sequence in python >= 3.8

This regular expression:
m = re.match(".Started MiniDFSCluster -- namenode on port (\d+).", line)

found here:
https://github.com/internetarchive/snakebite-py3/blob/master/snakebite/minicluster.py#L204

is not returning the namenode port as desired, rather NoneType, which leads to the error:
"TypeError: %d format: a number is required, not NoneType"

I suspect this is caused by python 3.8 (and above) throwing an invalid escape sequence error when encountering "\d"

I suspect the solution is to pass the regular expression as a raw string, i.e.:
m = re.match(r".Started MiniDFSCluster -- namenode on port (\d+).", line)

Snakebite doesn't work with HDFS RPC encryption

This is a tracking task to list all the work needed to solve one outstanding issue with snakebite. When RPC encryption is enabled for HDFS, the following happens:

  • snakebite contacts the HDFS namenode via Hadoop RPC, negotiating the encryption settings using GSS-API via SASL. It needs to retrieve the list of blocks to read/write and the related datanodes to talk to. This part works fine.
  • snakebite then has to contact every HDFS datanode, using a specific RPC protocol that is not Hadoop RPC. The authentication is done via DIGEST-MD5 via SASL, that also allows to set the encryption level if needed (to then allow the negotiation of AES encryption). This bit currently doesn't work because the code that would be needed relies on functionalities of SASL that are not implemented in pure-sasl (namely DIGEST-MD5).

I opened an issue to pure-sasl (thobbs/pure-sasl#32) but some work would be needed to add the missing features.

The alternative would be to use sasl (https://github.com/cloudera/python-sasl) but unfortunately the library is not maintained since 2016. There is a fork that we could consider that should support DIGEST-MD5 + GSS-API: cloudera/python-sasl#15 (comment)

Kerberos support for Python3

Hi!

I found this fork while looking at Apache Airflow's support for HDFS, and I am wondering if I can help in updating the code base (if nobody is already looking into it). I work for the Wikimedia foundation and we'd be really interested in using snakebite in production, but we'd need Kerberos support. Is there any work in progress or previous attempts failed that I should be aware of?

Thanks in advance!

HAClient doesn't fail over to alt Namenode

Hello,

First off, thank you for the snakebite Python3 port. I really appreciate it!!!

As I started using it I noticed a bug where HAClient class doesn't properly failover to the alternate Namenode when the org.apache.hadoop.ipc.StandbyException exception is thrown. Instead, it throws the following error:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/anaconda3/envs/anaconda3/lib/python3.7/site-packages/snakebite/client.py", line 174, in ls
    recurse=recurse):
  File "/usr/local/anaconda3/envs/anaconda3/lib/python3.7/site-packages/snakebite/client.py", line 1223, in _find_items
    fileinfo = self._get_file_info(path)
  File "/usr/local/anaconda3/envs/anaconda3/lib/python3.7/site-packages/snakebite/client.py", line 1351, in _get_file_info
    return self.service.getFileInfo(request)
  File "/usr/local/anaconda3/envs/anaconda3/lib/python3.7/site-packages/snakebite/service.py", line 43, in <lambda>
    rpc = lambda request, service=self, method=method.name: service.call(service_stub_class.__dict__[method], request)
  File "/usr/local/anaconda3/envs/anaconda3/lib/python3.7/site-packages/snakebite/service.py", line 49, in call
    return method(self.service, controller, request)
  File "/usr/local/anaconda3/envs/anaconda3/lib/python3.7/site-packages/google/protobuf/service_reflection.py", line 267, in <lambda>
    self._StubMethod(inst, method, rpc_controller, request, callback))
  File "/usr/local/anaconda3/envs/anaconda3/lib/python3.7/site-packages/google/protobuf/service_reflection.py", line 284, in _StubMethod
    method_descriptor.output_type._concrete_class, callback)
  File "/usr/local/anaconda3/envs/anaconda3/lib/python3.7/site-packages/snakebite/channel.py", line 450, in CallMethod
    return self.parse_response(byte_stream, response_class)
  File "/usr/local/anaconda3/envs/anaconda3/lib/python3.7/site-packages/snakebite/channel.py", line 421, in parse_response
    self.handle_error(header)
  File "/usr/local/anaconda3/envs/anaconda3/lib/python3.7/site-packages/snakebite/channel.py", line 424, in handle_error
    raise RequestError("\n".join([header.exceptionClassName, header.errorMsg]))
snakebite.errors.RequestError: org.apache.hadoop.ipc.StandbyException
Operation category READ is not supported in state standby
	at org.apache.hadoop.hdfs.server.namenode.ha.StandbyState.checkOperation(StandbyState.java:87)
	at org.apache.hadoop.hdfs.server.namenode.NameNode$NameNodeHAContext.checkOperation(NameNode.java:1978)
	at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkOperation(FSNamesystem.java:1368)
	at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getFileInfo(FSNamesystem.java:4096)
	at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getFileInfo(NameNodeRpcServer.java:1130)
	at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.getFileInfo(ClientNamenodeProtocolServerSideTranslatorPB.java:851)
	at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
	at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:640)
	at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:982)
	at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2351)
	at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2347)
	at java.security.AccessController.doPrivileged(Native Method)
	at javax.security.auth.Subject.doAs(Subject.java:422)
	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1865)
	at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2345)

It seems Python3 doesn't support unbound methods and instead defines all methods as functions unless operating on an instantiated object. So when HAClient._wrap_methods() calls the following _wrap_methods function:

    @classmethod
    def _wrap_methods(cls):
        # Add HA support to all public Client methods, but only do this when we haven't done this before
        for name, meth in inspect.getmembers(cls, inspect.ismethod):
            if not name.startswith("_"): # Only public methods
                if inspect.isgeneratorfunction(meth):
                    setattr(cls, name, cls._ha_gen_method(meth))
                else:
                    setattr(cls, name, cls._ha_return_method(meth))

When inspect.getmembers is called with the inspect.ismethod filter, it can't find the Client parent class functions to wrap with _ha_gen_method and _ha_return_method accordingly.

Any objections if I create a pull request to resolve the issue and get HA Namenode failover working again?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    πŸ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. πŸ“ŠπŸ“ˆπŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❀️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.