Giter Site home page Giter Site logo

libhdfs3's Introduction

libhdfs3

Build Status

A Native C/C++ HDFS Client

Description

The Hadoop Distributed File System (HDFS) is a distributed file system designed to run on commodity hardware. HDFS is highly fault-tolerant and is designed to be deployed on low-cost hardware. HDFS provides high throughput access to application data and is suitable for applications that have large data sets.

HDFS is implemented in JAVA language and additionally provides a JNI based C language library libhdfs. To use libhdfs, users must deploy the HDFS jars on every machine. This adds operational complexity for non-Java clients that just want to integrate with HDFS.

Libhdfs3, designed as an alternative implementation of libhdfs, is implemented based on native Hadoop RPC protocol and HDFS data transfer protocol. It gets rid of the drawbacks of JNI, and it has a lightweight, small memory footprint code base. In addition, it is easy to use and deploy.

Libhdfs3 is developed by Pivotal and used in HAWQ, which is a massive parallel database engine in Pivotal Hadoop Distribution.

========================

Installation

Requirement

To build libhdfs3, the following libraries are needed.

cmake (2.8+)                    http://www.cmake.org/
boost (tested on 1.53+)         http://www.boost.org/
google protobuf                 http://code.google.com/p/protobuf/
libxml2                         http://www.xmlsoft.org/
kerberos                        http://web.mit.edu/kerberos/
libgsasl                        http://www.gnu.org/software/gsasl/

To run code coverage test, the following tools are needed.

gcov (included in gcc distribution)
lcov (tested on 1.9)            http://ltp.sourceforge.net/coverage/lcov.php

Configuration

Assume libhdfs3 home directory is LIBHDFS3_HOME.

cd LIBHDFS3_HOME
mkdir build
cd build
../bootstrap

Environment variable CC and CXX can be used to setup the compiler. Script "bootstrap" is basically a wrapper of cmake command, user can use cmake directly to tune the configuration.

Run command "../bootstrap --help" for more configuration.

Build

Run command to build

make

To build concurrently, rum make with -j option.

make -j8

Test

To do unit test, run command

make unittest

To do function test, first start HDFS, and create the function test configure file at LIBHDFS3_HOME/test/data/function-test.xml, an example can be found at LIBHDFS3_HOME/test/data/function-test.xml.example. And run command.

make functiontest

To show code coverage result, run command. Code coverage result can be found at BUILD_DIR/CodeCoverageReport/index.html

make ShowCoverage

Install

To install libhdfs3, run command

make install

Wiki

https://github.com/PivotalRD/libhdfs3/wiki

libhdfs3's People

Contributors

alesapin avatar alexey-milovidov avatar amosbird avatar avogar avatar chenxing-xc avatar excitoon avatar gfrontera avatar gogowen avatar houfangdong avatar jsc0218 avatar kssenii avatar lhuang09287750 avatar liubingxing avatar loong-hy avatar m1eyu2018 avatar meenarenganathan22 avatar michael1589 avatar taiyang-li avatar tomscut avatar zhanglistar avatar zhongyuankai avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

libhdfs3's Issues

undefined symbol: crc_pcl (./libhdfs3.so)

After make, the library libhdfs3.so was generated normally with no error.
However, the output of "ldd -r libhdfs3.so" show that:
# ldd -r src/libhdfs3.so
linux-vdso.so.1 (0x00007fff75ed1000)
...
libpcre2-8.so.0 => /lib64/libpcre2-8.so.0 (0x00007ffa63c86000)
undefined symbol: crc_pcl (src/libhdfs3.so)

It seems that the crc_iscsi_v_pcl.asm is not be compiled.

Any suggestion? thanks in advance.

We need a Release for this libhdfs3

We are using this libhdfs3 in our project, but this code is changing sometimes.
Please create a release for this project, make it usable
thank you!

error: 'unique_lock' is not a member of `std`

Hi! I'm trying to build this library under Ubuntu 22.04

I call ../bootstrap and make and get the following errors:

/home/felix/Projects/alfabank/libhdfs3/src/client/SystemECPolicies.cpp: In member function ‘void Hdfs::Internal::SystemECPolicies::addEcPolicy(int8_t, std::shared_ptr<Hdfs::Internal::ECPolicy>)’:
/home/felix/Projects/alfabank/libhdfs3/src/client/SystemECPolicies.cpp:62:10: error: ‘unique_lock’ is not a member of ‘std’
   62 |     std::unique_lock<std::shared_mutex> lock(mutex);
      |          ^~~~~~~~~~~
/home/felix/Projects/alfabank/libhdfs3/src/client/SystemECPolicies.cpp:25:1: note: ‘std::unique_lock’ is defined in header ‘<mutex>’; did you forget to ‘#include <mutex>’?
   24 | #include "SystemECPolicies.h"
  +++ |+#include <mutex>
   25 | 
/home/felix/Projects/alfabank/libhdfs3/src/client/SystemECPolicies.cpp:62:39: error: expected primary-expression before ‘>’ token
   62 |     std::unique_lock<std::shared_mutex> lock(mutex);
      |                                       ^
/home/felix/Projects/alfabank/libhdfs3/src/client/SystemECPolicies.cpp:62:41: error: ‘lock’ was not declared in this scope; did you mean ‘lockf’?
   62 |     std::unique_lock<std::shared_mutex> lock(mutex);
      |                                         ^~~~
      |                                         lockf
make[2]: *** [src/CMakeFiles/libhdfs3-static.dir/build.make:731: src/CMakeFiles/libhdfs3-static.dir/client/SystemECPolicies.cpp.o] Error 1
make[2]: *** Waiting for unfinished jobs....
[  5%] Building CXX object src/CMakeFiles/libhdfs3-shared.dir/network/TcpSocket.cpp.o
cd /home/felix/Projects/alfabank/libhdfs3/build/src && /usr/bin/g++ -DTEST_HDFS_PREFIX=\"./\" -D_GNU_SOURCE -D__STDC_FORMAT_MACROS -Dlibhdfs3_shared_EXPORTS -I/home/felix/Projects/alfabank/libhdfs3/src -I/home/felix/Projects/alfabank/libhdfs3/src/common -I/home/felix/Projects/alfabank/libhdfs3/build/src -I/usr/include/libxml2 -I/home/felix/Projects/alfabank/libhdfs3/mock -fno-strict-aliasing -fno-omit-frame-pointer -msse4.2 -Wl,--export-dynamic -std=c++0x -D_GLIBCXX_USE_NANOSLEEP -Wall -O2 -g -DNDEBUG -fPIC -std=gnu++20 -MD -MT src/CMakeFiles/libhdfs3-shared.dir/network/TcpSocket.cpp.o -MF CMakeFiles/libhdfs3-shared.dir/network/TcpSocket.cpp.o.d -o CMakeFiles/libhdfs3-shared.dir/network/TcpSocket.cpp.o -c /home/felix/Projects/alfabank/libhdfs3/src/network/TcpSocket.cpp
/home/felix/Projects/alfabank/libhdfs3/src/client/SystemECPolicies.cpp: In member function ‘void Hdfs::Internal::SystemECPolicies::addEcPolicy(int8_t, std::shared_ptr<Hdfs::Internal::ECPolicy>)’:
/home/felix/Projects/alfabank/libhdfs3/src/client/SystemECPolicies.cpp:62:10: error: ‘unique_lock’ is not a member of ‘std’
   62 |     std::unique_lock<std::shared_mutex> lock(mutex);
      |          ^~~~~~~~~~~
/home/felix/Projects/alfabank/libhdfs3/src/client/SystemECPolicies.cpp:25:1: note: ‘std::unique_lock’ is defined in header ‘<mutex>’; did you forget to ‘#include <mutex>’?
   24 | #include "SystemECPolicies.h"
  +++ |+#include <mutex>
   25 | 
/home/felix/Projects/alfabank/libhdfs3/src/client/SystemECPolicies.cpp:62:39: error: expected primary-expression before ‘>’ token
   62 |     std::unique_lock<std::shared_mutex> lock(mutex);
      |                                       ^
/home/felix/Projects/alfabank/libhdfs3/src/client/SystemECPolicies.cpp:62:41: error: ‘lock’ was not declared in this scope; did you mean ‘lockf’?
   62 |     std::unique_lock<std::shared_mutex> lock(mutex);

How can I cope with it?

P.S.

g++ --version
g++ (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0

gcc --version
gcc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0

Could you please release a version to the public or tag it

We have recently fixed a bug in libhdfs3 regarding libgsasl, allowing it to connect to HDFS clusters with secure mode enabled. We have also replaced libgsasl with cyrus_SASL to support connecting to HDFS clusters with secure mode. If possible, I would like to contribute our code to this community. Could you please release a version to the public or tag it;.

dfs.client.use.datanode.hostname have no effect

The Hadoop cluster equipped with two sets of networks . One is connected to a 10G switch to communicate with servers in the cluster(IP like 192.168..), and the other is connected to a Gigabit switch to connect to client servers outside the cluster(IP like 80.99..).I've tried setting set up HDFS services (NameNode, DataNode, etc.) to bind to wildcard addresses,and at the same time setting "dfs.client.use.datanode.hostname" to "true" in hdfs-client.xml. Before starting clickhouse server,I've setting “LIBHDFS3_CONF=/home/clickhouse/server/conf/hdfs-client.xml”. I mean I tried passing all the configuration options using those different methods.No effect. And I made sure that the host file resolution configuration of the client server is a Gigabit network IP(such as 80.99.. cdh1.......),but the client server will still access the 10G network through host name resolution,such as 192.168..
Snipaste_2024-07-10_09-51-16
Snipaste_2024-07-10_09-50-47

Files written by gohdfs library are unreadable by Clickhouse

I already submitted the issue to the mentioned gohdfs library: colinmarc/hdfs#313

When I try writing a simple csv file (initially, I tried parquet to the same effect), the file written by this library ends up not being readable by clickhouse. This might also be an issue with clickhouse, as all other datalakes I use can read these files without issues.

What I tried so far:

  • the attached program that writes directly to hdfs: cannot be read by clickhouse
  • the attached program that writes to a local file, then upload using gohdfs put: cannot be read by clickhouse
  • the attached program that writes to a local file, then upload using hdfs dfs -put (hadoop tools): can be read by clickhouse
  • the attached program that writes directly to hdfs, use hdfs dfs -get followed by hdfs dfs -put: can be read by clickhouse, nothing missing

This is the same code:

package main

import (
	"log"
	"fmt"

	"github.com/colinmarc/hdfs/v2"
)

func main() {
	var err error
	client, err := hdfs.New("")
	if err != nil {
		log.Println("Can't create hdfs client", err)
		return
	}
	_ = client.Remove("/random/yet/existing/path/flat.csv")
	fw, err := client.Create("/random/yet/existing/path/flat.csv")
	if err != nil {
		log.Println("Can't create writer", err)
		return
	}

	num := 100
	for i := 0; i < num; i++ {
		if _, err = fmt.Fprintf(fw, "%d,%d,%f\n", int32(20+i%5), int64(i), float32(50.0)); err != nil {
			log.Println("Write error", err)
		}
	}
	log.Println("Write Finished")
	if err = fw.Close(); err != nil {
		log.Println("Issue closing file", err)
	}
	log.Println("Wrote ", num, "rows")
}

this is the response from running the latest clickhouse-local:

server.internal :) select * from hdfs('hdfs://nameservice1/random/yet/existing/path/flat.csv', 'CSV')

SELECT *
FROM hdfs('hdfs://nameservice1/random/yet/existing/path/flat.csv', 'CSV')

Query id: e33d9bf7-41b0-4025-a5fc-8dc6ebb65c0f


0 rows in set. Elapsed: 60.292 sec.

Received exception:
Code: 210. DB::Exception: Fail to read from HDFS: hdfs://nameservice1, file path: /random/yet/existing/path/flat.csv. Error: 
HdfsIOException: InputStreamImpl: cannot read file: /random/yet/existing/path/flat.csv, from position 0, size: 1048576.	
Caused by: HdfsIOException: InputStreamImpl: all nodes have been tried and no valid replica can be read for Block: [block pool 
ID: BP-2134387385-192.168.12.6-1648216715170 block ID 1367614010_294283603].: Cannot extract table structure from CSV
 format file. You can specify the structure manually. (NETWORK_ERROR)

I'm using Hadoop 2.7.3.2.6.5.0-292. Using strace, I can see that clickhouse is trying to access different data nodes, so there is a lot of network traffic.

clickhouse version 23, connect hdfs error

CREATE TABLE guandata.of0b4be941d9_0(
idNullable(Int32),
customer_nameNullable(String),
commentsNullable(String)
)ENGINE=HDFS('hdfs://nameserver1/guandata-store/of0b4/part-000.snappy.parquet','Parquet')

select * from guandata.of0b4be941d9_0;
When I execute the SQL query, ClickHouse reports an error.

connection reset by peer: While executing ParquetBlockInputFormat: While exeuting HDFSSource. (NETWORK_ERROR)

The error logs on HDFS are as follows:
image

My environment involves connecting ClickHouse to HDFS with Kerberos authentication to query data. I have confirmed that my Kerberos configuration is correct. I can obtain tickets using kinit and verify them with klist.

Erasure code is not supported

Currently, libhdfs3 does not support erasure codes, limiting its use in certain scenarios. We should implement this feature.

Strange error while compiling

While linking libhdfs3 with my source, I am getting the following error. The internet does not say much either!

$ c++ -o hdfs3 hdfs3.cpp -I /home/mark/source/libhdfs3-clickhouse/include -L /home/mark/source/libhdfs3-clickhouse/build/src/ -lkrb5 -lhdfs3
/home/mark/source/libhdfs3-clickhouse/build/src//libhdfs3.so: undefined reference to `crc_pcl'
collect2: error: ld returned 1 exit status
gcc (GCC) 8.5.0 20210514 (Red Hat 8.5.0-10)
Copyright (C) 2018 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

Any idea how to fix this?

hdfsBuilderSetPrincipal doesn't work

Hi Team,
I got an error like this :

2021-03-14 14:10:20.339761, p16581, th139969607461248, INFO Retrying connect to server: "hadoopc1h5:25000". Already tried 9 time(s)
2021-03-14 14:10:22.359599, p16581, th139969607461248, ERROR Failed to setup RPC connection to "hadoopc1h5:25000" caused by:
RpcChannel.cpp: 747: Problem with callback handler
	@	Hdfs::Internal::UnWrapper<Hdfs::SafeModeException, Hdfs::SaslException, Hdfs::Internal::Nothing, Hdfs::Internal::Nothing, Hdfs::Internal::Nothing, Hdfs::Internal::Nothing, Hdfs::Internal::Nothing, Hdfs::Internal::Nothing, Hdfs::Internal::Nothing, Hdfs::Internal::Nothing, Hdfs::Internal::Nothing>::unwrap(char const*, int)
	@	Hdfs::Internal::UnWrapper<Hdfs::AccessControlException, Hdfs::SafeModeException, Hdfs::SaslException, Hdfs::Internal::Nothing, Hdfs::Internal::Nothing, Hdfs::Internal::Nothing, Hdfs::Internal::Nothing, Hdfs::Internal::Nothing, Hdfs::Internal::Nothing, Hdfs::Internal::Nothing, Hdfs::Internal::Nothing>::unwrap(char const*, int)
	@	Hdfs::Internal::UnWrapper<Hdfs::UnsupportedOperationException, Hdfs::AccessControlException, Hdfs::SafeModeException, Hdfs::SaslException, Hdfs::Internal::Nothing, Hdfs::Internal::Nothing, Hdfs::Internal::Nothing, Hdfs::Internal::Nothing, Hdfs::Internal::Nothing, Hdfs::Internal::Nothing, Hdfs::Internal::Nothing>::unwrap(char const*, int)
	@	Hdfs::Internal::UnWrapper<Hdfs::RpcNoSuchMethodException, Hdfs::UnsupportedOperationException, Hdfs::AccessControlException, Hdfs::SafeModeException, Hdfs::SaslException, Hdfs::Internal::Nothing, Hdfs::Internal::Nothing, Hdfs::Internal::Nothing, Hdfs::Internal::Nothing, Hdfs::Internal::Nothing, Hdfs::Internal::Nothing>::unwrap(char const*, int)
	@	Hdfs::Internal::UnWrapper<Hdfs::NameNodeStandbyException, Hdfs::RpcNoSuchMethodException, Hdfs::UnsupportedOperationException, Hdfs::AccessControlException, Hdfs::SafeModeException, Hdfs::SaslException, Hdfs::Internal::Nothing, Hdfs::Internal::Nothing, Hdfs::Internal::Nothing, Hdfs::Internal::Nothing, Hdfs::Internal::Nothing>::unwrap(char const*, int)
	@	Hdfs::Internal::HandlerRpcResponseException(std::__exception_ptr::exception_ptr)
	@	Hdfs::Internal::RpcChannelImpl::readOneResponse(bool) [clone .cold]
	@	Hdfs::Internal::RpcChannelImpl::setupSaslConnection()
	@	Hdfs::Internal::RpcChannelImpl::connect()
	@	Hdfs::Internal::RpcChannelImpl::invokeInternal(std::shared_ptr<Hdfs::Internal::RpcRemoteCall>)
	@	Hdfs::Internal::RpcChannelImpl::invoke(Hdfs::Internal::RpcCall const&)
	@	Hdfs::Internal::NamenodeImpl::invoke(Hdfs::Internal::RpcCall const&)
	@	Hdfs::Internal::NamenodeImpl::getFsStats()
	@	Hdfs::Internal::NamenodeProxy::getFsStats()
	@	Hdfs::Internal::FileSystemImpl::getFsStats()
	@	Hdfs::Internal::FileSystemImpl::connect()
	@	Hdfs::FileSystem::connect(char const*, char const*, char const*)
	@	hdfsBuilderConnect
	@	main
	@	__libc_start_main
	@	Unknown

and my code and hdfs conf :

int main() {
    hdfsBuilder* builder = hdfsNewBuilder();
    hdfsBuilderSetNameNode(builder, schemaAddress.c_str());
    hdfsBuilderSetForceNewInstance(builder);
	
	hdfsBuilderSetPrincipal(builder, "ossuser");
  
    std::stringstream ss;
    ss.imbue(std::locale::classic());
    ss << "/tmp/krb5cc_0";
	std::cout << ss.str() << std::endl;
    const char * userCCpath = GetEnv("LIBHDFS3_TEST_USER_CCPATH", ss.str().c_str());
    hdfsBuilderSetKerbTicketCachePath(builder, userCCpath); 

    hdfsFS hdfs = hdfsBuilderConnect(builder);
    if (!hdfs) {
        die("Could not connect to HDFS server \"" << schemaAddress << "\": " << thrill::vfs::HdfsGetLastError());
    }

    std::cout << "success connect to hdfs!!!!!!!!!!!" << std::endl;

    return 1;
}

hdfs-client.xml

<?xml version="1.0" encoding="UTF-8"?>
<configuration>
<!-- RPC client configuration -->
<property>
    <name>hadoop.security.authentication</name>
    <value>kerberos</value>
    <description>
    the RPC authentication method, valid values include "simple" or "kerberos". default is "simple"
    </description>
</property>
</configuration>

klist shows:

Ticket cache: FILE:/tmp//krb5cc_0
Default principal: [email protected]

Valid starting     Expires            Service principal
03/14/21 14:09:56  03/15/21 14:09:56  krbtgt/[email protected]
03/14/21 14:10:05  03/15/21 14:09:56  hdfs/[email protected]

Am I use the hdfsBuilderSetPrincipal in the wrong way?
Can you please give me some example or reference how to use libhdfs3 to connect kerberized cluster?

Appreciate your support!
Thanks!

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.