Giter Site home page Giter Site logo

momhiar / datafusion-objectstore-hdfs-no-zip Goto Github PK

View Code? Open in Web Editor NEW

This project forked from datafusion-contrib/datafusion-objectstore-hdfs

0.0 0.0 0.0 101 KB

HDFS based on Java implementation as a remote ObjectStore for DataFusion without zip tokio test

License: Apache License 2.0

Rust 96.02% Dockerfile 3.98%

datafusion-objectstore-hdfs-no-zip's Introduction

this fork is used for personal project and froked from original branch. in this fork we have removed automation test for another application developments.

datafusion-objectstore-hdfs

HDFS as a remote ObjectStore for Datafusion.

Querying files on HDFS with DataFusion

This crate introduces HadoopFileSystem as a remote ObjectStore which provides the ability of querying on HDFS files.

For the HDFS access, We leverage the library fs-hdfs. Basically, the library only provides Rust FFI APIs for the libhdfs which can be compiled by a set of C files provided by the official Hadoop Community.

Prerequisites

Since the libhdfs is also just a C interface wrapper and the real implementation for the HDFS access is a set of Java jars, in order to make this crate work, we need to prepare the Hadoop client jars and the JRE environment.

Prepare JAVA

  1. Install Java.

  2. Specify and export JAVA_HOME.

Prepare Hadoop client

  1. To get a Hadoop distribution, download a recent stable release from one of the Apache Download Mirrors. Currently, we support Hadoop-2 and Hadoop-3.

  2. Unpack the downloaded Hadoop distribution. For example, the folder is /opt/hadoop. Then prepare some environment variables:

export HADOOP_HOME=/opt/hadoop

export PATH=$PATH:$HADOOP_HOME/bin

Prepare JRE environment

  1. Firstly, we need to add library path for the jvm related dependencies. An example for MacOS,
export DYLD_LIBRARY_PATH=$JAVA_HOME/jre/lib/server
  1. Since our compiled libhdfs is JNI native implementation, it requires the proper CLASSPATH to load the Hadoop related jars. An example,
export CLASSPATH=$CLASSPATH:`hadoop classpath --glob`

Examples

Suppose there's a hdfs directory,

let hdfs_file_uri = "hdfs://localhost:8020/testing/tpch_1g/parquet/line_item";

in which there're a list of parquet files. Then we can query on these parquet files as follows:

let ctx = SessionContext::new();
let url = Url::parse("hdfs://").unwrap();
ctx.runtime_env().register_object_store(&url, Arc::new(HadoopFileSystem));
let table_name = "line_item";
println!(
    "Register table {} with parquet file {}",
    table_name, hdfs_file_uri
);
ctx.register_parquet(table_name, &hdfs_file_uri, ParquetReadOptions::default()).await?;

let sql = "SELECT count(*) FROM line_item";
let result = ctx.sql(sql).await?.collect().await?;

Testing

  1. First clone the test data repository:
git submodule update --init --recursive
  1. Run testing
cargo test

During the testing, a HDFS cluster will be mocked and started automatically.

  1. Run testing for with enabling feature hdfs3
cargo build --no-default-features --features datafusion-objectstore-hdfs/hdfs3,datafusion-objectstore-hdfs-testing/hdfs3,datafusion-hdfs-examples/hdfs3

cargo test --no-default-features --features datafusion-objectstore-hdfs/hdfs3,datafusion-objectstore-hdfs-testing/hdfs3,datafusion-hdfs-examples/hdfs3

Run the ballista-sql test by

cargo run --bin ballista-sql --no-default-features --features hdfs3

datafusion-objectstore-hdfs-no-zip's People

Contributors

kyotoyaho avatar momhiar avatar ted-jiang avatar yahonanjing avatar yjshen avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.