Giter Site home page Giter Site logo

baidu / bigflow Goto Github PK

View Code? Open in Web Editor NEW
1.1K 1.1K 161.0 11.74 MB

Baidu Bigflow is an interface that allows for writing distributed computing programs and provides lots of simple, flexible, powerful APIs. Using Bigflow, you can easily handle data of any scale. Bigflow processes 4P+ data inside Baidu and runs about 10k jobs every day.

Home Page: http://baidu.github.io/bigflow

License: Apache License 2.0

CMake 2.12% C++ 50.44% Shell 0.45% Python 42.60% Scala 1.36% Perl 2.92% C 0.10% Thrift 0.01%

bigflow's Introduction

house.baidu.com

bigflow's People

Contributors

acmol avatar advancedxy avatar chunyang-wen avatar himdd avatar jimmycasey avatar scaugrated avatar tushushu avatar wanglun avatar wzhiqing avatar yshysh avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

bigflow's Issues

Failed to execute goal org.apache.maven.plugins:maven-compiler-plugin:2.0.2:compile (default-compile) on project flume-runtime: Compilation failure

[ 46%] Built target flume-runtime-spark-profile_scripts

[ERROR] Failed to execute goal org.apache.maven.plugins:maven-compiler-plugin:2.0.2:compile (default-compile) on project flume-runtime: Compilation failure
[ERROR] Failure executing javac, but could not parse the error:
[ERROR] javac: invalid target release: 1.8
[ERROR] Usage: javac
[ERROR] use -help for a list of possible options
[ERROR] -> [Help 1]
[ERROR]
[ERROR] To see the full stack trace of the errors, re-run Maven with the -e switch.
[ERROR] Re-run Maven using the -X switch to enable full debug logging.
[ERROR]
[ERROR] For more information about the errors and possible solutions, please read the following articles:
[ERROR] [Help 1] http://cwiki.apache.org/confluence/display/MAVEN/MojoFailureException
make[2]: *** [flume/CMakeFiles/bulid_spark_launcher_jar] Error 1
make[1]: *** [flume/CMakeFiles/bulid_spark_launcher_jar.dir/all] Error 2
make: *** [all] Error 2

manage byproducts

While running bigflow program, I find it will output some byproducts, e.g.,
entity-*
.flume
...
After several times, it will make the folder in a mess.
Could you please put those byproducts into a pre-specified subfolder, in order to conveniently manage.

A benchmark system is needed

We should have a benchmark on this opensource version of Bigflow.
And we should go even further to have a benchmark system, which can give us benchmark results everyday.

Enable travis' build matrix

Some combinations are:

OS Compiler
Ubuntu gcc
Ubuntu clang
Mac clang
Docker(CentOS) gcc
Docker(CentOS) clang
Windows(WSL) gcc(tbd)
Windows(WSL) clang(tbd)
Windows x

Remove code which is used for streaming process

Since we haven't made the streaming process part of Bigflow opensource, we should have removed all the code related to it. We've done most of the work, but there still is some code for streaming process remaining.
So, remove the useless code, and keep the code clean.

Compiling error due to mvn

Some error happens when executing command involving mvn.

After I add .m2/setting.xml, the building process continues. I am not sure what happens.

I think there should be instructions in build documentation.

Continuous Integration Support

Continuous integration is required for bigflow, and we should have a system to support continuos integration.

Maybe we can use travis-ci.org, or teamcity, or we can set up a jenkins at BCC (Baidu Cloud Compute)?

Support structure IO format on Spark

Definitions

Structure input formats specifically mean ORC file and Parquet file.

Current Status

Bigflow on DCE supports ORC file(only reading) and Parquet file with its own loader as DCE doesn't support reading ORC or Parquet natively.

For ORC files, Bigflow uses ORC's c++ API. As the time of adding ORC support, ORC's c++ API only supports reading.

For Parquet files, Bigflow also uses c++ API. Currently, parquet-cpp partially supports nested structure.

Bigflow on Spark doesn't support ORC neither Parquet for now. This doc lists some details how we can support for ORC and Parquet files.

Parquet Support Architecture Overview on DCE

parquet_architecture

ORC loader follows similar procedure.

How to add support for spark pipeline

Read support

The RecordBatch in the previous arch is an arrow
RecordBatch. Spark already adds supports to transform Dataset to RDD[ArrowPayload]
(see Dataset.scala), though not publicly.

It would be straightforward to add Parquet read support on spark pipeline, even ORC or CSV files.

Impl details to add read support

  1. Use SparkSession to read Parquet or Orc File(spark pipeline currently uses SparkContext)
  2. Implements toArrowPayload in flume-rumtime as Spark doesn't expose that publicly
  3. Reuse and refactoring current PythonFromRecordBatchProcessor
  4. Modify Bigflow's planner to use PythonFromRecordBatchProcessor for Spark pipeline's structure input when constructing Flume task

Write support

Bigflow uses its own sinker impl to write PCollection(or PType) into external target.

Current impl on DCE should also works on Spark. Although, some additional work is
needed, namely:

  1. Refactoring current ParquetSinker and Arrow Schema Converter
  2. Add write support for ORC files. (ORC's cpp API is adding write support incrementally)

References

  1. Apache Arrow is a promising in-memory columnar storage, we can leverage more
    power on it. See Arrow SlideShare

cc @himdd @chunyang-wen @bb7133 @acmol for comments and prs are appreciated

urllib2.URLError

Environment

OS: Centos7.2
Spark: HDP-2.1
Hadoop: HDP-2.7

I have set SPARK_HOME, HADOOP_HOME and BIGFLOW_PYTHON_HOME.

Description

I have error after running print pipeline.read(input.TextFile("hdfs://localhost:8020/data/test/new.txt")).get():

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "bigflow/ptype.py", line 87, in get
    if not self._is_readable():
  File "bigflow/ptype.py", line 127, in _is_readable
    return requests.is_node_cached(self._pipeline.id(), self._node.id())
  File "bigflow/rpc/requests.py", line 68, in _wrapper
    result, status = func(*args, **kw)
  File "bigflow/rpc/requests.py", line 318, in is_node_cached
    response = _service.request(request, "is_node_cached")
  File "bigflow/rpc/service.py", line 89, in request
    response = urllib2.urlopen(req, request_json, 365 * 24 * 60 * 60)
  File "/opt/bigflow/python_runtime/lib/python2.7/urllib2.py", line 154, in urlopen
    return opener.open(url, data, timeout)
  File "/opt/bigflow/python_runtime/lib/python2.7/urllib2.py", line 429, in open
    response = self._open(req, data)
  File "/opt/bigflow/python_runtime/lib/python2.7/urllib2.py", line 447, in _open
    '_open', req)
  File "/opt/bigflow/python_runtime/lib/python2.7/urllib2.py", line 407, in _call_chain
    result = func(*args)
  File "/opt/bigflow/python_runtime/lib/python2.7/urllib2.py", line 1228, in http_open
    return self.do_open(httplib.HTTPConnection, req)
  File "/opt/bigflow/python_runtime/lib/python2.7/urllib2.py", line 1198, in do_open
    raise URLError(err)
urllib2.URLError: <urlopen error [Errno 111] Connection refused>

I tried hdfs://data/test/new.txt and /data/test/new.txt, but still not work.

Hive Support

Read/write InputFormat/OutputFormat, SerDe from/to Hive Metastore.
Read/write data from/to Hive Table or Partition.

readline install failed

image

As illustrated in this picture, pip install readline failed but the build continued.

The building scripts should be improved to build readline success and to detect this kind of errors.

API Plan optimizations is needed

We should add a optimization layer in the current API layer, it could be called API Plan layer.

At this moment, API layer will transform the user's code to a LogicalPlan directly, and some information is lost, such as LogicalPlan don't know what is join, the LogicalPlan only know that two nodes are cogrouped and then a Processor will process the two cogrouped result.

Eg.

pc1.distinct().join(pc2)

is equal to

pc1.cogroup(pc2) \
       .apply_values(lambda p1, p2: p1.distinct().cartesian(p2)) \
       .flatten_values()

But we can't optimize it automatically without the help of API Plan.

So, API Plan is meant to keep all the information we can get from user's code, and optimize the plan by the information.

English documents are required

We have some documents in English, but there are so many mistakes and they're outdated.

They need to be refined.

Our documents are generated by sphinx-doc, and sphinx-doc has an internationalization support.

So, I think, all the comments should be written in English especially the ones which are the sources of the generated documents. Chinese documents should be generated with the help of sphinx-doc's internationalization support.

Better handle LINK_ALL_SYMBOLS option in cmake/generic.cmake

We implemented two methods(cc_library, cc_binary) that accept LINK_ALL_SYMBOLS (which means export all symbols when linking, by wrapping the libs with gcc options "-Wl,--whole-archive" and "-Wl,--no-whole-archive" around) as argument in cmake/generic.cmake.

  1. A binary or a library can use ALL_SYMBOLS_DEPS to tell the linker that it needs all the symbols in its deps.

  2. A library who have LINK_ALL_SYMBOLS attribute should export all symbols to the ones that depend on it. We expect these symbols can be exported recursively, however only the ones that depend this library directly will link all the symbols currently.

  3. cc_test don't have similar functionality.

So, cmake/generic.cmake's cc_library/cc_binary/cc_test needs improvements to handle LINK_ALL_SYMBOLS.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.