baidu / bigflow Goto Github PK

Baidu Bigflow is an interface that allows for writing distributed computing programs and provides lots of simple, flexible, powerful APIs. Using Bigflow, you can easily handle data of any scale. Bigflow processes 4P+ data inside Baidu and runs about 10k jobs every day.

Home Page: http://baidu.github.io/bigflow

License: Apache License 2.0

CMake 2.12% C++ 50.44% Shell 0.45% Python 42.60% Scala 1.36% Perl 2.92% C 0.10% Thrift 0.01%

bigflow's Introduction

house.baidu.com

bigflow's People

Contributors

Stargazers

Watchers

Forkers

advancedxy chunyang-wen yshysh zhukaisjtu xiongji flyingwen tobycc1990 drinktee himdd chidouhu anpark luciferyang qianqiuyitongbupt shizuozhi realsun xiapistudio nlpoasis supdizh fiendark ybbgpy chiass jiangfeng acmol anttylove fendaq qicny wanglun slayzzzzz fooway statml yangyaoyunshu allensmile yuanqiang01 bigrlab zerolugithub liyuanyaun xuyangyang01 leeoo wanghanfeng woerwin i-spark xyuan sasahou wysamuel zz198808 t-m-l-c xiaohei-info yanchaomars zmyer live0717 lkunxyz bb7133 majinzhou007 shuyunqi mmm311 heavenlxj lokihjl heroming hellovivi cloud-bear code-goblin-for-jack jiangshide chaihua483 xiaomj yangwei024 jarlene wtx626 wzhiqing yinxiuhe chain78 ezhangle hewen1125 sugarbearwei cherish3620 soargg inkstone2006 mloves0824 lonely7345 zhaoshiling1017 maniacs-ops devopsmi ahnufy dongzhaoyu zhaonaiy spencerx certnet lukw00heck lilonghua1987 zibo1996 yayawcx bigdot123456 witchiman monsoonzeng ljzzju kitter wangjieboy wall-eeeeeee sbsb3838 myjeffxie cnsuhao

bigflow's Issues

build failed

在线试用网页挂掉了？

Failed to execute goal org.apache.maven.plugins:maven-compiler-plugin:2.0.2:compile (default-compile) on project flume-runtime: Compilation failure

[ 46%] Built target flume-runtime-spark-profile_scripts

[ERROR] Failed to execute goal org.apache.maven.plugins:maven-compiler-plugin:2.0.2:compile (default-compile) on project flume-runtime: Compilation failure
[ERROR] Failure executing javac, but could not parse the error:
[ERROR] javac: invalid target release: 1.8
[ERROR] Usage: javac
[ERROR] use -help for a list of possible options
[ERROR] -> [Help 1]
[ERROR]
[ERROR] To see the full stack trace of the errors, re-run Maven with the -e switch.
[ERROR] Re-run Maven using the -X switch to enable full debug logging.
[ERROR]
[ERROR] For more information about the errors and possible solutions, please read the following articles:
[ERROR] [Help 1] http://cwiki.apache.org/confluence/display/MAVEN/MojoFailureException
make[2]: *** [flume/CMakeFiles/bulid_spark_launcher_jar] Error 1
make[1]: *** [flume/CMakeFiles/bulid_spark_launcher_jar.dir/all] Error 2
make: *** [all] Error 2

关于 bigflow.transforms.join 的一些疑问

join 函数在数据量多大的时候程序会挂掉？

希望增加group_by之后的统计函数？

在对key group_by之后，希望可以方便做求均值，求方差，排序再遍历这样的操作；
希望可以提供类似这样的内置函数

manage byproducts

While running bigflow program, I find it will output some byproducts, e.g.,
entity-*
.flume
...
After several times, it will make the folder in a mess.
Could you please put those byproducts into a pre-specified subfolder, in order to conveniently manage.

会有python3版本的吗

A benchmark system is needed

We should have a benchmark on this opensource version of Bigflow.
And we should go even further to have a benchmark system, which can give us benchmark results everyday.

Translate readme into English

Maybe both English and Chinese are needed in this page.

A type in the introduction

"Bigflow is a interface" -> "Bigflow is an interface"

Is it possible to run sklearn on bigflow?

Enable travis' build matrix

Some combinations are:

OS	Compiler
Ubuntu	gcc
Ubuntu	clang
Mac	clang
Docker(CentOS)	gcc
Docker(CentOS)	clang
Windows(WSL)	gcc(tbd)
Windows(WSL)	clang(tbd)
Windows	x

Remove code which is used for streaming process

Since we haven't made the streaming process part of Bigflow opensource, we should have removed all the code related to it. We've done most of the work, but there still is some code for streaming process remaining.
So, remove the useless code, and keep the code clean.

Online demo isn't working

Server responded with HTTP ERROR 500.

Compiling error due to mvn

Some error happens when executing command involving mvn.

After I add .m2/setting.xml, the building process continues. I am not sure what happens.

I think there should be instructions in build documentation.

Continuous Integration Support

Continuous integration is required for bigflow, and we should have a system to support continuos integration.

Maybe we can use travis-ci.org, or teamcity, or we can set up a jenkins at BCC (Baidu Cloud Compute)?

Support structure IO format on Spark

Definitions

Structure input formats specifically mean ORC file and Parquet file.

Current Status

Bigflow on DCE supports ORC file(only reading) and Parquet file with its own loader as DCE doesn't support reading ORC or Parquet natively.

For ORC files, Bigflow uses ORC's c++ API. As the time of adding ORC support, ORC's c++ API only supports reading.

For Parquet files, Bigflow also uses c++ API. Currently, parquet-cpp partially supports nested structure.

Bigflow on Spark doesn't support ORC neither Parquet for now. This doc lists some details how we can support for ORC and Parquet files.

Parquet Support Architecture Overview on DCE

ORC loader follows similar procedure.

How to add support for spark pipeline

Read support

The RecordBatch in the previous arch is an arrow
RecordBatch. Spark already adds supports to transform Dataset to RDD[ArrowPayload]
(see Dataset.scala), though not publicly.

It would be straightforward to add Parquet read support on spark pipeline, even ORC or CSV files.

Impl details to add read support

Use SparkSession to read Parquet or Orc File(spark pipeline currently uses SparkContext)
Implements toArrowPayload in flume-rumtime as Spark doesn't expose that publicly
Reuse and refactoring current PythonFromRecordBatchProcessor
Modify Bigflow's planner to use PythonFromRecordBatchProcessor for Spark pipeline's structure input when constructing Flume task

Write support

Bigflow uses its own sinker impl to write PCollection(or PType) into external target.

Current impl on DCE should also works on Spark. Although, some additional work is
needed, namely:

Refactoring current ParquetSinker and Arrow Schema Converter
Add write support for ORC files. (ORC's cpp API is adding write support incrementally)

References

Apache Arrow is a promising in-memory columnar storage, we can leverage more
power on it. See Arrow SlideShare

cc @himdd @chunyang-wen @bb7133 @acmol for comments and prs are appreciated

bigflow用来文本挖掘可以提高速度吗？

urllib2.URLError

Environment

OS: Centos7.2
Spark: HDP-2.1
Hadoop: HDP-2.7

I have set SPARK_HOME, HADOOP_HOME and BIGFLOW_PYTHON_HOME.

Description

I have error after running print pipeline.read(input.TextFile("hdfs://localhost:8020/data/test/new.txt")).get():

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "bigflow/ptype.py", line 87, in get
    if not self._is_readable():
  File "bigflow/ptype.py", line 127, in _is_readable
    return requests.is_node_cached(self._pipeline.id(), self._node.id())
  File "bigflow/rpc/requests.py", line 68, in _wrapper
    result, status = func(*args, **kw)
  File "bigflow/rpc/requests.py", line 318, in is_node_cached
    response = _service.request(request, "is_node_cached")
  File "bigflow/rpc/service.py", line 89, in request
    response = urllib2.urlopen(req, request_json, 365 * 24 * 60 * 60)
  File "/opt/bigflow/python_runtime/lib/python2.7/urllib2.py", line 154, in urlopen
    return opener.open(url, data, timeout)
  File "/opt/bigflow/python_runtime/lib/python2.7/urllib2.py", line 429, in open
    response = self._open(req, data)
  File "/opt/bigflow/python_runtime/lib/python2.7/urllib2.py", line 447, in _open
    '_open', req)
  File "/opt/bigflow/python_runtime/lib/python2.7/urllib2.py", line 407, in _call_chain
    result = func(*args)
  File "/opt/bigflow/python_runtime/lib/python2.7/urllib2.py", line 1228, in http_open
    return self.do_open(httplib.HTTPConnection, req)
  File "/opt/bigflow/python_runtime/lib/python2.7/urllib2.py", line 1198, in do_open
    raise URLError(err)
urllib2.URLError: <urlopen error [Errno 111] Connection refused>

I tried hdfs://data/test/new.txt and /data/test/new.txt, but still not work.

Hive Support

Read/write InputFormat/OutputFormat, SerDe from/to Hive Metastore.
Read/write data from/to Hive Table or Partition.

readline install failed

As illustrated in this picture, pip install readline failed but the build continued.

The building scripts should be improved to build readline success and to detect this kind of errors.

API Plan optimizations is needed

We should add a optimization layer in the current API layer, it could be called API Plan layer.

At this moment, API layer will transform the user's code to a LogicalPlan directly, and some information is lost, such as LogicalPlan don't know what is join, the LogicalPlan only know that two nodes are cogrouped and then a Processor will process the two cogrouped result.

Eg.

pc1.distinct().join(pc2)

is equal to

pc1.cogroup(pc2) \
       .apply_values(lambda p1, p2: p1.distinct().cartesian(p2)) \
       .flatten_values()

But we can't optimize it automatically without the help of API Plan.

So, API Plan is meant to keep all the information we can get from user's code, and optimize the plan by the information.

English documents are required

We have some documents in English, but there are so many mistakes and they're outdated.

They need to be refined.

Our documents are generated by sphinx-doc, and sphinx-doc has an internationalization support.

So, I think, all the comments should be written in English especially the ones which are the sources of the generated documents. Chinese documents should be generated with the help of sphinx-doc's internationalization support.

Could bigflow support either SQL or a familiar DataFrame API to query structured data.

Maybe we should build Bigflow to a python egg

So that user can use it in any version of Python.

Better handle LINK_ALL_SYMBOLS option in cmake/generic.cmake

We implemented two methods(cc_library, cc_binary) that accept LINK_ALL_SYMBOLS (which means export all symbols when linking, by wrapping the libs with gcc options "-Wl,--whole-archive" and "-Wl,--no-whole-archive" around) as argument in cmake/generic.cmake.

A binary or a library can use ALL_SYMBOLS_DEPS to tell the linker that it needs all the symbols in its deps.
A library who have LINK_ALL_SYMBOLS attribute should export all symbols to the ones that depend on it. We expect these symbols can be exported recursively, however only the ones that depend this library directly will link all the symbols currently.
cc_test don't have similar functionality.

So, cmake/generic.cmake's cc_library/cc_binary/cc_test needs improvements to handle LINK_ALL_SYMBOLS.

又一个烂尾的开源项目

lower the required cmake version

Installing cmake 3.9 costs so much time.
cmake 2.8.12.2 will be a better version, since it can be installed by yum in centos 7.1.