federatedai / fate Goto Github PK
View Code? Open in Web Editor NEWAn Industrial Grade Federated Learning Framework
License: Apache License 2.0
An Industrial Grade Federated Learning Framework
License: Apache License 2.0
I evaluated the computational efficiency of paillier in Fate(denoted as paillier_fate) and paillier implemented in python(https://github.com/n1analytics/python-paillier , denoted as paillier_python), the calculation efficiency of paillier_fate is more than six times higher than paillier_python.
The implementations of the two algorithms are similar. So, i want to know how to improve the efficiency of paillier. Are there any other homomorphic encryption libraries or implementation tips that can improve efficiency.
Is your feature request related to a problem? Please describe.
Referring to #45 , this thread processes the following question:
Measuring write amplification.
Describe the solution you'd like
Writes different size of data in different number of keys to measure the amplification scale.
Describe alternatives you've considered
Additional context
We are using lmdb as storage engine. Will measure different engines to collect data and to compare.
Describe the bug
When pickling a dtable, following error occured:
Traceback (most recent call last):
File "test.py", line 19, in <module>
test.run()
File "test.py", line 15, in run
table = self.data.mapValues(lambda x: self.fun(x))
File "/data/projects/fate/python/arch/api/cluster/eggroll.py", line 108, in mapValues
return self.__client.map_values(self, func)
File "/data/projects/fate/python/arch/api/cluster/eggroll.py", line 290, in map_values
func_id, func_bytes = self.serialize_and_hash_func(func)
File "/data/projects/fate/python/arch/api/cluster/eggroll.py", line 195, in serialize_and_hash_func
pickled_function = cloudpickle.dumps(func)
File "/data/projects/fate/python/arch/api/utils/cloudpickle.py", line 892, in dumps
cp.dump(obj)
File "/data/projects/fate/python/arch/api/utils/cloudpickle.py", line 271, in dump
return Pickler.dump(self, obj)
File "/data/projects/common/miniconda3/lib/python3.6/pickle.py", line 409, in dump
self.save(obj)
File "/data/projects/common/miniconda3/lib/python3.6/pickle.py", line 476, in save
f(self, obj) # Call unbound method with explicit self
File "/data/projects/fate/python/arch/api/utils/cloudpickle.py", line 412, in save_function
self.save_function_tuple(obj)
File "/data/projects/fate/python/arch/api/utils/cloudpickle.py", line 559, in save_function_tuple
save(state)
File "/data/projects/common/miniconda3/lib/python3.6/pickle.py", line 476, in save
f(self, obj) # Call unbound method with explicit self
File "/data/projects/common/miniconda3/lib/python3.6/pickle.py", line 821, in save_dict
self._batch_setitems(obj.items())
File "/data/projects/common/miniconda3/lib/python3.6/pickle.py", line 847, in _batch_setitems
save(v)
File "/data/projects/common/miniconda3/lib/python3.6/pickle.py", line 476, in save
f(self, obj) # Call unbound method with explicit self
File "/data/projects/common/miniconda3/lib/python3.6/pickle.py", line 781, in save_list
self._batch_appends(obj)
File "/data/projects/common/miniconda3/lib/python3.6/pickle.py", line 808, in _batch_appends
save(tmp[0])
File "/data/projects/common/miniconda3/lib/python3.6/pickle.py", line 521, in save
self.save_reduce(obj=obj, *rv)
File "/data/projects/common/miniconda3/lib/python3.6/pickle.py", line 634, in save_reduce
save(state)
File "/data/projects/common/miniconda3/lib/python3.6/pickle.py", line 476, in save
f(self, obj) # Call unbound method with explicit self
File "/data/projects/common/miniconda3/lib/python3.6/pickle.py", line 821, in save_dict
self._batch_setitems(obj.items())
File "/data/projects/common/miniconda3/lib/python3.6/pickle.py", line 852, in _batch_setitems
save(v)
File "/data/projects/common/miniconda3/lib/python3.6/pickle.py", line 521, in save
self.save_reduce(obj=obj, *rv)
File "/data/projects/common/miniconda3/lib/python3.6/pickle.py", line 634, in save_reduce
save(state)
File "/data/projects/common/miniconda3/lib/python3.6/pickle.py", line 476, in save
f(self, obj) # Call unbound method with explicit self
File "/data/projects/common/miniconda3/lib/python3.6/pickle.py", line 821, in save_dict
self._batch_setitems(obj.items())
File "/data/projects/common/miniconda3/lib/python3.6/pickle.py", line 847, in _batch_setitems
save(v)
File "/data/projects/common/miniconda3/lib/python3.6/pickle.py", line 521, in save
self.save_reduce(obj=obj, *rv)
File "/data/projects/common/miniconda3/lib/python3.6/pickle.py", line 634, in save_reduce
save(state)
File "/data/projects/common/miniconda3/lib/python3.6/pickle.py", line 476, in save
f(self, obj) # Call unbound method with explicit self
File "/data/projects/common/miniconda3/lib/python3.6/pickle.py", line 821, in save_dict
self._batch_setitems(obj.items())
File "/data/projects/common/miniconda3/lib/python3.6/pickle.py", line 847, in _batch_setitems
save(v)
File "/data/projects/common/miniconda3/lib/python3.6/pickle.py", line 521, in save
self.save_reduce(obj=obj, *rv)
File "/data/projects/common/miniconda3/lib/python3.6/pickle.py", line 634, in save_reduce
save(state)
File "/data/projects/common/miniconda3/lib/python3.6/pickle.py", line 476, in save
f(self, obj) # Call unbound method with explicit self
File "/data/projects/common/miniconda3/lib/python3.6/pickle.py", line 821, in save_dict
self._batch_setitems(obj.items())
File "/data/projects/common/miniconda3/lib/python3.6/pickle.py", line 847, in _batch_setitems
save(v)
File "/data/projects/common/miniconda3/lib/python3.6/pickle.py", line 496, in save
rv = reduce(self.proto)
File "stringsource", line 2, in grpc._cython.cygrpc.Channel.__reduce_cython__
TypeError: no default __reduce__ due to non-trivial __cinit__
To Reproduce
Steps to reproduce the behavior:
from arch.api import eggroll
import functools
eggroll.init("20190404.1425", 1)
class Test(object):
def __init__(self):
self.data = eggroll.parallelize(range(1000), include_key=False)
def fun(x):
return x * x
def run(self):
table = self.data.mapValues(lambda x: self.fun(x))
print (table.collect())
test = Test()
test.run()
Expected behavior
A clear and concise description of what you expected to happen.
Screenshots
If applicable, add screenshots to help explain your problem.
Desktop (please complete the following information):
Linux
Additional context
Working fine in standalone mode, not working in cluster mode.
Describe the bug
use clean up api in status tracer_decorator will cause bugs. Because in many situations, mostly standalone version, host or guest finish first will clean all job's data.
For now, the federated logistic regression (LR) algorithm is only using structural data (i.e., tabular data). This limits the application of LR. We may add support for automatic feature engineering to LR for dealing with various types of inputs such as text and images.
Neural networks such as RNN, CNN and autoencoders are widely used for learning features from text and images. Therefore, we may add these neural networks as local models for parties to extract features and then feed extracted features to LR.
This feature is recommended for FATE 0.3v
We rent multiple machines on google cloud for running the FATE project. In the beginning, we used machines with 2 vCPUs. While running the cluster code for hetero_logistic_regression, we get an Error from arbiter when sending public-keys to guest and host(console.log in federation):
[ERROR] 2019-03-18T07:11:15,361 [transferJobSchedulerExecutor-2] [SendProcessor:94] - java.lang.IndexOutOfBoundsException: Index: 0, Size: 0
at java.util.ArrayList.rangeCheck(ArrayList.java:657)
at java.util.ArrayList.get(ArrayList.java:433)
at com.webank.ai.fate.driver.federation.transfer.service.impl.DefaultProxySelectionService.select(DefaultProxySelectionService.java:81)
at com.webank.ai.fate.driver.federation.transfer.communication.processor.SendProcessor.run(SendProcessor.java:70)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
[ERROR] 2019-03-18T07:11:24,396 [transferJobSchedulerExecutor-2] [GrpcChannelFactory:119] - [COMMON][CHANNEL][ERROR] Error getting ManagedChannel after retries
[ERROR] 2019-03-18T07:11:24,397 [transferJobSchedulerExecutor-2] [TransferJobScheduler:127] - [FEDERATION][SCHEDULER] processor failed: transferMetaId: cxz-HeteroLRTransferVariable.paillier_pubkey-HeteroLRTransferVariable.paillier_pubkey.0-2-arbiter-1-guest, exception: java.lang.RuntimeException: should never get here
at com.webank.ai.fate.core.factory.GrpcStubFactory.createGrpcStub(GrpcStubFactory.java:47)
at com.webank.ai.fate.core.factory.GrpcStubFactory.createGrpcStub(GrpcStubFactory.java:56)
at com.webank.ai.fate.core.api.grpc.client.GrpcAsyncClientContext.createStub(GrpcAsyncClientContext.java:207)
at com.webank.ai.fate.core.api.grpc.client.GrpcStreamingClientTemplate.calleeStreamingRpc(GrpcStreamingClientTemplate.java:106)
at com.webank.ai.fate.core.api.grpc.client.GrpcStreamingClientTemplate.calleeStreamingRpcWithImmediateDelayedResult(GrpcStreamingClientTemplate.java:149)
at com.webank.ai.fate.driver.federation.transfer.api.grpc.client.ProxyClient.unaryCall(ProxyClient.java:98)
at com.webank.ai.fate.driver.federation.transfer.api.grpc.client.ProxyClient.requestSendEnd(ProxyClient.java:121)
at com.webank.ai.fate.driver.federation.transfer.communication.processor.SendProcessor.run(SendProcessor.java:98)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
and we tried the same code using machines with 8 vCPUs, the ERROR cannot be reproduced.
Here are the configurations of the two groups of machines:
The former ones: Google Cloud n1-standard-2 machines with 2 vCPUs and 7.5GB RAM
The latter ones: Google Cloud n1-standard-8 machines with 8 vCPUs and 30GB RAM
Describe the bug
If input DTABLE of any component (lr, secureboost,feature eng ,etc) is empty(not any keys,or have keys but values are None) , what happen?
Add svm-light sparse data inputformat supported
This issue is specially opened for discussion. Here, you can feedback some problem such as install problem when you install FATE or new feature which you think is important for you.
Finally, hope to use English to ask your questions.
Thanks.
dylanfan
Is your feature request related to a problem? Please describe.
All Processors of Eggroll are running in one python process, pretty slow
Describe the solution you'd like
Since GIL problem(Feature?) of PVM can not be fixed, deploying multiple processor would be a easy solution
Describe alternatives you've considered
Jython
Is your feature request related to a problem? Please describe.
When we want to initial a model or do some feature engineering, we might prefer to get the feature shape. Thus, getting one instance in a DTable is necessary.
Describe the solution you'd like
Provide an interface in federation.
Is your feature request related to a problem? Please describe.
No
Describe the solution you'd like
Proxy and other arch components use similar infrastructure but with different implementations. They should be merged into one.
Describe alternatives you've considered
N/A
Additional context
Regression test is required.
For now, it is a lack of documentation for public APIs (e.g., eggroll APIs and operators such as HeterologisticGradient.fore_gradient()), which makes the whole framework hard to use.
It would be much better to add clear documentation and even some examples for these APIs.
Describe the bug
In the feature selection component , make sure it has at least one feature of any party.
Is your feature request related to a problem? Please describe.
A clear and concise description of what the problem is. Ex. I'm always frustrated when [...]
Describe the solution you'd like
A clear and concise description of what you want to happen.
Describe alternatives you've considered
A clear and concise description of any alternative solutions or features you've considered.
Additional context
Add any other context or screenshots about the feature request here.
Implement a decentralized encryption scheme of FTL without arbiter in the loop (refer to paper)
Quantile Optimization
Sparse optimize : now quantile process time cost O(N * max_feature_dimension), we will speed up to O(sum of non_sparse_feature).
Secureboost Optimization:
a. memory optimize: we use breadth-first-search algorithm to build trees, we always use all nodes of one level to build histogram and find splits of next tree level. We now support not to use all nodes but specify the maximun node num once a time.
b. distributed finding candidate splits
c. speed up host federated finding splits process.
Add new function that federated calculating IV and WOE.
Describe the bug
In workflow, if not reset the flowid in validation stage, guest will receive old federation object, may raise a bug.
To Reproduce
Steps to reproduce the behavior:
set train and predict data totally different id sets, and run examples.
Additional context
If reset flowid in validation stage, it works perfectly.
I'm trying to interpret the results, understand the training and evaluation process and the loss/accuracy for each of the guest and the host for each round.
For example I run the logistic regression standalone version, and get all the logs: homo_lr_guest.log, homo_lr_host.log and homo_lr_arbiter.log.
Where should I look for the performance for each round for each of the guest? And the host model performance? Is there a good way to interpret the results and get the metrics of the whole process (model distributing, guest training performance, model encryption, model merging, host training performance etc.)
Thanks!
In practice, we will do a lot of model training experiments and release some models to production. Then we may encounter some problems such as:
FATE ModelManager will solve the problems. At first, it provides these features:
In the future,
Google pulished their version of Tensorflow FL stack. They seem to be specifically for horizontal FL with large number of guests, with models wrapped in Keras.
I'm wondering if FATE team is also looking into it, and what your thoughts are? Can we even embed it into the FATE as part of the Horizontal FL with deep learning?
Thanks!
regression evaluation method include:
1.mean_absolute_error
2.mean_squared_error
3.mean_squared_log_error
4.median_absolute_error
5.r2_score
6.root_mean_squard_error
7.fix some format project of evaluation
We need to develop a Mini-FederatedML Task, which seen as a test case for Federated Learning Task after FATE deployed.
Through this test case, users can know fate deployed successfully and they can run any other Federated Learning Task.
Is your feature request related to a problem? Please describe.
A clear and concise description of what the problem is. Ex. I'm always frustrated when [...]
Describe the solution you'd like
A clear and concise description of what you want to happen.
Describe alternatives you've considered
A clear and concise description of any alternative solutions or features you've considered.
Additional context
Add any other context or screenshots about the feature request here.
微众的前辈好,我想请问一下有没有项目成员(负责人)的联系方式呢,我对这个开源项目很感兴趣,想详细聊一聊
Is your feature request related to a problem? Please describe.
Add statistic methods so that the mean, median, variance etc. values are easy to access. This is useful in feature engineering.
Describe the solution you'd like
Add a statistic module and provide corresponding interface.
1.add Encode class
2.RSA intersection will send encrypted intersection results
3.RAW intersection's role can be configured
4.RAW intersection configure if encode for each role
Is your feature request related to a problem? Please describe.
Secret sharing Scheme is a must-have for FATE project.
Describe the solution you'd like
Do R&D on implementing secret sharing operations, such as:
Having secret sharing operations been created, then
These works do not need to be full-fledged for industrial applications. However, they should be able to help us create various secure federated learning algorithms/prototypes.
Is your feature request related to a problem? Please describe.
Referring to issue #45, step 2.
Describe the solution you'd like
Providing a cleanup API to enable batch data cleanup
Describe alternatives you've considered
Other steps in #45
Additional context
usage:
eggroll.cleanup(name, namespace, persistent=False)
name
: name of table, supporting '*' to wildcard.
namespace
: exact match of namespace.
persistent
: False to cleanup IN_MEMORY tables, True to cleanup LMDB persistent tables.
For now, trained models are stored through eggroll table-save API. It is OK for simple models like LR. But for complex models like CNN, it would be a tedious work to store model into eggroll table via such API.
It might be better to add higher level eggroll API or other storage mechanisms for storing tensorflow models such that we can exploit tensorflow built-in model save/load API.
Currently, too much running time was spent on decryption, which is account for 66% of the running time per iteration.
The reason that causes this issue is that currently, we delegate the whole work of updating local neural network model (calculate gradients and update model parameters) to tensorflow for the purpose of saving engineering effort and supporting model extensibility. To let tensorflow automatically update local model, we need to feed tensorflow with plain gradients of all sample. That is, if we have 30,000 samples, we would have 30,000 samples of gradients. The bottleneck comes when we decrypt the whole 30,000 samples of gradients (this is not safe in terms of data protection), which consumes lots of time on both computation and communication.
The solution to solve this issue is :
Use protobuffer object to save modelmeta for supporting multiple-language model loading
the current examples support run in train and cross_validation method,missing predict method support.
also i think there is no api to read out the saved pridict result.
it would be very grateful,these features are added.
Note that the Proxy.Packet (in arch/networking/proxy/src/main/java/com/webank/ai/fate/networking/proxy/grpc/client/DataTransferPipedClient.java line 96)do one extra read from the pipe which is already drained, and the read function set a timeout of 1 second to ensure the pipe has no packets. This may cause an one-second latency at the end of each transfer event.
This can be solved by adding a "isDrain" check before polling packets from the pipe.
I have tried to add a line of code "if (isDrained()) return result;" between line 66 and line 67 of PacketQueuePipe.java(in arch/networking/proxy/src/main/java/com/webank/ai/fate/networking/proxy/infra/impl/) , the training examples works properly and the latency is removed.
An online service for serving federated learning models.
It should have the following features.
Is your feature request related to a problem? Please describe.
It's very hard to observe a training progress, when there is no dashboard or visualization of the whole process
Describe the solution you'd like
Tensorboard like or at least something that is on the same level with the spark dashboard
Describe alternatives you've considered
Nope
Is your feature request related to a problem? Please describe.
Add feature selection methods for federal learning.
Describe the solution you'd like
Add a new workflow for feature selection. Also, provided interfaces for single side.
visualize offline/online tasklist,models, etc
support submit a task and view task status
sigmoid(x)
if x >0 , sigmoid(x) = 1/(1+exp(-x))
if x <=0, sigmoid(x) = exp(x)/(1+exp(x))
Is your feature request related to a problem? Please describe.
Storage usage is growing quite fast.
Describe the solution you'd like
Describe alternatives you've considered
Precise auto cleanup later depending on dynamic runtime mechanisms.
Additional context
I will fork several sub-thread to track each milestone.
If you have any idea on storage issue please reply. Thanks.
there is a route_table.json configure file under conf dir.
but i don't known how to configure guest,host and arbiter,can anyone give a more detail route_table.json configure demo?
We need to develop a toy example, which seen as a quickstart example of FATE for users.
Also, a toy example can be a test case for a successful deployment.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.