mozilla / canosp-2019 Goto Github PK

CANOSP-2019 internship project meta-repo for Private Federated Learning research project

License: Mozilla Public License 2.0

Jupyter Notebook 28.64% Python 70.01% Dockerfile 0.39% Makefile 0.75% Shell 0.21%

canosp-2019's Introduction

CANOSP-2019

This project implements a minimal server that can perform federated learning with differential privacy and accepts messages from clients.

Getting Started

We are using Miniconda to manage the environment. It can be installed using one of the installers available here. For MacOS, the bash installer is recommended.

Make sure to have conda init run during conda installation so that your PATH is set properly.

Installing locally to run tests

To install and run the tests for this project you can run:

# Set up the environment.
$ make setup_conda
$ conda activate mozfldp

# Run tests
$ make pytest

Running the server

You can run the server locally, serving requests on port 8000, using:

$ python -m mozfldp.server

Building a release

python setup.py sdist

Running from Docker

The server can also be built and run as a Docker container. First, install Docker.

Once you have Docker installed, you can build the container and run tests using:

$ make build_image
$ make docker_tests

To run the service in the container, use:

$ make up

Note that in the above command, we are exposing the container's port 8000 by binding it to port 8090 on the host computer.

Sending data to the server

You can submit arbitrary JSON blobs to the server using HTTP POST.

A sample curl invocation that will work is:

curl -X POST http://127.0.0.1:8000/api/v1/compute_new_weights

{"result":"ok","weights":[[[0.0,0.0,0.0,.... }

Note: If you are running locally, the port will be 8000. Port 8090 is used if you are running in a docker container.

canosp-2019's People

Contributors

Stargazers

Watchers

Forkers

bgluth maharshmellow zhaoyuxuan shivansh2407 jason-cooke crankycoder mhmohona

canosp-2019's Issues

Integrate the FL with DP implementation into client/server framework

Once #37 lands, fed_avg_w_dp.py should be refactored into the new simulation architecture.

Should client_num be m?

pyblack complains that m delcared here is unused:

CANOSP-2019/simulation_util.py

Line 101 in f592c21

m = max(int(client_num * C), 1)

I think the last argument here client_num should be m :

CANOSP-2019/simulation_util.py

Line 103 in f592c21

S = np.array(random.sample(range(client_num), client_num))

Refactor modules into a PyPI installable package

The bulk of the code should be put into a pypi package so that we can install this into the server code and have automated test cases.

Hook up communications between clients and server

Currently Clients can update their models on their local datasets, but the weights don't get transmitted anywhere. Sending weights to a running server instance needs to be implemented.

Implement actual minibatch learning

The federated client update algorithm calls for splitting a client's data into "minibatches" of size B and updating the weights once per minibatch. The SGDClassifier doesn't in fact support this (there is an open issue) - it updates the weights once for each training example, which is what is traditionally meant by SGD, "stochastic" because it traverses the dataset in a random order (when shuffle=True).

Using partial_fit on a batch still causes one weight update per training example in the batch, so this isn't really getting us minibatch learning. The difference with fit is that it starts from the current state of the model and runs 1 weight update for each example in the supplied dataset, which may be different from the "main" dataset.

However, we can compute the minibatch update by averaging the pointwise updates across the minibatch. Given a minibatch of data x = (x_1,...,x_k), we want to compute w = w_0 + \eta \dell E(w_0, x). But since the error on the dataset E is the average of the error on each individual training example, E(w, x) = 1\n \sum_i E(w, x_i), we get the same result by computing pointwise updates w^{(i)} = w_0 + \eta \dell E(w_0, x_i) (as is done at each step by the SGDClassifier) and averaging the w^{(i)}: w = 1/n \sum_i w^{(i)}. The difference is that the built-in method updates the weights progressively, whereas we want the initial w_0 to be the same for each i.

I think this can be done by using a similar approach to the current version of client_update, for each training example in the minibatch resetting the weights to the initial weights on entering the minibatch and running partial_fit on that training example. We would then average the weights at the end of the minibatch. We may want to create a fresh classifier instance each time rather than keep reusing the same one, because if we reuse, the internal step counter will keep incrementing. We should check the source code to see whether that will have any unintended consequences.

One outstanding issue is the choice of learning rate \eta. The default choice changes on each "time step", ie each per-example weight update. Let's pass this through as a param we handle. We can keep it constant for now, and maybe implement something adaptive later on.

Implement a Client class

Currently the data is loaded and apportioned to the clients by server_update. It would be better, on setting up the runner, to instantiate multiple Client instances that each manage their own data. client_update could then be a method of Client, and the server piece would receive model updates from clients and combine them.

Class labels should not be hardcoded

CANOSP-2019/mozfldp/simulation_util.py

Line 44 in c07e476

classes=[0,1,2,3,4,5,6,7,8,9]

The class labels should be passed in rather than hardcoded here. However, we may need to refactor the way we use the fitting functions, and this issue is pending that discussion.

Allow for customization of the SGDClassifier

We should be able to use custom settings for the SGDClassifier params and have these carry across the full simulation.

It may be worthwhile writing a "Classifier" wrapper class around SGDClassifier that we can instantiate at the beginning and pass into to the runner. This handle setting params, and it could also be made to handle the minibatch update gymnastics described in #30 (eg. a batch_update function which internally spawns clones of the SGDClassifier to compute the single-point updates and combine them, etc).

Correctness of federated learning needs to be verified

This is a follow up to last week's friday meeting where there were some concerns raised about correctness of the Federated Learning results.

Flagging @maharshmellow for follow up.

Use coef, intercept form for the weights in DP version

For simplicity, let's just use the (coef, intercept) form of the model weights throughout fed_avg_w_dp.py to avoid switching back and forth between the structures.

Allow for setting a random seed in every non-deterministic function call

We should modify the current API to allow passing through a random seed for testing purposes wherever random outcomes are used. Work on this was started in #27, but we should wait until #30 is settled before making these changes.

Tests

Certain portions of the project are still missing test coverage. In particular, simulation_runner.py

Start a well-documented notebook for running simulations and presenting results

We should start writing a notebook that includes rich documentation on the problem approach, parameters and their choices. The simulations should be run and results analyzed in this notebook as well.

As a part of this, the parameter grid should be refactored out from runner.py into this notebook.

Moments accountant

Incorporate moments accountant logic in the FLDP version

Outstanding error in Federated_Learning_Simulation notebook

Flagging that the issue with the notebook crashing, raised in #23, is still outstanding. The error is actually occurring in simulation_utils.client_update. Seems to be something to do with the conversion steps of the parameters from coef/intercept to a combined list to separate np arrays.

We might want to refactor to keep coef and intercept separate everywhere to avoid having to do conversions, since this is what the underlying classifier uses.