openmined / pyvertical Goto Github PK

Privacy Preserving Vertical Federated Learning

License: Apache License 2.0

Python 97.21% Shell 2.00% Dockerfile 0.79%

psi private-set-intersection split-neural-network splitnn vertical-federated-learning federated-learning partitioned-data

pyvertical's Introduction

PyVertical

A project developing privacy-preserving, vertical federated learning, using syft.

🔗 Private entity resolution using Private Set Intersection (PSI)
🔒 Trains a model on vertically partitioned data using SplitNNs, so only data holders can access data

Vertically-partitioned data is data in which fields relating to a single record are distributed across multiple datasets. For example, multiple hospitals may have admissions data on the same patients, or retailers have transaction data on the same shoppers. Vertically-partitioned data could be applied to solve vital problems, but data holders can't combine their datasets by simply comparing notes with other data holders unless they want to break user privacy. PyVertical uses PSI to link datasets in a privacy-preserving way. We train SplitNNs on the partitioned data to ensure the data remains separate throughout the entire process.

See the changelog for information on the current status of PyVertical.

NOTE: PyVertical does not currently work with syft 0.3.0

The Process

PyVertical process:

Create partitioned dataset
- Simulate real-world partitioned dataset by splitting MNIST into a dataset of images and a dataset of labels
- Give each data point (image + label) a unique ID
- Randomly shuffle each dataset
- Randomly remove some elements from each dataset
Link datasets using PSI
- Use PSI to link indices in each dataset using unique IDs
- Reorder datasets using linked indices
Train a split neural network
- Hold both datasets in a dataloader
- Send images to first part of split network
- Send labels to second part of split network
- Train the network

Requirements

OS

Windows	Linux	MacOS
❌	✔️	✔️

The Windows PyTorch version 1.4.0 is experiencing issues. It cannot be updated on a working version, until Syft will be updated, too.

Python

`3.6`	`3.7`	`3.8`	`3.9`
✔️	✔️	✔️	❌

syft and PSI upstream dependencies do not have Python 3.9 packages.

PyTorch Environment

To install the dependencies, we recommend using Conda:

Clone this repository
In the command line, navigate to your local copy of the repository
Run conda env create -f environment.yml
- This creates an environment pyvertical-dev
- Comes with most dependencies you will need
Activate the environment with conda activate pyvertical-dev
Run conda install notebook

N.b. Installing the dependencies takes several steps to circumvent versioning incompatibility between syft and jupyter. In the future, all packages will be moved into the environment.yml.

Tensorflow Environment

To install the dependencies, we recommend using Conda:

Clone this repository
In the command line, navigate to your local copy of the repository
Run conda env create -f tf_environment.yml
- This creates an environment pyvertical-dev-tf
- Comes with most dependencies you will need
Activate the environment with conda activate pyvertical-dev-tf
Run conda install notebook

Docker

You can instead opt to use Docker.

To run:

Build the image with docker build -t pyvertical:latest .
Launch a container with docker run -it -p 8888:8888 pyvertical:latest
- Defaults to launching jupyter lab

Synthea

PyVertical is applying fake medical data generated by synthea to demonstrate multi-party, vertical federated learning. Read the synthea docs for requirements to generate the data. With those pre-requisites installed, run the scripts/download_synthea.sh bash script from the root directory of this project, which generates a deterministic dataset and stores it in data/synthea.

Usage

Check out examples/PyVertical Example.ipynb to see PyVertical in action.

Goals

MVP
- Simple example on MNIST dataset
- One data holder has images, the other has labels
Extension demonstration
- Apply process to electronic health records (EHR) dataset
- Dual-headed SplitNN: input data is split amongst several data holders
Integrate with syft

Contributing

Pull requests are welcome. For major changes, please open an issue first to discuss what you would like to change.

Read the OpenMined contributing guidelines and styleguide for more information.

Contributors


TTitcombe	Pavlos-p	H4LL	rsandmann	daler3

Testing

We use pytest to test the source code. To run the tests manually:

In the command line, navigate to the root of this repository
Run python -m pytest

CI also checks the code is formatting according to contributing guidelines.

Publications

Romanini, D., Hall, A. J., Papadopoulos, P., Titcombe, T., Ismail, A., Cebere, T., Sandmann, R., Roehm, R. & Hoeh, M. A. (2021). PyVertical: A Vertical Federated Learning Framework for Multi-headed SplitNN. arXiv preprint arXiv:2104.00489. (link)

Angelou, N., Benaissa, A., Cebere, B., Clark, W., Hall, A. J., Hoeh, M. A., Liu, D., Papadopoulos, P., Roehm, R., Sandmann, R., Schoppmann, P. & Titcombe, T. (2020). Asymmetric Private Set Intersection with Applications to Contact Tracing and Private Vertical Federated Machine Learning. arXiv preprint arXiv:2011.09350. (link)

You can cite this work using:

@article{romanini2021pyvertical,
    title={PyVertical: A Vertical Federated Learning Framework for Multi-headed SplitNN},
    author={Romanini, Daniele and Hall, Adam James and Papadopoulos, Pavlos and Titcombe, Tom and Ismail, Abbas and Cebere, Tudor and Sandmann, Robert and Roehm, Robin and Hoeh, Michael A},
    journal={arXiv preprint arXiv:2104.00489},
    year={2021}
}

@article{angelou2020asymmetric,
    title={Asymmetric Private Set Intersection with Applications to Contact Tracing and Private Vertical Federated Machine Learning},
    author={Angelou, Nick and Benaissa, Ayoub and Cebere, Bogdan and Clark, William and Hall, Adam James and Hoeh, Michael A and Liu, Daniel and Papadopoulos, Pavlos and Roehm, Robin and Sandmann, Robert and others},
    journal={arXiv preprint arXiv:2011.09350},
    year={2020}
}

License

Apache License 2.0

pyvertical's People

Contributors

Stargazers

Watchers

Forkers

ttitcombe rsandmann abbas5253 kunchanglee aaslan54 ramesht007 darwin-systems daler3 tf369 jamesdu0504 som-don jduarter honshj koukyosyumei burcusonmez jerseyshin veithly tyfarnan call-me-hou-ge annabeer mirsci joeyfe djmartingale mathislu ejyske giterska lyh02 lemonviv yang-cheng-git wsgan001 shenguangyuan canoa0224 memy85 josephchataignon saigontrade88 yongsoo66 whuhxb shutingfang92 currycurrycurry nada-bu 1686594142 marklong7 kyrie-23 awinchen blacksilvergeek yy1tsui s1okouji zxchen98 captainhowjin xwang317 steppdesira

pyvertical's Issues

Remove hard-coded SplitNN

Description

There is a hardcoded SplitNN in the source code. This should be removed.

Are you interested in working on this improvement yourself?

Yes, I am, but anyone else is free to take this

Additional Context

We should remove this before we think about publishing the first PyVert pip package

Add Windows and MacOS tests

Description

Add test builds on Windows and MacOS. PyVertical involves a code compilation step (PSI) which varies system to system, so it is important we test all common OSs to catch bugs.

To save time, only test Python 3.7 on each new OS

Type of Test

Unit test (e.g. checking a loop, method, or function is working as intended)
Integration test (e.g. checking if a certain group or set of functionality is working as intended)
Regression test (e.g. checking if by adding or removing a module of code allows other systems to continue to function as intended)
Stress test (e.g. checking to see how well a system performs under various situations, including heavy usage)
Performance test (e.g. checking to see how efficient a system is as performing the intended task)
Other...

Expected Behavior

Opening a PR should run tests on:

Ubuntu 18.04
- Python 3.6, 3.7, 3.8 (current behaviour)
Windows (latest version available on github CI)
- Python 3.7
MacOS (latest version available on github CI)
- Python 3.7

Integrate PSI with workers

Question

The current PyVertical process requires a user to collect ids from the data holders to compute the set intersection. The data holders must then sort the remaining IDs they hold in the same format. We should integrate the PSI process more closely into vertically partitioned datasets. How we do this is an open question. We must consider:

data holders to agree on a process for sorting IDs
method for data holders to communicate IDs to a third-party for set intersection to be computed
method to confirm that data holder IDs are/can become strings
- PSi requires strings

Additional context

Part of #28

Dual-headed SplitNN

Feature Description

Implement a dual-headed splitNN. Each head takes some data as input and computes some representation of the data. The two intermediate vectors are combined and the rest of the network computes on the combined data. We should apply this network to Synthea medical data (#40). We need to split the data in a way which makes sense in a real setting.

When training this model, don't worry about the PSI process to link data entities. We will remove/jumble datasets to build the story at a later stage.

Is your feature request related to a problem?

In many real-world situations, input data is split across data holders. The current implementation of SplitNN takes input data from only one source, and the other data holder is expected to hold the labels.

Make generalised dataset splitter functions (PyTorch)

Feature Description

Create functions which split PyTorch datasets into separate datasets
Should work for image and non-image datasets
Functions should apply random IDs to datapoints
Unit test the functions

"Splitting" in this context means to split input features into two separate datasets. For images, split them top/bottom (a further issue will look to extend this)

Is your feature request related to a problem?

We should provide utility code to make it easy for people to turn non-vertically federated datasets into vertically federated ones, for experimental purposes.

We currently have some code built for this task, but it is not generaliseable to a wide range of datasets

What alternatives have you considered?

Don't provide generaliseable code: This is okay for initial experimentation, but PyVertical should be a widely useable package for VFL

Define project workflow

Where?

Documentation on the whole project. Store in a docs folder

Who?

The document will be for technical members, as a reference tool for when contributing to the project

What?

A diagram which outlines what the full PyVertical workflow will look like:

How data is partitioned
How/what data is supplied to PSI code
What PSI returns (indices of matching data, or ordered datasets)
How data is sent to SplitNN

Create Dockerfile

Feature Description

Create a Dockerfile with all requirements installed, including local PyVertical code, and has a clone of the PyVertical repo. A user should be able to run jupyter notebooks from within the image.

Is your feature request related to a problem?

PyVertical developers have had several problems compiling the PSI code on various operating systems. Having a Dockerfile would allow users to develop (and use) PyVertical more easily.

What alternatives have you considered?

None

Create partition function for federated datasets

Feature Description

Create a function dataset_partition which partitions a dataset, sends the partitioned datasets to the correct worker, and returns a syft.fl.FederatedDataset of partitioned datasets. This builds on the current partition_dataset function in PyVertical, and is similar to syft.fl.dataset_federate

Additional Context

Depends on #28
Blocked by #47

PSI

Description

Apply Private Set Intersection (PSI) to link vertically split data from multiple sources

Why?

To complete this Epic, we need to generate datasets which simulate a real-world setting: datapoints not shared by both datasets, data appearing in random orders, IDs which link data. We then apply PSI to re-link the data

Breakdown

Who else?

May require changes to https://github.com/OpenMined/PSI

Train Model on Synthea data

Feature Description

Train a machine learning model (not a split network) on synthea data to solve some task.
Create a notebook which demonstrates the performance of this model.

Synthea data can be fairly large, so we should find the minimum number of patients we need to develop a useful model. We're just trying to demonstrate a concept, not cure a disease.

Is your feature request related to a problem?

Before we attempt to develop a model on vertically-partitioned synthea data, we should confirm that a model can be trained successfully.

This model will also set the baseline so we can compare performance drop caused by privacy-preserving vertical federated learning

What alternatives have you considered?

None

Additional Context

An existing solution we could use is diabetes prediction https://github.com/IBM/example-health-machine-learning/blob/master/diabetes-prediction.ipynb

Requires #58 to be complete

Vertically Partitioned Data Loader

Feature Description

Develop a torch dataloader which holds two vertically partitioned sets of the same data.
This dataloader must send the parts of the data to the correct location of their corresponding network

Additional Context

Part of EPIC #3
Dependent on #6

Add mypy type checking in CI

Description

Add a mypy type check in CI script

Type of Test

Unit test (e.g. checking a loop, method, or function is working as intended)
Integration test (e.g. checking if a certain group or set of functionality is working as intended)
Regression test (e.g. checking if by adding or removing a module of code allows other systems to continue to function as intended)
Stress test (e.g. checking to see how well a system performs under various situations, including heavy usage)
Performance test (e.g. checking to see how efficient a system is as performing the intended task)
Other...

Expected Behavior

mypy type checking should be run as part of the black/linting stage of the CI script. The build should fail if mypy does, but if there are too many existing issues, just set it to warn and open a new issue to reduce mypy errors

Make pip-installable

Feature Description

Make PyVertical pip-installable.
This should happen after the integration of pyvertical with pysyft, when we have a clearer idea of pyvertical-specific functionality

Is your feature request related to a problem?

A pip-installable package would allow people to use pyvertical in their own projects

What alternatives have you considered?

None

Additional Context

None

Simple PSI demonstration

Feature Description

Implement a function that takes two arrays of unique IDs and returns indices of matching IDs.
These arrays do not need to have missing elements, but they must have different orders (otherwise PSI would not be necessary)

Is your feature request related to a problem?

This will be a first demonstration that PSI can be used to link vertically partitioned data.
The next step is to integrate it into the pipeline and use it to re-order the partitioned datasets

Additional Context

Part of PSI Epic #2
Dependent on #13

Use README template

What?

Use the OpenMined README template

Add unique identifier to data

Feature Description

Update the data partitioning function to apply a unique ID to each datapoint (shared across datasets).
The simplest solution would be an integer ID, however (fake) emails or names would also be acceptable to simulate data subjects

Is your feature request related to a problem?

We need some way of linking datapoints across datasets

Additional Context

Part of EPIC #2
~~Dependent on #4~~

Create syft-like federated dataloader

Feature Description

Replace the existing dataloader with a dataloader which takes a syft.fl.FederatedDataset. This should extend syft.fl.FederatedDataLoader to account for datasets which may not contain data or targets

Additional Context

Part of #28

Development of SplitNN as a 2d array of Models.

This allows for generic logic which can be used to accomodate arbitrary input/ label tensor partitions.

Currently we treat the SplitNN as a 1d array of models. This allows us to perform horizontal splits in the model.

When this works, we should be able to use the same, standard class for any data/ label distribution

Include PSI build in project

Feature Description

Add relevant Python code from PSI project to the project.
Include relevant documentation for compiling/running in README

Additional Context

Part of Epic #2

Improve "dataset.py" to reduce mypy errors

Description

Improve "src/dataset.py" file to reduce mypy check errors.

Additional Information

Mypy check reports 10 errors in "/src/dataset.py" file, as follows:

src/dataset.py:95: error: "Dataset" has no attribute "targets"
src/dataset.py:96: error: "Dataset" has no attribute "data"
src/dataset.py:99: error: Argument 1 to "len" has incompatible type "Dataset"; expected "Sized"
src/dataset.py:100: error: Argument 1 to "len" has incompatible type "Dataset"; expected "Sized"
src/dataset.py:104: error: Argument 1 to "len" has incompatible type "Dataset"; expected "Sized"
src/dataset.py:105: error: Argument 1 to "len" has incompatible type "Dataset"; expected "Sized"
src/dataset.py:111: error: "Dataset" has no attribute "data"
src/dataset.py:112: error: "Dataset" has no attribute "ids"
src/dataset.py:114: error: "Dataset" has no attribute "targets"
src/dataset.py:115: error: "Dataset" has no attribute "ids"

Integrate with syft

What?

The current implementation of data loaders and datasets is quite hacky.
We should integrate existing syft functionality and extend it to make Vertically-partitioned dataset
a robust class, making it easy for anyone to apply PyVertical to any dataset

Breakdown

Build on syft.fl.BaseDataset to create a dataset which holds partitions and may hold either data or targets. This should extend PyVertical's VerticalDataset to include syft functionality of ownership #47
Create a function dataset_partition which partitions a dataset, sends the partitioned datasets to the correct worker, and returns a syft.fl.FederatedDataset of partitioned datssets. This builds on the current partition_dataset function in PyVertical, and is similar to syft.fl.dataset_federate #48
Replace PartitionDistributingDataLoader with a dataloader which takes a syft.fl.FederatedDataset. This should extend syft.fl.FederatedDataLoader to account for datasets which may not contain data or targets #49
Integrate with PSI #50
Encrypt unique IDs #54

Additional Context

This will developed simultaneously with the extended PyVertical demonstration (#25), so to avoid breaking changes existing dataloaders/data splitters should be kept until this issue is complete

Add Contributors to README

Where?

Contributors section of README

Who?

Demonstrate to everyone reading the doc of the importance of community contributions

What?

Add a section highlighting all contributors to the project. This can be done manually, but ideally we would have a bot to add new contributors.

See the README template for more information

Simple Vertically Partitioned Model

Description

Implement a simple SplitNN using syft which trains on vertically split data provided by multiple data holders

Why?

This may be the first open source implementation of a split model trained on vertically partitioned data. Training the model successfully is a complex process, which requires a custom data pipeline and model architecture.

This epic is the first stage of the project: The data will be ordered (no need for PSI) and we will send data labels away from the data holders

Breakdown

Provide a bulleted or numbered list of how you might break this epic down into smaller issues.

Implement function to vertically split dataset (2 data holders/ 1 split) #4
Implement vertically split data loader #5
Develop vertically split SplitNN #6

Who else?

May require changes in PySyft

documents and source code not match

Description

The code in PyVertical/examples/PyVertical Example.ipynb were outdated, since it called a function compute_psi, which is shown to be deprecated in src/psi/util.py.

Find EHR dataset

Question

Find an Electronic Health Record (EHR) dataset for use in the next stage of PyVertical

Further Information

In the next stage of PyVertical, we will apply the concept to a dataset more complex than MNIST.
To demonstrate the utility of the technology, it would be useful to use a dataset with obvious links to real-world applications, yet still simple enough that we can quickly develop a useful product.

It was decided to use EHR data. Some possibilities are:

MIMIC
- Need to apply for access
Synthetic data

All suggestions are welcome

Re-add coverage tests

Description

Coverage tests were removed in #23 as there was some issue with the bazel build.
This was a temporary solution to get the PR merged. We should re-add the coverage tests.

Type of Test

Unit test (e.g. checking a loop, method, or function is working as intended)
Integration test (e.g. checking if a certain group or set of functionality is working as intended)
Regression test (e.g. checking if by adding or removing a module of code allows other systems to continue to function as intended)
Stress test (e.g. checking to see how well a system performs under various situations, including heavy usage)
Performance test (e.g. checking to see how efficient a system is as performing the intended task)
Other... Code coverage

Expected Behavior

The CI build should have a task for code coverage. Code coverage fails if baselines statistics are not met (see #27)

Function to vertically partition data

Feature Description

A function which takes a torch dataset, e.g. torchvision.datasets.MNIST and outputs two datasets, each holding a deterministic partition of the dataset (i.e. the first returned dataset always corresponds to the top half of the images)

The datapoints should remain ordered (i.e. so PSI is not required)

Is your feature request related to a problem?

To provide vertically partitioned data to the SplitNN, we need an efficient and easy way to partition datasets

Additional Context

This is part of EPIC #3

Remove syft dependency for dataloaders

What?

Remove the syft dataset/dataloader dependency from the VerticalDataset/loaders. Inherit instead from torch dataset/loader

Why?

Those classes are not present in syft >= 0.3.0

2-party Asymmetric learning

Feature Description

Implement an asymmetric learning protocol when calculating the ID intersection between parties.
See this paper for more information

Is your feature request related to a problem?

Asymmetric learning is the case where one of the parties in vertical federated learning has the majority of data IDs.
The major party can learn a great deal about the individuals/entities the minor party holds data on, but the minor party
learns almost nothing about the major party's dataset.

Protocols to protect both parties in this scenario include obscuring the intersection of data IDs by adding random IDs to the set sent to each party.

What alternatives have you considered?

None

Additional Context

This may need to implemented upstream by the PSI team.

This issue is should be worked on after

Integration with syft (i.e. we have worker-to-worker communication in place)
Robust PSI strategy ( securely sending IDs to and from computational server)

Open questions:

Should we always do an obscuring method?
If not, what is the determining factor?
Should workers be able to agree on using/not using an asymmetric protocol?

Pin PSI version

Description

Update the WORKSPACE file to point to a specific version of the PSI repo (which has been verified to work). Currently we point to master, so a new bug there may break PyVertical

Encrypt IDs

Feature Description

Encrypt IDs before they are sent to PSI. Data holders must agree on an encryption protocol so that PSI still works

Is your feature request related to a problem?

In the MVP, unique IDs attached to each data point are sent as raw strings. In real-world settings, unique IDs can expose user identity (e.g. could be an e-mail).

Move linting script into `test.yml`

Description

Currently, the linting/type checking/formatting checks performed by CI are housed in a separate script, lint_python.sh, which is called by the test.yml test script. While this keeps the test script clean, it makes it difficult to work out which check is causing a failure.

We should move the checks from lint_python.sh into test.yml, so we can easily see what's causing failures

Type of Test

Unit test (e.g. checking a loop, method, or function is working as intended)
Integration test (e.g. checking if a certain group or set of functionality is working as intended)
Regression test (e.g. checking if by adding or removing a module of code allows other systems to continue to function as intended)
Stress test (e.g. checking to see how well a system performs under various situations, including heavy usage)
Performance test (e.g. checking to see how efficient a system is as performing the intended task)
Other... Change to code linting test script

Expected Behavior

In the linting stage of the CI logs, each command (e.g. flake8, black, mypy) should be a separate, collapsible command

Apply PSI to SplitNN training example

Feature Description

Apply the PSI code to the SplitNN notebook to re-link vertically partitioned data. In its current state, the partitioned data is ordered, so PSI is unnecessary; change the arguments passed to the partition function to shuffle the partitioned data.

Change the parameters passed to the partition function so partitioned data is unordered
Demonstrate in the notebook that data is unordered i.e. print part of the data (this is not functional, but it helps to tell the story of what we're trying to achieve)
Use the PSI code to re-order the data
Train the SplitNN

Additional Context

Incorporating PSI into the SplitNN training example will finish the PoC
Do not work on this issue until #32 refactor is complete

Extend vertical partitioned demonstration

Description

Extend the MVP (partitioning MNIST into images and labels) to work on arbitrary vertically partitioned datasets

Why?

The dataset/dataloader/data partitioning/splitNN architecture/PSI functions are coded assuming the provided data is MNIST and the partitioning function split images and labels. Fortunately, the real world has more data than just MNIST. In this epic we will generalise the code to work with many datasets and partitions

Breakdown

IN PROGRESS

Who else?

May require work in PySyft and PSI

Additional Context

~~Should be completed after #2 and #3~~

Extend syft federated datasets

Feature Description

Build on syft.fl.BaseDataset to create a dataset which holds partitions and may hold either data and/or targets. This should extend PyVertical's VerticalDataset to include syft functionality of ownership

Additional Context

Part of #28

Bazel builds failing

Description

The bazel build stage of the test script is failing. See https://github.com/OpenMined/PyVertical/pull/39/checks?check_run_id=851670305 and https://github.com/OpenMined/PyVertical/runs/851664851?check_suite_focus=true

Multiple PRs are failing, and one of the PRs had previously passed the bazel build stage on an earlier day, so it looks like this is caused by an upstream change. Investigate whether or not we can do anything about it.

I have successfully built locally on Ubuntu 18.04, Python 3.7. Perhaps it's an issue with github CI's bazel?

How to Reproduce

Look at failing CI tests on PRs

Expected Behavior

The builds pass

Screenshots

System Information

OS: Ubuntu
OS Version: 18.04
Language Version: Python 3.6, 3.7, 3.8

Jumble vertically paritioned data

Feature Description

Update the function which vertically partitions datasets to jumble datapoints, so data indices are different for the datasets.

Is your feature request related to a problem?

For PSI to be necessary, data must be jumbled

Additional Context

Part of EPIC #2
Dependent on #4

Partition dataset by images/labels

Feature Description

Change the data partitioning function from splitting an image into top half/bottom half into (i) one dataset holding images and (ii) the other dataset holding labels

Is your feature request related to a problem?

The current data partitioning system, implemented in #10, is more complex than it needs to be.
Change the partitioning as described in this issue still demonstrates the core capability of PyVertical (training a model on vertically partitioned data), but is a simpler problem to solve as we do not need to train a model using two-headed input

Test for isort compatibility

Description

Codebase should adhere to OpenMined styleguide. This includes correct ordering of imports using isort. we should add isort as a developer requirement and test for isort compatibility in the unit tests. Look at PySyft for correct isort config

Type of Test

Unit test (e.g. checking a loop, method, or function is working as intended)
Integration test (e.g. checking if a certain group or set of functionality is working as intended)
Regression test (e.g. checking if by adding or removing a module of code allows other systems to continue to function as intended)
Stress test (e.g. checking to see how well a system performs under various situations, including heavy usage)
Performance test (e.g. checking to see how efficient a system is as performing the intended task)
Other...

Expected Behavior

Unit tests should include a test for isort compatibility, which passes

Additional Context

None

Vertically Partition data in SplitNN example

Feature Description

The SplitNN example notebook currently uses a normal dataset and dataloader.
Update the notebook to use partition_dataset and PartitionDistributingDataLoader

To partition the dataset without requiring PSI, call partition_dataset with keep_order=True and remove_data=False

Automate addition of new contributors

#14 introduced a Contributors section to the README. Set up a bot to automatically add new contributors. See https://github.com/all-contributors/all-contributors

Randomly failing PSI tests

Description

Some PSI tests are randomly failing. See https://github.com/OpenMined/PyVertical/pull/51/checks?check_run_id=872479981 for an example log.

Type of Test

Unit test (e.g. checking a loop, method, or function is working as intended)
Integration test (e.g. checking if a certain group or set of functionality is working as intended)
Regression test (e.g. checking if by adding or removing a module of code allows other systems to continue to function as intended)
Stress test (e.g. checking to see how well a system performs under various situations, including heavy usage)
Performance test (e.g. checking to see how efficient a system is as performing the intended task)
Other...

Expected Behavior

Tests should not randomly fail

Additional Context

This may be an issue with the upstream PSI dependency

Differentially-private intermediate tensors example

Feature Description

Develop an example which applies differential privacy to the tensors output from a local user's model before being sent to the computational sever.

Is your feature request related to a problem?

Non-DP tensors could allow a colluding computational server and data holder (in the case of multiple data holders) to identify a user's data using a model inversion attack

What alternatives have you considered?

None

Additional Context

Vertical federated learning reading list

Where?

PyVertical/READING_LIST.md

Who?

Anyone looking to learn more about vertical federated learning, from beginners to ml experts

What?

List of papers, blog posts and any other materials which explain or apply vertical federated learning.
Add a note to the readme which points readers to the reading list.

Additional Context

Some papers to consider:

Merge dataloader with PSI code

What?

Small refactor to dataloader code to integrate functionality introduced by PSI, which is currently split across multiple dataloaders

Dependent on #23

Why?

See conversation in #31 for details

Breakdown

Move PSI functionality from PartitioningDistributingDataLoader to NewDataLoader
Remove code for PartitioningDistributingDataLoader
Rename NewDataLoader to VerticalDataLoader and the existing VerticalDataLoader to SinglePartitionDataLoader

Simple Vertically partitioned SplitNN

Feature Description

Develop a SplitNN with a single input head, using syft to send the model parts to different devices.

See the PySyft tutorials for a base SplitNN; we may be able to use this SplitNN exactly

Place the model in src/splitnn.py

Additional Context

Part of EPIC #3

PSI builds failing

Description

I have got some problems in running build-psi.sh. The ERROR infomation is as follows:

INFO: Repository rule 'org_openmined_psi' returned: {"remote": "https://github.com/OpenMined/PSI", "commit": "ca2866e3c70a018ed9a5a2e7af5bcbd31dc49df9", "shallow_since": "2020-14-07", "init_submodules": True, "verbose": False, "strip_prefix": "", "patches": [], "patch_tool": "patch", "patch_args": ["-p0"], "patch_cmds": [], "name": "org_openmined_psi"}
ERROR: /home/tanghengzhu/.cache/bazel/_bazel_tanghengzhu/9b06aa84c50b71eaea609540d6aa319b/external/org_openmined_psi/private_set_intersection/javascript/toolchain/cc_toolchain_config.bzl:171:17: name 'CcToolchainConfigInfo' is not defined
ERROR: error loading package '': Extension 'private_set_intersection/javascript/toolchain/cc_toolchain_config.bzl' has errors
ERROR: error loading package '': Extension 'private_set_intersection/javascript/toolchain/cc_toolchain_config.bzl' has errors
INFO: Elapsed time: 8.584s
INFO: 0 processes.
FAILED: Build did NOT complete successfully (0 packages loaded)

How to Reproduce

Installed bazel
run the build script with .github/workflows/scripts/build-psi.sh..'

Expected Behavior

This should generate a _psi_bindings.so file and place it in src/psi/.

System Information

OS Version: Linux sigmaster 3.10.0 Red Hat 4.8.5-39
Language Version: python 3.7.8
Package Manager Version: Conda 4.6.14
Bazel Version: 1.0.0 (3.7.1 is also tried)

Add code coverage to CI

Description

Add codecoverage report to CI

Type of Test

Unit test (e.g. checking a loop, method, or function is working as intended)
Integration test (e.g. checking if a certain group or set of functionality is working as intended)
Regression test (e.g. checking if by adding or removing a module of code allows other systems to continue to function as intended)
Stress test (e.g. checking to see how well a system performs under various situations, including heavy usage)
Performance test (e.g. checking to see how efficient a system is as performing the intended task)
Other...

Expected Behavior

A code coverage step in the test script.
Each PR should generate a report on code coverage.
If coverage goes below 95%, the build should fail

Additional Context

See the PySyft test script

Train simple neural network on MIMIC data

What?

Train a single-headed neural network on MIMIC data. Look to previous VFL papers for reference.

How long?

Once MIMIC data access has been granted, ~2 weeks.

Is your research related to a problem?

The purpose of this task is to demonstrate that we can learn something useful with NNs on MIMIC (i.e. the data isn't too simple for it to be a compelling use-case). Once this has been proven, we can make the problem applicable to VFL.

Additional Context

DO NOT COMMIT MIMIC DATA. MAKE SURE MIMIC DATA HAS BEEN SCRUBBED FROM ANY RELEVANT NOTEBOOKS

Randomly remove data

Feature Description

Update the data partitioning function to randomly remove datapoints from one (or both) of the datasets.

The chance of this happening should be small so that there is enough data left over on which to train the model

Is your feature request related to a problem?

In a real-world setting, we cannot guarantee that all datasets hold data relating to the same subjects. We wish to simulate that

Additional Context

Part of Epic #2
Dependent on #4