Giter Site home page Giter Site logo

openmined / pyvertical Goto Github PK

View Code? Open in Web Editor NEW
209.0 13.0 52.0 24.44 MB

Privacy Preserving Vertical Federated Learning

License: Apache License 2.0

Python 97.21% Shell 2.00% Dockerfile 0.79%
psi private-set-intersection split-neural-network splitnn vertical-federated-learning federated-learning partitioned-data

pyvertical's Introduction

om-logo

Tests License OpenCollective

PyVertical

A project developing privacy-preserving, vertical federated learning, using syft.

  • ๐Ÿ”— Private entity resolution using Private Set Intersection (PSI)
  • ๐Ÿ”’ Trains a model on vertically partitioned data using SplitNNs, so only data holders can access data

Vertically-partitioned data is data in which fields relating to a single record are distributed across multiple datasets. For example, multiple hospitals may have admissions data on the same patients, or retailers have transaction data on the same shoppers. Vertically-partitioned data could be applied to solve vital problems, but data holders can't combine their datasets by simply comparing notes with other data holders unless they want to break user privacy. PyVertical uses PSI to link datasets in a privacy-preserving way. We train SplitNNs on the partitioned data to ensure the data remains separate throughout the entire process.

See the changelog for information on the current status of PyVertical.

NOTE: PyVertical does not currently work with syft 0.3.0

The Process

PyVertical diagram

PyVertical process:

  1. Create partitioned dataset
    • Simulate real-world partitioned dataset by splitting MNIST into a dataset of images and a dataset of labels
    • Give each data point (image + label) a unique ID
    • Randomly shuffle each dataset
    • Randomly remove some elements from each dataset
  2. Link datasets using PSI
    • Use PSI to link indices in each dataset using unique IDs
    • Reorder datasets using linked indices
  3. Train a split neural network
    • Hold both datasets in a dataloader
    • Send images to first part of split network
    • Send labels to second part of split network
    • Train the network

Requirements

OS

Windows Linux MacOS
โŒ โœ”๏ธ โœ”๏ธ

The Windows PyTorch version 1.4.0 is experiencing issues. It cannot be updated on a working version, until Syft will be updated, too.

Python

3.6 3.7 3.8 3.9
โœ”๏ธ โœ”๏ธ โœ”๏ธ โŒ

syft and PSI upstream dependencies do not have Python 3.9 packages.

PyTorch Environment

To install the dependencies, we recommend using Conda:

  1. Clone this repository
  2. In the command line, navigate to your local copy of the repository
  3. Run conda env create -f environment.yml
    • This creates an environment pyvertical-dev
    • Comes with most dependencies you will need
  4. Activate the environment with conda activate pyvertical-dev
  5. Run conda install notebook

N.b. Installing the dependencies takes several steps to circumvent versioning incompatibility between syft and jupyter. In the future, all packages will be moved into the environment.yml.

Tensorflow Environment

To install the dependencies, we recommend using Conda:

  1. Clone this repository
  2. In the command line, navigate to your local copy of the repository
  3. Run conda env create -f tf_environment.yml
    • This creates an environment pyvertical-dev-tf
    • Comes with most dependencies you will need
  4. Activate the environment with conda activate pyvertical-dev-tf
  5. Run conda install notebook

Docker

You can instead opt to use Docker.

To run:

  1. Build the image with docker build -t pyvertical:latest .
  2. Launch a container with docker run -it -p 8888:8888 pyvertical:latest
    • Defaults to launching jupyter lab

Synthea

PyVertical is applying fake medical data generated by synthea to demonstrate multi-party, vertical federated learning. Read the synthea docs for requirements to generate the data. With those pre-requisites installed, run the scripts/download_synthea.sh bash script from the root directory of this project, which generates a deterministic dataset and stores it in data/synthea.

Usage

Check out examples/PyVertical Example.ipynb to see PyVertical in action.

Goals

  • MVP
    • Simple example on MNIST dataset
    • One data holder has images, the other has labels
  • Extension demonstration
    • Apply process to electronic health records (EHR) dataset
    • Dual-headed SplitNN: input data is split amongst several data holders
  • Integrate with syft

Contributing

Pull requests are welcome. For major changes, please open an issue first to discuss what you would like to change.

Read the OpenMined contributing guidelines and styleguide for more information.

Contributors

TTitcombe Pavlos-P H4ll rsandmann daler3
TTitcombe Pavlos-p H4LL rsandmann daler3

Testing

We use pytest to test the source code. To run the tests manually:

  1. In the command line, navigate to the root of this repository
  2. Run python -m pytest

CI also checks the code is formatting according to contributing guidelines.

Publications

Romanini, D., Hall, A. J., Papadopoulos, P., Titcombe, T., Ismail, A., Cebere, T., Sandmann, R., Roehm, R. & Hoeh, M. A. (2021). PyVertical: A Vertical Federated Learning Framework for Multi-headed SplitNN. arXiv preprint arXiv:2104.00489. (link)

Angelou, N., Benaissa, A., Cebere, B., Clark, W., Hall, A. J., Hoeh, M. A., Liu, D., Papadopoulos, P., Roehm, R., Sandmann, R., Schoppmann, P. & Titcombe, T. (2020). Asymmetric Private Set Intersection with Applications to Contact Tracing and Private Vertical Federated Machine Learning. arXiv preprint arXiv:2011.09350. (link)

You can cite this work using:

@article{romanini2021pyvertical,
    title={PyVertical: A Vertical Federated Learning Framework for Multi-headed SplitNN},
    author={Romanini, Daniele and Hall, Adam James and Papadopoulos, Pavlos and Titcombe, Tom and Ismail, Abbas and Cebere, Tudor and Sandmann, Robert and Roehm, Robin and Hoeh, Michael A},
    journal={arXiv preprint arXiv:2104.00489},
    year={2021}
}

@article{angelou2020asymmetric,
    title={Asymmetric Private Set Intersection with Applications to Contact Tracing and Private Vertical Federated Machine Learning},
    author={Angelou, Nick and Benaissa, Ayoub and Cebere, Bogdan and Clark, William and Hall, Adam James and Hoeh, Michael A and Liu, Daniel and Papadopoulos, Pavlos and Roehm, Robin and Sandmann, Robert and others},
    journal={arXiv preprint arXiv:2011.09350},
    year={2020}
}

License

Apache License 2.0

pyvertical's People

Contributors

daler3 avatar h4ll avatar pavlos-p avatar rsandmann avatar ttitcombe avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

pyvertical's Issues

Remove hard-coded SplitNN

Description

There is a hardcoded SplitNN in the source code. This should be removed.

Are you interested in working on this improvement yourself?

  • Yes, I am, but anyone else is free to take this

Additional Context

We should remove this before we think about publishing the first PyVert pip package

Add Windows and MacOS tests

Description

Add test builds on Windows and MacOS. PyVertical involves a code compilation step (PSI) which varies system to system, so it is important we test all common OSs to catch bugs.

To save time, only test Python 3.7 on each new OS

Type of Test

  • Unit test (e.g. checking a loop, method, or function is working as intended)
  • Integration test (e.g. checking if a certain group or set of functionality is working as intended)
  • Regression test (e.g. checking if by adding or removing a module of code allows other systems to continue to function as intended)
  • Stress test (e.g. checking to see how well a system performs under various situations, including heavy usage)
  • Performance test (e.g. checking to see how efficient a system is as performing the intended task)
  • Other...

Expected Behavior

Opening a PR should run tests on:

  • Ubuntu 18.04
    • Python 3.6, 3.7, 3.8 (current behaviour)
  • Windows (latest version available on github CI)
    • Python 3.7
  • MacOS (latest version available on github CI)
    • Python 3.7

Integrate PSI with workers

Question

The current PyVertical process requires a user to collect ids from the data holders to compute the set intersection. The data holders must then sort the remaining IDs they hold in the same format. We should integrate the PSI process more closely into vertically partitioned datasets. How we do this is an open question. We must consider:

  • data holders to agree on a process for sorting IDs
  • method for data holders to communicate IDs to a third-party for set intersection to be computed
  • method to confirm that data holder IDs are/can become strings
    • PSi requires strings

Additional context

Part of #28

Dual-headed SplitNN

Feature Description

Implement a dual-headed splitNN. Each head takes some data as input and computes some representation of the data. The two intermediate vectors are combined and the rest of the network computes on the combined data. We should apply this network to Synthea medical data (#40). We need to split the data in a way which makes sense in a real setting.

When training this model, don't worry about the PSI process to link data entities. We will remove/jumble datasets to build the story at a later stage.

Is your feature request related to a problem?

In many real-world situations, input data is split across data holders. The current implementation of SplitNN takes input data from only one source, and the other data holder is expected to hold the labels.

Make generalised dataset splitter functions (PyTorch)

Feature Description

  • Create functions which split PyTorch datasets into separate datasets
  • Should work for image and non-image datasets
  • Functions should apply random IDs to datapoints
  • Unit test the functions

"Splitting" in this context means to split input features into two separate datasets. For images, split them top/bottom (a further issue will look to extend this)

Is your feature request related to a problem?

We should provide utility code to make it easy for people to turn non-vertically federated datasets into vertically federated ones, for experimental purposes.

We currently have some code built for this task, but it is not generaliseable to a wide range of datasets

What alternatives have you considered?

  • Don't provide generaliseable code: This is okay for initial experimentation, but PyVertical should be a widely useable package for VFL

Define project workflow

Where?

Documentation on the whole project. Store in a docs folder

Who?

The document will be for technical members, as a reference tool for when contributing to the project

What?

A diagram which outlines what the full PyVertical workflow will look like:

  • How data is partitioned
  • How/what data is supplied to PSI code
  • What PSI returns (indices of matching data, or ordered datasets)
  • How data is sent to SplitNN

Create Dockerfile

Feature Description

Create a Dockerfile with all requirements installed, including local PyVertical code, and has a clone of the PyVertical repo. A user should be able to run jupyter notebooks from within the image.

Is your feature request related to a problem?

PyVertical developers have had several problems compiling the PSI code on various operating systems. Having a Dockerfile would allow users to develop (and use) PyVertical more easily.

What alternatives have you considered?

None

Create partition function for federated datasets

Feature Description

Create a function dataset_partition which partitions a dataset, sends the partitioned datasets to the correct worker, and returns a syft.fl.FederatedDataset of partitioned datasets. This builds on the current partition_dataset function in PyVertical, and is similar to syft.fl.dataset_federate

Additional Context

Depends on #28
Blocked by #47

PSI

Description

Apply Private Set Intersection (PSI) to link vertically split data from multiple sources

Why?

To complete this Epic, we need to generate datasets which simulate a real-world setting: datapoints not shared by both datasets, data appearing in random orders, IDs which link data. We then apply PSI to re-link the data

Breakdown

  • Jumble data #7
  • Add unique IDs to data #8
  • Randomly missing data #9
  • Add PSI code #13
  • Apply PSI to re-link data #34

Who else?

May require changes to https://github.com/OpenMined/PSI

Train Model on Synthea data

Feature Description

Train a machine learning model (not a split network) on synthea data to solve some task.
Create a notebook which demonstrates the performance of this model.

Synthea data can be fairly large, so we should find the minimum number of patients we need to develop a useful model. We're just trying to demonstrate a concept, not cure a disease.

Is your feature request related to a problem?

Before we attempt to develop a model on vertically-partitioned synthea data, we should confirm that a model can be trained successfully.

This model will also set the baseline so we can compare performance drop caused by privacy-preserving vertical federated learning

What alternatives have you considered?

None

Additional Context

An existing solution we could use is diabetes prediction https://github.com/IBM/example-health-machine-learning/blob/master/diabetes-prediction.ipynb

Requires #58 to be complete

Vertically Partitioned Data Loader

Feature Description

Develop a torch dataloader which holds two vertically partitioned sets of the same data.
This dataloader must send the parts of the data to the correct location of their corresponding network

Additional Context

Part of EPIC #3
Dependent on #6

Add mypy type checking in CI

Description

Add a mypy type check in CI script

Type of Test

  • Unit test (e.g. checking a loop, method, or function is working as intended)
  • Integration test (e.g. checking if a certain group or set of functionality is working as intended)
  • Regression test (e.g. checking if by adding or removing a module of code allows other systems to continue to function as intended)
  • Stress test (e.g. checking to see how well a system performs under various situations, including heavy usage)
  • Performance test (e.g. checking to see how efficient a system is as performing the intended task)
  • Other...

Expected Behavior

mypy type checking should be run as part of the black/linting stage of the CI script. The build should fail if mypy does, but if there are too many existing issues, just set it to warn and open a new issue to reduce mypy errors

Make pip-installable

Feature Description

Make PyVertical pip-installable.
This should happen after the integration of pyvertical with pysyft, when we have a clearer idea of pyvertical-specific functionality

Is your feature request related to a problem?

A pip-installable package would allow people to use pyvertical in their own projects

What alternatives have you considered?

None

Additional Context

None

Simple PSI demonstration

Feature Description

Implement a function that takes two arrays of unique IDs and returns indices of matching IDs.
These arrays do not need to have missing elements, but they must have different orders (otherwise PSI would not be necessary)

Is your feature request related to a problem?

This will be a first demonstration that PSI can be used to link vertically partitioned data.
The next step is to integrate it into the pipeline and use it to re-order the partitioned datasets

Additional Context

Part of PSI Epic #2
Dependent on #13

Add unique identifier to data

Feature Description

Update the data partitioning function to apply a unique ID to each datapoint (shared across datasets).
The simplest solution would be an integer ID, however (fake) emails or names would also be acceptable to simulate data subjects

Is your feature request related to a problem?

We need some way of linking datapoints across datasets

Additional Context

Part of EPIC #2
Dependent on #4

Create syft-like federated dataloader

Feature Description

Replace the existing dataloader with a dataloader which takes a syft.fl.FederatedDataset. This should extend syft.fl.FederatedDataLoader to account for datasets which may not contain data or targets

Additional Context

Part of #28

Development of SplitNN as a 2d array of Models.

This allows for generic logic which can be used to accomodate arbitrary input/ label tensor partitions.

Currently we treat the SplitNN as a 1d array of models. This allows us to perform horizontal splits in the model.

When this works, we should be able to use the same, standard class for any data/ label distribution

Include PSI build in project

Feature Description

Add relevant Python code from PSI project to the project.
Include relevant documentation for compiling/running in README

Additional Context

Part of Epic #2

Improve "dataset.py" to reduce mypy errors

Description

Improve "src/dataset.py" file to reduce mypy check errors.

Additional Information

Mypy check reports 10 errors in "/src/dataset.py" file, as follows:

src/dataset.py:95: error: "Dataset" has no attribute "targets"
src/dataset.py:96: error: "Dataset" has no attribute "data"
src/dataset.py:99: error: Argument 1 to "len" has incompatible type "Dataset"; expected "Sized"
src/dataset.py:100: error: Argument 1 to "len" has incompatible type "Dataset"; expected "Sized"
src/dataset.py:104: error: Argument 1 to "len" has incompatible type "Dataset"; expected "Sized"
src/dataset.py:105: error: Argument 1 to "len" has incompatible type "Dataset"; expected "Sized"
src/dataset.py:111: error: "Dataset" has no attribute "data"
src/dataset.py:112: error: "Dataset" has no attribute "ids"
src/dataset.py:114: error: "Dataset" has no attribute "targets"
src/dataset.py:115: error: "Dataset" has no attribute "ids"

Integrate with syft

What?

The current implementation of data loaders and datasets is quite hacky.
We should integrate existing syft functionality and extend it to make Vertically-partitioned dataset
a robust class, making it easy for anyone to apply PyVertical to any dataset

Breakdown

  • Build on syft.fl.BaseDataset to create a dataset which holds partitions and may hold either data or targets. This should extend PyVertical's VerticalDataset to include syft functionality of ownership #47
  • Create a function dataset_partition which partitions a dataset, sends the partitioned datasets to the correct worker, and returns a syft.fl.FederatedDataset of partitioned datssets. This builds on the current partition_dataset function in PyVertical, and is similar to syft.fl.dataset_federate #48
  • Replace PartitionDistributingDataLoader with a dataloader which takes a syft.fl.FederatedDataset. This should extend syft.fl.FederatedDataLoader to account for datasets which may not contain data or targets #49
  • Integrate with PSI #50
  • Encrypt unique IDs #54

Additional Context

This will developed simultaneously with the extended PyVertical demonstration (#25), so to avoid breaking changes existing dataloaders/data splitters should be kept until this issue is complete

Add Contributors to README

Where?

Contributors section of README

Who?

Demonstrate to everyone reading the doc of the importance of community contributions

What?

Add a section highlighting all contributors to the project. This can be done manually, but ideally we would have a bot to add new contributors.

See the README template for more information

Simple Vertically Partitioned Model

Description

Implement a simple SplitNN using syft which trains on vertically split data provided by multiple data holders

Why?

This may be the first open source implementation of a split model trained on vertically partitioned data. Training the model successfully is a complex process, which requires a custom data pipeline and model architecture.

This epic is the first stage of the project: The data will be ordered (no need for PSI) and we will send data labels away from the data holders

Breakdown

Provide a bulleted or numbered list of how you might break this epic down into smaller issues.

  • Implement function to vertically split dataset (2 data holders/ 1 split) #4
  • Implement vertically split data loader #5
  • Develop vertically split SplitNN #6

Who else?

May require changes in PySyft

documents and source code not match

Description

The code in PyVertical/examples/PyVertical Example.ipynb were outdated, since it called a function compute_psi, which is shown to be deprecated in src/psi/util.py.

Find EHR dataset

Question

Find an Electronic Health Record (EHR) dataset for use in the next stage of PyVertical

Further Information

In the next stage of PyVertical, we will apply the concept to a dataset more complex than MNIST.
To demonstrate the utility of the technology, it would be useful to use a dataset with obvious links to real-world applications, yet still simple enough that we can quickly develop a useful product.

It was decided to use EHR data. Some possibilities are:

All suggestions are welcome

Re-add coverage tests

Description

Coverage tests were removed in #23 as there was some issue with the bazel build.
This was a temporary solution to get the PR merged. We should re-add the coverage tests.

Type of Test

  • Unit test (e.g. checking a loop, method, or function is working as intended)
  • Integration test (e.g. checking if a certain group or set of functionality is working as intended)
  • Regression test (e.g. checking if by adding or removing a module of code allows other systems to continue to function as intended)
  • Stress test (e.g. checking to see how well a system performs under various situations, including heavy usage)
  • Performance test (e.g. checking to see how efficient a system is as performing the intended task)
  • Other... Code coverage

Expected Behavior

The CI build should have a task for code coverage. Code coverage fails if baselines statistics are not met (see #27)

Function to vertically partition data

Feature Description

A function which takes a torch dataset, e.g. torchvision.datasets.MNIST and outputs two datasets, each holding a deterministic partition of the dataset (i.e. the first returned dataset always corresponds to the top half of the images)

The datapoints should remain ordered (i.e. so PSI is not required)

Is your feature request related to a problem?

To provide vertically partitioned data to the SplitNN, we need an efficient and easy way to partition datasets

Additional Context

This is part of EPIC #3

Remove syft dependency for dataloaders

What?

Remove the syft dataset/dataloader dependency from the VerticalDataset/loaders. Inherit instead from torch dataset/loader

Why?

Those classes are not present in syft >= 0.3.0

2-party Asymmetric learning

Feature Description

Implement an asymmetric learning protocol when calculating the ID intersection between parties.
See this paper for more information

Is your feature request related to a problem?

Asymmetric learning is the case where one of the parties in vertical federated learning has the majority of data IDs.
The major party can learn a great deal about the individuals/entities the minor party holds data on, but the minor party
learns almost nothing about the major party's dataset.

Protocols to protect both parties in this scenario include obscuring the intersection of data IDs by adding random IDs to the set sent to each party.

What alternatives have you considered?

None

Additional Context

This may need to implemented upstream by the PSI team.

This issue is should be worked on after

  • Integration with syft (i.e. we have worker-to-worker communication in place)
  • Robust PSI strategy ( securely sending IDs to and from computational server)

Open questions:

  • Should we always do an obscuring method?
  • If not, what is the determining factor?
  • Should workers be able to agree on using/not using an asymmetric protocol?

Pin PSI version

Description

Update the WORKSPACE file to point to a specific version of the PSI repo (which has been verified to work). Currently we point to master, so a new bug there may break PyVertical

Encrypt IDs

Feature Description

Encrypt IDs before they are sent to PSI. Data holders must agree on an encryption protocol so that PSI still works

Is your feature request related to a problem?

In the MVP, unique IDs attached to each data point are sent as raw strings. In real-world settings, unique IDs can expose user identity (e.g. could be an e-mail).

Move linting script into `test.yml`

Description

Currently, the linting/type checking/formatting checks performed by CI are housed in a separate script, lint_python.sh, which is called by the test.yml test script. While this keeps the test script clean, it makes it difficult to work out which check is causing a failure.

We should move the checks from lint_python.sh into test.yml, so we can easily see what's causing failures

Type of Test

  • Unit test (e.g. checking a loop, method, or function is working as intended)
  • Integration test (e.g. checking if a certain group or set of functionality is working as intended)
  • Regression test (e.g. checking if by adding or removing a module of code allows other systems to continue to function as intended)
  • Stress test (e.g. checking to see how well a system performs under various situations, including heavy usage)
  • Performance test (e.g. checking to see how efficient a system is as performing the intended task)
  • Other... Change to code linting test script

Expected Behavior

In the linting stage of the CI logs, each command (e.g. flake8, black, mypy) should be a separate, collapsible command

Apply PSI to SplitNN training example

Feature Description

Apply the PSI code to the SplitNN notebook to re-link vertically partitioned data. In its current state, the partitioned data is ordered, so PSI is unnecessary; change the arguments passed to the partition function to shuffle the partitioned data.

  • Change the parameters passed to the partition function so partitioned data is unordered
  • Demonstrate in the notebook that data is unordered i.e. print part of the data (this is not functional, but it helps to tell the story of what we're trying to achieve)
  • Use the PSI code to re-order the data
  • Train the SplitNN

Additional Context

Incorporating PSI into the SplitNN training example will finish the PoC
Do not work on this issue until #32 refactor is complete

Extend vertical partitioned demonstration

Description

Extend the MVP (partitioning MNIST into images and labels) to work on arbitrary vertically partitioned datasets

Why?

The dataset/dataloader/data partitioning/splitNN architecture/PSI functions are coded assuming the provided data is MNIST and the partitioning function split images and labels. Fortunately, the real world has more data than just MNIST. In this epic we will generalise the code to work with many datasets and partitions

Breakdown

IN PROGRESS

Who else?

May require work in PySyft and PSI

Additional Context

Should be completed after #2 and #3

Extend syft federated datasets

Feature Description

Build on syft.fl.BaseDataset to create a dataset which holds partitions and may hold either data and/or targets. This should extend PyVertical's VerticalDataset to include syft functionality of ownership

Additional Context

Part of #28

Bazel builds failing

Description

The bazel build stage of the test script is failing. See https://github.com/OpenMined/PyVertical/pull/39/checks?check_run_id=851670305 and https://github.com/OpenMined/PyVertical/runs/851664851?check_suite_focus=true

Multiple PRs are failing, and one of the PRs had previously passed the bazel build stage on an earlier day, so it looks like this is caused by an upstream change. Investigate whether or not we can do anything about it.

I have successfully built locally on Ubuntu 18.04, Python 3.7. Perhaps it's an issue with github CI's bazel?

How to Reproduce

  1. Look at failing CI tests on PRs

Expected Behavior

The builds pass

Screenshots

NA

System Information

  • OS: Ubuntu
  • OS Version: 18.04
  • Language Version: Python 3.6, 3.7, 3.8

Jumble vertically paritioned data

Feature Description

Update the function which vertically partitions datasets to jumble datapoints, so data indices are different for the datasets.

Is your feature request related to a problem?

For PSI to be necessary, data must be jumbled

Additional Context

Part of EPIC #2
Dependent on #4

Partition dataset by images/labels

Feature Description

Change the data partitioning function from splitting an image into top half/bottom half into (i) one dataset holding images and (ii) the other dataset holding labels

Is your feature request related to a problem?

The current data partitioning system, implemented in #10, is more complex than it needs to be.
Change the partitioning as described in this issue still demonstrates the core capability of PyVertical (training a model on vertically partitioned data), but is a simpler problem to solve as we do not need to train a model using two-headed input

Test for isort compatibility

Description

Codebase should adhere to OpenMined styleguide. This includes correct ordering of imports using isort. we should add isort as a developer requirement and test for isort compatibility in the unit tests. Look at PySyft for correct isort config

Type of Test

  • Unit test (e.g. checking a loop, method, or function is working as intended)
  • Integration test (e.g. checking if a certain group or set of functionality is working as intended)
  • Regression test (e.g. checking if by adding or removing a module of code allows other systems to continue to function as intended)
  • Stress test (e.g. checking to see how well a system performs under various situations, including heavy usage)
  • Performance test (e.g. checking to see how efficient a system is as performing the intended task)
  • Other...

Expected Behavior

Unit tests should include a test for isort compatibility, which passes

Additional Context

None

Vertically Partition data in SplitNN example

Feature Description

The SplitNN example notebook currently uses a normal dataset and dataloader.
Update the notebook to use partition_dataset and PartitionDistributingDataLoader

To partition the dataset without requiring PSI, call partition_dataset with keep_order=True and remove_data=False

Randomly failing PSI tests

Description

Some PSI tests are randomly failing. See https://github.com/OpenMined/PyVertical/pull/51/checks?check_run_id=872479981 for an example log.

image

image

Type of Test

  • Unit test (e.g. checking a loop, method, or function is working as intended)
  • Integration test (e.g. checking if a certain group or set of functionality is working as intended)
  • Regression test (e.g. checking if by adding or removing a module of code allows other systems to continue to function as intended)
  • Stress test (e.g. checking to see how well a system performs under various situations, including heavy usage)
  • Performance test (e.g. checking to see how efficient a system is as performing the intended task)
  • Other...

Expected Behavior

Tests should not randomly fail

Additional Context

This may be an issue with the upstream PSI dependency

Differentially-private intermediate tensors example

Feature Description

Develop an example which applies differential privacy to the tensors output from a local user's model before being sent to the computational sever.

Is your feature request related to a problem?

Non-DP tensors could allow a colluding computational server and data holder (in the case of multiple data holders) to identify a user's data using a model inversion attack

What alternatives have you considered?

None

Additional Context

Vertical federated learning reading list

Where?

PyVertical/READING_LIST.md

Who?

Anyone looking to learn more about vertical federated learning, from beginners to ml experts

What?

List of papers, blog posts and any other materials which explain or apply vertical federated learning.
Add a note to the readme which points readers to the reading list.

Additional Context

Some papers to consider:

Merge dataloader with PSI code

What?

Small refactor to dataloader code to integrate functionality introduced by PSI, which is currently split across multiple dataloaders

Dependent on #23

Why?

See conversation in #31 for details

Breakdown

  • Move PSI functionality from PartitioningDistributingDataLoader to NewDataLoader
  • Remove code for PartitioningDistributingDataLoader
  • Rename NewDataLoader to VerticalDataLoader and the existing VerticalDataLoader to SinglePartitionDataLoader

Simple Vertically partitioned SplitNN

Feature Description

Develop a SplitNN with a single input head, using syft to send the model parts to different devices.

See the PySyft tutorials for a base SplitNN; we may be able to use this SplitNN exactly

Place the model in src/splitnn.py

Additional Context

Part of EPIC #3

PSI builds failing

Description

I have got some problems in running build-psi.sh. The ERROR infomation is as follows:

INFO: Repository rule 'org_openmined_psi' returned: {"remote": "https://github.com/OpenMined/PSI", "commit": "ca2866e3c70a018ed9a5a2e7af5bcbd31dc49df9", "shallow_since": "2020-14-07", "init_submodules": True, "verbose": False, "strip_prefix": "", "patches": [], "patch_tool": "patch", "patch_args": ["-p0"], "patch_cmds": [], "name": "org_openmined_psi"}
ERROR: /home/tanghengzhu/.cache/bazel/_bazel_tanghengzhu/9b06aa84c50b71eaea609540d6aa319b/external/org_openmined_psi/private_set_intersection/javascript/toolchain/cc_toolchain_config.bzl:171:17: name 'CcToolchainConfigInfo' is not defined
ERROR: error loading package '': Extension 'private_set_intersection/javascript/toolchain/cc_toolchain_config.bzl' has errors
ERROR: error loading package '': Extension 'private_set_intersection/javascript/toolchain/cc_toolchain_config.bzl' has errors
INFO: Elapsed time: 8.584s
INFO: 0 processes.
FAILED: Build did NOT complete successfully (0 packages loaded)

How to Reproduce

  1. Installed bazel
  2. run the build script with .github/workflows/scripts/build-psi.sh..'

Expected Behavior

This should generate a _psi_bindings.so file and place it in src/psi/.

System Information

  • OS Version: Linux sigmaster 3.10.0 Red Hat 4.8.5-39
  • Language Version: python 3.7.8
  • Package Manager Version: Conda 4.6.14
  • Bazel Version: 1.0.0 (3.7.1 is also tried)

Add code coverage to CI

Description

Add codecoverage report to CI

Type of Test

  • Unit test (e.g. checking a loop, method, or function is working as intended)
  • Integration test (e.g. checking if a certain group or set of functionality is working as intended)
  • Regression test (e.g. checking if by adding or removing a module of code allows other systems to continue to function as intended)
  • Stress test (e.g. checking to see how well a system performs under various situations, including heavy usage)
  • Performance test (e.g. checking to see how efficient a system is as performing the intended task)
  • Other...

Expected Behavior

A code coverage step in the test script.
Each PR should generate a report on code coverage.
If coverage goes below 95%, the build should fail

Additional Context

See the PySyft test script

Train simple neural network on MIMIC data

What?

Train a single-headed neural network on MIMIC data. Look to previous VFL papers for reference.

How long?

Once MIMIC data access has been granted, ~2 weeks.

Is your research related to a problem?

The purpose of this task is to demonstrate that we can learn something useful with NNs on MIMIC (i.e. the data isn't too simple for it to be a compelling use-case). Once this has been proven, we can make the problem applicable to VFL.

Additional Context

DO NOT COMMIT MIMIC DATA. MAKE SURE MIMIC DATA HAS BEEN SCRUBBED FROM ANY RELEVANT NOTEBOOKS

Randomly remove data

Feature Description

Update the data partitioning function to randomly remove datapoints from one (or both) of the datasets.

The chance of this happening should be small so that there is enough data left over on which to train the model

Is your feature request related to a problem?

In a real-world setting, we cannot guarantee that all datasets hold data relating to the same subjects. We wish to simulate that

Additional Context

Part of Epic #2
Dependent on #4

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.