Giter Site home page Giter Site logo

ashwinrj / federated-learning-pytorch Goto Github PK

View Code? Open in Web Editor NEW
1.2K 9.0 431.0 214 KB

Implementation of Communication-Efficient Learning of Deep Networks from Decentralized Data

License: MIT License

Python 100.00%
federated-learning distributed-computing deep-learning pytorch python

federated-learning-pytorch's Introduction

Federated-Learning (PyTorch)

Implementation of the vanilla federated learning paper : Communication-Efficient Learning of Deep Networks from Decentralized Data.

Experiments are produced on MNIST, Fashion MNIST and CIFAR10 (both IID and non-IID). In case of non-IID, the data amongst the users can be split equally or unequally.

Since the purpose of these experiments are to illustrate the effectiveness of the federated learning paradigm, only simple models such as MLP and CNN are used.

Requirments

Install all the packages from requirments.txt

  • Python3
  • Pytorch
  • Torchvision

Data

  • Download train and test datasets manually or they will be automatically downloaded from torchvision datasets.
  • Experiments are run on Mnist, Fashion Mnist and Cifar.
  • To use your own dataset: Move your dataset to data directory and write a wrapper on pytorch dataset class.

Running the experiments

The baseline experiment trains the model in the conventional way.

  • To run the baseline experiment with MNIST on MLP using CPU:
python src/baseline_main.py --model=mlp --dataset=mnist --epochs=10
  • Or to run it on GPU (eg: if gpu:0 is available):
python src/baseline_main.py --model=mlp --dataset=mnist --gpu=0 --epochs=10

Federated experiment involves training a global model using many local models.

  • To run the federated experiment with CIFAR on CNN (IID):
python src/federated_main.py --model=cnn --dataset=cifar --gpu=0 --iid=1 --epochs=10
  • To run the same experiment under non-IID condition:
python src/federated_main.py --model=cnn --dataset=cifar --gpu=0 --iid=0 --epochs=10

You can change the default values of other parameters to simulate different conditions. Refer to the options section.

Options

The default values for various paramters parsed to the experiment are given in options.py. Details are given some of those parameters:

  • --dataset: Default: 'mnist'. Options: 'mnist', 'fmnist', 'cifar'
  • --model: Default: 'mlp'. Options: 'mlp', 'cnn'
  • --gpu: Default: None (runs on CPU). Can also be set to the specific gpu id.
  • --epochs: Number of rounds of training.
  • --lr: Learning rate set to 0.01 by default.
  • --verbose: Detailed log outputs. Activated by default, set to 0 to deactivate.
  • --seed: Random Seed. Default set to 1.

Federated Parameters

  • --iid: Distribution of data amongst users. Default set to IID. Set to 0 for non-IID.
  • --num_users:Number of users. Default is 100.
  • --frac: Fraction of users to be used for federated updates. Default is 0.1.
  • --local_ep: Number of local training epochs in each user. Default is 10.
  • --local_bs: Batch size of local updates in each user. Default is 10.
  • --unequal: Used in non-iid setting. Option to split the data amongst users equally or unequally. Default set to 0 for equal splits. Set to 1 for unequal splits.

Results on MNIST

Baseline Experiment:

The experiment involves training a single model in the conventional way.

Parameters:

  • Optimizer: : SGD
  • Learning Rate: 0.01

Table 1: Test accuracy after training for 10 epochs:

Model Test Acc
MLP 92.71%
CNN 98.42%

Federated Experiment:

The experiment involves training a global model in the federated setting.

Federated parameters (default values):

  • Fraction of users (C): 0.1
  • Local Batch size (B): 10
  • Local Epochs (E): 10
  • Optimizer : SGD
  • Learning Rate : 0.01

Table 2: Test accuracy after training for 10 global epochs with:

Model IID Non-IID (equal)
MLP 88.38% 73.49%
CNN 97.28% 75.94%

Further Readings

Papers:

Blog Posts:

federated-learning-pytorch's People

Contributors

amitport avatar ashwinrj avatar gladuz avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

federated-learning-pytorch's Issues

Parallel computing support

Hi thanks for providing this wonderful repository, but I'm wondering if there will be support for parallelization of client training in each round

specifically, making the local update in federated_main.py to be executed by parallel processes

for idx in idxs_users:
            local_model = LocalUpdate(args=args, dataset=train_dataset,
                                      idxs=user_groups[idx], logger=logger)
            w, loss = local_model.update_weights(
                model=copy.deepcopy(global_model), global_round=epoch)
            local_weights.append(copy.deepcopy(w))
            local_losses.append(copy.deepcopy(loss))

or

are there suggestions for start working on this approach?

AttributeError: 'Namespace' object has no attribute 'gpu_id'

Hello: I ran 'python src/federated_main.py --model=cnn --dataset=mnist --iid=0 --epochs=10 --gpu=1'
But keep receiving error message:
Traceback (most recent call last):
File "src/federated_main.py", line 34, in
if args.gpu_id:
AttributeError: 'Namespace' object has no attribute 'gpu_id'

I am using Windows 10 and make sure I have GPU and GPU 1 in my task manager. Thanks

A Small Issue with the MLP Model

In the MLP model, I think in the last layer it should be F.log_softmax instead of softmax.
Otherwise, the NLL loss would return negative values.

Miswriting in function 'get_dataset'.

There is a writing mistake in /src/utils.py.

In function get_dataset(args) (line24),

        train_dataset = datasets.MNIST(data_dir, train=True, download=True,
                                       transform=apply_transform)

        test_dataset = datasets.MNIST(data_dir, train=False, download=True,
                                      transform=apply_transform)

should be

        train_dataset = datasets.CIFAR10(data_dir, train=True, download=True,
                                       transform=apply_transform)

        test_dataset = datasets.CIFAR10(data_dir, train=False, download=True,
                                      transform=apply_transform)

federated_main.py运行有问题

Traceback (most recent call last):
File "federated_main.py", line 33, in
torch.cuda.set_device(args.gpu)
File "D:\Anaconda3\lib\site-packages\torch\cuda_init_.py", line 243, in set_device
device = _get_device_index(device)
File "D:\Anaconda3\lib\site-packages\torch\cuda_utils.py", line 20, in _get_device_index
device = torch.device(device)
RuntimeError: Expected one of cpu, cuda, mkldnn, opengl, opencl, ideep, hip, msnpu device type at start of device string: 0

Regarding attribute errors during the federated learning both in equal and unequal cases

While running the code, the following attribute errors were coming. Can anyone tell the reasons for such errors??
For equal case:

Traceback (most recent call last):
  File "src/federated_main.py", line 36, in <module>
    train_dataset, test_dataset, user_groups = get_dataset(args)
  File "C:\Users\sharm\Downloads\Federated-Learning-PyTorch-master\src\utils.py", line 41, in get_dataset
    user_groups = cifar_noniid(train_dataset, args.num_users)
  File "C:\Users\sharm\Downloads\Federated-Learning-PyTorch-master\src\sampling.py", line 173, in cifar_noniid
    labels = np.array(dataset.train_labels)
  File "C:\Users\sharm\.conda\envs\newEnv\lib\site-packages\torch\utils\data\dataset.py", line 83, in __getattr__
    raise AttributeError
AttributeError

For Unequal case:

Traceback (most recent call last):
  File "src/federated_main.py", line 36, in <module>
    train_dataset, test_dataset, user_groups = get_dataset(args)
  File "C:\Users\sharm\Downloads\Federated-Learning-PyTorch-master\src\utils.py", line 38, in get_dataset
    raise NotImplementedError()
NotImplementedError

regarding saving the file

While executing federated learning code and MLP code i am getting this error
raceback (most recent call last):
File "src/federated_main.py", line 129, in
with open(file_name, 'wb') as f:
FileNotFoundError: [Errno 2] No such file or directory: '../save/objects/cifar_cnn_5_C[0.1]_iid[1]_E[10]_B[10].pkl'
170500096it [09:57, 285128.32it/s]
whether i have to create some files

New dataset app

Hi, I want to try the model on the new dataset, which py files will i need to change? (utils.py and ?)

CNNcifar got some problem to work

First thanks for the amazing work, but when I want to run CNN on CIFAR10 dataset there is some issue it got runtime error I wonder how to solve it. And the link is the error message.
issue

Is in update.py line 64

Thanks for you work again.

about the average_weights function

In the original paper, it uses a weighted average here. However, the implementation in average_weights is the simple average. Is there a bug or do I misunderstand something? Thanks!

copy.deepcopy(model), why?

Hello, your project has helped me a lot. Thank you very much. But I have a question: why do I need copy.deepcopy(model) when I am trying to implement a federated learning model, it seems that without copy.deepcopy all models will have the same weight. It's only when you use it that the model is different. So why is that?

AttributeError: 'CIFAR10' object has no attribute 'train_labels'

Files already downloaded and verified
Traceback (most recent call last):
  File "/Federated-Learning-PyTorch/src/sampling.py", line 282, in <module>
    d = cifar_noniid(dataset_train, num)
  File "/Federated-Learning-PyTorch/src/sampling.py", line 248, in cifar_noniid
    labels = np.array(dataset.train_labels)
AttributeError: 'CIFAR10' object has no attribute 'train_labels'

Question

Hello, i am just curious to know, what does line 72 in update.py do? Does it forward the image to the model?
log_probs = model(images)

federated_main.py not working

Hi I tried to run "python src/federated_main.py --model=cnn --dataset=cifar --gpu=0 --iid=1 --epochs=10"
but is not working. (in any option w federate_main.py including dataset, model, so)

I found several issues from your git and modified those parts, but it seems like there r additional problem w loop of 'federated_main.py'.

Is there anyone else who r suffering from same issue or have fixed them?
image

The optimizer of clients is created every epoch?

Hi, thanks for the code.

According to the lines in update.py:

if self.args.optimizer == 'sgd':
    optimizer = torch.optim.SGD(model.parameters(), lr=self.args.lr,
                                momentum=0.5)
elif self.args.optimizer == 'adam':
    optimizer = torch.optim.Adam(model.parameters(), lr=self.args.lr,
                                 weight_decay=1e-4)

The optimizer is created for every epoch, is that correct?

How to realize communication and "federated"?

  1. I wonder why can I run "federated_main.py" on only one GPU (stand alone deployment). Because I got the acc.png and loss.png, so I believed that I do run this .py successfully, is that right? Does the codes and experiments involve communication period? Can this be called federated learning?
  2. If so, which sentences of the codes realize the communication?
  3. How to get the information( specific figuresf) of its communication time and the volume of communication data?

Looking forward to somebody's reply. Millions of thanks!

1.为什么我能在单机上跑通 "federated_main.py"文件?因为我在单台服务器上运行依旧得到了loss.png和acc.png,所以我认为我应该是跑通了。但这其中有没有通信?能算真正的联邦吗?
2.如果可以的话,到底是哪行代码实现的通信呢?
3.怎么能够获得通信时间和通信数据量这些信息?
期待热心网友的解答 谢谢!

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.