Thank you very much for your kind contribution. I am using your FL architecture for my research purposes. But i have a concern: why the accuracy of training with iid datasets is lower than patho and diri dataset? would you explain the reason? in theory the accuracy of training a ML model with iid is expected to have higher accuracy. about federated-learning-in-pytorch HOT 14 CLOSED

vaseline555 commented on August 15, 2024

Thank you very much for your kind contribution. I am using your FL architecture for my research purposes. But i have a concern: why the accuracy of training with iid datasets is lower than patho and diri dataset? would you explain the reason? in theory the accuracy of training a ML model with iid is expected to have higher accuracy.

from federated-learning-in-pytorch.

Comments (14)

vaseline555 commented on August 15, 2024 1

but for patho and diri the results are sometimes growing sometimes lower than the previous iteration

Yes, it is a very NATURAL phenomenon!
(a.k.a. statistical heterogeneity issue in federated learning, in a technical term)
That is why federated learning in the wild is very difficult and is also sensitive to hyperparameters I mentioned.
(e.g., C, E, B, R, lr, optimizer, lr_decay, lr_decay_step)

You may try other advanced method, e.g., FedProx, FedAdam, FedYogi, FedAdagrad instead of FedAvg, or even personalized FL methods, if you are interested in.

from federated-learning-in-pytorch.

vaseline555 commented on August 15, 2024

Dear @bbw251,
Thank you for your warm comments!

Could you please share example commands (IID & Patho & Diri with Dataset name you used, K, R, E, B, etc.) so that I can reproduce the unexpected result?

Thank you!

from federated-learning-in-pytorch.

vaseline555 commented on August 15, 2024

@bbw251
Hi, I have just tried with following commands with MNIST dataset, K=100 clients, R=50 rounds, E=5, B=10, and C=0.1:

python3 main.py \
            --exp_name "FedAvg_MNIST_2NN_IID" --seed 42 --device cuda \
            --dataset MNIST \
            --split_type iid --test_size 0.1 \
            --model_name TwoNN --resize 28 --hidden_size 200 \
            --algorithm fedavg --eval_fraction 1 --eval_type both --eval_every 1 --eval_metrics acc1 acc5 \
            --K 100 --R 50 --E 5 --C 0.1 --B 10 --beta1 0 \
            --optimizer SGD --lr 0.01 --lr_decay 0.99 --lr_decay_step 25 --criterion CrossEntropyLoss

Pathological (with mincls=2)

python3 main.py \
            --exp_name "FedAvg_MNIST_2NN_PATHO" --seed 42 --device cuda \
            --dataset MNIST \
            --split_type patho --mincls 2 --test_size 0.1 \
            --model_name TwoNN --resize 28 --hidden_size 200 \
            --algorithm fedavg --eval_fraction 1 --eval_type both --eval_every 1 --eval_metrics acc1 acc5 \
            --K 100 --R 50 --E 5 --C 0.1 --B 10 --beta1 0 \
            --optimizer SGD --lr 0.01 --lr_decay 0.99 --lr_decay_step 25 --criterion CrossEntropyLoss

Dirichlet (with cncntrtn=0.01)

python3 main.py \
            --exp_name "FedAvg_MNIST_2NN_DIRI" --seed 42 --device cuda \
            --dataset MNIST \
            --split_type diri --cncntrtn 0.01 --test_size 0.1 \
            --model_name TwoNN --resize 28 --hidden_size 200 \
            --algorithm fedavg --eval_fraction 1 --eval_type both --eval_every 1 --eval_metrics acc1 acc5 \
            --K 100 --R 50 --E 5 --C 0.1 --B 10 --beta1 0 \
            --optimizer SGD --lr 0.01 --lr_decay 0.99 --lr_decay_step 25 --criterion CrossEntropyLoss

Here are corresponding results:

IID FedAvg_MNIST_2NN_IID_231013_122100.log

Central evaluation: loss: 0.1455 | acc1: 0.9553 | acc5: 0.9987
Local evaluation:
- Loss: Avg. (0.1426) Std. (0.0785) | Top 10% (0.3214) Std. (0.0337) | Bottom 10% (0.0349) Std. (0.0179)
- Acc1: Avg. (0.9555) Std. (0.0236) | Top 10% (0.9926) Std. (0.0082) | Bottom 10% (0.9130) Std. (0.0070)
- Acc5: Avg. (0.9987) Std. (0.0045) | Top 10% (1.0000) Std. (0.0000) | Bottom 10% (0.9869) Std. (0.0070)

Pathological FedAvg_MNIST_2NN_PATHO_231013_122126.log

Central evaluation: loss: 0.5953 | acc1: 0.7880 | acc5: 0.9932
Local evaluation:
- Loss: Avg. (0.6345) Std. (0.4361) | Top 10% (1.4864) Std. (0.3184) | Bottom 10% (0.1114) Std. (0.0312)
- Acc1: Avg. (0.7779) Std. (0.1661) | Top 10% (0.9816) Std. (0.0119) | Bottom 10% (0.4675) Std. (0.0888)
- Acc5: Avg. (0.9939) Std. (0.0125) | Top 10% (1.0000) Std. (0.0000) | Bottom 10% (0.9628) Std. (0.0132)

Dirichlet FedAvg_MNIST_2NN_DIRI_231013_122148.log

Central evaluation: loss: 0.8174 | acc1: 0.7071 | acc5: 0.9865
Local evaluation:
- Loss: Avg. (0.4397) Std. (0.3672) | Top 10% (1.0558) Std. (0.2559) | Bottom 10% (0.0301) Std. (0.0086)
- Acc1: Avg. (0.8454) Std. (0.1938) | Top 10% (1.0000) Std. (0.0000) | Bottom 10% (0.4545) Std. (0.0962)
- Acc5: Avg. (0.9956) Std. (0.0042) | Top 10% (1.0000) Std. (0.0000) | Bottom 10% (0.9865) Std. (0.0113)

According to these demo experiments, I found expected behaviors that the performance of IID split is better than Dirichlet distribution-based non-IID split or pathological non-IID split.

If it is not the case for you, I presume it might be due to some configurations like local batch size, client sampling size, optimizer, learning rate scheduling, and so forth.

Please let me know if there exist other issues that I didn't catch up with.
Thank you!

from federated-learning-in-pytorch.

bbw251 commented on August 15, 2024

Thank you for your fast response, i appreciate your dedication Adam!. The results you reproduced above is the expected one. let me provide my parameter settings may be it will be the case. here is one example of the command:
for e in 1
do
for s in 'patho' 'iid' 'diri'
do

    python main.py \
        --exp_name "FedAvg_MNIST_CNN_BN_${s}_e${e}" --seed 42 --device cpu \
        --dataset MNIST \
        --split_type $s --test_fraction 0 --mincls 2\
        --model_name OneCNN --resize 28 --hidden_size 5 \
        --algorithm fedavg --eval_fraction 1 --eval_type local --eval_every 1 --eval_metrics acc1 f1 precision recall \
        --K 5 --R 100 --E $e --C 1 --B 30 --beta 0 \
        --optimizer SGD --lr 0.01 --lr_decay 0.95 --lr_decay_step 20 --criterion CrossEntropyLoss
 done

done

Note that: i have also tried for B = 10, 20 and C=0.1, there is test_size in the previous code, may be you mean test_fraction
and other parameters are the default settings you provided in the main function.
I thank you !

from federated-learning-in-pytorch.

vaseline555 commented on August 15, 2024

@bbw251
TL; DR: pull the newest version of this repository and run again, please!

Seems like you customized the model (i.e., named OneCNN in your command) which has batchnorm layer in it, right?
Since I don't have OneCNN architecture right now, I cannot assert the main cause of your situation.

Instead, I STRONGLY recommend you to pull the newest version of this repository first, and run your code again.
There were bugs in previous version of this repository, specifically occurred when using a model having bathcnorm layers.

Both running mean and variance should also be federated on top of BN layer's weight and bias (commonly denoted as gamma and beta), but in the previous version, only gamma and beta are communicated with the server.

Thank you.

Best,
Adam

from federated-learning-in-pytorch.

bbw251 commented on August 15, 2024

I see, sorry for not commenting on the model: here is the model if you can provide me the comment on it! and ![comparisonresults](https://github.com/vaseline555/Federated-Learning-in-PyTorch/assets/76777175/c2f377b7-205b-4502-adc2-29741cc29d99) this link is for sample result, if it helps. I thank you class OneCNN(torch.nn.Module): def __init__(self, in_channels, hidden_size, num_classes): super(OneCNN, self).__init__() self.in_channels = in_channels self.hidden_channels = hidden_size self.num_classes = num_classes # self.activation = torch.nn.ReLU(True) # Single Convolutional Layer self.conv1 = torch.nn.Conv2d(in_channels=in_channels, out_channels=self.hidden_channels, kernel_size=(9, 9), padding=0, stride=1, bias=True) self.bn1 = torch.nn.BatchNorm2d(self.hidden_channels) # add Batch normalization layer self.activation = torch.nn.ReLU() #Sigmoid and Relue are not giving good self.maxpool = torch.nn.MaxPool2d(kernel_size=(2, 2), stride=2) # Reducing to 10x10 # Final FC Layer self.fc1 = torch.nn.Linear(in_features=self.hidden_channels * 10 * 10, out_features=num_classes, bias=True) def forward(self, x): x = self.conv1(x) x = self.bn1(x) # Use batch normalization after convolution x = self.activation(x) x = self.maxpool(x) x = x.view(x.size(0), -1) # Flatten #2D x = self.fc1(x) #x = self.activation(x) # No need for softmax activation here if using CrossEntropyLoss return x #x.view(-1) #flatten the output tensor to a 1D tensor

…

On Sat, Oct 14, 2023 at 8:15 AM Seok-Ju Hahn (Adam) < ***@***.***> wrote: @bbw251 <https://github.com/bbw251> TL; DR: pull the newest version of this repository and run again, please! Seems like you customized the model (i.e., named OneCNN in your command) which has batchnorm layer in it, right? Since I don't have OneCNN architecture right now, I cannot assert the main cause of your situation. Instead, I STRONGLY recommend you to pull the newest version of this repository first, and run your code again. There were bugs in previous version of this repository, specifically occurred when using a model having bathcnorm layers. Both running mean and variance should also be federated on top of BN layer's weight and bias (commonly denoted as gamma and beta), but in the previous version, only gamma and beta are communicated with the server. Thank you. Best, Adam — Reply to this email directly, view it on GitHub <#14 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ASJYNV4T47OEWLBXA2Z4ZOLX7KF5HAVCNFSM6AAAAAA56QE5H2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTONRSHA4TCNBUHA> . You are receiving this because you were mentioned.Message ID: ***@***.*** com>

from federated-learning-in-pytorch.

bbw251 commented on August 15, 2024

for the new version could you mention the modified parts since i have changed many parts to my own research purpose. i can see you have modified the fedavgserver.py file and some lines of fedavg.py and model.py

from federated-learning-in-pytorch.

vaseline555 commented on August 15, 2024

Since many parts are updated, I cannot pinpoint which parts are different from yours. (plus, I have no idea which version you are working on)
You can find all changes in commit histories.
(https://github.com/vaseline555/Federated-Learning-in-PyTorch/commits/main)

Thank you.

from federated-learning-in-pytorch.

bbw251 commented on August 15, 2024

The new version is not stable especially for the patho, i tried without any change except the device to cpu:
for b in 0 10
do
for c in 0.0 0.1 0.2 0.5 1.0
do
"C:/Users/A0923/AppData/Local/Programs/Python/Python311/python.exe" main.py
--exp_name "FedAvg_MNIST_CNN_Patho_C${c}_B${b}" --seed 42 --device cpu
--dataset MNIST
--split_type patho --test_size 0
--model_name TwoCNN --resize 28 --hidden_size 200
--algorithm fedavg --eval_fraction 1 --eval_type both --eval_every 1 --eval_metrics acc1 acc5
--K 100 --R 1000 --E 5 --C $c --B $b --beta1 0
--optimizer SGD --lr 0.1 --lr_decay 0.99 --lr_decay_step 10 --criterion CrossEntropyLoss
done
done
i am not sure if device can make that much difference, and i watched for about 20 iterations, iid starts from accuracy of 60 and patho starts from accuracy of 55 from the first epoch. i am confused!

from federated-learning-in-pytorch.

vaseline555 commented on August 15, 2024

What do you mean 'unstable', and what do you refer to 'accuracy', is it about central evaluation or averaged local evaluations?
I don't get the exact problem in this statement: "iid starts from accuracy of 60 and patho starts from accuracy of 55 from the first epoch".
What did you expect from each setting (IID and Patho)?
The convergence speed can vary across the combination of hyperparameters (e.g., C, E, B, lr, lr_decay, lr_decay_step).

from federated-learning-in-pytorch.

vaseline555 commented on August 15, 2024

I still cannot understand exact problem you faced...

You said,

iid starts from accuracy of 60 and patho starts from accuracy of 55 from the first epoch

Then, the initial performance of IID setting is better than that of Pathological setting, right?
Is this problematic? Isn't this what you originally expected(i.e., IID > Patho)?

If it is, could you clarify what is the problem, in detail?
If it is not, please also provide me detailed information. (what results you originally expected & what you had with commands)

Per your uploaded log,

Here is the example trained for 30 epochs, i am concerned on the "server_evaluated": acc1

What is the issue about the metric "server_evaluated": acc1?
Is it problematic because it showed too low accuracies(~10%)?

If it is, it is a very natural result since you have set C=0 and B=0.
In theory, you should set C not to be too small or too big (see Li et al., 2020), and the batch size (B) should also be appropriately tuned along with lr.
If it the low accuracy is NOT the issue, please also clarify your problem with details.

from federated-learning-in-pytorch.

bbw251 commented on August 15, 2024

I think the server evaluation result is expected to be growing in every iteration. i have trained for three of the data splitting types for 20 global epochs each, i found training with iid is growing, but for patho and diri the results are sometimes growing sometimes lower than the previous iteration. i have attached the results for patho and dir below. that was i mean.
FedAvg_MNIST_2NN_diri.json
FedAvg_MNIST_2NN_PATHO.json

note: the parameter settings are as you provided 3 days before.
I thank you

from federated-learning-in-pytorch.

bbw251 commented on August 15, 2024

Yes, it is a very NATURAL phenomenon!
(a.k.a. statistical heterogeneity issue in federated learning, in a technical term), thank you very much. I want to train in HFL, hierarchical FL, would you share with me valuable resources or code repositories designed for that, especially for wireless communications?

I am thankful indeed dear Adam!

from federated-learning-in-pytorch.

vaseline555 commented on August 15, 2024

Sorry, I am not familiar with the concept HFL, you may find related resources in Google.
For example, hierarchical federated learning github may act as a good search keyword for your case.
Hope this helps!
Thank you.

Best,
Adam

from federated-learning-in-pytorch.

Comments (14)

Related Issues (19)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent