ryderling / deepsec Goto Github PK

View Code? Open in Web Editor NEW

206.0 13.0 71.0 175.67 MB

DEEPSEC: A Uniform Platform for Security Analysis of Deep Learning Model

License: MIT License

Python 100.00%

adversarial-attacks adversarial-examples defenses deep-leaning

deepsec's Introduction

DEEPSEC: A Uniform Platform for Security Analysis of Deep Learning Model

1. Description

DEEPSEC is the first implemented uniform evaluating and securing system for deep learning models, which comprehensively and systematically integrates the state-of-the-art adversarial attacks, defenses and relative utility metrics of them.

1.1 Citation:

Xiang Ling, Shouling Ji, Jiaxu Zou, Jiannan Wang, Chunming Wu, Bo Li and Ting Wang, DEEPSEC: A Uniform Platform for Security Analysis of Deep Learning Models, IEEE S&P, 2019

1.2 Glance at the DEEPSEC Repo:

RawModels/ contains code used to train models and trained models will be attacked;
CleanDatasets/ contains code used to randomly select clean samples to be attacked;
Attacks/ contains the implementations of attack algorithms.
AdversarialExampleDatasets/ are collections of adversarial examples that are generated by all kinds of attacks;
Defenses/ contains the implementations of defense algorithms;
DefenseEnhancedModels/ are collections of defense-enhanced models (re-trained models);
Evaluations/ contains code used to evaluate the utility of attacks/defenses and the security performance between attack and defenses.

1.3 Requirements:

Make sure you have installed all of following packages or libraries (including dependencies if necessary) in you machine:

PyTorch 0.4
TorchVision 0.2
numpy, scipy, PIL, skimage, tqdm ...
Guetzli: A Perceptual JPEG Compression Tool

1.4 Datasets:

We mainly employ two benchmark dataset MNIST and CIFAR10.

2. Usage/Experiments

STEP 1. Training the raw models and preparing the clean samples

We firstly train and save the deep learning models for MNIST and CIFAR10 here, and then randomly select and save the clean samples that will be attacked here.

STEP 2. Generating adversarial examples

Taking the trained models and the clean samples as the input, we can generate corresponding adversarial examples for each kinds of adversarial attacks that we have implemented in ./Attacks/, and we save these adversarial examples here.

STEP 3. Preparing the defense-enhanced models

With the defense parameter settings, we obtain the corresponding defense-enhanced model from the original model for each kinds of defense methods that we have implemented in ./Defenses/, and we save those re-trained defense-enhanced models into here.

STEP 4: Evaluation of Utility/Security Performance

After generating adversarial examples and obtained the defense-enhanced models, we evaluate the utility performance of attacks or defenses,respectively. In addition, we can test the security performance between attacks and defenses, i.e., whether or to what extent the state-of-the-art defenses can defend against attacks. All codes can be found ./Evaluations/.

3. Update

If you want to contribute any new algorithms or updated implementation of existing algorithms, please also let us know.

UPDATE: All codes have been re-constructed for better readability and adaptability.

deepsec's People

Contributors

Stargazers

Watchers

Forkers

wang-jingyi jiannanwang zoujx96 kutim minliamoy chengruimeng zblyou1 gnutella venutrue iemos ianchen88 abuhamad duoergun0729 josh200501 jiweitian coriverchen yf817 qiushilin expdb2015 xiaodanli001 vf3ng xiahaifeng1995 killvxk fengjixuchui mrdoulestar alisaxxw gxy0116 minhtranca cody2333 git04112019 wangyankun17 enegativy neineit yuxy411 blakecheng qiangtimer yubi-ece marcelomata stjordanis zjz5250 xl86305955 domjrivera tzolkinlht alarminwar pige2nd yjinhui macromachine yslin013 isabellarossi nuhdv xuinc afoolboy github-jen happy-hacker0 zxyholmes liuyishoua kxyoke ideasplus jzikang lingkok chandms freezingbp aharol cvamy11 supersuperstar yangyzju gaozhen-w zggg1p lionzl83

deepsec's Issues

Reporting success rate of unbounded attacks is meaningless

Three of the attacks presented (EAD, CW2, and BLB) are unbounded attacks: rather than finding the “worst-case” (i.e., highest loss) example within some distortion bound, they seek to find the closest input subject to the constraint that it is misclassified. Unbounded attacks should always reach 100% “success” eventually, if only by actually changing an image from one class into an image from the other class; the correct and meaningful metric to report for unbounded attacks is the distortion required.

PGD/BIM implementation is incorrect

The PGD (and BIM) implementation in this repository is significantly less effective than as reported in prior work. In Table XIV PGD (or BIM) appears to succeed 82.4% (or 75.6%) of the time. When I run the code in the repository, I get a very similar result: 82.5% (or 74.2%).

This should be somewhat surprising given that prior work reports PGD and BIM succeeds nearly 100% of the time with the same distortion bound of 0.3. See for example Figure 4 of Madry et al. (2018), or Table IV of Carlini & Wagner (2017). Indeed, when I put a loop around my FGSM call (using the approach discussed in #3) I reach 100% attack success rate with both BIM and PGD.

I have not investigated the cause of this discrepancy further.

It is deeply concerning that now I have checked five results (FGSM/PGD/BIM/JSMA/PAT) and all of them have issues (#3 / this issue / #14 / #4). Did you cross-check the results of your attacks with any other libraries?

Comparing attack effectiveness is done incorrectly

Using the data provided, it is not possible to compare the efficacy of different attacks across models. Imagine we would like to decide whether LLC or ILLC was the stronger attack on the CIFAR-10 dataset.

Superficially, I might look at the “Average” column and see that the average model accuracy under LLC is 39.4% compared to 58.7% accuracy under ILLC. While in general averages in security can be misleading, fortunately, for all models except one, LLC reduces the model accuracy more than ILLC does, often by over twenty percentage points.

A reasonable reader might therefore conclude (incorrectly!) that LLC is the stronger attack. Why is this conclusion incorrect? The LLC attack only succeeded 134 times out of 1000 times on the baseline CIFAR-10 model. Therefore, when the paper writes that the accuracy of PGD adversarial training under LLC is 61.2% what this number means is that 38.8% of adversarial examples that are effective on the baseline model are also effective on the adversarially trained model. How the model would perform on the other 866 examples is not reported. In contrast, when the base model is evaluated on the ILLC attack, the attack succeeded on all 1000 examples. The 83.7 accuracy obtained by adversarial training is inherently incomparable to the the 61.2% value.

Attack success rate decreases with distortion bound

It is a basic observation that when given strictly more power, the adversary should never do worse. However, in Table VII the paper reports that MNIST adversarial examples with their l_infinity norm constrained to be less than 0.2 are harder to detect than when constrained to be within 0.5. The reason this table shows this effect is that FGSM, a single-step method, is used to generate these adversarial examples. The table should be re-generated with an optimization-based attack (that actually targets the defense; not a transfer attack) to give meaningful numbers.

Epsilon values studied are too large to be meaningful

On at least two counts the paper choses l_infinity distortion bounds that are not well motivated.

Throughout the paper the report studies a CIFAR-10 distortion of eps=0.1 and eps=0.2. This value is 3x (or 6x) larger than what is typically studied in the literature. When CIFAR-10 images are perturbed with noise of distortion 0.1, they are often difficult for humans to correctly classify; I'm aware of no other work which studies CIFAR-10 robustness at this extremely high distortion bound.
The studies l_infinity distortion bounds as high as eps=0.6 in Table VII on both MNIST and CIFAR-10, a value that is so high that any image can be converted to solid grey (and then past). The entire purpose of bounding the l_infinity norm of adversarial examples is to ensure that the actual true class has not changed. Choosing a distortion bound so large that all images can be converted to a solid grey image fundamentally misunderstands the purpose of the distortion bound.

PGD adversarial training implementation is incorrect

While the idea of adversarial training is straightforward—-generate adversarial examples during training and train on those examples until the model learns to classify them correctly—-in practice it is difficult to get right. The basic idea has been independently developed at least twice and was the focus of several papers before all of the right ideas were combined by Madry et al. to form the strongest defense to date. There are at least three flaws in the re-implementation of this defense after a cursory analysis:

Incorrect loss function. The loss function used in the original paper is only a loss on the adversarial examples whereas this paper mixes adversarial examples and original examples to form the loss function.
Incorrect model architectures. In the original paper, the authors make three claims for the novelty of their method. One of these claims states “To reliably withstand strong adversarial attacks, networks require a significantly larger capacity than for correctly classifying benign examples only.” The code that re-implements this defense does not follow this advice and instead uses a substantially smaller model than recommended.
Incorrect hyperparameter settings. The original paper trains their MNIST model for 83 epochs of training; In contrast, the paper here trains for only 20 epochs (4x fewer iterations).

Possibly because of these implementation differences, the DeepSec report finds (incorrectly) that a more basic form of adversarial training performs better than PGD adversarial training.

I didn't re-implement any of the other defenses; the fact that I'm not raising other issues is not because there are none, just that I didn't look for any others.

Paper uses averages instead of the minimum for security analysis

Perhaps the one key factor that differentiates security (and adversarial robustness) from other general forms of robustness is the worst-case mindset from which we evaluate. This paper uses the mean throughout to evaluate both attacks and defenses.

Using the mean over various attacks to compute the “security” of a defense completely misunderstands what it means to perform a security evaluation in the first place.

For example, the paper bolds the column for the NAT defense when evaluated on CIFAR-10 because it gives the highest “average security” against all attacks. However, this is fundamentally the incorrect evaluation to make: the only metric that matters in security is how well a defense withstands attacks targeting that defense. And in this setting, the alternate adversarial training approach of Madry et al. is strictly stronger.

For this reason, when the paper says that all the defenses are "more or less" effective, it's completely misrepresenting what is actually going on. In fact, almost all of the defenses studied offer 0% robustness to any actual attack. By misrepresenting this fact, the work of Madry et al. and Xu et al. which actually mostly satisfy their security claims aren't appropriately

Significant and fundamental flaws in methodology, analysis, and conclusions

This framework is designed to "systematically evaluate the existing adversarial attack and defense methods". The research community would be well served by such an analysis. When new defenses are proposed, authors must choose which set of attacks to apply in order to perform an evaluation. A systematic evaluation of which attacks have been most effective in the past could help inform the decision of which attacks should be tried in the future. Similarly, when designing new attacks, a comprehensive review of defenses could help researchers decide which defenses to test against.

Unfortunately, the analysis performed in the DeepSec paper is fundamentally flawed and does not achieve any of these goals. It neither accurately measures the power of attacks not measures the efficacy of defenses. I have filed a number of issues that summarizes the many ways in which the report is misleading in its methodology and analysis. (Almost all of the conclusions are misleading as a result of these other flaws. I do not make comments on the conclusions but I expect they will need to be completely re-written once true results are obtained.)

The issues raised are ordered roughly by importance:

#1 Attacks are not run on defenses in an all-pairs manner
#2 Paper uses averages instead of the minimum for security analysis
#3 FGSM implementation is incorrect
#4 PGD adversarial training is implemented incorrectly
#5 Computing the average over different threat models is meaningless
#6 Comparing attack effectiveness is done incorrectly
#7 Epsilon values studied are too large to be meaningful
#8 Detection defenses set per-attack thresholds
#9 Attack success rate decreases with distortion bound
#10 Reporting success rate of unbounded attacks is meaningless
#11 Paper does not report attack success rate for targeted adversarial examples
#12 Discrepancies between tables, text, and code

Computing the average over different threat models is meaningless

Security is all about worst-case guarantees. Despite this fact, the paper makes many of the inferences by looking at the average-case robustness.

This is fundamentally flawed.

If a defense gives 0% robustness against one attack and 100% robustness against another attack the defense is not "50% robust". It is 0% robust. Completely broken and ineffective.

Now this doesn't preclude it from being possibly useful or informative in some settings. But it can not in good faith be called partially secure. If a defense argues l_2 robustness and a l_2 attack can generate adversarial examples on it with similar distortion of an undefended model, then it's broken. The fact that some other l_2 attack fails to generate adversarial examples is irrelevant.

When you are averaging across multiple different attacks, many of which are weak single-step attacks, it artificially inflates the apparent robustness. Imagine if there was another row that measured robustness to uniform random noise within the distortion bound--by adding this attack all defenses would suddenly appear more robust, which clearly is not the case.

Discrepancies between tables, text, and code

Table XIII states that on CIFAR-10 the R+FGSM attack was executed with eps=0.05 and alpha=0.05 whereas the README in the Attack module of the open source code gives eps=0.1 and alpha=0.5. I assume the code is correct and the table is wrong. Table XIII states that the “box” constraint for CWL2 is set to -0.5, 0.5 but in the code the (correct) values of 0.0, 1.0 are used.

Other hyperparameters are completely missing (e.g., Table XIII does not give the number of iterations used for any of the gradient-based attacks). This is especially confusing when the default values differ from the original attack implementations; for example, this code sets the number of binary search steps for CW2 to 5 (and does not state this in the paper) whereas the original code uses the value 10; fortunately, this setting often has only a minimal impact on accuracy.

Attacks are not run on defenses in an all-pairs manner

The only meaningful metric for evaluating a defense is by measuring the effectiveness of attacks which run against it.

This paper does not actually measure this, however. It generates adversarial examples on a baseline model and then tests them on different defenses, and uses this as a way to assess the supposed robustness of the various defenses.

This basic flaw completely undermines the purpose of a security evaluation.

As a point of comparison, imagine that I were designing a new computer architecture that was designed to be secure memory corruption vulnerabilities. I do this by taking a pre-existing computer architecture and instead of designing it as little-endian or big-endian, implement some new “middle-endian” where the least significant byte is put in the middle of the word. This crazy new architecture would appear to be perfectly robust against all existing malware. However it would be fundamentally incorrect to call this new computer architecture “more secure”: the only thing that I have done is superficially broken existing exploits from working on our new system.

In the context of adversarial examples notice that this type of analysis is not useless and does tell us something: the analysis performed tells us something useful about the ability for these attacks to transfer and for the models to defend against transferability attacks. If the paper had made this observation and drawn the conclusions from this perspective, then at least the fundamental idea behind the table would have been correct. (None of the remaining errors would be resolved, still.)

Worryingly, the DeepSec code itself does not appear support the ability to run any of the attacks on a new defense model. It looks like the code only supports the ability to load raw model files into the attacks natively.

Fixing this fatal and fundamental error in the paper's evaluation will not be easy. Many of the defenses are non-differentiable or cause gradient masking, and that is why the original papers believed their defenses were secure to begin with. Performing a proper security evaluation necessarily requires adapting attacks to defenses. I see no easy way to resolve this issue and correct the paper but devoting significant and extensive work to performing this correct analysis.

Paper does not report attack success rate for targeted adversarial examples

When measuring how well targeted attacks work, the metric should be targeted attack success rate. However, Table V measures model misclassification rate. This is not the right way to do measure it.

It's also unclear why PGD and BIM are listed as untargeted attacks and not as targeted attacks, when it works both ways (i.e., CW2 is the same and could just as easily be classified as an untargeted attack).

What's the difference between UMIFGSM and TMIFGSM

Detection defenses set per-attack thresholds

In Table VI the paper analyzes three different defense techniques. In this table, the paper reports the true positive rate and false positive rate of the defenses against various attacks. In doing so, the papers vary the detection threshold to make comparisons fair and says “we try our best to adjust the FPR values of all detection methods to the same level via fine-tuning the parameters.”

However, the paper varies the defense settings on a per-attack basis. This is not a valid thing to do.

When performing a security analysis between the attacker and defender it is always important to recognize that one of the players goes first and commits to an approach, and then the second player goes second and tries to defeat the other. In working with adversarial example defenses, it is the defender who commits first and the attacker who then tries to find instance that evades the defense.

As such, it is meaningless to allow the defender to alter the detection hyperparameters depending on which attack will be encountered. If the defender knew which attack was going to be presented, they could do much better than just selecting a different hyperparameter setting for the detection threshold.

Worse yet, by varying the exact threshold used, in actuality the false positive rates presented in the table vary between 1.5% and 9.0%. Comparing the true positive rate of two defenses when the corresponding false positive vary by a factor of six is meaningless. Worse yet, computing the mean TPR across a range of attacks when the FPR by a factor of six results in a completely uninterpretable value.

It would be both simpler and more accurate to use a validation set to choose the FPR once for all attacks, and then report the TPR on each attack using the same threshold.

JSMA implementation is incorrect

The JSMA implementation in this repository is significantly less effective than as reported in prior work. In Table XIV JSMA appears to succeed 76% of the time. When I run the code in the repository, I get a very similar result: 72.3%.

This should be somewhat surprising given that prior work reports JSMA succeeds above 90% of the time with the same distortion bound of 10% of pixels changed. Unfortunately, Papernot et al. (2016) uses a bound of 14.5% so is not directly comparable, but, in Carlini & Wagner (2016) we re-implemented JSMA and found a 90% attack success rate at 78. Indeed, when I run the JSMA attack from CleverHans on this exact same network (using the approach discussed in #3) I reach 95% attack success rate.

When investigating this just a bit more, I observe that when attacking a solid-black image targeting each possible target label 0 through 9 the code in this repository returns a substantially different adversarial example than the code from CleverHans returns, which this code is based on.

FGSM implementation is incorrect

Despite the simplicity of the Fast Gradient Sign Method, it is surprisingly effective at generating adversarial examples on unsecured models. However, Table XIV reports the misclassification rate of FGSM at eps=0.3 on MNIST as 30.4%, significantly less effective than expected given the results of prior work.

I investigate this further by taking the one-line script and following the README to run the FGSM attack on the baseline MNIST model. Doing this yields a misclassification rate of 38.3%. It is mildly concerning that this number is 25% larger than the value reported in the paper, and I'm unable to account for this statistically significant deviation from what the code returns. However, this error is only of secondary concern: as prior work indicates, the success rate of FGSM should be substantially higher.

So I compare the result of attacking with the CleverHans framework. Because DeepSec is implemented in PyTorch, and CleverHans only supports TensorFlow, I load the DeepSec pre-trained PyTorch model weights into a TensorFlow model and generate adversarial examples on this model with the CleverHans implementation of FGSM. CleverHans obtains a 61% misclassification rate–over double the misclassification rate reported in the DeepSec paper. To confirm the results that I obtain are correct I save these adversarial examples and run the original DeepSec PyTorch model on them, again finding the misclassification rate is 61%. I'm currently not able to explain how DeepSec incorrectly implemented FGSM, however the fact the simplest attack is implemented incorrectly is deeply concerning.

The remainder of the issues I'm filing on DeepSec therefore discusses only the methodology and analysis, and not any specific numbers which may or may not be trustworthy.

ryderling / deepsec Goto Github PK

deepsec's Introduction

DEEPSEC: A Uniform Platform for Security Analysis of Deep Learning Model

1. Description

1.1 Citation:

1.2 Glance at the DEEPSEC Repo:

1.3 Requirements:

1.4 Datasets:

2. Usage/Experiments

STEP 1. Training the raw models and preparing the clean samples

STEP 2. Generating adversarial examples

STEP 3. Preparing the defense-enhanced models

STEP 4: Evaluation of Utility/Security Performance

3. Update

deepsec's People

Contributors

Stargazers

Watchers

Forkers

deepsec's Issues

Recommend Projects

Recommend Topics

Recommend Org