1 Introduction
Batch norm is a standard component of modern deep neural networks, and tends to make the training process less sensitive to the choice of hyperparameters in many cases
[13]. While ease of training is desirable for model developers, an important concern among stakeholders is that of model robustness to plausible, previously unseen inputs during deployment. The adversarial examples phenomenon has exposed unstable predictions across stateoftheart models [27]. This has led to a variety of methods that aim to improve robustness, but doing so effectively remains a challenge [1, 20, 11, 14]. We believe that a prerequisite to developing methods that increase robustness is an understanding of factors that reduce it.Approaches for improving robustness often begin with existing neural network architectures—that use batch norm—and patching them against specific attacks, e.g., through inclusion of adversarial examples during training [27, 9, 15, 16]. An implicit assumption is that batch norm itself does not reduce robustness – an assumption that we tested empirically and found to be invalid. In the original work that introduced batch norm, it was suggested that other forms of regularization can be turned down or disabled when using it without decreasing standard test accuracy. Robustness, however, is less forgiving: it is strongly impacted by the disparate mechanisms of various regularizers. The frequently made observation that adversarial vulnerability can scale with the input dimension [9, 8, 24] highlights the importance of identifying regularizers as more than merely a way to improve test accuracy. In particular, batch norm was a confounding factor in [24], making the results of their initializationtime analysis hold after training. By adding regularization and removing batch norm, we show that there is no inherent relationship between adversarial vulnerability and the input dimension.
2 Batch Normalization
We briefly review how batch norm modifies the hidden layers’ preactivations of a neural network. We use the notation of [32], where
is the index for a neuron,
for the layer, and for a minibatch of samples from the dataset; denotes the number of neurons in layer , is the matrix of weights andis the vector of biases that parametrize layer
. The batch mean is defined as, and the variance is
. In the batch norm procedure, the mean is subtracted from the preactivation of each unit (consistent with [13]), the result is divided by the standard deviation
plus a small constant to prevent division by zero, then scaled and shifted by the learned parameters and , respectively. This is described in Eq. (1), where a perunit nonlinearity, e.g., ReLU, is applied after the normalization.
(1) 
Note that this procedure fixes the first and second moments of all neurons
equally at initialization, independent of the width or depth of the network. This suppresses the information contained in these moments. Because batch norm induces a nonlocal batchwise nonlinearity to each unit , this loss of information cannot be recovered by the parameters and . Furthermore, it has been widely observed empirically that these parameters do not influence the effect being studied [31, 33, 32]. Thus, and can be incorporated into the perunit nonlinearity without loss of generality. To understand how batch normalization is harmful, consider two minibatches that differ by only a single example: due to the induced batchwise nonlinearity, they will have different representations for each example [32]. This difference is further amplified by stacking batch norm layers. Conversely, normalization of intermediate representations for two different inputs impair the ability of batchnormalized networks to distinguish highquality examples (as judged by an “oracle”) that ought to be classified with a large prediction margin, from lowquality, i.e., more ambiguous, instances. The last layer of a discriminative neural network, in particular, is typically a linear decoding of class labelhomogeneous clusters, and thus makes extensive use of information represented via differences in mean and variance at this stage for the purpose of classification. We argue that this information loss and inability to maintain relative distances in the input space reduces adversarial as well as general robustness. Figure
1 shows a degradation of classrelevant input distances in a batchnormalized linear network on a 2D variant of the “Adversarial Spheres” dataset [8].^{1}^{1}1We add a ReLU nonlinearity when attempting to learn the binary classification task posed by [8]. In Appendix C we show that batch norm increases sensitivity to the learning rate in this case. Conversely, class membership is preserved in arbitrarily deep unnormalized networks (See Figure 7 of Appendix C), but we require a scaling factor to increase the magnitude of the activations to see this visually.3 Empirical Result
We first evaluate the robustness (quantified as the drop in test accuracy under input perturbations) of convolutional networks, with and without batch norm, that were trained using standard procedures. The datasets – MNIST, SVHN, CIFAR10, and ImageNet – were normalized to zero mean and unit variance. As a whitebox adversarial attack we use projected gradient descent (PGD),
 and norm variants, for its simplicity and ability to degrade performance with little perceptible change to the input [16]. We run PGD for 20 iterations, with and a step size of for SVHN, CIFAR10, and for ImageNet. For PGD we set , where is the input dimension. We report the test accuracy for additive Gaussian noise of zero mean and variance , denoted as “Noise” [5], as well as the CIFAR10C common corruption benchmark [11]. We found these methods were sufficient to demonstrate a considerable disparity in robustness due to batch norm, but this is not intended as a formal security evaluation. All uncertainties are the standard error of the mean.
^{2}^{2}2Each experiment has a unique uncertainty, hence the number of decimal places varies.BN  Clean  Noise  PGD  PGD 

✗  
✓ 
For the SVHN dataset, models were trained by stochastic gradient descent (SGD) with momentum 0.9 for 50 epochs, with a batch size of 128 and initial learning rate of
, which was dropped by a factor of ten at epochs 25 and 40. Trials were repeated over five random seeds. We show the results of this experiment in Table 1, finding that despite batch norm increasing clean test accuracy by , it reduced test accuracy for additive noise by , for PGD by , and for PGD by .CIFAR10  CIFAR10.1  

Model  BN  Clean  Noise  PGD  PGD  Clean  Noise 
VGG  ✗  
VGG  ✓  
WRN  F  
WRN  ✓ 
For the CIFAR10 experiments we trained models with a similar procedure as for SVHN, but with random
crops using fourpixel padding, and horizontal flips. We evaluate two families of contemporary models, one without skip connections (VGG), and WideResNets (WRN) using “Fixup” initialization
[34] to reduce the use of batch norm. In the first experiment, a basic comparison with and without batch norm shown in Table 2, we evaluated the best model in terms of test accuracy after training for 150 epochs with a fixed learning rate of . In this case, inclusion of batch norm for VGG reduces the clean generalization gap (difference between training and test accuracy) by . For additive noise, test accuracy drops by , and for PGD perturbations by and for and variants, respectively.Model  Test Accuracy ()  
L  BN  Clean  Noise  PGD 
8  ✗  
8  ✓  
11  ✗  
11  ✓  
13  ✗  
13  ✓  
16  ✓  
19  ✓ 
Very similar results are obtained on a new test set, CIFAR10.1 v6 [18]: batch norm slightly improves the clean test accuracy (by ), but leads to a considerable drop in test accuracy of for the case with additive noise, and and respectively for and PGD variants (PGD absolute values omitted for CIFAR10.1 in Table 2 for brevity). It has been suggested that one of the benefits of batch norm is that it facilitates training with a larger learning rate [13, 2]. We test this from a robustness perspective in an experiment summarized in Table 3, where the initial learning rate was increased to when batch norm was used. We prolonged training for up to 350 epochs, and dropped the learning rate by a factor of ten at epoch 150 and 250 in both cases, which increases clean test accuracy relative to Table 2. The deepest model that was trainable using standard “He” initialization [10] without batch norm was VGG13. ^{3}^{3}3For which one of ten random seeds failed to achieve better than chance accuracy on the training set, while others performed as expected. We report the first three successful runs for consistency with the other experiments. None of the deeper batchnormalized models recovered the robustness of the most shallow, or samedepth unnormalized equivalents, nor does the higher learning rate with batch norm improve robustness compared to baselines trained for the same number of epochs. Additional results for deeper models on SVHN and CIFAR10 can be found in Appendix A.3.We also evaluated robustness on the common corruption benchmark comprising 19 types of realworld effects that can be grouped into four categories: “noise”, “blur”, “weather”, and “digital” corruptions [11]. Each corruption has five “severity” or intensity levels. We report the mean error on the corrupted test set (mCE) by averaging over all intensity levels and corruptions [11]. We summarize the results for two VGG variants and a WideResNet on CIFAR10C, trained from scratch on the default training set for three and five random seeds respectively. Accuracy for the noise corruptions, which caused the largest difference in accuracy with batch norm, are outlined in Table 4. The key takeaway is: For all models tested, the batchnormalized variant had a higher error rate for all corruptions of the “noise” category, at every intensity level.
Model  Test Accuracy ()  

Variant  BN  Clean  Gaussian  Impulse  Shot  Speckle 
VGG8  ✗  
✓  
VGG13  ✗  
✓  
WRN28  F  
✓ 
Robustness of three modern convolutional neural network architectures with and without batch norm on the
CIFAR10C common “noise” corruptions [11]. We use “F” to denote the Fixup variant of WRN. Values were averaged over five intensity levels for each corruption.Averaging over all 19 corruptions we find that batch norm increased mCE by for VGG8, for VGG13, and for WRN. There was a large disparity in accuracy when modulating batch norm for different corruption categories, therefore we examine these in more detail.
Model  Top 5 Test Accuracy ()  

Model  BN  Clean  Noise  PGD 
VGG11  ✗  
VGG11  ✓  
VGG13  ✗  
VGG13  ✓  
VGG16  ✗  
VGG16  ✓  
VGG19  ✗  
VGG19  ✓  
AlexNet  ✗  
DenseNet121  ✓  
ResNet18  ✓ 
For VGG8, the mean generalization gaps for noise were: Gaussian—, Impulse—, Shot—, and Speckle—. After the “noise” category the next most damaging corruptions (by difference in accuracy due to batch norm) were: Contrast—, Spatter—, JPEG—, and Pixelate—. Results for the remaining corruptions were a coin toss as to whether batch norm improved or degraded robustness, as the random error was in the same ballpark as the difference being measured. For VGG13, the batch norm accuracy gap enlarged to for Gaussian noise at severity levels 3, 4, and 5; and over for Impulse noise at levels 4 and 5. Averaging over all levels, we have gaps for noise variants of: Gaussian—, Impulse—, Shot—, and Speckle—. Robustness to the other corruptions seemed to benefit from the slightly higher clean test accuracy of for the batchnormalized VGG13. The remaining generalization gaps varied from (negative) for Zoom blur, to for Pixelate. For the WRN, the mean generalization gaps for noise were: Gaussian—, Impulse—, Shot—, and Speckle—. Note that the large uncertainty for these measurements is due to high variance for the model with batch norm, on average versus for Fixup. JPEG compression was next at . Interestingly, some corruptions that led to a positive gap for VGG8 showed a negative gap for the WRN, i.e., batch norm improved accuracy to: Contrast—, Snow—, Spatter—. These were the same corruptions for which VGG13 lost, or did not improve its robustness when batch norm was removed, hence why we believe these correlate with standard test accuracy (highest for WRN). Visually, these corruptions appear to preserve texture information. Conversely, noise is applied in a spatially global way that disproportionately degrades these textures, emphasizing shapes and edges. It is now known that modern CNNs trained on standard image datasets have a propensity to rely on texture, but we would rather they use shape and edge cues [7, 3]. Our results support the idea that batch norm may be exacerbating this tendency to leverage superficial textures for classification of image data. Next, we evaluated the robustness of pretrained ImageNet models from the torchvision.models repository, which conveniently provides models with and without batch norm.^{4}^{4}4https://pytorch.org/docs/stable/torchvision/models.html, v1.1.0. Results are shown in Table 5, where batch norm improves top5 accuracy on noise in some cases, but consistently reduces it by to (absolute) for PGD. The trends are the same for top1 accuracy, only the absolute values were smaller; the degradation varies from to . Given the discrepancy between noise and PGD for ImageNet, we include a blackbox transfer analysis in Appendix A.4 that is consistent with the whitebox analysis.
Finally, we explore the role of batch size and depth in Figure 2. Batch norm limits the maximum trainable depth, which increases with the batch size, but quickly plateaus as predicted by Theorem 3.10 of [32]. Robustness decreases with the batch size for depths that maintain a reasonable test accuracy, at around 25 or fewer layers. This tension between clean accuracy and robustness as a function of the batch size is not observed in unnormalized networks.
In unnormalized networks, we observe that perturbation robustness increases with the depth of the network. This is consistent with the computational benefit of the hidden layers proposed by [23], who take an informationtheoretic approach. This analysis uses two mutual information terms: – the information in the layer activations about the input , which is a measure of representational complexity, and – the information in the activations about the label , which is understood as the predictive content of the learned input representations . It is shown that under SGD training, generally increases with the number of epochs, while increases initially, but reduces throughout the later stage of the training procedure. An informationtheoretic proof as to why reducing , while ensuring a sufficiently high value of , should promote good generalization from finite samples is given in [29, 22]. We estimate of the batchnormalized networks from the experiment in Figure 2 for subsampled batch sizes and plot it in Figure 3. We assume , since the networks are noiseless and thus is deterministic given . We use the “plugin” maximumlikelihood estimate of the entropy using the full MNIST training set [17]. Activations
are taken as the softmax output, which was quantized to 7bit accuracy. The number of bits was determined by reducing the precision as low as possible without inducing classification errors. This provides a notion of the model’s “intrinsic” precision. We use the confidence interval:
recommended by [17], which contains both bias and variance terms for for the regime where and . This is multiplied by ten for each dimension. Our first observation is that the configurations where is low—indicating a more compressed representation—are the same settings where the model obtains high clean test accuracy. The transition of at around 10 bits occurs remarkably close to the theoretical maximum trainable depth of layers. For the unnormalized network, the absolute values of were almost always in the same ballpark or less than the lowest value obtained by any batchnormalized network, which was bits. We therefore omit the comparable figure for brevity, but note that did continue to decrease with depth in many cases, e.g., from to for a minibatch size of 20, but unfortunately these differences were small compared to the worstcase error. The fact that is small where BIM robustness is poor for batchnormalized networks disagrees with our initial hypothesis that more layers were needed to decrease . However, this result is consistent with the observation that it is possible for networks to overfit via too much compression [23]. In particular, [32] prove that batch norm loses the information between minibatches exponentially quickly in the depth of the network, so overfitting via “too much” compression is consistent with our results. This intuition requires further analysis, which is left for future work.4 Vulnerability and Input Dimension
A recent work [24] analyzes adversarial vulnerability of batchnormalized networks at initialization time and conjectures based on a scaling analysis that, under the commonly used [10] initialization scheme, adversarial vulnerability scales as .
Model  Test Accuracy ()  

BN  Clean  Noise  
28  ✗  
✓  
56  ✗  
✓  
84  ✗  
✓ 
They also show in experiments that independence between vulnerability and the input dimension can be approximately recovered through adversarial training by projected gradient descent (PGD) [16], with a modest tradeoff of clean accuracy. We show that this can be achieved by simpler means and with little to no tradeoff through weight decay, where the regularization constant corrects the loss scaling as the norm of the input increases with .
Model  Test Accuracy ()  

BN  Clean  Noise  
56  ✗  
✓  
84  ✗  
✓ 
We increase the MNIST image width from 28 to 56, 84, and 112 pixels. The loss is predicted to grow like for sized attacks by Thm. 4 of [24]. We confirm that without regularization the loss does scale roughly as predicted: the predicted values lie between loss ratios obtained for and attacks for most image widths (see Table 4 of Appendix B). Training with weight decay, however, we obtain adversarial test accuracy ratios of , , and and clean accuracy ratios of , , and for of 56, 84, and 112 respectively, relative to the original dataset. A more detailed explanation and results are provided in Appendix B. Next, we repeat this experiment with a twohiddenlayer ReLU MLP, with the number of hidden units equal to the half the input dimension, and optionally use one hidden layer with batch norm.^{5}^{5}5This choice of architecture is mostly arbitrary, the trends were the same for constant width layers. To evaluate robustness, 100 iterations of BIM were used with a step size of 1e3, and . We also report test accuracy with additive Gaussian noise of zero mean and unit variance, the same first two moments as the clean images.^{6}^{6}6We first apply the noise to the original 2828 pixel images, then resize them to preserve the appearance of the noise. Despite a difference in clean accuracy of only , Table 6 shows that for the original image resolution, batch norm reduced accuracy for noise by , and for BIM by . Robustness keeps decreasing as the image size increases, with the batchnormalized network having less robustness to BIM and less to noise at all sizes. We then apply the regularization constants tuned for the respective input dimensions on the linear model to the ReLU MLP with no further adjustments. Table 7 shows that by adding sufficient regularization () to recover the original (, no BN) accuracy for BIM of when using batch norm, we induce a test error increase of , which is substantial on MNIST. Furthermore, using the same regularization constant without batch norm increases clean test accuracy by , and for the BIM perturbation by . Following the guidance in the original work on batch norm [13] to the extreme (): to reduce weight decay when using batch norm, accuracy for the perturbation is degraded by for , and for . In all cases, using batch norm greatly reduced test accuracy for noisy and adversarially perturbed inputs, while weight decay increased accuracy for such inputs.
5 Related Work
Our work examines the effect of batch norm on model robustness at test time. Many references which have an immediate connection to our work were discussed in the previous sections; here we briefly mention other works that do not have a direct relationship to our experiments, but are relevant to the topic of batch norm in general. The original work [13] that introduced batch norm as a technique for improving neural network training and test performance motivated it by the “internal covariate shift” – a term refering to the changing distribution of layer outputs, an effect that requires subsequent layers to steadily adapt to the new distribution and thus slows down the training process. Several followup works started from the empirical observation that batch norm usually accelerates and stabilizes training, and attempted to clarify the mechanism behind this effect. One argument is that batchnormalized networks have a smoother optimization landscape due to smaller gradients immediately before the batchnormalized layer [19]. However, [32] study the effect of stacking many batchnormalized layers and prove that this causes gradient explosion that is exponential in the depth of the network for any nonlinearity. In practice, relatively shallow batchnormalized networks yield the expected “helpful smoothing” of the loss surface property [19], while very deep networks are not trainable [32]. In our work, we find that a single batchnormalized layer suffices to induce severe adversarial vulnerability.
In Figure 4 we visualize the activations of the penultimate hidden layer in a fullyconnected network LABEL: without and LABEL: with batch norm over the course of 500 epochs. In the unnormalized network 4, all data points are overlapping at initialization. Over the first epochs, the points spread further apart (middle plot) and begin to form clusters. In the final stage, the clusters become tighter. When we introduce two batchnorm layers in the network, placing them before the visualized layer, the activation patterns display notable differences, as shown in Figure 4: i) at initialization, all data points are spread out, allowing easier partitioning into clusters and thus facilitating faster training. We believe this is associated with the “helpful smoothing” property identified by [19] for shallow networks; ii) the clusters are more stationary, and the stages of cluster formation and tightening are not as distinct; iii) the intercluster distance and the clusters themselves are larger. Weight decay’s loss scaling mechanism is complementary to other mechanisms identified in the literature, for instance that it increases the effective learning rate [31, 33]. Our results are consistent with these works in that weight decay reduces the generalization gap (between training and test error), even in batchnormalized networks where it is presumed to have no effect. Given that batch norm is not typically used on all layers, the loss scaling mechanism persists although to a lesser degree in this case.
6 Conclusion
We found that there is no free lunch with batch norm: the accelerated training properties and occasionally higher clean test accuracy come at the cost of robustness, both to additive noise and for adversarial perturbations. We have shown that there is no inherent relationship between the input dimension and vulnerability. Our results highlight the importance of identifying the disparate mechanisms of regularization techniques, especially when concerned about robustness.
Acknowledgements
The authors wish to acknowledge the financial support of NSERC, CFI, CIFAR and EPSRC. We also acknowledge hardware support from NVIDIA and Compute Canada. Research at the Perimeter Institute is supported by Industry Canada and the province of Ontario through the Ministry of Research & Innovation. We thank Thorsteinn Jonsson for helpful discussions; Colin Brennan, Terrance DeVries and JörnHenrik Jacobsen for technical suggestions; Justin Gilmer for suggesting the common corruption benchmark; Maeve Kennedy, Vithursan Thangarasa, Katya Kudashkina, and Boris Knyazev for comments and proofreading.
References

[1]
A. Athalye, N. Carlini, and D. Wagner.
Obfuscated Gradients Give a False Sense of Security:
Circumventing Defenses to Adversarial Examples.
In
International Conference on Machine Learning
, pages 274–283, 2018.  [2] N. Bjorck, C. P. Gomes, B. Selman, and K. Q. Weinberger. Understanding Batch Normalization. In Advances in Neural Information Processing Systems 31, pages 7705–7716. Curran Associates, Inc., 2018.
 [3] W. Brendel and M. Bethge. Approximating CNNs with BagoflocalFeatures models works surprisingly well on ImageNet. In International Conference on Learning Representations, 2019.
 [4] G. W. Ding, L. Wang, and X. Jin. AdverTorch v0.1: An Adversarial Robustness Toolbox based on PyTorch. arXiv preprint arXiv:1902.07623, 2019.
 [5] N. Ford, J. Gilmer, and E. D. Cubuk. Adversarial Examples Are a Natural Consequence of Test Error in Noise. 2019.
 [6] A. Galloway, T. Tanay, and G. W. Taylor. Adversarial Training Versus Weight Decay. arXiv preprint arXiv:1804.03308, 2018.
 [7] R. Geirhos, P. Rubisch, C. Michaelis, M. Bethge, F. A. Wichmann, and W. Brendel. ImageNettrained CNNs are biased towards texture; increasing shape bias improves accuracy and robustness. In International Conference on Learning Representations, 2019.
 [8] J. Gilmer, L. Metz, F. Faghri, S. Schoenholz, M. Raghu, M. Wattenberg, and I. Goodfellow. Adversarial Spheres. In International Conference on Learning Representations Workshop Track, 2018.
 [9] I. J. Goodfellow, J. Shlens, and C. Szegedy. Explaining and Harnessing Adversarial Examples. In International Conference on Learning Representations, 2015.

[10]
K. He, X. Zhang, S. Ren, and J. Sun.
Delving Deep into Rectifiers: Surpassing HumanLevel
Performance on ImageNet Classification.
In
International Conference on Computer Vision
, pages 1026–1034. IEEE Computer Society, 2015.  [11] D. Hendrycks and T. Dietterich. Benchmarking Neural Network Robustness to Common Corruptions and Perturbations. In International Conference on Learning Representations, 2019.
 [12] E. Hoffer, I. Hubara, and D. Soudry. Train longer, generalize better: Closing the generalization gap in large batch training of neural networks. In Advances in Neural Information Processing Systems 30, pages 1731–1741. Curran Associates, Inc., 2017.
 [13] S. Ioffe and C. Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In International Conference on Machine Learning, 2015.
 [14] J.H. Jacobsen, J. Behrmann, N. Carlini, F. Tramèr, and N. Papernot. Exploiting Excessive Invariance caused by NormBounded Adversarial Robustness. Safe Machine Learning workshop at ICLR, 2019.
 [15] A. Kurakin, I. J. Goodfellow, and S. Bengio. Adversarial Machine Learning at Scale. International Conference on Learning Representations, 2017.

[16]
A. Madry, A. Makelov, L. Schmidt, D. Tsipras, and A. Vladu.
Towards Deep Learning Models Resistant to Adversarial Attacks.
In International Conference on Learning Representations, 2018.  [17] L. Paninski. Estimation of Entropy and Mutual Information. In Neural Computation, volume 15, pages 1191–1253. 2003.
 [18] B. Recht, R. Roelofs, L. Schmidt, and V. Shankar. Do CIFAR10 Classifiers Generalize to CIFAR10? arXiv:1806.00451, 2018.
 [19] S. Santurkar, D. Tsipras, A. Ilyas, and A. Madry. How Does Batch Normalization Help Optimization? In Advances in Neural Information Processing Systems 31, pages 2488–2498. 2018.
 [20] L. Schott, J. Rauber, M. Bethge, and W. Brendel. Towards the first adversarially robust neural network model on MNIST. In International Conference on Learning Representations, 2019.
 [21] D. Sculley, J. Snoek, A. Wiltschko, and A. Rahimi. Winner’s Curse? On Pace, Progress, and Empirical Rigor. In International Conference on Learning Representations, Workshop, 2018.
 [22] R. ShwartzZiv, A. Painsky, and N. Tishby. REPRESENTATION COMPRESSION AND GENERALIZATION IN DEEP NEURAL NETWORKS. 2019.
 [23] R. ShwartzZiv and N. Tishby. Opening the Black Box of Deep Neural Networks via Information. arXiv:1703.00810 [cs], 2017.
 [24] C.J. SimonGabriel, Y. Ollivier, L. Bottou, B. Schölkopf, and D. LopezPaz. Adversarial Vulnerability of Neural Networks Increases With Input Dimension. arXiv:1802.01421 [cs, stat], 2018.
 [25] D. Soudry, E. Hoffer, M. S. Nacson, and N. Srebro. The Implicit Bias of Gradient Descent on Separable Data. In International Conference on Learning Representations, 2018.
 [26] D. Su, H. Zhang, H. Chen, J. Yi, P.Y. Chen, and Y. Gao. Is Robustness the Cost of Accuracy? – A Comprehensive Study on the Robustness of 18 Deep Image Classification Models. In Computer Vision – ECCV 2018, pages 644–661. Springer International Publishing, 2018.
 [27] C. Szegedy, W. Zaremba, I. Sutskever, J. Bruna, D. Erhan, I. Goodfellow, and R. Fergus. Intriguing properties of neural networks. In International Conference on Learning Representations, 2014.
 [28] T. Tanay and L. D. Griffin. A Boundary Tilting Persepective on the Phenomenon of Adversarial Examples. arXiv:1608.07690, 2016.
 [29] N. Tishby and N. Zaslavsky. Deep Learning and the Information Bottleneck Principle. In Information Theory Workshop, pages 1–5. IEEE, 2015.

[30]
D. Tsipras, S. Santurkar, L. Engstrom, A. Turner, and A. Madry.
Robustness May Be at Odds with Accuracy.
In International Conference on Learning Representations, 2019.  [31] T. van Laarhoven. L2 Regularization versus Batch and Weight Normalization. arXiv:1706.05350, 2017.
 [32] G. Yang, J. Pennington, V. Rao, J. SohlDickstein, and S. S. Schoenholz. A Mean Field Theory of Batch Normalization. In International Conference on Learning Representations, 2019.
 [33] G. Zhang, C. Wang, B. Xu, and R. Grosse. Three Mechanisms of Weight Decay Regularization. In International Conference on Learning Representations, 2019.
 [34] H. Zhang, Y. N. Dauphin, and T. Ma. Residual Learning Without Normalization via Better Initialization. In International Conference on Learning Representations, 2019.
Appendix A Supplement to Empirical Results
This section contains supplementary explanations and results to those of Section 3.
a.1 Why the VGG Architecture?
For SVHN and CIFAR10 experiments, we selected the VGG family of models as a simple yet contemporary convolutional architecture whose development occurred independent of batch norm. This makes it suitable for a causal intervention, given that we want to study the effect of batch norm itself, and not batch norm + other architectural innovations + hyperparameter tuning. Stateoftheart architectures, such as Inception and ResNet, whose development is more intimately linked with batch norm may be less suitable for this kind of analysis. The superior standard test accuracy of these models is somewhat moot given a tradeoff between standard test accuracy and robustness, demonstrated in this work and elsewhere [28, 6, 26, 30]. Aside from these reasons, and provision of pretrained variants on ImageNet with and without batch norm in torchvision.models for ease of reproducibility, this choice of architecture is arbitrary.
a.2 Comparison of PGD to BIM
We used the PGD implementation from [4] with settings as below. The pixel range was set to for SVHN, and for CIFAR10 and ImageNet:
from advertorch.attacks import LinfPGDAttack adversary = LinfPGDAttack(net, loss_fn=nn.CrossEntropyLoss(reduction="sum"), eps=0.03, nb_iter=20, eps_iter=0.003, rand_init=False, clip_min=1.0, clip_max=1.0, targeted=False)
We compared PGD using a step size of to our own BIM implemenation with a step size of , for the same number (20) of iterations. This reduces test accuracy for perturbations from for BIM to for PGD for the unnormalized VGG8 network, and from to for the batchnormalized network. The difference due to batch norm is identical in both cases: . Results were also consistent between PGD and BIM for ImageNet. We also tried increasing the number of PGD iterations for deeper networks. For VGG16 on CIFAR10, using 40 iterations of PGD with a step size of , instead of 20 iterations with , reduced accuracy from to , a difference of only .
a.3 Additional SVHN and CIFAR10 Results for Deeper Models
Our first attempt to train VGG models on SVHN with more than 8 layers failed, therefore for a fair comparison we report the robustness of the deeper models that were only trainable by using batch norm in Table 8. None of these models obtained much better robustness in terms of PGD, although they did better for PGD.
Test Accuracy ()  

L  Clean  Noise  PGD  PGD 
11  
13  
16  
19 
Fixup initialization was recently proposed to reduce the use of normalization layers in deep residual networks [34]. As a natural test we compare a WideResNet (28 layers, width factor 10) with Fixup versus the default architecture with batch norm. Note that the Fixup variant still contains one batch norm layer before the classification layer, but the number of batch norm layers is still greatly reduced.^{7}^{7}7We used the implementation from https://github.com/valilenk/fixup, but stopped training at 150 epochs for consistency with the VGG8 experiment. Both models had already fit the training set by this point.
CIFAR10  CIFAR10.1  

Model  Clean  Noise  PGD  PGD  Clean  Noise 
Fixup  
BN 
We train WideResNets (WRN) with five unique seeds and show their test accuracies in Table 9. Consistent with [18], higher clean test accuracy on CIFAR10, i.e. obtained by the WRN compared to VGG, translated to higher clean accuracy on CIFAR10.1. However, these gains were wiped out by moderate Gaussian noise. VGG8 dramatically outperforms both WideResNet variants subject to noise, achieving vs. . Unlike for VGG8, the WRN showed little generalization gap between noisy CIFAR10 and 10.1 variants: is reasonably compatible with , and with . The Fixup variant improves accuracy by for noisy CIFAR10, for noisy CIFAR10.1, for PGD, and for PGD. We believe our work serves as a compelling motivation for Fixup and other techniques that aim to reduce usage of batch normalization. The role of skipconnections should be isolated in future work since absolute values were consistently lower for residual networks.
a.4 ImageNet Blackbox Transferability Analysis
Target  
11  13  16  19  
Acc. Type  Source  ✗  ✓  ✗  ✓  ✗  ✓  ✗  ✓  
Top 1  11  ✗  1.2  42.4  37.8  42.9  43.8  49.6  47.9  53.8 
✓  58.8  0.3  58.2  45.0  61.6  54.1  64.4  58.7  
Top 5  11  ✗  11.9  80.4  75.9  80.9  80.3  83.3  81.6  85.1 
✓  87.9  6.8  86.7  83.7  89.0  85.7  90.4  88.1 
The discrepancy between the results in additive noise and for whitebox BIM perturbations for ImageNet in Section 3 raises a natural question: Is gradient masking a factor influencing the success of the whitebox results on ImageNet? No, consistent with the whitebox results, when the target is unnormalized but the source is, top 1 accuracy is higher, while top 5 accuracy is higher, than vice versa. This can be observed in Table 10 by comparing the diagonals from lower left to upper right. When targeting an unnormalized model, we reduce top 1 accuracy by using a source that is also unnormalized, compared to a difference of only
by matching batch normalized networks. This suggests that the features used by unnormalized networks are more stable than those of batch normalized networks. Unfortunately, the pretrained ImageNet models provided by the PyTorch developers do not include hyperparameter settings or other training details. However, we believe that this speaks to the generality of the results, i.e., that they are not sensitive to hyperparameters.
a.5 Batch Norm Limits Maximum Trainable Depth and Robustness
with these outliers removed and additional batch sizes from 5–20 in Figure 2. Best viewed in colour.
In Figure 5 we show that batch norm not only limits the maximum trainable depth, but robustness decreases with the batch size for depths that maintain test accuracy, at around 25 or fewer layers (in Figure 5). Both clean accuracy and robustness showed little to no relationship with depth nor batch size in unnormalized networks. A few outliers are observed for unnormalized networks at large depths and batch size, which could be due to the reduced number of parameter update steps that result from a higher batch size and fixed number of epochs [12]. Note that in Figure 5 the bottom row—without batch norm—appears lighter than the equivalent plot above, with batch norm, indicating that unnormalized networks obtain less absolute peak accuracy than the batchnormalized network. Given that the unnormalized networks take longer to converge, we prolong training for 40 epochs total. When they do converge, we see more configurations that achieve higher clean test accuracy than batchnormalized networks in Figure 5. Furthermore, good robustness can be experienced simultaneously with good clean test accuracy in unnormalized networks, whereas the regimes of good clean accuracy and robustness are still mostly nonoverlapping in Figure 5.
Appendix B Weight Decay and Input Dimension
Consider a logistic classification model represented by a neural network consisting of a single unit, parameterized by weights and bias , with input denoted by and true labels . Predictions are defined by
, and the model is optimized through empirical risk minimization, i.e., by applying stochastic gradient descent (SGD) to the loss function (
2), where :(2) 
We note that is a scaled, signed distance between and the classification boundary defined by our model. If we define as the signed Euclidean distance between and the boundary, then we have: . Hence, minimizing (2) is equivalent to minimizing
(3) 
We define the scaled loss as
(4) 
and note that adding a regularization term in (3), resulting in (5), can be understood as a way of controlling the scaling of the loss function:
(5) 
In Figures 6(a)6(c), we develop intuition for the different quantities contained in (2) with respect to a typical binary classification problem, while Figures 6(d)6(f) depict the effect of the regularization parameter on the scaling of the loss function. To test this theory empirically we study a model with a single linear layer (number of units equals input dimension) and crossentropy loss function on variants of MNIST of increasing input dimension, to approximate the toy model described in the “core idea” from [24] as closely as possible, but with a model capable of learning. Clearly, this model is too simple to obtain competitive test accuracy, but this is a helpful first step that will be subsequently extended to ReLU networks. The model was trained by SGD for 50 epochs with a constant learning rate of 1e2 and a minibatch size of 128. In Table 11 we show that increasing the input dimension by resizing MNIST from to various resolutions with PIL.Image.NEARESTinterpolation increases adversarial vulnerability in terms of accuracy and loss. Furthermore, the “adversarial damage”, defined as the average increase of the loss after attack, which is predicted to grow like by Theorem 4 of [24], falls in between that obtained empirically for and for all image widths except for 112, which experiences slightly more damage than anticipated. [24] note that independence between vulnerability and the input dimension can be recovered through adversarialexample augmented training by projected gradient descent (PGD), with a small tradeoff in terms of standard test accuracy. We find that the same can be achieved through a much simpler approach: weight decay, with parameter chosen dependent on to correct for the loss scaling. This way we recover input dimension invariant vulnerability with little degradation of test accuracy, e.g., see the result for and in Table 11: the accuracy ratio is with weight decay regularization, compared to without. Compared to PGD training, weight decay regularization i) does not have an arbitrary hyperparameter that ignores intersample distances, ii) does not prolong training by a multiplicative factor given by the number of steps in the inner loop, and 3) is less attackspecific. Thus, we do not use adversarially augmented training because we wish to convey a notion of robustness to unseen attacks and common corruptions. Furthermore, enforcing robustness to perturbations may increase vulnerability to invariancebased examples, where semantic changes are made to the input, thus changing the Oracle label, but not the classifier’s prediction [14]. Our models trained with weight decay obtained higher accuracy (86% vs. 74% correct) compared to batch norm on a small sample of 100 invariancebased MNIST examples.^{8}^{8}8Invariance based adversarial examples downloaded from https://github.com/ftramer/ExcessiveInvariance. We make primary use of traditional perturbations as they are well studied in the literature and straightforward to compute, but solely defending against these is not the end goal. A more detailed comparison between adversarial training and weight decay can be found in [6]. The scaling of the loss function mechanism of weight decay is complementary to other mechanisms identified in the literature recently, for instance that it also increases the effective learning rate [31, 33]. Our results are consistent with these works in that weight decay reduces the generalization gap, even in batchnormalized networks where it is presumed to have no effect. Given that batch norm is not typically used on the last layer, the loss scaling mechanism persists in this setting, albeit to a lesser degree.
Model  (Relative) Test Accuracy  (Relative) Loss  

Clean  Clean  Pred.  
28  –    
56  –  2  
56  0.01    
84  –  3  
84  0.0225    
112  –  4  
112  0.05   
Model  (Relative) Test Accuracy  (Relative) Loss  

BN  Clean  Clean  
28  ✗  
28  ✓  
56  ✗  
56  ✓  
84  ✗  
84  ✓ 
Appendix C Adversarial Spheres
The “Adversarial Spheres” dataset contains points sampled uniformly from the surfaces of two concentric dimensional spheres with radii and respectively, and the classification task is to attribute a given point to the inner or outer sphere. We consider the case , that is, datapoints from two concentric circles. This simple problem poses a challenge to the conventional wisdom regarding batch norm: not only does batch norm harm robustness, it makes training less stable. In Figure 8 we show that, using the same architecture as in [8], the batchnormalized network is highly sensitive to the learning rate . We use SGD instead of Adam to avoid introducing unnecessary complexity, and especially since SGD has been shown to converge to the maximummargin solution for linearly separable data [25]. We use a finite dataset of 500 samples from projected onto the circles. The unormalized network achieves zero training error for up to 0.1 (not shown), whereas the batchnormalized network is already untrainable at . To evaluate robustness, we sample 10,000 test points from the same distribution for each class (20k total), and apply noise drawn from . We evaluate only the models that could be trained to training accuracy with the smaller learning rate of . The model with batch norm classifies of these points correctly, while the unnormalized net obtains .
Appendix D Author Contributions
In the spirit of [21], we provide a summary of each author’s contributions.

First author formulated the hypothesis, conducted the experiments, and wrote the initial draft.

Second author prepared detailed technical notes on the main references, met frequently with the first author to advance the work, and critically revised the manuscript.

Third author originally conceived the key theoretical concept of Appendix B as well as some of the figures, and provided important technical suggestions and feedback.

Fourth author met with the first author to discuss the work and helped revise the manuscript.

Senior author critically revised several iterations of the manuscript, helped improve the presentation, recommended additional experiments, and sought outside feedback.
Comments
There are no comments yet.