### by Zhifeng Wang

**Abstract**

Internal covariate shift (ICS) is a common problem of CNN neural networks, this will cause the interaction between the interconnected entities of the CNN neural network, and make it difficult to accurately optimize the parameters during neural network training. At present, the main method to solve this problem is the batch normalization (BN) algorithm, which can convert the sample feature processed by the neural network in a low-value range, so as to alleviate the problem of interlayer interference during training caused by the ICS problem. But the BN algorithm cannot make sure the improvement of classification accuracy in the case of small mini-batch size, and on the other hand, the larger mini-batch size will bring about more consumption of computing power and memory space. To solve the ICS problem more thoroughly, this paper proposed a CNN algorithm based on regularized relu activation function, that introduced a regularization factor to the original relu activation function according to the amplification factors in the feature extraction process of CNN neural networks. This algorithm has mitigated the internal covariate shift problem that exists in the traditional relu activation algorithm and improved the classification accuracy of the CNN neural network.

**1. Introduction**

The neural network needs to be trained according to the neural network training process before performing object classification, the task of a neural network training process is to cyclically modify every neuron’s parameters according to specific optimization extent based on the training dataset until the classification accuracy of the neural network meets the targets. The neuron’s parameter optimization extent is decided by a specific neural network optimization algorithm according to the parameter gradient. During the neural network training, the parameter gradient is calculated based on the feature and error which the corresponding neurons have received. Because the feature is transmitted here by many transit neurons from the input layer, and the error is transmitted here by many transit neurons from the output layer, so, the parameter gradient of the neuron is easy to be affected by all transit neurons of the neural network, that is to say, if a certain transit neuron happen some parameter variation, it will bring about the variation of the received feature or the received error, this phenomenon is known as internal covariate shift. At every iteration cycle of the neural network training stage, because the synchronous adjustment to all parameters of a neural network is needed, thus it will result in the variation of the transmitted feature traffic or error traffic along with the synchronous adjustment, it means the previously calculated parameter gradients do not reflect the current feature and error status of the neurons when the synchronous parameter adjustment is going on, so the ICS phenomenon will influence the effect of the neural network training.

At present, the main method to solve the ICS problem is the batch normalization (BN) algorithm, the BN algorithm can normalize the input variable to a small scope according to the mini-batch mean and mini-batch variation, so it can effectively mitigate the ICS problem and improve the speed of the neural network training. But the performance of the BN algorithm depends on the standardization of sample distribute in every mini-batch, the mini-batch size and mini-batch feature distribution status will have a big influence to the neural network training effect. For the scenario of small mini-batch size, due to the representative distribution of every mini-batch may have some differences with the overall training dataset, it will result in the improvement of classification accuracy is not guaranteed while the BN algorithm is utilized. On the other hand, for the scenario of a larger mini-batch size, it will bring about more computing power and memory consumption.

This paper’s contributes are to do a detail analysis for the reason of ICS phenomenon existing on the convolutional neural network, and to propose a new algorithm based on regularized relu activation function, this algorithm can effectively mitigate both the ICS problem and the vanishing gradient problem without increasing the neural network training complexity.

This detail description of the regularized relu algorithm is composed of the follow four sections. Section 1 is the background description that focuses on the reason analysis of ICS phenomenon and the previous work. In section 2, elaborates upon the regularized relu algorithm. Session 3 has given the detail experiment method, testing results and discussion. Conclusions are in the last section.

**2. Background and previous work **

The neural network training process mainly includes feature extracting process, error backpropagation process, and parameter learning process. These processes are realized by a plurality of neurons which are interconnected according to layers. The neuron interactions in these processes is the main reason of ICS phenomenon. Thus how to reduce the neuron interactions in these processes is the main research direction to fix the ICS problem.

2.1 Neuron interactions in the feature extracting process

The main task of the CNN is to do feature extracting to the original images and to map the original images to the probability that the original images belong to the corresponding class. The CNN is mainly composed of a plurality of neurons interconnected according to layers, it includes an input layer, an output layer, and one or more hidden layers, which is shown in the following figure:

Fig 1. Feature extracting process of the CNN

The function of the input layer is to encode the pixel’s intensity of the original images to obtain one or more data matrixes reflecting the differentiated features of the original images, the data matrixes are as the input images of the hidden layer for performing further feature extracting.

One or more different hidden layers are included in the neural network, each hidden layer contains one or more feature maps. The feature map is responsible for mapping the data of input images to one or more feature data by one or more neurons which belong to the feature map, these feature data of the feature map directly, or after pooling operations, integrally form an image, that becomes the input image of the next hidden layer. The main task of a neuron in the feature map is to perform feature extraction, it summarizes the data in a certain area of one or more input images according to the weight parameters and bias parameter of the neuron and further generate feature data by its activation function, the functional model of a neuron is as follows:

z=Σw_{k}*x_{k}+b，a=θ(z)

Where x_{k} represents the input data, w_{k} is the weight parameter of the neuron, b is the bias parameter of the neuron, z is the weighted sum variable, a is the output activation of the activation function, and θ(z) is the activation function of the neuron. The main function of the activation function is to calculate the output value based on the weighted sum variable and perform scope limiting to the output value, the traditional activation function types include sigmoid, tanh, relu, etc.

One neuron can simultaneously perform feature extracting to multiple different input images. For one input image, it can be feature-extracted simultaneously by multiple neurons that belong the same or different feature map. In the input image, an area from where to perform feature extracting is called a local receptive field, and the measure of the local receptive field is called the convolution filter size of the neuron. A different neuron of the same feature map corresponds to a different local receptive field of the input image.

In addition to some neurons, the output layer includes some classifiers, the main function of a classifier is to convert the output feature value of the corresponding output layer neuron to the probability that the original images belong to the corresponding class. Generally, one classifier corresponds to one class.

2.2 Neuron interactions in error backpropagation process

The corresponding parameter gradient must be calculated firstly while the neural network optimization algorithm does parameter optimization at every iteration cycle in the training stage of the neural network. A parameter gradient corresponds to one or more neuron parameter gradient, if the neuron’s a certain weight is exclusive, then the weight gradient is the same with the neuron weight gradient, but for the scenario that different neurons share the same weight, the gradient of this shared weight is the sum of every corresponding neuron weight gradient. The calculation formula of neuron weight gradient and neuron bias gradient respectively is:

G_{wl}= E_{l}*f'(z)*a_{l-1}

G_{bl}= E_{l}*f'(z)

Where the G_{wl} and G_{bl} respectively is the neuron weight gradient and neuron bias gradient of the l^{th} hidden layer. The f'(z) represents the derivative value of the corresponding activation function. The a_{l-1} is the input data to which the neuron weight corresponds, it generally is the output activation of the previous-layer neuron which the neuron connected. The E_{l} is the neuron’s output error, which can be obtained by the partial derivative of the system loss function to the neuron’s output variable.

For the hidden layer neurons, E_{l} can be generated according to the compound function chain derivative rule. Since the system loss function uses the outputs of the classifiers as input variables, and the outputs of the classifiers are obtained by performing feature extraction according to the sequence from the lower layer to the higher layer, so, the system loss function may be regarded as a multivariate compound function for a large number of neuron variables in the neural network. Therefore, the output error of a certain neuron can be obtained by an iterative calculation process based on the output error of next-layer neurons according to the compound function chain derivative rule. The output error calculation process is shown in the following figure:

Fig 2. Error backpropagation process of the CNN

In the above figure, neuron 2, neuron 3, and neuron 4 are the next-layer neurons which neuron 1 connected, the marker E_{2}*f'(z)*W_{22}, E_{3}*f'(z)*W_{32,} and E_{4}*f'(z)*W_{42} respectively is the backpropagation error from neuron 2, neuron 3, and neuron 4. According to the compound function chain derivative rule, it is deduced that the output error of neuron 1 is the sum of all backpropagation errors from the next-layer neurons, that is E_{2}*f'(z)*W_{22}+E_{3}*f'(z)*W_{32}+ E_{4}*f'(z)*W_{42. } Among them, the parameter W_{22}, W_{32}, and W_{42} respectively is the corresponding weight parameter, f'(z) is the derivative of the corresponding activation function, the variable E_{2}, E_{3}, and E_{4} respectively is the output error of the corresponding next-layer neuron.

For the output layer neurons that locate on the terminal point of the neural network, E_{l }can be obtained by the partial derivative of the system loss function to the output layer neuron’s output variable, if the classifier is softmax type and cross-entropy loss function is used, the output error of an output layer neuron is derived as below:

E_{i}=a_{i}-Y_{i}

Where this variable a_{i} is the actual output probability of the classifier to which the output layer neuron corresponds, Y_{i} is the expected output probability of the classifier, if the real class of the original image coincides with the class to which the classifier corresponds, the Y_{i} is 1, otherwise the Y_{i} is 0.

2.3 The formulation mechanism of ICS phenomenon

In the neural network training stage, the training to neuron’s weight parameters and bias parameters is generally realized by performing synchronous adjustment at every iteration cycle to all parameters of a neural network. While the adjustment is going on, the parameters of the adjacent neurons will be changed immediately along with the synchronous adjustment, the feature and error traffic which the neuron is receiving right now may have some difference with the value before this synchronous adjustment, although the original images of the input layer no changed. Because the parameter gradient is calculated based on the received feature and the received error, so the previously calculated parameter gradients does not reflect the current feature and error status when the synchronous parameter adjustment is going on. This variation of the received feature or error traffic may be very great regarding abundant repeated extractions to the same input data by multiple neurons of every hidden layer. The internal covariate shift phenomenon is shown as below:

Fig 3. Formulation mechanism of internal covariate shift

In the above figure, there are 6 data in the input image, among them, data G and H in local receptive field 1, data M and L in local receptive field 2, data E and F concurrently belong to both local receptive field 1 and 2. The hidden layer 2 contains feature map 21 and 22, among them, feature map 21 includes neuron A and B, feature map 22 includes neuron C and D. As shown in the figure, the four neurons of hidden layer 2 respectively use the same convolution filter size 4 to acquire the input data and generates its own output activation that carries the feature info of the same input data, then these neuron’s output activations are converged to neuron P of the hidden layer 3 and finally form the output activation of neuron P. Owing to neuron A and C simultaneously perform feature extraction to local receptive field 1, and neuron B and D simultaneously perform feature extraction to local receptive field 2, the times of feature extracting on data G, H, M, or L is 2, and the times of feature extracting on data E or F is 4 (as shown in the figure), therefore, the same feature from the same input data has been value-amplified by multiple neurons of hidden layer 2 before it arrive neuron P, That means, the variation of input data in the input image will bring multiple variations to the output activation of neuron p during the neural network training, thus the internal covariate shift phenomenon is formed. Moreover, because of the error backpropagation following the reverse path of feature propagation, the repeated convolution by multiple neurons also will make the previous neurons to receive more backpropagation error traffic, this is also the reason of internal covariate shift phenomenon.

2.4 Previous work to fix the ICS problem

Internal covariate shift, as a common problem in the neural network training stage, make it be difficult to accurately optimize the parameters during neural network training. At present, the main method to mitigate this problem is the batch normalization method. The batch normalization can be implemented by calculating the mean and variation of each input variable to a BN layer per mini-batch and using these statistics to perform the normalization. A batch normalization process is shown below [6].

// mini-batch mean

// mini-batch variance

// normalize

// scale and shift

In the above formulas, *x _{i}* is the weighted sum variable which is input to the BN layer, every

*x*variable respectively corresponds to a certain image of the mini-patch,

_{i }*µ*is the mean value of all

_{В }*x*variables of the mini-batch,

_{i}*σ*is the variation value based on the

^{2}_{B}*µ*and all

_{В }*x*variables of the mini-batch. After all original samples in one mini-batch are input to the neural network, the previous neurons located in front of the neuron will do feature extracting operation layer by layer for these input samples, finally, while the neuron make up weighted sum variables for a sample of the mini-batch based on the previous-layer neuron’s output activations, the corresponding variable

_{i}*x*is formed. The

_{i}*y*variable of the above formulas is the BN function output variable, as the operation result that corresponds to a certain variable x

_{i}_{i}, it will be transmitted to the activation function to do further operations. The parameters

*β*and

*γ*allow the automatic scaling and shifting of the normalization output, the two parameters are learned by the model as part of the neural network training process.

From the above formulas, it is seen that the batch normalization can convert the input variable into a number that roughly follows a normal distribution with a mean of 0 and a variation of 1, it means the BN function can keep the output activation of the neuron in a low-value range, this has decreased the influence each other between this-layer neurons and next-layer neurons, so the ICS problem is mitigated. But the BN algorithm also has a disadvantage, that is, when the size of mini-batches is small, because of the greater difference of the feature distribution in these different mini-batches, it will result in that the mean parameter *µ _{В }*and the variance parameter σ

^{2}

_{B }of the mini-batches are not consistent with each other, and have some differences with the overall training dataset. For the small size mini-batch case, due to the different mean parameter

*µ*and the different variance parameter σ

_{В }^{2}

_{B}, it will cause the neural network may learn different representative value from the different mini-batches although the corresponding features are basically the same, that will influence the classification accuracy of the neural network model. On the other hand, if the mini-batch size is set too large, considering the large number of neurons in the neural network, the network training time will change longer because of the wider iterative calculation process in the BN process, simultaneously, more middle variables are requested, and these variables will consume more memory space.

So, for accurately optimizing the parameters during neural network training, and improving the training convenience, it is needed to find another more effective solution to mitigate the ICS phenomenon during the neural network training.

**3. Proposed method **

To solve the ICS problem more conveniently and further improve the performance of the deep convolutional neural network, the regularized relu activation function algorithm is proposed.

3.1 The definition of regularized relu activation function

This algorithm introduced a regularization factor to the traditional relu activation function according to the convolution filter size, the number of this-layer feature maps, the neuron number of one this-layer feature map, and the data number of one input image. The definition of regularized relu activation function is as below：

f(z)=max(0,z*c/(s*n*m))

In the above function, the function unit c/(s*n*m) is the regularization coefficient which is used to perform output value regularization, it includes the feature map regularization factor 1/n and the convolution regularization factor c/(s*m).

For the feature map regularization factor 1/n, the parameter n is the number of the feature maps in this-layer, the factor is used to regularize the traffic output from all this-layer neurons and make all this-layer neurons only output one feature map scale’s feature traffic. When this regularization factor is deployed in all this-layer neurons, the factor 1/n as the reciprocal of this-layer feature map number, it can compensate for the influence of repeat convolution by multiple this-layer feature maps.

For the convolution regularization factor c/(s*m), it is used to compensate for the feature amplification by multiple neurons in one feature map owing to the overlapping of the corresponding local receptive fields. In which, the parameter c is the data number of one input image, the parameter m is the convolution filter size of the neuron, the parameter s is the neuron number of one this-layer feature map, the factor part c/m is the necessary neuron number for one input image under the condition of non-repeated convolution. So, when this convolution regularization factor is deployed in all this-layer neurons, the regularization factor c/(s*m), as the quotient of the necessary neuron number divided by the neuron number of one this-layer feature map, it can compensate for the influence of repeat convolution by multiple neurons in one this-layer feature map.

3.2 The advantage of regularized relu activation function

When the synchronous parameter adjustment results in the variation of the transmitted feature or error during the neural network training, the regularized relu activation functions that are deployed on the all neurons of the all feature maps can limit the traffic variation no be amplified by these parallel processing neurons. From the view of traffic inhibition by the regularized relu activation function mechanism, the whole hidden layer can be regarded as a single virtual feature map that undertake the role of one time feature extracting for the data of the input image, the all neurons in the hidden layer can be regarded as one or more virtual neurons in the virtual feature map, every virtual neuron has no overlapping extracting area for the input image, therefore, the synchronous parameter adjustment will not bring too big traffic changes to the received feature or error, the ICS phenomenon is alleviated. After the regularized relu activation functions have been deployed in all the neurons of the hidden layer, the effect of the regularized relu activation function seems as below.

Fig 4. Impact-map of Regularized relu activation function

On the other hand, the function curve of the regularized relu activation function is a linear growth curve within the positive input range. The regularized relu function can obtain a constant derivative value in the positive input range, so it can solve the vanishing gradient problem like traditional relu activation function does. The function curve of the regularized relu activation function is as follow:

Fig 5. The curve of Regularized relu function

**4. Experiment and discussion**

Two frequency-used nonlinear activation functions, sigmoid and relu, are selected as comparison references for the regularized relu activation function algorithm, the experiment target is to do performance comparison for the three kinds of activation function models. The experiment environment, testing method, and testing results are described.

4.1 Experiment environment

The experiment environment is composed of 1 input layer, 6 convolution layers, 2 pooling layers, and 1 output layer, the neural network architecture and the corresponding entity configurations are shown as below:

Fig 6. The neural network architecture of the models

Table 1. The entity configurations of the models

The input layer is responsible for producing object data matrixes by performing binarization operations to the original images (include three color’s pixel maps: Red, Green, and Blue) that come from the training dataset or the test dataset, these object data matrices respectively are transmitted to the hidden layer by three data channels (include three color’s data channels: Red, Green, and Blue). The output layer includes 2 neurons respectively for class 0 and class 1, the classifiers are softmax type, and the cross-entropy loss function is used to calculate the system loss. The pooling algorithm of the pooling layers is the mean-pooling method, but other pooling methods (for example max-pooling) also be supported for regularized relu activation function model.

For the testing of the regularized relu model, regularized relu activation function needs to be deployed on all neurons of every hidden layer, from convolution layer 1 to convolution layer 6. In addition, the value amplification operation generally needs to be performed on the data of the input layer. The size of magnification needs to be modified according to the output value size of the output layer neurons, this can prevent the output value of the output layer neurons to become too small due to the inhibition effect of the feature map regularization factor and the convolution regularization factor in every hidden layer. In this experiment, the magnification is set as 10000 when the maximum value of the input layer data is 100.

4.2 Training dataset and test dataset

The Cifar-10 dataset is used for this experiment. As a standard dataset, the Cifar-10 was collected by Alex Krizhevsky, Vinod Nair, and Geoffrey Hinton, it consists of 60000 32×32 color images in 10 classes, with 6000 images in each class. The dataset is divided into five training batches and one test batch, each batch has 10000 images, the test batch contains exactly 1000 randomly selected images for each class.

Regarding the limitation of computing power, the first 800 pictures tagged as class 0 or class 1 from the Cifar-10 training batches are selected as the training dataset of the experiment, and the first 1000 pictures tagged as class 0 or class 1 from the Cifar-10 test batch are selected as the test dataset of the experiment. Because all the models utilize the same training dataset and test dataset, this dataset selection method will not influence the correctness of the testing conclusion.

4.3 Evaluating index

Select accuracy as the performance evaluating index of the neural network model, the accuracy’s definition is as below:

Accuracy = (TP+TN)/(TP+TN+FP+FN)

In which, TP means True Positives (correctly identified positives: if the image is malignant and the model predicts malignant), TN means True Negatives (correctly identified negatives: if the image is benign and the model predicts benign), FP means False Positives (incorrectly identified positives: if the image is benign and the model predicts malignant), and FN means False Negatives (incorrectly identified negatives: if the image is malignant and the model predicts benign).

4.4 Optimization algorithm selection

In the past years, there it is many neural network optimization algorithms, for example, SGD, RMSProb, Adam, MBGD, etc. These optimization algorithms are contributed by scientists to improve the neural network training effect, among these algorithms, Adam algorithm has got heavy use in recent years because of its good classification accuracy and convergence speed, so the Adam algorithm is selected for the model testing. Because the all tested models facing the same vanishing gradient problem and internal covariate shift phenomenon, this optimization algorithm selection does not influence the correctness of the testing conclusion.

The Adam algorithm combines the EWMA (exponentially weighted moving average) of parameter gradient and the EWMA of parameter gradient square to form a parameter adjustment extent at every initial iteration cycle. By this method, it not only alleviates the problem of parameter oscillation substantially during neural network training but also can adapt the parameter learning rate to the progress status of the neural network training, the parameter learning rate can be reduced to a certain degree after the parameter has gotten enough training.

For the Adam algorithm, several variables are calculated to produce parameter adjustment extent at every iteration cycle, the calculation formulas of the Adam algorithm are as follows [4]:

m_{t}=β_{1}*m_{t-1}+(1-β_{1})*g_{t}, (Update the EWMA of parameter gradient)

v_{t}=β_{2}*v_{t-1}+(1-β_{2})*g_{t}^{2}, (Update the EWMA of parameter gradient square)

= m_{t/}/(1-β_{1}^{t}), (Compute corrected EWMA of parameter gradient)

= v_{t}/(1-β_{2}^{t}), (Compute corrected EWMA of parameter gradient square)

W_{t}←W_{t-1}-η*/(+ε), (Update parameter value)

Where the variable m_{t} and m_{t-1} respectively is the EWMA of parameter gradient at time t and time t-1, the variable v_{t}, and v_{t-1} respectively is the EWMA of parameter gradient square at time t and time t-1, the variable and respectively is the corrected EWMAs about parameter gradient and parameter gradient square, the variable W_{t} and W_{t-1} respectively is the neuron parameter value at time t and time t-1.

For the above five Adam formulas, some coefficients must be preset before performing parameter learning. Among them, η is the learning rate coefficient, ε is a small constant value, β_{1} is the EWMA coefficient of parameter gradient, β_{2} is the EWMA coefficient of parameter gradient square. The β_{1}^{t} and β_{2}^{t} respectively is the correction coefficient for the variable and , they are set to larger values at the beginning iteration cycles and gradually reduced to zero for the subsequent iteration cycles.

4.5 Hyper-parameter tuning

In the experiment, the sigmoid, relu, and regularized relu models respectively use the same initial weights, every initial weight use the random value between -1 and 1, which is produced by the random module of Python. The setting of Adam-related parameters follows the industry’s best practice. Among them, the learning rate coefficient η is set as 0.001, the parameter ε is set as 0.00000001, the gradient EWMA coefficient β_{1} is set as 0.9, the gradient square EWMA coefficient β_{2} is set as 0.999, the initial value of parameter β_{1}^{t} is set as 0.9, and the initial value of parameter β_{1}^{t} is set as 0.999.

4.6 Testing steps and overfitting consideration

As an different datasets, the training dataset and test dataset may have some different features to represent the same classes, the model which is trained excessively by the training dataset may learn some noise or less representative features from the training datasets, so while the trained model is used to classify the test dataset, it generally has the lower classification accuracy than for the training dataset, this is overfitting phenomenon. To measure the accuracy of the activation function models and eliminate the influence of the overfitting phenomenon, the peek accuracy testing method is selected.

The peak accuracy testing method is divided into several steps. Firstly start the feature learning process of the model and let the model be trained cyclically based on the training dataset epoch by epoch. While each training dataset epoch is completed, save all model parameters as the epoch’s model status and check the training dataset’s classification accuracy. When it is seen that the training dataset’s classification accuracy does not increase or fluctuates in a small range for over fifteen continuous epochs, the learning convergence status has arrived, it means the model has finished the learning to all necessary features of the training dataset. Then from all the saved model statuses, find the model status that has the highest classification accuracy for the test dataset, this highest classification accuracy for the test dataset represents the accuracy of this model.

4.7 Testing results and discussion

The testing results are shown in the following figures:

Fig 7. Model accuracy of Sigmoid

Fig 8. Model accuracy of Relu

Fig 9. Model accuracy of Regularized relu

Fig 10. Model accuracy comparison for Sigmoid, Relu, and Regularized relu

From all the above figures, it is seen that the peak accuracy status which is marked keeps for over 15 epochs in both the training curve and the test curve, it indicates the training convergence arrived. So it can be concluded that the classification accuracy of the sigmoid model is 77.74%, the classification accuracy of the relu model is 76.93%, and the classification accuracy of the regularized relu model is 82.98%, this regularized relu model has a higher classification accuracy than other two models in this testing environment.

Secondly, the classification accuracy of the relu model is lowest in the three models, it indicates the ICS problem has affected the parameter optimization during the relu model training. For the sigmoid model, because of the activation scope limiting effect of the sigmoid function, the sigmoid model has a certain restrictive effect on the ICS problem than the relu model.

Additionly, the accuracy of the sigmoid model is higher than the relu model, and it indicates the sigmoid model is less affected by its vanishing gradient problem. That is because of the depth of the CNN neural network in the testing environment not too deep. It has only six hidden layers, thus the vanishing gradient problem is not too serious in the testing environment.

**5. Conclusion**

The feature and error amplification by multiple neurons that distributed in multiple feature maps is the main reason of the internal covariate shift phenomenon, the feature map regularization factor and the convolution regularization factor in the regularized relu activation function can overall regularize the output activations of this-layer neurons within one time feature extracting scale for the input images, they have compensated the multiple variation of the transmitted feature or error during the neural network training, and concurrently alleviate both the vanishing gradient problem and the internal covariate shift phenomenon in the deep convolutional neural network. Compared with other traditional nonlinear activation functions, the regularized relu activation function algorithm can improve the classification accuracy, and it does not need to set more intermediate variables, which will consume more computing power and memory space.

**References **

[1] A. Krizhevsky, I. Sutskever, and G. E. Hinton. 2012. “Imagenet classification with deep convolutional neural networks.” Advances in neural information processing systems. URL http://imagenet-classification-with-deep-convolutional-nn.pdf (nvidia.cn).

[2] A. L. Maas, A. Y. Hannun, and A. Y. Ng. 2013. “Rectifier nonlinearities improve neural network acoustic models.” Proc. ICML, volume 30. URL http://citeseerx.ist.psu.edu/viewdoc/download;jsessionid=38E2FA51A64FA003D871ADAD2069167B?doi=10.1.1.693.1422&rep=rep1&type=pdf

[3] D.-A. Clevert, T. Unterthiner, and S. Hochreiter. 2016. “Fast and accurate deep network learning by exponential linear units (elus).” ICLR. URL https://arxiv.org/pdf/1511.07289v2.pdf

[4] D. Kingma and J. Ba. 2015. “Adam: A method for stochastic optimization.” ICLR. URL http://cslt.riit.tsinghua.edu.cn/mediawiki/images/b/b0/2015_ADAM-A_method_for_stochastic_optimization.pdf

[5] Michael A. Nielsen. 2015. *Neural Network and Deep Learning*. Determination Press. URL http://neuralnetworksanddeeplearning.com

[6] S. Ioffe and C. Szegedy. 2015. “Batch normalization: Accelerating deep network training by reducing internal covariate shift.” International Conference on Machine Learning. URL https://arxiv.org/pdf/1502.03167v2.pdf

[7] T. Tieleman and G. Hinton. 2012. “Lecture 6.5-rmsprop: Divide the gradient by a running average of its recent magnitude.” COURSERA: Neural networks for machine learning.

[8] W. Shang, K. Sohn, D. Almeida, and H. Lee. 2016. “Understanding and improving convolutional neural networks via concatenated regularized linear units.” Proceedings of the 33th International Conference on Machine Learning, volume 48. URL https://arxiv.org/pdf/1603.05201v1.pdf

**Glossary of Key terms
**

Convolutional neural network |
A convolutional neural network is a class of deep neural networks, most commonly applied to analyzing visual imagery.
It accepts the given original images and produces output classification for identifying the class of the object to which the original image corresponds. |

Neural network training |
Neural network training is to implement the task of presetting parameters of all neurons in the neural network, the neural network training process is to perform parameter learning cyclically for all neurons based on the training dataset until the classification accuracy of the neural network satisfies the targets. |

Training dataset |
Training data set refers to the set of data samples that have been labeled by class for training neural network model. |

Test dataset |
Test dataset is a set of data samples used to test the classification accuracy of neural network model. |

Optimization algorithm |
An algorithm that updates the model parameters based on the error estimated by the system loss function. |

Parameter learning |
The action that updates the weight parameter or the bias parameter based on the error estimated by the system loss function. |

System loss function |
System loss function is a function to evaluate the error degree that reflects the gap between the actual output and the expected output of every classifier. The neural network can choose one different system loss function type, for softmax classifier, cross-entropy loss function generally is chosen, it is shown as: L=-∑Y_{i}*ln(a_{i}), where a_{i} is the output probability of each classifier, and Y_{i} is the expected output value of the classifier, if the real class of the original image coincides with the class to which the classifier corresponds, the Y_{i} is 1, otherwise, the Y_{i} is 0. |

Parameter gradient |
The parameter gradient is equivalent to the partial derivative of the system loss function to the parameter variable. This parameter may be exclusive to the neuron or shared by multiple neurons belonging to the same feature map. |

Neuron parameter gradient |
The Neuron parameter gradient refers to the partial derivative of the system loss function to the weight parameter variable used by the neuron to connect the low-layer neuron. |

Softmax |
Softmax function is used to calculate the probability, the formula of softmax function is: a_{i}=e^{zi}/Σe^{zj}, where a_{i} is the output probability of the classifier, z_{i} is the output feature value of the neuron to which the classifier corresponds, and each z_{j} respectively is the output feature value of different output-layer neuron. |

Mini-batch |
For MBGD optimization algorithm or BN algorithm, the neural network training is according to the method of mini-batch by mini-batch, every mini-batch includes a specific number of samples. In every iteration cycle, the optimization algorithm modify a parameter simultaneously based on all the parameter gradients which are produced by all the samples of the mini-batch. |

Mini-batch size |
Mini-batch size refers to the number of samples contained in the mini-batch. |

Mini-batch feature distribution status |
Mini-batch feature distribution status refers to the distribution of the number of samples that meet a certain location feature within a certain feature value range in the mini-batch. |

Iteration cycle |
In the neural network training stage, the parameter optimization process is divided many steps according to the time cycles, in every time cycle, it implement parameter optimization operations to all parameters of the neural network based on the current original images or the current mini-batch, the time cycle used by the optimization algorithm is called an iteration cycle. |

Output error |
Neuron output error is the partial derivative of the system loss function to the neuron output intermediate variable when the neuron output is the intermediate variable of the system loss compound function. |

Error backpropagation process |
The error propagation process is the process of calculating the output error of the neuron according to the output error of the high-layer neurons connected to the neuron. |

Multivariate compound function |
Take the following example to illustrate the definition of multivariable compound function: if L=δ(a,b), where a=f(u), b=g(v), u=φ(x,y), v=ψ(x,y), L becomes a function of x and y through a, b, u and v, L is called a multivariate compound function composed of L, a, b, u and v, among them, a, b, u and v are called intermediate variables of the compound function. |

Compound function chain derivative rule |
An example of compound function chain derivative rule is: if L=δ(a,b), where a=f(u), b=g(v), u=φ(x,y), v=ψ(x,y), then the partial derivative of L to x is: dL/dx=(dL/da)*(da/du)*(du/dx)+(dL/db)*(db/dv)*(dv/dx)=(dL/da)* f’(u)*(du/dx)+(dL/db)*g’(v)*(dv/dx). So, based on the rule, it can be deduced that the output error (like above dL/dx) of a certain neuron is the sum of all backpropagation errors from all the next-layer neurons, in which, every backpropagation error is the product of the output error (like above dL/da or dL/db) of the corresponding next-layer neuron, the derivative of the corresponding activation function (like above f’(u) or g’(v)), and the corresponding neuron weight value (like above du/dx or dv/dx). |

Activation function |
The activation function is mainly used to limit the output value of neurons to a certain range of values, the traditional activation function types include sigmoid, tanh, relu, etc. |

Relu |
Relu is one type of activation function, it calculates the output value based on the input data and decides whether the output value should be kept in the model, the formula of relu function is a=max(0,z), and the output value of relu function is 0 while z<0, otherwise, is the same with z. |

Tanh |
Tanh is one type of activation function, it calculates the output value based on the input data and decides whether the output value should be kept in the model, the formula of tanh function is a=(e^{z}-e^{-z})/(e^{z}+e^{-z}), and the output value of tanh function is greater than -1 and less than 1. |

Sigmoid |
Sigmoid is one type of activation function, it calculates the output value based on the input data and decides whether the output value should be kept in the model, the formula of sigmoid function is a=1/(1+e^{-z}), and the output value of sigmoid function is greater than 0 and less than 1. |

Vanishing gradient problem |
For the nonlinear activation function (for example sigmoid, tanh, etc.) is utilized, it mostly has characteristic of the weak expression for the saturation input, the derivative of the activation function is reset closely to zero while in the saturation input state, because the activation function will participate in the neuron’s error backpropagation process in the parameter training process, the derivative disappearance issue may bring about very small neuron output error and very small parameter gradient on the shallow layer of neural network, this is the vanishing gradient problem, that make it difficult to perform enough training to the parameters of the neural network shallow layers. |

Hidden layer |
An internal function entity of the neural network, not the input layer or the output layer, it can utilize a set of learnable filers to perform operations on the input image. |

Convolutional layer |
A layer utilizes a set of learnable filers to perform operations on the input image. |

Filter |
It represent a set of weight parameters used to do feature extracting for the all data that locate in an area of the input image, a weight parameter generally corresponds a data. While the area moving, the filter can extract another feature for the current area. |

Pooling layer |
The pooling layer in a CNN architecture is used for abstracting image features by pooling action. |

Pooling |
The pooling operation is to combine multiple feature data of one feature map into one data by taking the maximum value, by taking an average value, or by other algorithms. |

This-layer |
The hidden layer or the output layer where this neuron locates. |

Previous-layer |
The hidden layer or the input layer which this neuron uses its output as input. |

Next-layer |
The hidden layer or the output layer which this neuron transmits the output activation to. |

SGD |
SGD (Stochastic gradient descent) algorithm is a neural network optimization algorithm, it calculates the gradient of each neural network parameter in every iteration circle, and then modifies the parameters directly based on the gradients. |

MBGD |
MBGD (Mini-Batch gradient descent) algorithm is a neural network optimization algorithm that updates the parameters according to the loss of a batch of training data, a parameter modification extent is calculated based on the average value of the gradients generated by each training sample of the batch. |

RMSProb |
RMSProb (Root mean square prob) algorithm is a neural network optimization based on adaptive learning rate, it provides an attenuation factor for the parameter learning rate according to the EWMA (exponentially weighted moving average) of parameter gradient square. |

Exponentially weighted moving average |
The exponentially weighted moving average (EWMA) is an average value that is used to track the change of the variable, this EWMA formula is: EWMA(t)=p*EWMA(t-1)+(1-p)*x(t), where EWMA(t) = moving average at time t, p = mixing coefficient between 0 and 1, x(t) = value of variable x at time t. |

Python |
A kind of computer programming language. |