Loss Function
Regularization: Add term to loss
$L=\frac{1}{N}\sum_{i=1}^{N}\sum_{j\not= y_i}\max(0, f(x_i;W)j-f(x_i;W){y_i}+1)+\lambda R(W)$
In common use:
Dropout
In each forward pass, randomly set some neurons to zero
Probability of dropping is a hyperparameter; 0.5 is common
Data Augmentation
- Horizontal Flipping 水平翻转
- Random Cropping 随机裁剪
- Random Scaling 随机放缩
- Color Jittering 颜色抖动
- Random Translation 随机平移
- Random Shearing 随机剪切
Regression
L1 Loss
$L(y, \hat{y})=w(\theta) | \hat{y}-y | $ |
L2 Loss
$L(y, \hat{y})=w(\theta)(\hat{y}-y)^2$
Classification
Hinge Loss Function 铰链损失函数
$L(y, \hat{y}) = \max{(0, 1-\hat{y}y)}$
$y\in{-1, 1}$
Cross-Entropy Loss Function 交叉熵损失函数
In binary classification, $L(y, \hat{y}) = -y\log{(\hat{y})}-(1-y)\log{(1-\hat{y})}$
In multiclass classification (class number = M), $L(y) = -\sum_{c=1}^My_o, c\log{(p_o, c)}$
- M - number of classes (dog, cat, fish)
- log - the natural log
- y - binary indicator (0 or 1) if class label cc is the correct classification for observation $o$
- p - predicted probability observation $o$ is of class $c$
Exponential Loss Function 指数损失函数
$L(y,\hat{y}) = \exp{(-y\hat{y})}$
SGD, Stochastic Gradient Descent 随机梯度下降
在梯度下降时,为了加快收敛速度,通常使用一些优化方法。
SGD每次都会在当前位置上沿着负梯度方向更新(下降,沿着正梯度则为上升),并不考虑之前的方向梯度大小等等。而动量(moment)通过引入一个新的变量 v去积累之前的梯度(通过指数衰减平均得到),得到加速学习过程的目的。
最直观的理解就是,若当前的梯度方向与累积的历史梯度方向一致,则当前的梯度会被加强,从而这一步下降的幅度更大。若当前的梯度方向与累积的梯度方向不一致,则会减弱当前下降的梯度幅度。
Nesterov Momentum
Ada Grad (Adaptive Gradient)
通常,我们在每一次更新参数时,对于所有的参数使用相同的学习率。而AdaGrad算法的思想是:每一次更新参数时(一次迭代),不同的参数使用不同的学习率。
Added element-wise scaling of the gradient based on the historical sum of squares in each dimension.
RMSProp: “Leaky AdaGrad”
Adam
Adam with beta1 = 0.9, beta2 = 0.999, and learning_rate = 1e-3 or 5e-4 is a great starting point for many models
Gradient Descent
1 Dimension
- Randomly pick an initial value $w^0$
-
Compute $\frac{\partial L}{\partial w} _{w=w^0}$ -
$w^1\gets w^0-\eta\frac{\partial L}{\partial w} _{w=w^0}$, where $\eta$ stands for learning rate
2 Dimension
$y=wx_1+b$
- Randomly pick an initial value $w^0, b^0$
-
Compute $\frac{\partial L}{\partial w} _{w=w^0}$, $\frac{\partial L}{\partial b} _{w=w^0, b=b^0}$ -
Update $w$ and $b$ iteratively: $w^1\gets w^0-\eta\frac{\partial L}{\partial w} _{w=w^0}$, $b^1\gets b^0-\eta\frac{\partial L}{\partial b} _{w=w^0, b=b^0}$, where $\eta$ stands for learning rate
Learning rate decay
Step
Reduce learning rate at a few fixed points. E.g. for ResNets, multiply LR by 0.1 after epochs 30, 60, and 90
Cosine
$\alpha_t=\frac{1}{2}\alpha_0(1+\cos (t\pi/T))$
- $\alpha_0$: initial learning rate
- $\alpha_t$: learning rate at epoch t
- $T$: total number of epochs
Linear
$\alpha_t=\alpha_0(1-t/T)$
Inverse Sqrt
$\alpha_t = \alpha_1/\sqrt{t}$
Looking at learning curves
Losses may be noisy, use a scatter plot and also plot moving average to see trends better.
Activation Function
introduce non-linear properties to the network
Sigmoid
Formula
\[\sigma(x) = \frac{1}{1+e^{-x}}\]Features
- Squashes numbers to range [0, 1]
Problems
-
Saturated neurons “kill” the gradients
-
Sigmoid outputs are not zero-centered
The gradient on $w$ is always all positive or negative, so the gradient can upgrade in only one direction.
-
A bit compute expensive
Tanh
Formula
\[\tanh(x) = \frac{\sinh(x)}{\cosh(x)}=\frac{e^x-e^{-x}}{e^x+e^{-x}}\]Features
- Squashes number to range [-1, 1]
- Zero centered
Problems
- Saturated neurons “kill” the gradients
ReLU
Formula
\[f(x) = \max(0, x)\]Features
- Does not saturate (in >0 region)
- Very computationally efficient
- Converges much faster
- More biologically plausible
Problems
- Not zero-centered
- “kill” the gradients (in $\le$ 0 region), what we called the “dead ReLU”
Leaky ReLU
Formula
\[f(x) = \max(0.01x, x)\]Features
- Does not saturate (in >0 region)
- Vary computationally efficient
- Converges much faster
- Will not “die”
References
Parametric Rectifier (PReLU)
\[f(x) = \max(\alpha x, x)\]ELU
Formula
\[f(x) = \left\{ \begin{array}{lc} x & if\ x > 0\\\alpha(\exp{(x)}-1) & if\ x \le 0\end{array}\right.\]Features
- All benefits of ReLU
- Closer to zero mean outputs
- Negative saturation regime compared with Leaky ReLU adds some robustness to noise
Problems
- Computation requires exp()
Maxout
There is also another popular variant called maxout which is the generalized form above relu and leaky relu.
Formula
\[\max(w_1^T x+b_1, w_2^T x+b_2)\]Features
- Generalizes ReLU and Leaky ReLU
- Does not saturate
- Does not die
Problems
- Doubles the number of parameters / neuron
Layers
Fully Connected Layer
dot product
Convolution Layer
Number of parameters in the layer: $K\times (F^2\times C + 1)$
$+1$ for bias
GANS, Generative Adversarial Network 生成对抗网络
Generative Model G 生成器
Captures data distribution and generate samples close to real distribution
Discriminative Model D 判别器
Estimates the probability that a sample came from the training data rather than G