WGAN存在着训练困难、收敛速度慢等问题。WGAN的作者Martin Arjovsky不久后就在reddit上表示他也意识到了这个问题，认为关键在于原设计中Lipschitz限制的施加方式不对：

I am now pretty convinced that the problems that happen sometimes in WGANs is due to the specific way of how weight clipping works. It’s just a terrible way of enforcing a Lipschitz constraint, and better ways are out there. I feel like apologizing for being too lazy and sticking to what could be done in one line of torch code.

A simple alternative (less than 5 lines of code) has been found by Montréal students. It works on quite a few settings (inc 100 layer resnets) with default hyperparameters. Arxiv coming this or next week, stay tuned.

\begin{align} L(D) &= - \mathbb{E}_{x \sim P_r} [D(x)] + \mathbb{E}_{x \sim P_g} [D(x)] \\ L(G) &= - \mathbb{E}_{x \sim P_r} [D(x)] \end{align}

Lipschitz限制则体现为，在整个样本空间$\mathcal{X}$上，要求判别器函数$D(x)$梯度的$L_p$ norm大于一个有限的常数K：
$$| \nabla_x D(x) |_p \leq K, \forall x \in \mathcal{X}$$

$$ReLU[| \nabla_x D(x) |_p - K]$$

$$[| \nabla_x D(x) |_p - K]^2$$

$$L(D) = - \mathbb{E}_{x \sim P_r} [D(x)] + \mathbb{E}_{x \sim P_g} [D(x)] + \lambda \mathbb{E}_{x \sim \mathcal{X}} [| \nabla_x D(x) |_p - 1]^2$$

$$x_r \sim P_r, x_g \sim P_g, \epsilon \sim Uniform[0, 1]$$

$$\hat{x} = \epsilon x_r + (1 - \epsilon) x_g$$

$$L(D) = - \mathbb{E}_{x \sim P_r} [D(x)] + \mathbb{E}_{x \sim P_g} [D(x)] + \lambda \mathbb{E}_{x \sim \hat{x}} [| \nabla_x D(x) |_p - 1]^2$$

• weight clipping是对样本空间全局生效，但因为是间接限制判别器的梯度norm，会导致一不小心就梯度消失或者梯度爆炸；

• 文本粒度为英文字符，而非英文单词，所以字典大小才二三十，大大减小了搜索空间
• 文本长度也才32
• 生成器用的不是常见的LSTM架构，而是多层反卷积网络，输入一个高斯噪声向量，直接一次性转换出所有32个字符

$$L(D) = - \mathbb{E}_{x \sim P_r} [D(x)] + \mathbb{E}_{x \sim P_g} [D(x)] + \lambda \mathbb{E}_{x_1 \sim \hat{x}, x_2 \sim \hat{x}} [ \frac{|D(x_1) - D(x_2)|}{| x_1 - x_2 |_p} - 1]^2$$