Paper Reading: GloVe: Global Vectors for Word Representation
Pennington, J., Socher, R., & Manning, C. D. (2014, October). Glove: Global vectors for word representation. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP) (pp. 1532-1543).
New global logbilinear regression model that combines the advantages of the two major model families in the literature: global matrix factorization and local context window methods.
Advantages
- Efficiently leverages statistical information by training only on the nonzero elements in a word-word co-occurrence matrix.
- Produces a word vector space with meaningful substructure
Related Work
There are two main model families for learning word vectors:
model family | example | advantages | disadvantages |
---|---|---|---|
Global matrix factorization methods | latent semantic analysis(LSA) | efficiently leverage statistical information | do poorly on the word analogy task |
Local context window methods | the skip-gram model | do better on the word analogy task | poorly utilize the statistics of the corpus |
The GloVe Model(GloVe, for Global Vectors)
The statistics of word occurrences in a corpus is the primary source of information available to all unsupervised methods for learning word representations.
Notation
- $X$: the matrix of word-word co-occurrence counts
- $X_{ij}$ tabulate the number of times word $j$ occurs in the context of word $i$
- $\sum_k X_{ik}$: the number of times any word appears in the context of word $i$
-
$P_{ij} = P(j i) = X_{ij}/X_i$: the probability that word $j$ appear in the context of word $i$
Advantage
- Compared to the raw probabilities
- the ratio is better able to distinguish relevant words from irrelevant words
- the ratio is better able to discriminate between the two relevant words
Method
Staring point
The appropriate starting point of word vector learning should be with ratios of co-occurrence probabilities rather than the probabilities themselves.
The ratio $P_{ik}/P_{jk}$ depends on three words $i$, $j$, and $k$:
\[F(w_i, w_j, \tilde{w_k}) = \frac{P_{ik}}{P_{jk}}\]- $w\in \mathbb{R}^d$: word vectors
- $\tilde{w}\in \mathbb{R}^d$: separate context word vectors
- $F$: may depend on some as-of-yet unspecified parameters
- $\frac{P_{ik}}{P_{jk}}$: extracted from the corpus
Selecting $F$
-
let $F$ to encode the information present the ratio $P_{ik}/P_{jk}$ in the word vector space Restrict our consideration to those functions $F$ that depend only on the difference of the two target words \(F(w_i-w_j, \tilde{w_k}) = \frac{P_{ik}}{P_{jk}}\)
-
Prevent MF from mixing the vector dimensions \(F((w_i-w_j)^T \tilde{w_k}) = \frac{P_{ik}}{P_{jk}}\)
-
Require $F$ be a homomorphism between the groups ($\mathbb{R}$, +) and ($\mathbb{R}{>0}$, $\times$) \(F((w_i-w_j)^T \tilde{w_k}) = \frac{F(w_i^T \tilde{w_k})}{F(w_j^T \tilde{w_k})}\) Therefore, \(F(w_i^T\tilde{w_k}) = P_{ik} = \frac{X_{ik}}{X_i}\) Thus, $F = \exp$, or, $w_i^T\tilde{w_k} = \log(P{ik})=\log(X_{ik}) - \log(X_i)$
-
Adding an additional bias $\tilde{b_k}$ for $\tilde{w_k}$ restores the symmetry \(w_i^T\tilde{w_k} + b_i + \tilde{b_k} = \log(X_{ik})\)
-
Address problem (the logarithm diverges whenever its argument is zero) by proposing a new weighted least square regression model. \(J = \sum_{i,j=1}^{V} f(X_{ij})(w_i^T\tilde{w_j}+b_i+\tilde{b_j}-\log X_{ij})^2\) where $V$ is the size of the vocabulary. Regulations of the weighting function $f$:
- $f(0)=0$. If $f$ is viewed as a continuous function, it should vanish as $x\rightarrow 0$ fast enough that the $\lim_{x\rightarrow 0} f(x)\log^2 x$ is finite
- $f(x)$ should be non-decreasing so that rare co-occurrences are not overweighted
- $f(x)$ should be relatively small for large values of $x$, so that frequent co-occurrences are not overweighted
\(f(x) =
\left\{
\begin{array}{ll}
(x/x_{\max})^\alpha & \mathrm{if}\quad x < x_{\max}\\
1 & \mathrm{otherwise}
\end{array}
\right.\)
$\alpha = 3/4$ gives a modest improvement over a linear version with $\alpha = 1$