Lecture 1: Introduction and Word Vectors

The result that word meaning can be represented rather well by large vector of real numbers.

1. The course

  • fundamental knowledge
  • Advanced Technology
  • practise: PyTorch

2. Human language and word meaning

How to get computers to understand language.

How do we present the meaning of a word?

  • symbmol $\Leftrightarrow$ idea or thing
  • denotational semantics

Traditional NLP solution:

  • WordNet, a thesaurus containing lists of synonym sets and hypernyms
    • great as a resource but missing nuance
      • only correct in some contexts
    • missing new meanings of words
      • impossible to keep up-to-date
    • subjective
    • requires human labor to create and adapt
    • can’t compute accurate words similarity
  • one-hot vectors
    • regard words as discrete symbols
    • HUGE vector dimensions
    • no natural similarity for one-hot vectors

Solution

  • learn to encode similarity in the vectors themselves
  • Distributional semantics: a word’s meaning is given by the words that frequently appear close-by
  • When a word $w$ appears in a text, its context is the set of words that appear nearby(within a fixed-size window)
  • Use the many contexts of $w$ to build up a representation of $w$

Word vectors

  • We will build a dense vector for each word, chosen so that it is similar to vectors of words that appear in similar contexts.
  • Word vectors are also called word embeddings or (neural) word representations. They are a distributed representation.

3. Word2vec introduction

Word2vec (Mikolov et al. 2013) is a framework for learning word vectors.

Idea:

  • We have a learge corpus (“body”) of text
  • Every word in a fixed covabularty is represented by a vector
  • Go through each position $t$ in the text, which has a center word $c$ and context (“outside”) words $o$
  • Use the similarity of word vectors for $c$ and $o$ to calculate the probability of $o$ given $c$ (or vice versa)
  • Keep adjusting the word vectors to maximize this probability.

4. Word2vec objective function

Goal

For each position $t = 1,…,T$, predict context words within a window of fixed size $m$, given center word $w_j$. Data likelihood:

\[\textrm{Likelihood} = L(\theta) = \prod_{t=1}^T\prod_{-m\le j\le m \& j\not ={0}} P(w_{t+j}|w_t;\theta)\]

$\theta$ is all variables to be optimized $\prod_{t=1}^T$: using each wor as a center word

Objective function

The objective function(cost function, loss function) $J(\theta)$ is the (average) negative log likelihood

\[J(\theta) = -\frac{1}{T} \log L(\theta) = -\frac{1}{T}\sum_{t=1}^T\sum_{-m\le j\le m \& j\not ={0}} \log P(w_{t+j}|w_t;\theta)\]

Minimizing objective function $\Leftrightarrow$ Maximizing predictive accuracy

  • $v_w$ when $w$ is a center word
  • $u_w$ when $w$ is a context word
  • Then for a center word $c$ and a context word $o$ (softmax) \(P(o|c) = \frac{\exp(u_o^Tv_c)}{\sum_{w\in V}\exp(u_w^Tv_c)}\)

softmax function maps arbitrary values $x_i$ to a probability distribution $p_i$

  • $\mathbb{R}^n \rightarrow [0, 1]^n$
  • max” because amplifies probability of largest $x_i$(but returns a distribution)
  • soft” because still assigns some probability to smaller $x_i$ Frequently used in Deep learning

Gradient

\[\begin{align} \frac{\partial}{\partial v_c} \log P(o|c) &= \frac{\partial}{\partial v_c} \log \frac{\exp(u_o^Tv_c)}{\sum_{w\in V}\exp(u_w^Tv_c)}\\ &= \frac{\partial}{\partial v_c} \log \exp(u_o^Tv_c)- \frac{\partial}{\partial v_c} \log \sum_{w\in V}\exp(u_w^Tv_c) \\ &= \frac{\partial}{\partial v_c} u_o^Tv_c - \frac{\partial}{\partial v_c} \log \sum_{w\in V}\exp(u_w^Tv_c) \\ &= u_o - \frac{\partial}{\partial v_c} \log \sum_{w\in V}\exp(u_w^Tv_c) \\ &= u_o - \frac{1}{\sum_{w\in V}\exp(u_w^Tv_c)}\times \frac{\partial}{\partial v_c}\sum_{w\in V}\exp(u_w^Tv_c) \\ &= u_o - \frac{1}{\sum_{w\in V}\exp(u_w^Tv_c)}\times \sum_{w\in V}\frac{\partial}{\partial v_c}\exp(u_w^Tv_c) \\ &= u_o - \frac{1}{\sum_{w\in V}\exp(u_w^Tv_c)}\times \sum_{w\in V}\exp(u_w^Tv_c) \frac{\partial}{\partial v_c} u_w^Tv_c\\ &= u_o - \frac{1}{\sum_{w\in V}\exp(u_w^Tv_c)}\times \sum_{w\in V}\exp(u_w^Tv_c) u_w\\ &= u_o - \frac{\sum_{w\in V}\exp(u_w^Tv_c) u_w}{\sum_{w\in V}\exp(u_w^Tv_c)}\\ &= u_o - \sum_{x\in V}\frac{\exp(u_x^Tv_c)}{\sum_{w\in V}\exp(u_w^Tv_c)} u_x\\ &= u_0 - \sum_{x=1}^V p(x|c) u_x \\ &= \textrm{observed} - \textrm{expected} \end{align}\]

5. Optimization basics

6. Looking at word vectors