King - man + woman = queen: the hidden algebraic structure of words

[2019] Carl Allen and Timothy Hospedales recently received a Best Paper Honourable Mention at the world’s largest conference in Machine Learning for explaining a curious phenomenon found amongst word representations.

Much of the current wave of advances in Machine Learning, or Artificial Intelligence, comes from our ability to extract underlying information from vast amounts of data. By analysing millions of images or billions of words, we are able to automatically recognise objects in pictures, convert human hand-writing or speech to digital text, and translate text into many other languages. When it comes to words, for a computer to perform any "reasoning", e.g. to recognise that alternatively worded search queries are essentially the same, words need to be represented numerically as vectors of numbers termed "embeddings". Intuitively, where words are similar in some respect that can be reflected by certain values in their embeddings being similar. In recent years, successful algorithms have been developed, such as Word2Vec and Glove that learn word embeddings by extracting information from huge text sources such as Wikipedia. Word embeddings now underpin much state of the art technology, e.g. Alexa and Google translate (see below).

A curious phenomenon identified amongst word embeddings of Word2Vec and Glove, is that analogies, e.g. "man is to king as woman is to ...?" or "Paris is to France as Rome is to ...?", can often be solved simply by adding and subtracting embeddings. For example, the word embedding for queen is found to be that closest to the result of computing king - man + woman. Of course, this result might come as no surprise if the embedding systems were trained to achieve this, but they aren't! While such "word algebra" relationships may seem to fit with our intuition that they frequently materialise among word embeddings has been a source of intrigue.

Image
Word Embeddings Figure

In their recent paper, Carl Allen and Timothy Hospedales show why this phenomenon arises and, in doing so, reveal an underlying mathematical relationship between the co-occurrence statistics with which we use words that form analogies.

It's all just statistics

Allen and Hospedales show that the observed phenomenon is underpinned by a known relationship between the values, or parameters, in a word embedding and statistics (in the data) of how often the word corresponding to that embedding "co-occurs" with all other words, i.e. appears within a short distance of each of them. Those statistics reveal much about a word and, in essence, uniquely characterise each word in a manner by which they can be compared. For example, if two words both tend to co-occur with a common collection of other words, then they are likely to be similar; if a word tends to co-occur with words of a particular subject, then it is likely to relate to that subject. As such, by reflecting co-occurrence statistics, word embeddings capture something of the meaning of words. Importantly, that meaning is captured numerically and in a way that adding and subtracting embeddings can be interpreted - a key contribution of the paper.

By consideration of the statistics captured in word embeddings, Allen and Hospedales show that the word embeddings of paraphrases (e.g. king is a paraphrase of man and royal) form a specific geometric relationship. It is then shown that analogies are in fact equivalent to a particular type of paraphrase and, as such, their word embeddings also form geometric relationships that turn out to be precisely those observed, subject to error terms that can be explicitly defined.

Beyond explaining the analogy phenomenon itself, the work sheds light on an interesting mathematical connection between word co-occurrence statistics and semantic relationships, of which there are many others e.g. relatedness and similarity. It also dispels some of the "black-box magic" of word embeddings that pervades much current machine learning that uses neural networks (of which Word2Vec and Glove are relatively simple cases), to give a meaningful interpretation of what is being learned, paving the way for a better understanding of the many downstream tasks in which word embeddings are deployed.

 

Related links

How Alexa Is Learning to Converse More Naturally

How does Google translate work? 

Guide to Word2Vec

Carl Allen's Blog

Analogies Explained: Towards Understanding Word Embeddings