LDA for mortals
I’ve been intrigued by LDA topic models for a few weeks now. I actually built one before I really knew what I was doing. That kind of frightens and excites me. On one hand, LDA provides rich output which is easy for the humanist researcher to interpret. On the other hand, shouldn’t I know what kind of machinery I’m operating that’s spitting out results? I would certainly like to know more than what’s provided in documentation of whatever software implementation I’m using (currently gensim), but I don’t necessarily need to be able to prove every last algorithm before I kick the tires and start building models.
I echo Ted Underwood’s observation that it can be difficult to develop a basic intuition of LDA from the hardcore computer science literature. Proving that it works, understanding how it works, and making it work, are very different things. My goal for this exercise was to bridge the gaps in my brain between the technical theoretical papers and the high level LDA articles out there by developing a rudimentary LDA model from scratch.
Generative, so what?
Most academic papers I encountered begin by describing LDA as a generative model. That is, a model that can randomly generate observable data (documents). Blei, Ng, & Jordan, 2003 outline this process in their seminal paper on the topic:
LDA assumes the following generative process for each document w in a corpus D:
- Choose N ∼ Poisson(ξ).
- Choose θ ∼ Dir(α).
- For each of the N words wn:
(a) Choose a topic zn ∼ Multinomial(θ).
(b) Choose a word wn from p(wn | zn ,β), a multinomial probability conditioned on the topic zn.
I found this confusing at first. I was initially mixing up the generative and inference/training algorithms. They are different, very different. I already have my documents…why would I want to probabilistically generate new documents with my LDA model? Short answer, you probably don’t. However, the generative approach lets you do some cool things:
- Assign probabilities to just about everything: words, documents and topics. Probabilistic Latent Semantic Indexing (pLSI), the precursor to LDA, pushed some boundaries in information retrieval, but fell short in providing a probabilistic model at the document level.
- The ability to calculate probabilities (topic assignments) for previously unseen documents, which is not possible with pLSI.
- Model updates. That is, you can continue training your model with new documents when they come in without repeatedly starting from scratch and re-training on the full corpus. Online learning should be supported in most LDA implementations.
There is even some new research (Zhai and Boyd-Graber, 2013) exploring breeds of LDA that allow for infinite vocabularies – assigning probabilities to new documents with words outside of the training vocabulary.
Statistical inference… how we actually train an LDA model
So the goal of LDA is to compute the 2 hidden parameters often commonly denoted as θ and φ, where θ represents topic-document distribution and φ represents the word-topic distribution.
However, when you do some fancy math, it becomes clear that the posterior distribution for these parameters is intractable. That is, we’re left with integrals that we cannot compute analytically in high dimensional space where numerical approximation gets hairy. How to handle this issue is the topic of a large swath of the LDA literature. It seems that most advocate one of two methods:
- Variational inference seems to be the preferred method for getting fast and accurate results in most software implementations.
- Markov chain Monte Carlo (sampling) methods have the benefit of being unbiased and easy understand. However, it seems MCMCs are handicapped by being computationally inefficient when working with large corpora.
Building LDA from scratch
Purpose
I have no intention for anyone (including myself) to use this code for anything beyond pedagogical purposes. There are countless ways to optimize the following approach. I simply wanted to put the theory into action as clearly as possible and dispel any myths in my head that LDA is black magic. I leave the implementation and optimization to the experts.
Getting started
Of all the academic papers I read trying to wrap my head around LDA, I found Streyver & Griffith’s Probabilistic Topic Models paper to be the most helpful for understanding implementation. Their 2004 paper, Finding Scientific Topics that everyone seems to cite provides a more thorough derivation of the Baysesian magic, but I found their subsequent paper most useful. They clearly walk through how to conduct inference using Collapsed Gibbs Sampling which I translated into R code.
Edwin Chen’s Introduction to Latent Dirichlet Allocation post provides an example of this process using Collapsed Gibbs Sampling in plain english which is a good place to start.
I did find some other homegrown R and Python implementations from Shuyo and Matt Hoffman – also great resources. Especially Shuyo’s code which I modeled my implementation after.
So let’s code it
I generated a very trivial corpus of 8 documents. To focus just on the LDA mechanics, I opted for the simplest of R objects: lists and matrices.
Abstracting the words away - now we have a numbers problem.
Now we need to generate some basic count matrices. First we randomly assign topics to each word in each document. Then we create a word-topic count matrix.
Now we generate a document-topic count matrix where the counts correspond to the number of tokens assigned to each topic for each document.
I’m beginning to mix notation from the code and theory. Oops. Anything in italics refers to theory. Anything in an inline code block
refers to my iterators and objects in R code.
If following along with Streyvers and Griffiths, wt
corresponds to the CWT matrix and dt
corresponds to the CDT matrix.
Now we get to the fun part where LDA starts learning topics. We will iteratively update the random (bad) topic assignments we populated into ta
one token at a time. This is where I had to brush up on my Bayesian statistics.
Rather than directly estimating our parameters of interest (φ and θ), we can directly esimate the posterior over z which represents topic assignments for each token. This is p_z
, the full-conditional distribution for z which represents the probability that token w
belongs to topic t
.
The left side of p_z
((wt[,wid] + eta) / denom_b
) represents the probability of word w
under topic t
. After many tokens of a word have been assigned to a particular topic, LDA increasingly wants to assign new occurrences of this token to that topic.
The right side of p_z
((dt[d,] + alpha) / denom_a
) represents the probablity of topic t
in document d
. If many tokens in a document have been assigned to a particular topic, LDA wants to assign tokens to to that topic at each iteration of Gibbs Sampling.
Example: So if 90% of the tokens in a document belong to topic 1 at the start of an iteration of Gibbs Sampling AND those same tokens in other documents are assigned to topic 1, it’s a no-brainer for LDA. Topic 1 it is.
However, if those 90% of tokens that belonged to topic 1 in our document all belong to topic 2 in the other documents, then it’s a toss-up and LDA is forced to spread the wealth across the two topics.
Defining some notation…
p_z
= P(zi = j | z-i, wi, di)
where
P(zi = j) is the probability that token i is assigned to topic j
z-i represents topic assignments of all other tokens
wi is the word index of token i (1…length(vocab)
)
di is the document index of token i (1…length(docs)
)
This is really the only “magic” part of the whole LDA learning process far. To dispel the magic and truly understand why this works read Griffiths and Streyvers – they walk through the derivation of the full-conditional distribution (p_z
) much more elegantly than I could.
The printed output above is a window into the learning process. We printed a line each time a token got reassigned to a new topic. Remember we only made 3 passes (iterations <- 3
) through the corpus, so our topic assignments are likely still pretty terrible. In practice, with many more iterations, these re-assignments would eventually stabilize.
So now we’ve finished learning topics. We just have to analyze them. theta
normalizes the counts in the document-topic count matrix dt
to compute the probability that a document belongs to each topic. theta
is technically the posterior mean of the parameter θ.
phi
normalizes the word-topic count matrix wt
to compute the probabilities of words given topics. phi
is technically the posterior mean of the parameter φ.