Introduction
An important development of text analytics is the invention of the Latent Dirichlet Allocation (LDA) algorithm (also called topic modeling) in 2003. LDA is non negative matrix factorization algorithm. A matrix factorization consists of decomposing a matrix into a product of two or more matrices. It turned out that these linear algebra techniques have applications for data analysis. These applications are generaly referred as data dimension reductions methods. Examples of matrix factorization methods in statistics include Factor Analysis, Principal Component Analysis, and Latent Dirichlet Allocation. They are all latent variables models, which consist of using observed variables to infer the values for unobserved (or hidden) variables. The basic idea of these methods is to find \(\theta_{D,K}\) and \(\phi_{K,V}\) (two sets of hidden variables) from \(W_{D,V}\), the set of observed variables such that: \[W_{D,V} \simeq \theta_{D,K}*\phi_{K,V}\] Where \(D\) is the number of observations, \(V\) is the number of variables; and \(K\) is the number of latent variables. We want \(K<<V\), and “hopefully” we can infer a meaning for each of the \(K\) columns of \(\theta_{D,K}\) from each of the \(K\) rows of \(\phi_{K,V}\). Also, it turned out that most information about the observations (rows of \(W\)) contained in \(W_{D,V}\) is captured in the reduced matrix \(\theta_{D,K}\), hence the idea of data dimension reduction. A major challenge in data dimension reduction is deciding on the appropriate value for \(K\).
To help fix ideas, let’s assume we have exams scores of 100 students on the following subjects: Gaelic, English, History, Arithmetic, Algebra, Geometry (this example is not a text data example, but it is a good one to illustrate the idea of latent variable models). The dataset is \(W_{D,V} = W_{100,6}\); that is, 100 observations and 6 variables. Let’s assume we want to collapse the \(V = 6\) variables into \(K=2\) variables. Let’s further assume that the first variable may be termed “Humanities”, and the second variable may be termed “Math” (this is a sensible assumption!). Thus, we want to create a \(\theta_{100,2}\) matrix that captures most of the informations about the students grades on 6 subjects. With the two variables, humanities and math, we can quickly learn about the students with the help of, for example, a simple scatterplot. The \(\phi\) matrix helps us infer the meanings of the columns of \(\theta\) as humanities and math because (hopefully) one row has big coefficients for Gaelic, English, History, and small coefficients for Arithmetic, Algebra, Geometry; and the second row has big coefficients for Arithmetic, Algebra, Geometry, and small coefficients for Gaelic, English, History. I hope this example provides an intuition of what matrix factorization wants to achieve when used for data analysis. The goal is to reduce the dimension of the data, i.e. reduce the number of variables. The meaning of each of the new variables is inferred by guessing a name associated with the original variables with highest coefficients for a given new variable. In the future, I will provide a numerical example within the context of Factor Analysis. Factor analysis is a building block for understanding latent variables models.
In LDA, the \(W\) matrix is a matrix of words counts, the \(\theta\) matrix is a matrix of topic proporions within each document, and the \(\phi\) matrix is a matrix of each word’s relative importance for each topic.
LDA: the model
This section provides a mathematical exposition of topic modeling and presents the data generative process used to estimate the \(\theta\) and \(\phi\) matrices. LDA is a generative model that represents documents as being generated by a random mixture over latent variables called topics (David M. Blei, Ng, and Jordan 2003). A topic is defined as a distribution over words. For a given corpus (a collection of documents) of D documents each of length \(N_{d}\) , the generative process for LDA is defined as follows:
For each topic \(k\), draw a distribution over words \(\phi_k \sim Dirichlet(\beta)\) with \(k = \{1, 2, ...K\}\)
For each document \(d\):
Draw a vector of topic proportions \(\theta_d \sim Dirichlet(\alpha)\)
For each word \(i\)
Draw a topic assignment \(z_{d,n} \sim multinomial(\theta_d)\) with \(z_{d,n} \in \{1, 2, ..., K\}\)
Draw a word \(w_{d,v} \sim multinomial(\phi_{k = z_{d,n}})\) with \(w_{d,v} \in \{1, 2, ..., V\}\)
Note: Only the words \(w\) are observed.
The above generative process allows us to construct an explicit closed form expression for the joint likelihood of the observed and hidden variables. Markov Chain Monte Carlo (MCMC), and Variational Bayes methods can then be used to estimate the parameters \(\theta\) and \(\phi\) (See David M. Blei, Ng, and Jordan (2003); David M. Blei (2012) for further exposition of the method). We derive the posterior distribution of the \(\theta\)s and \(\phi\)s in the next section.
Deriving the \(\theta\) and \(\phi\) values
A topic \(\phi_{k}\) is a distribution over V unique words, each having a proportion \(\phi_{k,v}\); i.e \(\phi_{k,v}\) is the relative importance of the word v for the definition (or interpretation) of the topic k. It is assumed that:
\[\phi_{k}\sim Dirichlet_{V}(\beta)\] That is: \[p(\phi_{k}|\beta)=\frac{1}{B(\beta)}\prod_{v=1}^{V}\phi_{k,v}^{\beta_{v}-1}\]
Where \(B(\beta)=\frac{\prod_{v=1}^{V}\Gamma(\beta_{v})}{\Gamma(\sum_{v=1}^{V}\beta_{v})}\), and \(\beta=(\beta_{1},...,\beta_{V})\). Since we have K independent topics (by assumption), \[p(\phi|\beta)=\prod_{k=1}^{K}\frac{1}{B(\beta)}\prod_{v=1}^{V}\phi_{k,v}^{\beta_{v}-1}\]
A document d is a distribution over K topics, each having a proportion \(\theta_{d,k}\), i.e. \(\theta_{d,k}\) is the relative importance of the topic k, in the document d. We assume:
\[\theta_{d}\sim Dirichlet_{K}(\alpha)\]
That is:
\[p(\theta_{d}|\alpha)=\frac{1}{B(\alpha)}\prod_{k=1}^{K}\theta_{d,k}^{\alpha_{k}-1}\]
And since we have D independent documents (by assumption),\[p(\theta|\alpha)=\prod_{d=1}^{D}\frac{1}{B(\alpha)}\prod_{k=1}^{K}\theta_{d,k}^{\alpha_{k}-1}\]
It is further assumed that \(\beta_{v}=\beta\), and \(\alpha_{k}=\alpha\)
Let \(z\) be the latent topic assignment variable, i.e. the random variable \(z_{d,n}\) assigns the word \(w_{d,n}\) to the topic k in document \(d\). \(z_{d,n}\) is a vector of zeros and 1 at the \(k^{th}\) position \((z_{d,n}=[0,0,...1,0,..])\). Define \(z_{d,n,k}=I(z_{d,n}=k)\) where \(I\) is an indicator function that assigns 1 to the random variable \(z_{d,n}\) when \(z_{d,n}\) is the topic \(k\), and \(0\) otherwise.We assume:
\[z_{d,n}\sim Multinomial(\theta_{d})\]
That is: \[p(z_{d,n,k}|\theta_{d}) =\theta_{d,k} = \prod_{k=1}^{K}\theta_{d,k}^{z_{d,n,k}}\]
A document is assumed to have \(N_{d}\) independent words, and since we assume D independent documents, we have:
\[p(z|\theta) =\prod_{d=1}^{D}\prod_{n=1}^{N_{d}}\prod_{k=1}^{K}\theta_{d,k}^{z_{d,n,k}}\] \[= \prod_{d=1}^{D}\prod_{k=1}^{K}\prod_{n=1}^{N_{d}}\theta_{d,k}^{z_{d,n,k}}\] \[= \prod_{d=1}^{D}\prod_{k=1}^{K}\prod_{v=1}^{V}\theta_{d,k}^{n_{d,v}*z_{d,v,k}}\] \[= \prod_{d=1}^{D}\prod_{v=1}^{V}\prod_{k=1}^{K}\theta_{d,k}^{n_{d,v}*z_{d,v,k}}\]
\(n_{d,v}\) is the count of the word v in document d.
The word \(w_{d,n}\) is drawn from the topic’s words distribution \(\phi_{k}\):
\[w_{d,n}|\phi_{k=z_{d,n,k}}\sim Multinomial(\phi_{k=z_{d,n}})\]
\[p(w_{d,n} =v|\phi_{k=z_{d,n}})=\phi_{k,v}\] \[= \prod_{v=1}^{V}\prod_{k=1}^{K}\phi_{k,v}^{w_{d,n,v}*z_{d,n,k}}\]
\(w_{d,n}\) is a vector of zeros and 1 at the \(v^{th}\) position. Define \(w_{d,n,v}=I(w_{d,n}=v)\) where \(I\) is an indicator function that assigns \(1\) to the random variable \(w_{d,n}\) when \(w_{d,n}\) is the word \(v\), and \(0\) otherwise.
There are D independent documents, each having \(N_{d}\) independent words, so: \[p(w|\phi)=\prod_{d=1}^{D}\prod_{n=1}^{N_{d}}\prod_{v=1}^{V}\prod_{k=1}^{K}\phi_{k,v}^{w_{d,n,v}*z_{d,n,k}}\]
\[p(w|\phi)=\prod_{d=1}^{D}\prod_{v=1}^{V}\prod_{k=1}^{K}\phi_{k,v}^{n_{d,v}*z_{d,v,k}}\]
The joint distribution of the observed words w and unobserved (or hidden variables) \(\theta\), \(z\), and \(\phi\) is given by:
\[P(w,z,\theta,\phi|\alpha,\beta)=p(\theta|\alpha)p(z|\theta)p(w|\phi,z)p(\phi|\beta)\]
The goal is to get the posterior distribution of the unobserved variables: \[p(z,\theta,\phi|w,\alpha,\beta)=\frac{P(w,z,\theta,\phi|\alpha,\beta)}{\int\int\sum_{z}P(w,z,\theta,\phi|\alpha,\beta)d\theta d\phi}\]
\(\int\int\sum_{z}P(w,z,\theta,\phi|\alpha,\beta)d\theta d\phi\) is intractable, so approximation methods are used to approximate the posterior distribution. The seminal paper of LDA (David M. Blei, Ng, and Jordan 2003) uses the Mean Field Variational Bayes (an optimization method) to approximate the posteriors distribution (See Bishop (2006), pp. 462 or David M Blei, Kucukelbir, and McAuliffe (2017) for an exposition of the theory of the variational method). The mean field variational inference uses the following approximation: \[p(z,\theta,\phi|w,\alpha,\beta)\simeq q(z,\theta,\phi)=q(z)q(\theta)q(\phi)\]
From Bishop (2006), [p. 466], we have: \[q^{*}(z)\propto exp\left\{ E_{\theta,\phi}\left[log(p(z|\theta))+log(p(w|\phi,z))\right]\right\}\]
\[q^{*}(\theta)\propto exp\left\{ E_{z,\phi}\left[log(p(\theta|\alpha))+log(p(z|\theta))\right]\right\}\]
\[q^{*}(\phi)\propto exp\left\{ E_{\theta,z}\left[log(p(\phi|\beta))+log(p(w|\phi,z))\right]\right\}\]
Using the expressions above, we have:
\[log(q^{*}(z)) \propto E_{\theta,\phi}\left[\sum_{d=1}^{D}\sum_{v=1}^{V}\sum_{k=1}^{K}n_{d,v}*z_{d,v,k}\left(log(\theta_{d,k})+log(\phi_{k,v})\right)\right]\] \[\propto \sum_{d=1}^{D}\sum_{v=1}^{V}\sum_{k=1}^{K}n_{d,v}*z_{d,v,k}\left(E(log(\theta_{d,k}))+E(log(\phi_{k,v}))\right)\]
Note that \[x|p\sim Multinomial_{K}(p)\iff log\left(p(x|p)\right)=\sum_{k=1}^{K}x_{k}log(p_{k})\], and let’s define \(log(p_{k})=E(log(\theta_{d,k})+E(log(\phi_{k,v}))\), so \(p_{k}=exp(E(log(\theta_{d,k}))+E(log(\phi_{k,v})))\). Thus,
\[q^{*}(z)\propto\prod_{d=1}^{D}\prod_{v=1}^{V}\prod_{k=1}^{K}\left[exp(E(log(\theta_{d,k}))+E(log(\phi_{k,v})))\right]^{n_{d,v}*z_{d,v,k}}\]
That is, \[z_{d,v}|w_{d},\theta_{d},\phi_{k}\sim Multinomial_{K}(p_{k})\]
and by the multinomial properties,\(E(z_{d,v,k})=p_{k}=exp(E(log(\theta_{d,k}))+E(log(\phi_{k,v})))\)
\[q^{*}(\theta) \propto exp\left\{ E_{z}\left[\sum_{d}\sum_{k}(\alpha-1)log(\theta_{d,k})+\sum_{d}\sum_{k}\sum_{v}n_{d,v}*z_{d,v,k}log(\theta_{d,k})\right]\right\}\] \[= \prod_{d}^{D}\prod_{k=1}^{K}exp\left\{ (\alpha+\sum_{v=1}^{V}n_{d,v}E(z_{d,v,k})-1)log(\theta_{d,k})\right\}\] \[= \prod_{d=1}^{D}\prod_{k=1}^{K}\theta_{d,k}^{\alpha+\sum_{v=1}^{V}n_{d,v}E(z_{d,v,k})-1}\]
Thus, the approximate posterior distribution of the topics distribution in a document d is:
\[\theta_{d}|w_{d},\alpha=Dirichlet_{K}(\tilde{\alpha}_{d})\] where \(\tilde{\alpha}_{d}=\alpha+\sum_{v=1}^{V}n_{d,v}E(z_{d,v,.})\). Note that \(\tilde{\alpha}_{d}\) is a vector of K dimension.
By the properties of the Dirichlet distribution, the expected value of \(\theta_{d}|\tilde{\alpha}_{d}\) is given by:
\[E(\theta_{d}|\tilde{\alpha_{d}})=\frac{\alpha+\sum_{v=1}^{V}n_{d,v}E(z_{d,v,.})}{\sum_{k=1}^{K}[\alpha+\sum_{v=1}^{V}E(z_{d,v,k})]}\]
The numerical estimation of \(E(\theta_{d}|\tilde{\alpha}_{d})\) gives the estimates of the topics proportions within each document \(d\), \((\hat\theta_{d})\). It is worth noting that \(E(z_{d,v,k})\) can be interpreted as the responsibility that topic \(k\) takes for explaining the observation of the word v in document d. Ignoring for a moment the denominator of equation above, \(E(\theta_{d,k}|\tilde{\alpha}_{d,k})\) is similar to a regression equation where \(n_{d,v}\) are the observed counts of words in document \(d\), and \(E(z_{d,v,k})\) are the parameter estimates (or weight) of the words. That illustrates that the importance of a topic in a document is due to the high presence of words \((n_{d,v})\) referring to that topic, and the weight of these words \((E(z_{d,v,k}))\).
Similarly,
\[q^{*}(\phi) \propto exp\left\{ E_{z}\left[\sum_{k=1}^{K}\sum_{v=1}^{V}(\beta-1)log(\phi_{k,v})+\sum_{d=1}^{D}\sum_{k=1}^{K}\sum_{v=1}^{V}n_{d,v}*z_{d,v,k}log(\phi_{k,v})\right]\right\}\] \[= \prod_{k=1}^{K}\prod_{v=1}^{V}exp\left\{ (\beta+\sum_{d=1}^{D}n_{d,v}*E(z_{d,v,k})-1)log(\phi_{k,v})\right\}\] \[= \prod_{k=1}^{K}\prod_{v=1}^{V}\phi_{k,v}^{\beta+\sum_{d=1}^{D}n_{d,v}*E(z_{d,v,k})}\]
Thus, the approximate posterior distribution of the words distribution in a topic \(\hat\phi_{k}\) is:
\(\phi_{k}|w,\beta\sim Dirichlet_{V}(\tilde{\beta_{k}})\)
where \(\tilde{\beta_{k}}=\beta+\sum_{d=1}^{D}n_{d,v}*E(z_{d,.,k})\). Note that \(\tilde{\beta}_{k}\) is a vector of V dimension.
And the expected value of \(\phi_{k}|\tilde{\beta}_{k}\) is given by:
\[ E(\phi_{k}|\tilde{\beta_{k}})=\frac{\beta+\sum_{d=1}^{D}n_{d,v}*E(z_{d,.,k})}{\sum_{v=1}^{V}(\beta+\sum_{d=1}^{D}n_{d,v}*E(z_{d,v,k}))} \]
The numerical estimation of \(E(\phi_{k}|\tilde{\beta}_{k})\) gives the estimates of the words relative importance for each topic \(k\), \((\phi_{k})\). Ignoring the denominator in the equation above, \(E(\phi_{k,v}|\tilde{\beta_{k,v}})\) is the weighted sum of the the frequencies of the word \(v\) in each of the documents \((n_{d,v})\), the weights being the responsibility topic \(k\) takes for explaining the observation of the word \(v\) in document \(d\) \((E(z_{d,v,k}))\).
Here, we have derived the posteriors expected values of the \(\theta\)s and \(\phi\)s using the words counts \(n_{d,v}\), which is slightly different from David M. Blei, Ng, and Jordan (2003). Posterior formulae similar to the current derived solution can be found in Murphy (2012), p. 962.
In sum, the rows of \(\phi_{K,V}=\left[E(\phi_{k}|\tilde{\beta}_{k})\right]_{K,V}\) are useful for interpreting (or identifying) the themes, which relative importance in each document are represented by the columns of \(\theta_{D,K}=\left[E(\theta_{d}|\tilde{\alpha}_{d})\right]_{D,K}\).
Practical tools for estimating the topics distributions of a corpus exist (see Grun and Hornik (2011); Silge and Robinson (2017 Chap. 6)).
References
Bishop, Christopher M. 2006. Pattern Recognition and Machine Learning. springer.
Blei, David M, Alp Kucukelbir, and Jon D McAuliffe. 2017. “Variational Inference: A Review for Statisticians.” Journal of the American Statistical Association, no. just-accepted. Taylor & Francis.
Blei, David M. 2012. “Probabilistic Topic Models.” Commun. ACM 55 (4). New York, NY, USA: ACM: 77–84. doi:10.1145/2133806.2133826.
Blei, David M., Andrew Y. Ng, and Michael I. Jordan. 2003. “Latent Dirichlet Allocation.” J. Mach. Learn. Res. 3 (March). JMLR.org: 993–1022. http://dl.acm.org/citation.cfm?id=944919.944937.
Grun, Bettina, and Kurt Hornik. 2011. “Topicmodels: An R Package for Fitting Topic Models.” Journal of Statistical Software, Articles 40 (13): 1–30. doi:10.18637/jss.v040.i13.
Murphy, Kevin P. 2012. Machine Learning: A Probabilistic Perspective. MIT press.
Silge, J., and D. Robinson. 2017. Text Mining with R: A Tidy Approach. O’Reilly Media, Incorporated. https://books.google.com/books?id=7bQzMQAACAAJ.