Introduction
An important development of text analytics is the invention of the Latent Dirichlet Allocation (LDA) algorithm (also called topic modeling) in 2003. LDA is non negative matrix factorization algorithm. A matrix factorization consists of decomposing a matrix into a product of two or more matrices. It turned out that these linear algebra techniques have applications for data analysis. These applications are generaly referred as data dimension reductions methods. Examples of matrix factorization methods in statistics include Factor Analysis, Principal Component Analysis, and Latent Dirichlet Allocation. They are all latent variables models, which consist of using observed variables to infer the values for unobserved (or hidden) variables. The basic idea of these methods is to find θD,K and ϕK,V (two sets of hidden variables) from WD,V, the set of observed variables such that: WD,V≃θD,K∗ϕK,V Where D is the number of observations, V is the number of variables; and K is the number of latent variables. We want K<<V, and “hopefully” we can infer a meaning for each of the K columns of θD,K from each of the K rows of ϕK,V. Also, it turned out that most information about the observations (rows of W) contained in WD,V is captured in the reduced matrix θD,K, hence the idea of data dimension reduction. A major challenge in data dimension reduction is deciding on the appropriate value for K.
To help fix ideas, let’s assume we have exams scores of 100 students on the following subjects: Gaelic, English, History, Arithmetic, Algebra, Geometry (this example is not a text data example, but it is a good one to illustrate the idea of latent variable models). The dataset is WD,V=W100,6; that is, 100 observations and 6 variables. Let’s assume we want to collapse the V=6 variables into K=2 variables. Let’s further assume that the first variable may be termed “Humanities”, and the second variable may be termed “Math” (this is a sensible assumption!). Thus, we want to create a θ100,2 matrix that captures most of the informations about the students grades on 6 subjects. With the two variables, humanities and math, we can quickly learn about the students with the help of, for example, a simple scatterplot. The ϕ matrix helps us infer the meanings of the columns of θ as humanities and math because (hopefully) one row has big coefficients for Gaelic, English, History, and small coefficients for Arithmetic, Algebra, Geometry; and the second row has big coefficients for Arithmetic, Algebra, Geometry, and small coefficients for Gaelic, English, History. I hope this example provides an intuition of what matrix factorization wants to achieve when used for data analysis. The goal is to reduce the dimension of the data, i.e. reduce the number of variables. The meaning of each of the new variables is inferred by guessing a name associated with the original variables with highest coefficients for a given new variable. In the future, I will provide a numerical example within the context of Factor Analysis. Factor analysis is a building block for understanding latent variables models.
In LDA, the W matrix is a matrix of words counts, the θ matrix is a matrix of topic proporions within each document, and the ϕ matrix is a matrix of each word’s relative importance for each topic.
LDA: the model
This section provides a mathematical exposition of topic modeling and presents the data generative process used to estimate the θ and ϕ matrices. LDA is a generative model that represents documents as being generated by a random mixture over latent variables called topics (David M. Blei, Ng, and Jordan 2003). A topic is defined as a distribution over words. For a given corpus (a collection of documents) of D documents each of length Nd , the generative process for LDA is defined as follows:
For each topic k, draw a distribution over words ϕk∼Dirichlet(β) with k={1,2,...K}
For each document d:
Draw a vector of topic proportions θd∼Dirichlet(α)
For each word i
Draw a topic assignment zd,n∼multinomial(θd) with zd,n∈{1,2,...,K}
Draw a word wd,v∼multinomial(ϕk=zd,n) with wd,v∈{1,2,...,V}
Note: Only the words w are observed.
The above generative process allows us to construct an explicit closed form expression for the joint likelihood of the observed and hidden variables. Markov Chain Monte Carlo (MCMC), and Variational Bayes methods can then be used to estimate the parameters θ and ϕ (See David M. Blei, Ng, and Jordan (2003); David M. Blei (2012) for further exposition of the method). We derive the posterior distribution of the θs and ϕs in the next section.
Deriving the θ and ϕ values
A topic ϕk is a distribution over V unique words, each having a proportion ϕk,v; i.e ϕk,v is the relative importance of the word v for the definition (or interpretation) of the topic k. It is assumed that:
ϕk∼DirichletV(β) That is: p(ϕk|β)=1B(β)V∏v=1ϕβv−1k,v
Where B(β)=∏Vv=1Γ(βv)Γ(∑Vv=1βv), and β=(β1,...,βV). Since we have K independent topics (by assumption), p(ϕ|β)=K∏k=11B(β)V∏v=1ϕβv−1k,v
A document d is a distribution over K topics, each having a proportion θd,k, i.e. θd,k is the relative importance of the topic k, in the document d. We assume:
θd∼DirichletK(α)
That is:
p(θd|α)=1B(α)K∏k=1θαk−1d,k
And since we have D independent documents (by assumption),p(θ|α)=D∏d=11B(α)K∏k=1θαk−1d,k
It is further assumed that βv=β, and αk=α
Let z be the latent topic assignment variable, i.e. the random variable zd,n assigns the word wd,n to the topic k in document d. zd,n is a vector of zeros and 1 at the kth position (zd,n=[0,0,...1,0,..]). Define zd,n,k=I(zd,n=k) where I is an indicator function that assigns 1 to the random variable zd,n when zd,n is the topic k, and 0 otherwise.We assume:
zd,n∼Multinomial(θd)
That is: p(zd,n,k|θd)=θd,k=K∏k=1θzd,n,kd,k
A document is assumed to have Nd independent words, and since we assume D independent documents, we have:
p(z|θ)=D∏d=1Nd∏n=1K∏k=1θzd,n,kd,k =D∏d=1K∏k=1Nd∏n=1θzd,n,kd,k =D∏d=1K∏k=1V∏v=1θnd,v∗zd,v,kd,k =D∏d=1V∏v=1K∏k=1θnd,v∗zd,v,kd,k
nd,v is the count of the word v in document d.
The word wd,n is drawn from the topic’s words distribution ϕk:
wd,n|ϕk=zd,n,k∼Multinomial(ϕk=zd,n)
p(wd,n=v|ϕk=zd,n)=ϕk,v =V∏v=1K∏k=1ϕwd,n,v∗zd,n,kk,v
wd,n is a vector of zeros and 1 at the vth position. Define wd,n,v=I(wd,n=v) where I is an indicator function that assigns 1 to the random variable wd,n when wd,n is the word v, and 0 otherwise.
There are D independent documents, each having Nd independent words, so: p(w|ϕ)=D∏d=1Nd∏n=1V∏v=1K∏k=1ϕwd,n,v∗zd,n,kk,v
p(w|ϕ)=D∏d=1V∏v=1K∏k=1ϕnd,v∗zd,v,kk,v
The joint distribution of the observed words w and unobserved (or hidden variables) θ, z, and ϕ is given by:
P(w,z,θ,ϕ|α,β)=p(θ|α)p(z|θ)p(w|ϕ,z)p(ϕ|β)
The goal is to get the posterior distribution of the unobserved variables: p(z,θ,ϕ|w,α,β)=P(w,z,θ,ϕ|α,β)∫∫∑zP(w,z,θ,ϕ|α,β)dθdϕ
∫∫∑zP(w,z,θ,ϕ|α,β)dθdϕ is intractable, so approximation methods are used to approximate the posterior distribution. The seminal paper of LDA (David M. Blei, Ng, and Jordan 2003) uses the Mean Field Variational Bayes (an optimization method) to approximate the posteriors distribution (See Bishop (2006), pp. 462 or David M Blei, Kucukelbir, and McAuliffe (2017) for an exposition of the theory of the variational method). The mean field variational inference uses the following approximation: p(z,θ,ϕ|w,α,β)≃q(z,θ,ϕ)=q(z)q(θ)q(ϕ)
From Bishop (2006), [p. 466], we have: q∗(z)∝exp{Eθ,ϕ[log(p(z|θ))+log(p(w|ϕ,z))]}
q∗(θ)∝exp{Ez,ϕ[log(p(θ|α))+log(p(z|θ))]}
q∗(ϕ)∝exp{Eθ,z[log(p(ϕ|β))+log(p(w|ϕ,z))]}
Using the expressions above, we have:
log(q∗(z))∝Eθ,ϕ[D∑d=1V∑v=1K∑k=1nd,v∗zd,v,k(log(θd,k)+log(ϕk,v))] ∝D∑d=1V∑v=1K∑k=1nd,v∗zd,v,k(E(log(θd,k))+E(log(ϕk,v)))
Note that x|p∼MultinomialK(p)⟺log(p(x|p))=K∑k=1xklog(pk), and let’s define log(pk)=E(log(θd,k)+E(log(ϕk,v)), so pk=exp(E(log(θd,k))+E(log(ϕk,v))). Thus,
q∗(z)∝D∏d=1V∏v=1K∏k=1[exp(E(log(θd,k))+E(log(ϕk,v)))]nd,v∗zd,v,k
That is, zd,v|wd,θd,ϕk∼MultinomialK(pk)
and by the multinomial properties,E(zd,v,k)=pk=exp(E(log(θd,k))+E(log(ϕk,v)))
q∗(θ)∝exp{Ez[∑d∑k(α−1)log(θd,k)+∑d∑k∑vnd,v∗zd,v,klog(θd,k)]} =D∏dK∏k=1exp{(α+V∑v=1nd,vE(zd,v,k)−1)log(θd,k)} =D∏d=1K∏k=1θα+∑Vv=1nd,vE(zd,v,k)−1d,k
Thus, the approximate posterior distribution of the topics distribution in a document d is:
θd|wd,α=DirichletK(~αd) where ~αd=α+∑Vv=1nd,vE(zd,v,.). Note that ~αd is a vector of K dimension.
By the properties of the Dirichlet distribution, the expected value of θd|~αd is given by:
E(θd|~αd)=α+∑Vv=1nd,vE(zd,v,.)∑Kk=1[α+∑Vv=1E(zd,v,k)]
The numerical estimation of E(θd|~αd) gives the estimates of the topics proportions within each document d, (^θd). It is worth noting that E(zd,v,k) can be interpreted as the responsibility that topic k takes for explaining the observation of the word v in document d. Ignoring for a moment the denominator of equation above, E(θd,k|~αd,k) is similar to a regression equation where nd,v are the observed counts of words in document d, and E(zd,v,k) are the parameter estimates (or weight) of the words. That illustrates that the importance of a topic in a document is due to the high presence of words (nd,v) referring to that topic, and the weight of these words (E(zd,v,k)).
Similarly,
q∗(ϕ)∝exp{Ez[K∑k=1V∑v=1(β−1)log(ϕk,v)+D∑d=1K∑k=1V∑v=1nd,v∗zd,v,klog(ϕk,v)]} =K∏k=1V∏v=1exp{(β+D∑d=1nd,v∗E(zd,v,k)−1)log(ϕk,v)} =K∏k=1V∏v=1ϕβ+∑Dd=1nd,v∗E(zd,v,k)k,v
Thus, the approximate posterior distribution of the words distribution in a topic ^ϕk is:
ϕk|w,β∼DirichletV(~βk)
where ~βk=β+∑Dd=1nd,v∗E(zd,.,k). Note that ~βk is a vector of V dimension.
And the expected value of ϕk|~βk is given by:
E(ϕk|~βk)=β+∑Dd=1nd,v∗E(zd,.,k)∑Vv=1(β+∑Dd=1nd,v∗E(zd,v,k))
The numerical estimation of E(ϕk|~βk) gives the estimates of the words relative importance for each topic k, (ϕk). Ignoring the denominator in the equation above, E(ϕk,v|~βk,v) is the weighted sum of the the frequencies of the word v in each of the documents (nd,v), the weights being the responsibility topic k takes for explaining the observation of the word v in document d (E(zd,v,k)).
Here, we have derived the posteriors expected values of the θs and ϕs using the words counts nd,v, which is slightly different from David M. Blei, Ng, and Jordan (2003). Posterior formulae similar to the current derived solution can be found in Murphy (2012), p. 962.
In sum, the rows of ϕK,V=[E(ϕk|~βk)]K,V are useful for interpreting (or identifying) the themes, which relative importance in each document are represented by the columns of θD,K=[E(θd|~αd)]D,K.
Practical tools for estimating the topics distributions of a corpus exist (see Grun and Hornik (2011); Silge and Robinson (2017 Chap. 6)).
References
Bishop, Christopher M. 2006. Pattern Recognition and Machine Learning. springer.
Blei, David M, Alp Kucukelbir, and Jon D McAuliffe. 2017. “Variational Inference: A Review for Statisticians.” Journal of the American Statistical Association, no. just-accepted. Taylor & Francis.
Blei, David M. 2012. “Probabilistic Topic Models.” Commun. ACM 55 (4). New York, NY, USA: ACM: 77–84. doi:10.1145/2133806.2133826.
Blei, David M., Andrew Y. Ng, and Michael I. Jordan. 2003. “Latent Dirichlet Allocation.” J. Mach. Learn. Res. 3 (March). JMLR.org: 993–1022. http://dl.acm.org/citation.cfm?id=944919.944937.
Grun, Bettina, and Kurt Hornik. 2011. “Topicmodels: An R Package for Fitting Topic Models.” Journal of Statistical Software, Articles 40 (13): 1–30. doi:10.18637/jss.v040.i13.
Murphy, Kevin P. 2012. Machine Learning: A Probabilistic Perspective. MIT press.
Silge, J., and D. Robinson. 2017. Text Mining with R: A Tidy Approach. O’Reilly Media, Incorporated. https://books.google.com/books?id=7bQzMQAACAAJ.