how to determine the number of topics for LDA?

nlp data-mining lda

31,980

Solution 1

Unfortunately, there is no hard science yielding the correct answer to your question. To the best of my knowledge, hierarchical dirichlet process (HDP) is quite possibly the best way to arrive at the optimal number of topics.

If you are looking for deeper analyses, this paper on HDP reports the advantages of HDP in determining the number of groups.

Solution 2

A reliable way is to compute the topic coherence for different number of topics and choose the model that gives the highest topic coherence. But sometimes, the highest may not always fit the bill.

See this topic modeling example.

Solution 3

First some people use harmonic mean for finding optimal no.of topics and i also tried but results are unsatisfactory.So as per my suggestion ,if you are using R ,then package"ldatuning" will be useful.It has four metrics for calculating optimal no.of parameters. Again perplexity and log-likelihood based V-fold cross validation are also very good option for best topic modeling.V-Fold cross validation are bit time consuming for large dataset.You can see "A heuristic approach to determine an appropriate no.of topics in topic modeling". Important links: https://cran.r-project.org/web/packages/ldatuning/vignettes/topics.html https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4597325/

Solution 4

Let k = number of topics

There is no single best way and I am not even sure if there is any standard practices for this.

Method 1: Try out different values of k, select the one that has the largest likelihood.

Method 2: Instead of LDA, see if you can use HDP-LDA

Method 3: If the HDP-LDA is infeasible on your corpus (because of corpus size), then take a uniform sample of your corpus and run HDP-LDA on that, take the value of k as given by HDP-LDA. For a small interval around this k, use Method 1.

Solution 5

Since I am working on that same problem, I just want to add the method proposed by Wang et al. (2019) in their paper "Optimization of Topic Recognition Model for News Texts Based on LDA". Besides giving a good overview, they suggest a new method. First you train a word2vec model (e.g. using the word2vec package), then you apply a clustering algorithm capable of finding density peaks (e.g. from the densityClust package), and then use the number of found clusters as number of topics in the LDA algorithm.

If time permits, I will try this out. I also wonder if the word2vec model can make the LDA obsolete.

View more solutions

31,980

Chelsea Wang

I am a little developer for ios. Chelsea is my favourite football club. Frank Lampard is my favourite player.

Updated on July 09, 2022

Comments

Chelsea Wang almost 2 years

I am a freshman in LDA and I want to use it in my work. However, some problems appear.

In order to get the best performance, I want to estimate the best topic number. After reading "Finding Scientific topics", I know that I can calculate logP(w|z) firstly and then use the harmonic mean of a series of P(w|z) to estimate P(w|T).

My question is what does the "a series of" mean?
kfmfe04 over 2 years

LDA is a generative model, word2vec is not (it's just an embedding model), so the latter cannot render LDA obsolete. This approach replaces the need to specify number of topics in LDA with the need to specify number of features in word2vec.
Karsten W. over 2 years

What do you mean "specify the number of features in word2vec"? I believe it is necessary to specify whether a density peak is a cluster center or not, but otherwise no further assumptions are necessary.
kfmfe04 over 2 years

You need to pick the number of dimensions: typically, between 100 and 1000 - that's a parameter that impacts the quality of your embeddings (up to a point).