Structural Topic Modelling

Topic modelling and word clustering are common natural language processing (NLP) approaches to obtaining insight into text. They have been used to facilitate qualitative text analysis through automating topic extraction, and grouping semantically similar text. Structural topic models (STM) are a generative models that are an extension of topic models such as Latent Dirichlet Allocation (LDA) and Correlated Topic Model (CTM). These models identify latent topics in text. A topic is a set of words where each word has a probability of belonging to that topic. A document is a mixture of topics that can also be correlated.

Unlike LDA and CTM, STM enables the covariates (metadata) to be associated with a document of interest. The metadata covariates may influence the topic mentioned or during data generation, such as the date or trust the survey is collected. For example, feedback during the winter may include details about longer wait times and demand on services. Essentially, STM enables context to be added when generating a topic model.

In general, the model iterates through each word in a document. Based on a prior distribution of the topic proportions, the model assigns the word to a topic. The metadata covariates can influence the prevalence of the topics. In this way, documents with similar covariates will tend to mention similar topics and use more similar words to discuss them. The value of STM lies in its ability to discover topics in a corpus and estimate the effect of the associated metadata. The analyst is then able to look at the relationship between variables and topics in the text, which enables model interpretability and hypothesis testing. Some applications of STM have included examining the public opinion of the UK government throughout the COVID-19 pandemic; understanding causes of user dissatisfaction from complaints and topics present in aviation incident reports. STM has also been used alongside other text analytic methods such as sentiment predictions and hierarchical clustering to improve the usability of Intelligent Personal Assistants.

STM was implemented using stm R package. searchK is used alongside Semantic coherence, exclusivity score, heldout log-likelihood and lower bound are used to determine the number of topics (K) for the model.

The best performing models are evaluated qualitatively by manually looking at representative text (higher proportion of text estimated for a given topic) and the most associated words, such as words ranked highest by FREX score and those with the highest probability. FREX score is a weighted mean of the probability of a word appearing in a topic (frequency) and its exclusivity to a topic.