TM-LDA: Efficient Online Modeling of Latent Topic Transitions in Social Media

Latent topic analysis has emerged as one of the most effec- tive methods for classifying, clustering and retrieving tex- tual data. However, existing models such as Latent Dirich- let Allocation (LDA) were developed for static corpora of relatively large documents. In contrast, much of the textual content on the web, and especially social media, is tempo- rally sequenced, and comes in short fragments, including microblog posts on sites such as Twitter and Weibo, sta- tus updates on social networking sites such as Facebook and LinkedIn, or comments on content sharing sites such as YouTube. In this paper we propose a novel topic model, Temporal-LDA or TM-LDA, for efficiently mining text streams such as a sequence of posts from the same author, by mod- eling the topic transitions that naturally arise in these data. TM-LDA learns the transition parameters among topics by minimizing the prediction error on topic distribution in sub- sequent postings. After training, TM-LDA is thus able to accurately predict the expected topic distribution in future posts. To make these predictions more efficient for a realistic online setting, we develop an efficient updating algorithm to adjust the topic transition parameters, as new documents stream in. Our empirical results, over a corpus of over 30 million microblog posts, show that TM-LDA significantly outperforms state-of-the-art static LDA models for estimat- ing the topic distribution of new documents over time. We also demonstrate that TM-LDA is able to highlight inter- esting variations of common topic transitions, such as the differences in the work-life rhythm of cities, and factors as- sociated with area-specific problems and complaints.