Topic modeling with latent Dirichlet allocation (LDA) and visualization in t-SNE.

The code snippets in this post are only for your better understanding as you read along. For full working codes, please refer to this repo.

We’ll first introduce what topic modeling and t-SNE are, and then apply these techniques on two datasets: 20 Newsgroups and tweets.

What is topic modeling?

Topic models are a suite of algorithms/statistical models that uncover the hidden topics in a collection of documents. Intuitively, given that a document is about a particular topic, one would expect certain words to appear in the document more or less frequently: ‘algorithms’, ‘compiler’, and ‘array’ will appear more often in documents about computer science, ‘democracy’, ‘politician’, and ‘policy’ in documents about politics, and ‘the’, ‘a’, and ‘is’ may appear equally in both. Furthermore, a document typically concerns multiple topics in different proportions, especially so in cross-disciplinary documents (e.g., 60% about biology, 25% about statistics, and 15% computer science in a bioinformatics article). A topic model captures this intuition in a mathematical framework to examine and discover what the topics might be and what each document’s balance of topics is.

Popular topic modeling algorithms incldue latent semantic analysis (LSA), hierarchical Dirichlet process (HDP), and latent Dirichlet allocation (LDA), among which LDA has shown great results in practice and therefore been widely adopted. This post will be using LDA for topic modeling (for those of you who’d love know more about LDA theories & are comfortable reading formulas, refer to this paper).

t-SNE

t-SNE, or t-distributed stochastic neighbor embedding, is a dimensionality reduction algorithm for high-dimensional data visualization. This is partly to mitigate the fact that human cannot (at least not now) perceive vector space that is beyond 3-D.

Here is an example of reducing 784-D digits representation and visualizing them in 3-D space (credits: Google Embedding Project)

t-SNE is nondeterministic, and its results would depend on the data batch. In other words, the same high dimensional data point might be transformed into different 2-D or 3-D vectors in different batches, relative to other data points in the batch.

Implementation of t-SNE in various languages are available, but speeds may vary. For example, I did a comparison between the C++ & Python wrapper and Python sklearn versions, and found that the former is usually 3x faster in terms of matrix transformation speed:

Environment

15-inch MacBook Pro, macOS Sierra
2.2 GHz Intel Core i7 processor
16 GB 1600 MHz DDR3 memory


1. Transform a 10,000 x 50 matrix to 10,000 x 2

C++ & Python

real    1m2.662s
user    1m0.575s
sys     0m1.929s


Python sklearn

real    3m29.883s
user    2m22.748s
sys     1m7.010s


2. Transform a 20,000 x 50 matrix to 20,000 x 2

C++ & Python

real    2m40.250s
user    2m32.400s
sys     0m6.420s


Python sklearn

real    6m54.163s
user    4m17.524s
sys     2m31.693s


3. Transform a 1,000,000 x 25 matrix to 1,000,000 x 2

C++ & Python

real    224m55.747s
user    216m21.606s
sys     8m21.412s


Python sklearn

out of memory... :(


The author of t-SNE says that they have “applied the technique on data sets with up to 30 million example” (though he didn’t specify the dimension of the data and runtime). In case you have an even larger dataset, you can either scale up your hardware, adjust the parameters (e.g., the angle param in sklearn’s t-SNE), or try an alternative (such as LargeVis, the authors of which claim “Comparing to tSNE, LargeVis significantly reduces the computational cost of the graph construction step”. I haven’t tested it yet).

Putting it together: 20 Newsgroups example

Enough theories: let’s get our hands dirty. In this section we’ll be applying LDA algorithms on the 20 Newsgroups dataset to discover the underlying topics in each document and visualize them as groups using t-SNE.

Getting the data

Fortunately, sklearn has functions to to easily retrieve and filter the 20 newsgroups data:

from sklearn.datasets import fetch_20newsgroups

# we only want to keep the body of the documents!

# fetch train and test data
newsgroups_train = fetch_20newsgroups(subset='train', remove=remove)
newsgroups_test = fetch_20newsgroups(subset='test', remove=remove)

# a list of 18,846 cleaned news in string format
# only keep letters & make them all lower case
news = [' '.join(filter(unicode.isalpha, raw.lower().split())) for raw in
newsgroups_train.data + newsgroups_test.data]


Training an LDA model

After we get the cleaned data, we can vectorize the tokens and train an LDA model:

import lda
from sklearn.feature_extraction.text import CountVectorizer

n_topics = 20 # number of topics
n_iter = 500 # number of iterations

# vectorizer: ignore English stopwords & words that occur less than 5 times
cvectorizer = CountVectorizer(min_df=5, stop_words='english')
cvz = cvectorizer.fit_transform(news)

# train an LDA model
lda_model = lda.LDA(n_topics=n_topics, n_iter=n_iter)
X_topics = lda_model.fit_transform(cvz)


where X_topics is a 18,846 (num_news) by 20 (n_topics) matrix. Note that we have a nice probability interpretation here: each row is a probability distribution (learned by our LDA model) of this news belonging to a certain topic (e.g., X_topics[0][0] represents the probablitiy of the first news belong to topic 1).

Reducing to 2-D with t-SNE

We have a learned LDA model. But we cannot visually inspect how good our model is. t-SNE comes to the rescue:

from sklearn.manifold import TSNE

# a t-SNE model
# angle value close to 1 means sacrificing accuracy for speed
# pca initializtion usually leads to better results
tsne_model = TSNE(n_components=2, verbose=1, random_state=0, angle=.99, init='pca')

# 20-D -> 2-D
tsne_lda = tsne_model.fit_transform(X_topics)


Visualzing groups and their keywords

Now we are ready to visualize the news groups and keywords, using the popular Python visualization library bokeh.

First we do some setup work (import classes & functions, set params, etc.):

import numpy as np
import bokeh.plotting as bp
from bokeh.plotting import save
from bokeh.models import HoverTool

n_top_words = 5 # number of keywords we show

# 20 colors
colormap = np.array([
"#1f77b4", "#aec7e8", "#ff7f0e", "#ffbb78", "#2ca02c",
"#98df8a", "#d62728", "#ff9896", "#9467bd", "#c5b0d5",
"#8c564b", "#c49c94", "#e377c2", "#f7b6d2", "#7f7f7f",
"#c7c7c7", "#bcbd22", "#dbdb8d", "#17becf", "#9edae5"
])


Then we find the most likely topic for each news:

_lda_keys = []
for i in xrange(X_topics.shape[0]):
_lda_keys +=  _topics[i].argmax(),


and get top words for each topic:

topic_summaries = []
topic_word = lda_model.topic_word_  # all topic words
vocab = cvectorizer.get_feature_names()
for i, topic_dist in enumerate(topic_word):
topic_words = np.array(vocab)[np.argsort(topic_dist)][:-(n_top_words + 1):-1] # get!
topic_summaries.append(' '.join(topic_words)) # append!


Last but not least, we plot the news (each point representing one news):

title = '20 newsgroups LDA viz'
num_example = len(X_topics)

plot_lda = bp.figure(plot_width=1400, plot_height=1100,
title=title,
tools="pan,wheel_zoom,box_zoom,reset,hover,previewsave",
x_axis_type=None, y_axis_type=None, min_border=1)

plot_lda.scatter(x=tsne_lda[:, 0], y=tsne_lda[:, 1],
color=colormap[_lda_keys][:num_example],
source=bp.ColumnDataSource({
"content": news[:num_example],
"topic_key": _lda_keys[:num_example]
}))


and plot the crucial words for each topic:

# randomly choose a news (within a topic) coordinate as the crucial words coordinate
topic_coord = np.empty((X_topics.shape[1], 2)) * np.nan
for topic_num in _lda_keys:
if not np.isnan(topic_coord).any():
break
topic_coord[topic_num] = tsne_lda[_lda_keys.index(topic_num)]

# plot crucial words
for i in xrange(X_topics.shape[1]):
plot_lda.text(topic_coord[i, 0], topic_coord[i, 1], [topic_summaries[i]])

# hover tools
hover = plot_lda.select(dict(type=HoverTool))
hover.tooltips = {"content": "@content - topic: @topic_key"}

# save the plot
save(plot_lda, '{}.html'.format(title))


That’s a lot of code… but if you’ve made it this far, you’ll get an interactive plot like this:

But it looks so messy…

I know! We can see some patterns in the visualization but the plot is busy and hard to interpret. What went wrong here: is it the LDA model or the t-SNE transformation?

Turns out when we assign a major topic for each document, there are cases where the probability of even the most probable topic is rather low (an extreme case being each topic is assigned 5%, i.e., uniformlly distributed). In other words, our model is not able to confidently (with big margins) assign a topic to such news.

One workaround is to add a threshold factor that would help filter out unconfident assignments. Simply put these lines right after we trained a LDA model and before we use t-SNE for dimensionality reduction:

import numpy

threshold = 0.5
_idx = np.amax(X_topics, axis=1) > threshold  # idx of doc that above the threshold
X_topics = X_topics[_idx]


and re-run the code we’ll get this:

Looks much better: isolated and clear-cut groups! We, however, achieved this at the expense ofremoving unconfident assignments (in this case more than half of the data). What this shows is that there is only so much our LDA model can learn from this dataset, and our model is not confident to assign a good topic to all news.

That said, the top words learned for each topic make some sense if you examnine them closely: e.g., ‘medical health use number patient’ (healthcare) against ‘god jesus christian bible’ (religion).

Try to play with the parameters and see you can find something more interesting!

Tweets example

Twitter has become one of the most popular news & social networking service (SNS) platforms. In the last blogpost, Real-time Twitter trend discovery, we discussed how to visualize Twitter trends in real-time. Yet we can also use tweets corpus to model topics.

Instead of putting the tweets in memory for real-time processing, we want to save the tweets to disk and accumulate a certain amount (at least millions) to model topics effectively.

First we need to establish a tweet connection: please check this section how. With the credentials, we can scrape live tweets:

import datetime

def datetime_filename(prefix='output_'):
"""
Creates filename with current datetime string suffix.
"""
outputname = prefix + '{:%Y%m%d%H%M%S}utc.txt'.format(datetime.datetime.utcnow())
return outputname

def scrape(tweets_per_file=100000):
"""
Scrape live tweets. GetStreamSample() gets ~1,000 English
tweets per min, or 1.5 million/day. For easier reference,
we save 100k tweets per file.
"""
f = open(datetime_filename(prefix='en_tweet_'), 'w')
tweet_count = 0
try:
for line in api.GetStreamSample():
if 'text' in line and line['lang'] == u'en':
text = line['text'].encode('utf-8').replace('\n', ' ')
f.write('{}\n'.format(text))
tweet_count += 1
if tweet_count % tweets_per_file == 0: # new batch
f.close()
f = open(datetime_filename(prefix='en_tweet_'), 'w')
continue
except KeyboardInterrupt:
finally:
f.close()
return tweet_count


Give it at least a day or two to accumulate a decent amount of tweets. Sometimes the connection can be interrupted: just re-run the scripts such that new tweets would be saved to the disk.

After getting enough tweets, we can load the tweets, process them, vectorize them and compute tf-idf scores, train an LDA model, reduce to 2-D, and visualize the results. Refer to the complete script here.

You’ll get a plot that looks like this:

This is a visualization of a model trained on 2 million tweets with only 5,000 data points (or tweets) shown. We have some good clusters learned by the model: ‘sex girl porn’ representing some porn-related tweets, ‘video liked new’ representing social network contents, and ‘trump hilary cliton’ representing politics and election (the tweets were indeed collected during the 2016 election heat).

Some last words

In this blogpost we discussed topic modeling and t-SNE enabled visualization, and applied these algorithms to two datasets, 20 Newsgroup and tweets. I’d also recommend another library, gensim, that is great for modeling topics, training word/sentence vectors, and many other natural language processing tasks. Specific to topic modeling and LDA, gensim also has a multicore version that can parallelize and speed up model training. Coupled with another library pyLDAvis, we can do some interesting visualization (see below). Give them a try!

That’s all for this blogpost! Bye for now.