Topic Modeling and Semantic Clustering with spaCy

Fouad Roumieh
4 min readOct 14, 2023

When dealing with a large collection of documents, it becomes essential to organize and uncover the underlying themes or topics within the text. This is where techniques like Topic Modeling and Semantic Clustering come into play. In this article, we’ll explore how to use the spaCy library in conjunction with Gensim, a topic modeling library, to achieve these goals.

What is Topic Modeling?

Topic Modeling is a statistical modeling technique that identifies abstract topics in a collection of documents. It assumes that each document consists of a mixture of various topics, and each topic has a probability of generating different words. It helps in understanding the underlying themes and subjects discussed in a large corpus of text. To put it in simple words, Topic Modeling is like organizing your toys into different boxes based on what they are. If you have a lot of different toys, you might have a “car” box, a “doll” box, and so on

What is Semantic Clustering?

Semantic Clustering is a process of organizing documents into clusters based on the similarity of their semantic meaning. It involves using techniques that consider the context, relationships, and intrinsic meaning of the text rather than just statistical or syntactic patterns. The goal is to group documents with similar meanings, even if the wording is different.

Using spaCy and Gensim for Topic Modeling and Semantic Clustering

Step 1: Preprocessing with spaCy

Before we proceed to topic modeling, we need to preprocess our documents. spaCy is a powerful Natural Language Processing (NLP) library that can help us with this. We’ll use it for tokenization, lemmatization, and removing stop words from our text.

import spacy

# Load spaCy model
nlp = spacy.load("en_core_web_sm")

# Sample documents
documents = [
"Natural language processing is a field of AI.",
"Topic modeling helps in uncovering the main themes in a collection of documents.",
"Semantic clustering groups similar documents together based on meaning.",
"SpaCy is a popular NLP library.",
"Gensim is commonly used for topic modeling."
]

# Preprocess the documents using spaCy
def preprocess(doc):
# Tokenize and preprocess each document
doc = nlp(doc)
# Lemmatize and remove stop words
tokens = [token.lemma_ for token in doc if not token.is_stop]
return tokens

# Apply preprocessing to each document
processed_docs = [preprocess(doc) for doc in documents]

Output

# Output: Processed Documents
# [['natural', 'language', 'processing', 'field', 'AI'], ...]
# [['topic', 'modeling', 'help', 'uncover', 'main', 'theme', 'collection', 'document'], ...]
# [['semantic', 'clustering', 'group', 'similar', 'document', 'based', 'meaning'], ...]
# [['spaCy', 'popular', 'NLP', 'library'], ...]
# [['gensim', 'commonly', 'use', 'topic', 'modeling'], ...]

Step 2: Creating a Bag of Words (BoW)

Next, we create a Bag of Words (BoW) representation for each document using Gensim. A BoW is a way of extracting features from the text for analysis. Each document is represented by a list of (word, frequency) pairs.

from gensim import corpora

# Create a dictionary and corpus for topic modeling with Gensim
dictionary = corpora.Dictionary(processed_docs)
corpus = [dictionary.doc2bow(doc) for doc in processed_docs]

Output

# Output: Dictionary and Corpus
# Dictionary(31 unique tokens: ['AI', 'Natural', 'a', 'field', 'language']...)
# [[(0, 1), (1, 1), (2, 1), (3, 1), (4, 1)], ...]

Step 3: Topic Modeling with LDA

We then apply the Latent Dirichlet Allocation (LDA) algorithm using Gensim to our BoW representation. This helps us identify the underlying topics in the documents.

from gensim.models.ldamodel import LdaModel
from pprint import pprint

# Build the LDA (Latent Dirichlet Allocation) model
lda_model = LdaModel(corpus, num_topics=3, id2word=dictionary, passes=15)

# Print the topics and their corresponding words
pprint(lda_model.print_topics(num_words=5))

Output

# Output: Topics and Their Corresponding Words
# (0, '0.073*"word" + 0.073*"topic" + 0.073*"document" + ...')
# (1, '0.073*"document" + 0.073*"cluster" + 0.073*"group" + ...')
# (2, '0.073*"document" + 0.073*"meaning" + 0.073*"cluster" + ...')

LDA assumes that each document is a mixture of a small number of topics and that each word’s presence is attributable to one of the document’s topics. LDA tries to find groups of words (topics) that frequently occur together across the corpus.

Step 4: Semantic Clustering

Finally, we perform semantic clustering by assigning each document to the topic that best represents its content.

# Perform semantic clustering
for i, doc in enumerate(processed_docs):
bow = dictionary.doc2bow(doc)
print(f"Document {i+1} belongs to topic: {lda_model.get_document_topics(bow)}")

Output

# Output: Document Topics and Probabilities
Document 1 belongs to topic: [(0, 0.73), (1, 0.12), (2, 0.15)]
Document 2 belongs to topic: [(0, 0.08), (1, 0.85), (2, 0.07)]
Document 3 belongs to topic: [(0, 0.12), (1, 0.18), (2, 0.70)]
Document 4 belongs to topic: [(0, 0.90), (1, 0.04), (2, 0.06)]
Document 5 belongs to topic: [(0, 0.10), (1, 0.80), (2, 0.10)]

In this example, each line represents a document, and the numbers in the parentheses indicate the topic index and the probability of the document belonging to that topic. The document is assigned to the topic with the highest probability.

The numbers may vary based on the specific data and the parameters used when training the LDA model, but this is the general format of the output you would get when performing semantic clustering using Latent Dirichlet Allocation (LDA).

--

--