Get started with Gensim for basic NLP tasks

Gensim is an open source python package for natural language processing with a particular focus on topic modeling. It is designed as a subject modeling library, allowing users to apply common academic models in production or projects. So, in this article, we will talk about this library and its main functions and features, as well as various tasks related to NLP. Below are the main points that we will be discussing throughout this article.

Contents

  1. What is Gensim?
  2. Characteristics of genism
  3. NLP practice with Gensim
    1. Create a dictionary from a list of phrases
    2. bag of words
    3. Create a Bigram
    4. Creating a TF-IDF Matrix

Let’s talk about the Gensim library first.

What is Gensim?

Gensim is an open-source software that performs unsupervised subject modeling and natural language processing using modern statistical machine learning. Gensim is written in Python and Cython for performance. It is designed to handle large collections of text using incremental online dataflows and algorithms, which sets it apart from most other machine learning software packages designed solely for in-memory processing.

Gensim is not a full NLP research library (like NLTK); rather, it is a mature, focused, and effective collection of NLP tools for topic modeling. It also includes tools for loading pre-formed word embeddings in a variety of formats, as well as using and querying a loaded embedding.

Characteristics of genism

Here are some of the characteristics of gensim.

Gensim provides efficient multi-core implementations of common techniques including Latent Semantic Analysis (LSA), Latent Dirichlet Allocation (LDA), Random Projections (RP), and Hierarchical Dirichlet Process to speed up processing and retrieval on clusters of machines (HDP).

Thanks to its incremental online training algorithms, Gensim can easily process massive and web-scale corpora. It is scalable because the entire input corpus does not need to be fully stored in random-access memory (RAM) at any given time. In other words, regardless of the size of the corpus, all of its methods are memory-independent.

Gensim is a powerful system that has been used in a variety of systems by a variety of people. Our own input corpus or data stream can be easily connected. It is also simple to add other vector space algorithms to it.

NLP practice with Gensim

In this section, we will cover some of the basic NLP tasks using Gensim. Let’s start by creating the dictionary first.

1. Create a dictionary from a list of phrases

Gensim requires words (or tokens) to be translated into unique identifiers in order to work on text documents. To do this, Gensim allows you to create a Dictionary object that associates each word with a unique identifier. We can do this by turning our text/sentences into a list of words and passing it to the corpus.Dictionary() method.

In the next part, we will see how to actually do it. The dictionary object is often used to generate a “bag of words” corpus. This dictionary, along with the bag of words (Corpus), are used as inputs for modeling Gensim topics and other models.

Here is the snippet that creates the dictionary for a given text.

text = [
   "Gensim is an open-source library for",
   "unsupervised topic modeling and",
   "natural language processing."
]
# get the separate words
text_tokens = [[tok for tok in doc.split()] for doc in text]
# create dictionary
dict_ = corpora.Dictionary(text_tokens)
# get the tkens and ids
pprint(dict_.token2id)

2. Bag of Words

The Corpus is the next important thing to learn if you want to use gensim (a bag of words) effectively. It is a corpus object that contains both the word id and the frequency with which it appears in each document.

To create a bag of word corpus, simply feed the tokenized word list into the dictionary after it has been updated. doc2bow(). To generate BOW, we’ll continue from the tokenized text of the previous example.

# tokens
text_tokens = [[tok for tok in doc.split()] for doc in text]
# create dict
dict_ = corpora.Dictionary()
#BOW
BoW_corpus = [dict_.doc2bow(doc, allow_update=True) for doc in text_tokens]
pprint(BoW_corpus)

The (0, 1) in line 1 indicates that the word id=0 appears only once in the first sentence. Similarly, the (10, 1) in the third element of the list indicates that the word with the id 10 occurs once in the third sentence. And so on.

3. Creation of Bi-gram

Certain words in paragraphs invariably appear in pairs (bigram) or groups of three (trigram). Because the two terms, when joined, form the real entity. Forming bigrams and trigrams from sentences is essential, especially when working with bag-of-words models. It’s quick and easy with Gensim’s Phrases template. Since the constructed Phrases model supports indexing, it suffices to send the original text (list) to the constructed Phrases model to generate the bigrams.

from gensim.models.phrases import Phrases
# Build the bigram models
bigram = gensim.models.phrases.Phrases(text_tokens, min_count=3, threshold=10)
#Construct bigram
pprint(bigram[text_tokens[0]])

4. Creation of the TF-IDF matrix

Like the regular corpus model, the Term Frequency – Inverse Document Frequency (TF-IDF) model reduces the weight of tokens (words) that appear frequently in texts. Tf-Idf is calculated by dividing a local component, such as term frequency (TF), by a global component, such as inverse document frequency (IDF), and then normalizing the result to unit length. Therefore, phrases that appear frequently in posts will be given less weight.

There are various formula modifications for TF and IDF. Below is how we can get the TF-IDF matrix. Blow snippets first get the frequency given by the BOW and later by the TF-IDF.

from gensim.utils import simple_preprocess
from gensim import models
import numpy as np
# data to be processed
doc = [
   "Gensim is an open-source library for  ",
   "unsupervised topic modeling and",
   "natural language processing."]
 
# Create the Dictionary and Corpus
mydict = corpora.Dictionary([simple_preprocess(line) for line in doc])
corpus = [mydict.doc2bow(simple_preprocess(line)) for line in doc]
 
# Show the Word Weights in Corpus
for doc in corpus:
    print([[mydict[id], freq] for id, freq in doc])

Now on to TF-IDF, we just need to fit the model and access the weights by loops and conditions for each word.

# Create the TF-IDF model
tfidf = models.TfidfModel(corpus, smartirs="ntc")
 
# Show the TF-IDF weights
for doc in tfidf[corpus]:
    print([[mydict[id], np.around(freq, decimals=2)] for id, freq in doc])

Here is the output.

Last words

Through this article, we have discussed Python-based library called Gensim, which is a type of modular library that gives us the ability to create SOTA algorithms and pipelines for NLP related problems. This article is about getting started with Gensim where we have pretty much covered some of the basic tasks related to NLP and understood the same.

Reference

Comments are closed.