NLP Entity Extraction/NER using python NLTK

4 min readOct 13, 2023

Have you ever wondered how a tool like ChatGPT is able to understand what we write? well, the answer is NLP or Natural Language Processing.

What is NLP?

I will put it in simple words to allow everyone to understand instead of using the complicated terms. The reason for that is I feel AI/ML engineers have responsibility to explain these topics for a wider audience as much as they can.

Natural Language Processing (NLP) is like teaching computers to understand and communicate with humans in the way we naturally speak and write. It’s a smart way for machines to read, interpret, and respond to human language, enabling them to help us with various tasks like understanding texts, translating languages, summarizing articles, and even having conversations. It’s like giving computers the ability to speak our language and help us better!

Entity Extraction

One essential task in NLP is entity extraction, also known as Named Entity Recognition (NER), is a technique in Natural Language Processing (NLP) that involves identifying and categorizing specific entities or objects within a given text. These entities can be individuals, organizations, locations, expressions of times, quantities, monetary values, percentages, and more.

For example, let’s say we have this sentence: “F. Henly was born in San Francisco and he works at Microsoft.”

In this sentence, the named entities can be extracted on the high level as follows:

Person: “F. Henley”
Location: “San Francisco”
Organization: “Microsoft”

One may ask, so what is this useful for? Different benefits out of this extraction as it assists in linking named entities to relevant knowledge bases, providing additional context and information about the entities mentioned in the text. One simple example is that imagine we have a chatbot to ask “Where does F. Henley works?”

Natural Language Toolkit

Now to go back to the original article topic what is NLTK?

“NLTK is a leading platform for building Python programs to work with human language data. It provides easy-to-use interfaces to over 50 corpora and lexical resources such as WordNet, along with a suite of text processing libraries for classification, tokenization, stemming, tagging, parsing, and semantic reasoning, wrappers for industrial-strength NLP libraries, and an active discussion forum.”

Let’s explore NLTK with an example.

Setting Up the Environment

First, we need to install NLTK and its data packages. Open your Python environment and run the following commands:

import nltk

# Download necessary NLTK data packages
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
nltk.download('maxent_ne_chunker')
nltk.download('words')

Loading and Preprocessing Text Data

Now that we install packages needed, let’s explore further how NER can be done with NLTK. For the sake of simplicity we’ll use the same text in our previous example and preprocess it by tokenizing the text into words and tagging the words with their respective parts of speech.

sample_text = "F. Henly was born in San Francisco and he works at Microsoft."
tokens = nltk.word_tokenize(sample_text)
tagged_tokens = nltk.pos_tag(tokens)

Till now we haven’t done the Entity Extraction part yet but let me explain further what we did here:

In this step, we start by loading a sample text, which can be any piece of text you choose, such as a news article, a paragraph from a book, or even a sentence.

sample_text = "F. Henly was born in San Francisco and he works at Microsoft."

2. Next, we need to tokenize the text. Tokenization is the process of breaking the text into smaller units, such as words or punctuation, which are called tokens. NLTK provides a convenient function for tokenization.

tokens = nltk.word_tokenize(sample_text)

In this line of code, we’re using nltk.word_tokenize() to split the sample_text into a list of words or tokens. For example, the sentence "F. Henly was born in San Francisco and he works at Microsoft." would be tokenized into:

["F.", "Henly", "was", "born", "in", "San", "Francisco", "and", "he", "works", "at", "Microsoft", "."]

4. And the last part is tagging

tagged_tokens = nltk.pos_tag(tokens)

Here, nltk.pos_tag() assigns a part-of-speech tag to each token. For example, "Henley" is tagged as a proper noun (NNP), and "born" is tagged as a verb (VBD). Continue to the end of the article for further explanation on these tags.

[('F.', 'NNP'), ('Henly', 'NNP'), ('was', 'VBD'), ('born', 'VBN'), ('in', 'IN'), ('San', 'NNP'), ('Francisco', 'NNP'), ('and', 'CC'), ('he', 'PRP'), ('works', 'VBZ'), ('at', 'IN'), ('Microsoft', 'NNP'), ('.', '.')]

Now, we have the tagged_tokens that contain both the tokens and their respective part-of-speech tags, which is a crucial preprocessing step for named entity recognition, let’s see the final step which is the actual Entity Extraction.

NER

To extract the entities all we need is to call “ne_chunk” to chunk the given list of tagged tokens:

entities = nltk.ne_chunk(tagged_tokens)
print(entities)

Entities output:


(S
  F./NNP
  (PERSON Henly/NNP)
  was/VBD
  born/VBN
  in/IN
  (GPE San/NNP Francisco/NNP)
  and/CC
  he/PRP
  works/VBZ
  at/IN
  (ORGANIZATION Microsoft/NNP)
  ./.)

Let's break down and explain each line in the output above:

F./NNP: "F." is a word tagged as a proper noun (NNP), indicating it's likely a name or a specific entity.
(PERSON Henly/NNP): Here, "Henly" is identified as a person's name (PERSON) and tagged as a proper noun (NNP).
was/VBD: "was" is a verb (VBD) in past tense.
born/VBN: "born" is a verb (VBN) in past participle form.
in/IN: "in" is a preposition (IN).
(GPE San/NNP Francisco/NNP): "San Francisco" is recognized as a geopolitical entity (GPE) and tagged as proper nouns (NNP).
and/CC: "and" is a coordinating conjunction (CC).
he/PRP: "he" is a pronoun (PRP), referring to a person.
works/VBZ: "works" is a verb (VBZ) in the third person singular present tense.
at/IN: "at" is a preposition (IN).
(ORGANIZATION Microsoft/NNP): "Microsoft" is identified as an organization (ORGANIZATION) and tagged as a proper noun (NNP).
./.): "." is a punctuation mark (.) indicating the end of a sentence.

To learn more about tagging in NLTK here is a good source from the official documentation: Categorizing and Tagging Words