Natural Language Processing

Natural Language Processing Economy Insights

There are two fundamental approaches to shortening text: extractable and abstract. The first one extracts words and phrases from the original text to create a resume. The latter studies the internal linguistic representation in order to create a human-like presentation by paraphrasing the original text.
2021-10-02, by Ted Jackman, Independent Financial Adviser

#ML || #NLP || #Science ||

Table of contents:



Preprocessing of Data

Preprocessing of Data is the Data Mining stage, which includes the transformation of the original data into an understandable format.

Tokenization

Tokenization is the process of breaking a text document into separate words called tokens.

As you can see above, the sentence is broken down into words (tokens).

The Natural language toolkit (NLTK library) is a popular open source package of libraries used for all sorts of NLP tasks. In this article, we will be using the NLTK library for all the steps of Text Preprocessing.

Removing stop words

Stop words are commonly used words that do not add any additional information to the text. Words like "the", "is", "a" have no value and only add noise to the data.

The NLTK library has a built-in stop word list that you can use to remove stop words from text. However, this is not a universal stop word list for any task, we can also create our own set of stop words depending on the scope.

As shown here doctranslator, the NLTK library has a predefined list of stop words. We can add or remove stop words from this list or use it depending on the specific task.

Ted Jackman

Ted Jackman contributor to abundance.org.uk
Independent Financial Adviser