hackajob Insider

4 Libraries to Hack Text Processing in Python

Written by hackajob Staff | Feb 23, 2021 12:00:00 AM

Natural language processing is one of the areas of artificial intelligence that has seen a major growth in the last few years and it's easy to see why.

We use it in our daily lives such as translating, extracting information from text, summarisation, classification and many more. With current advances in deep learning algorithms from text, such as LSTMs (Long Short-Term Memory), CNNs (Convolutional Neural Network), and Transformers, we've been able to see amazing results on these tasks if we have enough data.

However, sometimes, we might not have the amount of data needed, or perhaps deep learning just isn't the best idea for our project. In this case, we need to use some other methods and come up with some simpler strategies for addressing text processing tasks. You might be wondering 'how can I do this?' - well you're in the right place. Keep reading to see the 4 libraries we love that you can use to make text processing easier in Python.

NLTK

NLTK (Natural Language ToolKit) is one of the most comprehensive libraries for text processing in python. In nltk, you can do all the standard steps in text processing in python: tokenisation, stemming and lemmatisation, text normalisation, removing of stop words, part of speech tagging, and more.

Our personal favourite feature in nltk is the large English corpora. It can be downloaded by:

from nltk.corpus import words
nltk.download('words')

Once downloaded, it can be used a for spellchecking, searching keywords and interesting out-of-vocabulary terms in the text, without using more complicated algorithms. Give it a try and tell us what you think!

SPACY

spacy’s creators claim for it to be the "best way to prepare text for deep learning". It provides language models for multiple languages. From Norwegian to North Macedonian, you'll be pleasantly surprised with their inventory. With spacy, you can design and implement text processing pipelines, train machine learning models, as well as load and use pre-trained word vectors and transformer models.

Our favourite task for utilising spacy’s power is the task of named entity recognition, or in other words, extracting names of people, places, and organisation for the text. Normally, training your own algorithms on this task would require thousands or millions of records for good results, but spacy’s built-in functionalities can quickly provide decent accuracy and coverage.  

FUZZYWUZZY

Need to do string matching in python, but there are spelling mistakes or different forms of words? For this task we're loving fuzzywuzzy. Fuzzywyzzy is a python library that uses Levenshtein distance (also called edit distance) to do fuzzy (or in other words, partial) string matching.

What makes fuzzywuzzy better than other libraries based on Levenshtein distance is that with it, you can match not only by the whole strings, but also by the different tokens the string consist of - by the methods partial_ratio, token_sort_ratio and token_set_ratio. It also implements searching for the closest match from a list of candidates.

CLEANCO

Ever had to match different spellings of company names, such as “Microsoft” and “Microsoft Corp.”? You might already approach such a problem by using a combination of regular expressions, text preprocessing, n-grams, fuzzy text matching, or even a language model. We're about to make it way easier for you: try python library cleanco, with which you can directly clean company types for organisations coming from different counties, and save tons of working time.

We hope that these four libraries can help you the next time you're using Python and that you enjoy using them as much as we do!

Like what you've read or want more like this? Let us know! Email us here or DM us: Twitter, LinkedIn, Facebook, we'd love to hear from you.