Using SpaCy for Natural Language Processing

A guide for everyone to spaCy: from installation to training the model with your own data.

8 min readDec 4, 2020

It’s easy to forget how powerful the human brain is. Not only do we understand language, we’ve figured out how to teach computers to understand human language, referred to as Natural Language, and make it easy for anyone to use NLP. It’s all around us now. If you use a computer or smartphone today, you’ve probably taken advantage of machine translation, sentence generation, and error correction.

People have even made it easier for normal users to look behind the curtain and watch the magic happen. Stick around even if you don’t know programming. Trust me, you won’t need it.

Natural Language Processing is for everybody. Let me show you how.

I considered using NLTK for this article since it’s aimed towards research and teaching. SpaCy is just closer to my heart, since I learned how to work with it on my own. SpaCy is free, open-source, designed for production, and it’s wonderfully documented.

NLTK, spaCy, and PyTorch work with python. If you’re a java enthusiast, be sure to check out Apache OpenNLP afterwards.

Let’s get right into it

If you have python3, pip, and Jupyter notebooks installed on your system, you can install spaCy on your system. Or you can use an online colab.

Here’s the colab where I ran all of this code: SpaCy for NLP

Go to File -> Save copy in Drive

Now, feel free to play around as you wish! Ctrl+enter executes the cell that you are in and Runtime -> Run all will run all the cells for you.

So, what is SpaCy?

SpaCy is a library for Natural Language Processing that can process and “understand” large volumes of text. SpaCy does this through a variety of features. You need to load a core statistical model for the features we’ll be using today. Statistical models essentially mean that the “decisions” spaCy makes are actually statistical predictions.

Although I’ll start at the very basics, by the end of this article, you’ll know how to train spaCy to make new predictions.

When we call spacy.load(‘en’), we’re calling the default English model ‘en_core_web_sm’. The vocab length of the small model is only 478, compared to 1340242 for the large model. The small model is sufficient for preprocessing, so we’ll start there.

The difference in size between the small language model and the large language model

What’s a Vocab?

Strings take a lot of space to store, so spaCy stores them as hash-values instead. Since the hash values hold no meaning on their own, the conversion between them is stored in Vocab.

Vocab is sort of like a translator between the hash values and the words. Each entry in the vocab is an object called a Lexeme, that has a hash value, a text, and some other data like whether the strings consist of numbers or alphabets (‘is_num’ or ‘is_alpha’).

What does the model do?

As soon as the model is called on a text, it puts the text through a preprocessing pipeline. A preprocessing pipeline is a series of steps (functions) that the model performs to “understand” the text and prepare it for further processing.

Doc is the convention for storing the objects returned after preprocessing. They seem to be normal strings of text, but they’re not. The doc object contains the data that was found during preprocessing, such as the ‘tokens’, sentences, entities and parts of speech found in a sentence.

Tokenize is performed before the rest of the steps in the pipeline. It splits the text into tokens. Tokens are the smallest units of the string that have semantic meaning, i.e. words and punctuation. The rest of the pipeline can be customized.

Tagger

The tagger assigns a part of speech (POS) to every token. The POS tag indicates how a token functions in meaning as well as grammatically within the sentence. We all know common parts of speech, such as noun, pronoun, adjective, verb, adverb, etc. There are 36 different commonly accepted parts of speech in the English language.

Frequency of POS tags in the Oz document. Can you guess what each tag means?

Parser

The parser distinguishes sentence boundaries. It also shows the dependencies the tokens have with each other. This is call dependency parsing. Unlike humans, spaCy cannot “instinctively” understand which words depend on others. However, it has been trained on a lot of data to predict dependencies between words.

The text output format for dependency parsing is quite difficult to understand.

The root is clearly marked but the dependencies are confusing

Displacy is a visualiser that can be useful in showing the dependencies between the tokens. The tokens are joined by arrow marks and the type of dependency is mentioned above the arrow.

The Spacy Visualiser is a great external tool that you can use to interactively understand the dependencies in a sentence.

The interface is interactive. Try it out!

Named Entity Recognition (NER)

Named entities are “real-world objects” that have been assigned a name — for example, a place, a country, a product, or a work of art. A proper noun, just about. However, as you’ll see, NER works on more than the POS of the token.

When NLP is used in production, it is often important to find the people and places in large volumes of text, which is why NER is included in the basic pipeline.

As you can see, Dorothy, who is most definitely a person, is not considered a named entity here. If you go back to where we printed the POS of the tokens, Dorothy is considered a proper noun. These discrepancies happen because spaCy is not taught rules. It has to make its own statistical predictions based on the data given to it.

You’ll also notice that ‘Uncle Harry’ and ‘Aunt Em’ are correctly identified as one person each. While a POS is given to each individual token, NERs can have multiple tokens. Look at ‘Charles Dickens’ and ‘The French Revolution’.

A Tale of Two Cities is not marked as a work of art

Changing the pipeline

A very practical feature of spaCy is that it allows you to modify the pipeline. You can add steps (functions) that process the doc as you desire.

I’ve showcased this with a custom sentence parser that splits the sentence at commas. You can add any function you want here, even one that just prints out the length of the doc. We’ll look at a more practical use of changing the pipeline when we train the model.

It’s much easier to read the first sentence of A Tale of Two Cities like this :)

Comparing Similarities

SpaCy can predict similarity. It returns the similarity between two objects on a scale of 0 (no similarity) to 1 (completely the same). For similarity, you’ll need to use either the large or medium models.

The similarities are more complex than just comparing every single word. They’re compared using word vectors, which are multi-dimensional meaning representations of words. I won’t get into the details here. While you’re playing around, you’ll find short phrases fare better than long documents with irrelevant words.

Notice how the sentence order has been changed, something that plagiarism checkers don’t usually pick up on.

If you want to see NLTK’s cosine similarity used on a larger document, check out this code where I used it to compare the similarity between two Jane Austen books.

Training the Model

This is the fun bit.

As you have seen, the named entity recognition is not the greatest.

NER is mainly used in applications where it’s important to find the names of the people. What happens when you need spaCy to identify something else that’s necessary for your applications?

If you worked with shelters, you might be interested if an animal was being put up for adoption on social media. If you were working with a company, you’d want to know when a specific product is mentioned in the open-ended customer support form.

We’d need to recognise ‘fridge’ as a product and ‘Horses’ and ‘dog’ as animals

Let’s teach spaCy to recognise two new types of entities- products and animals.

We’ll provide labelled training data to spaCy, which will have the exact position of the token and the ner tag for that token. Our training data is not very good for practical purposes. It is too small and only marks one entity per sentence. This may cause the model to learn in some unexpected ways.

Making changes to the training data (adding a new label type or adding more data) won’t affect how the code runs, so you can experiment with this as well!

At this point, explaining the code depends a bit more on machine learning. The code has comments that give an overview of what is happening. Basically, we are teaching spaCy how to make statistical predictions based on the limited data we have provided.

Dog is still not identified as an animal

While our training data is limited, we can still see that spaCy has learned how to identify (some) animals and products.

Using the untrained model vs Using the trained model (notice date)

You can see here that spaCy does not just memorize words and sentences. No training sentence mentioned weak legs, but spacy identifies Horse as an animal. The second and third sentences both use the adjective “nice”, but the animal and the person are correctly identified. The trained model is better at identifying people. Although “snake” is identified in one sentence, it isn’t in the last one. Unfortunately, our model unlearned ‘today’ as a date.

This is a basic introduction to NLP with spaCy, focusing mostly on the powerful preprocessing capabilities of the library. The more you play around with the code, the better you’ll understand how it works. I hope you learned something new today!

Coding is for everyone ❤