Creating a custom spaCy tokenizer to use with Prodigy - virtual7 GmbH

Über
Letzte Artikel

Letzte Artikel von Bogdan Zegheanu (Alle anzeigen)

Creating a custom spaCy tokenizer to use with Prodigy - 22. September 2021
Troubleshooting H2 Database in Spring Boot - 6. März 2020
Intercept Callable execution with ExecutorService in Java - 25. November 2019

Recently I’ve been working on a ML pipeline to recognize SKILLs from domain-specific documents.
This is basically a NER (Named Entity Recognizer ) task, where you have a corpus of data (text) and you find specific words or span of words which denote an Entity, in this case a SKILL.

For this I’ve been using spacy for the learning part. And because each learning task needs annotations to learn from, you need also an annotation tool, with the help of which one can as easily as possible annotate the SKILLs from the text.
This is handled by Prodigy in my case. There are many reasons to use Prodigy, but a compelling one is that it’s created by the same guys who created spacy so you can expect a good integration between the two.

For the purpose of this article, I will assume some knowledge of these tools. For more information see here:
Spacy: https://spacy.io/
Prodigy: https://prodi.gy/

Prodigy and spacy models

When starting the annotation task with Prodigy, you can pass such a model from the command line so that Prodigy can use that model.

Spacy models: https://spacy.io/usage/models

It can use that models in two ways:

Active learning. If you are using ner.teach recipe then you can actively update the model as you go and if the Entity you want is already trained in the model you get it pre-labeled in the UI of prodigy
Manual annotation. If you are using ner.manual recipe then only the Tokenizer component is used from that model.

We are using ner.manual and so we need only the Tokenizer from that model. In this case, we can actually pass a blank model which contains only the Tokenizer. This is done using the “blank:en” in place of model parameter.

So we run Prodigy like this from the command line:

prodigy ner.manual skills_dataset blank:de data/text_to_annotate.jsonl –label SKILL

Problem

Now what’s the problem? I mentioned that the documents are domain-specific. I should also mention that they are also in German language.

This is all fine and well, as spacy already has generic trained models for German. However, the Tokenizer doesn’t quite work for what we want to achieve here. Specifically, German language makes heavy use of hyphens (-) as word concatenator. However, for our use case this complicates the things a bit as words which are SKILLs are concatenated into a single Token with another ‘qualifier’.

Let me give you an example:

SOAP Interfaces translates becomes something like SOAP-Schnittstellen. The default German tokenizer will consider this a single token and because of this Prodigy will only let you annotate it together and not only “SOAP”.

Modify the tokenizer in the blank model

What can we do is to change the Tokenizer component in the blank model. To do this, we need to open a python editor and write a few lines

Modify the tokenizer and export the modified model on disk

#Initialize a blank model
nlp_de = spacy.blank(“de”)
# Override the infix rule to add the hyphen as separator for the current loaded model.
infix_rules = nlp_de.Defaults.infixes + [r”'[-]”’,]
infix_re = spacy.util.compile_infix_regex(infix_rules)
#To see how it looks like without the hyphen infix, comment the following line.
nlp_de.tokenizer.infix_finditer = infix_re.finditer
nlp_de.to_disk(‘/home/bogdan/work/nlp/extended_de_model’)

You can change also the suffix and prefix rules at the boundaries of a token. For more information on how to do this, check this: https://spacy.io/usage/linguistic-features#native-tokenizers

Package the exported model

What this short script does is change the way tokenizer works and then export the blank model to disk.

After we’ve exported this to disk, we need to package it using spacy package command.
But before that, we need to change the ./extended_de_model/meta.json file to give a new name to the exported model. For this, edit the “name” field in meta.json. For example: “name”: “blank_de_custom_tokenizer”

Then we can run the spacy package command:

python -m spacy package ./extended_de_model/ ./model_packages

Build module from the packaged module

Now we have our packaged model in ./model_packages/de_blank_de_custom_tokenizer-0.0.0.

To make it available to use from the Prodigy CLI, we need to package this into a python module.

We do this by running: python setup.py sdist in that folder.

Passing the new model into Prodigy

Now we are ready to pass the model to use with Prodigy

prodigy ner.manual skills_dataset de_blank_de_custom_tokenizer data/text_to_annotate.jsonl –label SKILL

Now we will get three tokens for SOAP-Schnittstellen:

SOAP, -, Schnittstellen

A possible alternative

Instead of modifying the tokenizer we could pass –highlight-chars parameter to Prodigy (https://prodi.gy/docs/named-entity-recognition) . With this, we can select only the part of the token are interested in.

However there are two downsides for this:

Annotation speed. It will become more tedious to annotate as you now have to select every letter in a word/token. Without this flag, you annotate a word/token simply by double clicking.
Tokenizer during training. As mentioned also in the flag’s documentation, it is recommended to use the same tokenizer in annotation as in training. That’s because if in training you will stumble upon SOAP-Schnittstellen again this will not be matched with SOAP token that you annotated.

Notes about Prodigy and spaCy versions

What I talked about in this article applies to Prodigy10.* and spacy<3.0.

The latest version of Prodigy does not yet support spacy>=3.0.

In spacy>3.0 we can jump a few steps. We can add –name extended_de_model in the package command so that we don’t have to change the meta.json ourselves.