ValueError: [E949] Unable to align tokens for the predicted and reference docs. #12932

PeachDew · 2023-08-24T04:54:37Z

PeachDew
Aug 24, 2023

Hi! I referred to spacy's custom tokenization doc here: https://spacy.io/usage/linguistic-features#custom-tokenizer-training
and tried using a custom-trained tokenizer in my NER project.

Here is my functions.py file:


from tokenizers import Tokenizer
from spacy.tokens import Doc
import spacy
import pickle

TK_PATH = "./tokenizers/WPC-trained.json"
tokenizer = Tokenizer.from_file(TK_PATH)

class CustomTokenizer:
    def __init__(self, vocab):
        self.vocab = vocab
        self._tokenizer = tokenizer

    def __call__(self, text):
        tokens = self._tokenizer.encode(text)
        words = []
        spaces = []
        for i, (text, (start, end)) in enumerate(zip(tokens.tokens, tokens.offsets)):
            words.append(text)
            if i < len(tokens.tokens) - 1:
                # If next start != current end we assume a space in between
                next_start, next_end = tokens.offsets[i + 1]
                spaces.append(next_start > end)
            else:
                spaces.append(True)
        return Doc(self.vocab, words=words, spaces=spaces)
    
    def to_bytes(self):
        return pickle.dumps(self.__dict__)

    def from_bytes(self, data):
        self.__dict__.update(pickle.loads(data))
    
    def to_disk(self, path, **kwargs):
        with open(path, 'wb') as file_:
            file_.write(self.to_bytes())

    def from_disk(self, path, **kwargs):
        with open(path, 'rb') as file_:
            self.from_bytes(file_.read())

@spacy.registry.tokenizers("custom_tokenizer")
def create_whitespace_tokenizer():
    def create_tokenizer(nlp):
        return CustomTokenizer(nlp.vocab)

    return create_tokenizer

and in my config.cfg:

[nlp.tokenizer]
@tokenizers = "custom_tokenizer"

I trained different tokenizers, and the BPE one worked without any hiccups but when training using the WordLevel tokenizer:

ValueError: [E949] Unable to align tokens for the predicted and reference docs. 
It is only possible to align the docs when both texts are the same except for whitespace and capitalization. 
The predicted tokens start with: ['AAA', 'BBB', ':', '0']. The reference tokens start with: ['AAA', 'BBB:0.999', '"', '\r']

It seems that spacy is not using my custom tokenizer for prediction. Or is it an issue with an additional alignment step I have to include in the config?

I used https://huggingface.co/docs/tokenizers/quicktour to train my custom tokenizers.

svlandeg · 2023-08-30T12:09:28Z

svlandeg
Aug 30, 2023
Maintainer

Hi! Sorry to hear you've been having issues with this, let's look into this in more detail.

You didn't include the full stack trace, and there are two code paths from where E949 could be thrown. Both are from within the align.get_alignments function though. I assume you're getting this error when running spacy train on the commandline?

So the tokenizer function that you created defines how the words/characters are split into tokens, but some sort of alignment still needs to happen when you're training. In spaCy terminology, an Example object is created that contains 1) a reference doc that has gold annotations and 2) a predicted doc that contains the predictions/annotations made by the various components in your pipeline.

The reference docs are read from the train.spacy file(s) that you've provided to train from. The information in the .spacy files is stored as Doc objects, meaning that they already have a specific tokenization - their tokens are already defined. The predicted docs need to be aligned against that, so that the gold annotations can be transferred to the tokenization on the predicted docs (produced with your custom tokenizer).

So what seems to happen with your "custom_tokenizer", is that this alignment is unable to complete because the underlying text of the documents don't correspond. I don't see what the issue is from the error message alone, because it only prints the first 4 tokens of each, and these seem to be "alignable" as far as I can see. But the issue might be further down in the text.

So the main question is this: does your custom tokenizer actually change the underlying text (other than whitespace)?

4 replies

PeachDew Aug 31, 2023
Author

thank you so much for the reply!! you're right, the custom tokenizers i had this trouble with all do make changes to underlying text, for my word-level tokenizer there is the presence of tokens and for word-piece some tokens are prepend with '##'.

are these tokenizers not compatible with spacy and so I can only use tokenizers that don't change underlying text? or is there a way to change how i convert my .json to .spacy files so they use the same custom tokenization?

and i apologise for the duplicate issue/discussion, it is my first time opening an issue and a button popped up asking if i wanted to open a discussion as well, thank you for letting me know 😅

svlandeg Sep 1, 2023
Maintainer

That's right, like we've put it over at https://spacy.io/usage/linguistic-features#tokenization:

spaCy’s tokenization is non-destructive, which means that you’ll always be able to reconstruct the original input from the tokenized output. Whitespace information is preserved in the tokens and no information is added or removed during tokenization. This is kind of a core principle of spaCy’s Doc object: doc.text == input_text should always hold true.

As I explained - otherwise it becomes impossible to do the alignment between the "gold" data and your raw text.

PeachDew Sep 5, 2023
Author

Thank you for your explanations and guidance! On the same topic of custom tokenizers, this is how it is implemented in the docs:
Would like to ask why we need two vocabs, one from nlp.vocab and another provided in the file:

@spacy.registry.tokenizers("custom_tokenizer")
def create_whitespace_tokenizer(vocab_file: str):
    def create_tokenizer(nlp):
        return CustomTokenizer(nlp.vocab, vocab_file)

    return create_tokenizer

And what is happening if I were to do something like:

def load_vocab(file):
    with open(file, "r", encoding="utf-8") as f:
        data = json.load(f)
    return list(data.keys())

@spacy.registry.tokenizers("custom_tokenizer")
def create_whitespace_tokenizer(vocab_file: str):
    def create_tokenizer(nlp):
        tok_vocab = load_vocab(vocab_file)
        vocab = Vocab(strings = tok_vocab)
        nlp.tokenizer = CustomTokenizer(vocab=vocab)
        return CustomTokenizer(nlp.vocab)

    return create_tokenizer

As my model still trains well with this adjustment. Does this just adjust the vocab of the custom tokenizer?

svlandeg Sep 5, 2023
Maintainer

this is how it is implemented in the docs:

I can't find that code snippet in our documentation, though what I can find is this:

@spacy.registry.tokenizers("bert_word_piece_tokenizer")
def create_whitespace_tokenizer(vocab_file: str, lowercase: bool):
    def create_tokenizer(nlp):
        return BertWordPieceTokenizer(nlp.vocab, vocab_file, lowercase)

which is referring to the use of tokenizers.BertWordPieceTokenizer as in https://spacy.io/usage/linguistic-features#custom-tokenizer-example2. The BertWordPieceTokenizer needs this vocab_file argument, but I think the code here should be updated and should read

        return BertTokenizer(vocab_file, lowercase=lowercase)

It's referring to the specific example BertTokenizer implementation from the section above.
Basically the vocab_file is required by the tokenizer library - the nlp.vocab is a spaCy implementation detail.

Uh oh!

ValueError: [E949] Unable to align tokens for the predicted and reference docs. #12932

Uh oh!

Uh oh!

PeachDew Aug 24, 2023

Replies: 1 comment · 4 replies

Uh oh!

Uh oh!

svlandeg Aug 30, 2023 Maintainer

Uh oh!

PeachDew Aug 31, 2023 Author

Uh oh!

svlandeg Sep 1, 2023 Maintainer

Uh oh!

Uh oh!

PeachDew Sep 5, 2023 Author

Uh oh!

Uh oh!

svlandeg Sep 5, 2023 Maintainer

PeachDew
Aug 24, 2023

Replies: 1 comment 4 replies

svlandeg
Aug 30, 2023
Maintainer

PeachDew Aug 31, 2023
Author

svlandeg Sep 1, 2023
Maintainer

PeachDew Sep 5, 2023
Author

svlandeg Sep 5, 2023
Maintainer