SpanRuler on Retokenized tokens links back to original token text, not the token text with a split (space) introduced #13457

vrunm · 2024-04-25T14:00:43Z

vrunm
Apr 25, 2024

Goal: First split a token into two tokens. Then use SpanRuler to label both the re-tokenized tokens as a single span with one label.

Problem: The labeled span consists of the original text (a single token) rather than the two tokens concatenated with a separating space (ie after re-tokenization).

What I did:

I add a custom tokenizer splitter as the first stage. It correctly splits the single token into two tokens.
I then detect the two (splitted) tokens using a SpanRuler. Notice that the SpanRuler works for a pattern of two separated tokens (ie pattern=['abc', 'efg']), and will correctly detect nothing if the pattern is the original single token (pattern='abcefg').

Notice the custom retokenizer does respect Spacy's non-destructive retokenization.

Thanks for any help.

Minimal Reproducible Example:

import spacy
from spacy.language import Language

@Language.component('splitter')
def splitter(doc):
    with doc.retokenize() as retokenizer:
        retokenizer.split(doc[0], ['abc', 'efg'], heads=[doc[0], doc[0]])
    return doc

nlp = spacy.load('en_core_web_sm'])
nlp.add_pipe('splitter', first=True)
sp_ruler = nlp.add_pipe('span_ruler')
sp_ruler.add_patterns([{'label': 'testing', 'pattern': [{'TEXT': 'abc'}, {'TEXT': 'efg'}]}])
    
doc = nlp('abcefg')

print([(tok.text, i) for i, tok in enumerate(doc)])
print([(type(span), span.text, span.label_) for span in doc.spans["ruler"]])
print(len(doc.spans['ruler']))

Actual Output:

> [('abc', 0), ('efg', 1)]
> [(<class 'spacy.tokens.span.Span'>, 'abcefg', 'testing')]
> 1

Expected output:

> [('abc', 0), ('efg', 1)]
> [(<class 'spacy.tokens.span.Span'>, 'abc efg', 'testing')]  # notice the space in the text, expected due to custom re-tokenization
> 1

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

SpanRuler on Retokenized tokens links back to original token text, not the token text with a split (space) introduced #13457

Uh oh!

{{title}}

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

Uh oh!

SpanRuler on Retokenized tokens links back to original token text, not the token text with a split (space) introduced #13457

Uh oh!

vrunm Apr 25, 2024

Replies: 0 comments

vrunm
Apr 25, 2024