SpanRuler on Retokenized tokens links back to original token text, not the token text with a split (space) introduced #13457
Unanswered
vrunm
asked this question in
Help: Coding & Implementations
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
Goal: First split a token into two tokens. Then use SpanRuler to label both the re-tokenized tokens as a single span with one label.
Problem: The labeled span consists of the original text (a single token) rather than the two tokens concatenated with a separating space (ie after re-tokenization).
What I did:
I add a custom tokenizer splitter as the first stage. It correctly splits the single token into two tokens.
I then detect the two (splitted) tokens using a SpanRuler. Notice that the SpanRuler works for a pattern of two separated tokens (ie pattern=['abc', 'efg']), and will correctly detect nothing if the pattern is the original single token (pattern='abcefg').
Notice the custom retokenizer does respect Spacy's non-destructive retokenization.
Thanks for any help.
Minimal Reproducible Example:
Actual Output:
Expected output:
Beta Was this translation helpful? Give feedback.
All reactions