Skip to content

Report partial matches #678

Closed as not planned
Closed as not planned
@Ekleog

Description

@Ekleog

Hello,

First, this issue's description may overlap a bit with #425, by attempting to do what #25 wanted.
However, I believe the solution described below to handle it would probably be much easier, and I think it would be a good thing to handle my use case even if #425 were implemented (even though it could be worked around by using longest_potential_match_len in such a world).
So, I thought it was worth it suggesting this intermediate solution, that would hopefully be reasonable to implement even though #425 appears to not be as of today, and solve at least some issues :)

Problem description

I'm trying to use regexes from streaming nom parsers, for parsing SMTP information.

In order to do so, the nom way of doing things is basically to, when receiving a packet, optimistically parse it, and if the parser reports that it needs additional data to proceed, then wait for enough data to come in and parse again -- this way most matches are done in one pass, even though it means one additional case for the rare case where a message was split between two packets.

However, in order to use regexes from this, there has to be a way to be able to know whether the regex didn't match because it couldn't match, or if it didn't match because it didn't have a complete match yet.

Proposed solution

The solution I can think of is simple with only basic theory of automata -- not having any idea how regex's engine works in practice, I cannot say whether it would be simple in practice, though.

The idea is the following: if the automaton has reached a failing state before the end of the input, return that there can be no match. If it has not, and reaches the end of the input in a start or intermediate state, return that there may be a match that would come up if there were more input.

Note that this scheme works well only for regexes that start with ^ or similar -- but any regex that doesn't would potentially match later on, which means that this would be working as intended.

As a further evolution that might be enough to handle part of #425's problem too, it might make sense to also report when the regex was last in its start state, so that the caller could just buffer starting from the last time the regex was in its start state and re-run from that last position the next time. It's less optimized than doing it all in one pass, but hopefully would be fast enough (by making the user rescan only the parts of the buffer that could contain a match), while unlocking additional use cases.

What do you think about this idea? Would it be simple enough to deserve implementing, until #425 would be solved, or would I be better off implementing this myself based on the regex_automata crate?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions