-
Notifications
You must be signed in to change notification settings - Fork 1.6k
<regex>
: Revise parsing of escape sequences
#5380
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
<regex>
: Revise parsing of escape sequences
#5380
Conversation
I am reviewing this now - thanks as always for the extremely detailed PR description with citations, it is super appreciated for these complex changes. I edited it to fix a typo that made a citation harder to understand - please meow if I messed it up instead.
Yes, I would be in favor of that, in a followup PR - it seems analogous to implementing a C++ Defect Report, and I suspect 99.9% of users would already expect that behavior. |
Thanks for this amazing overhaul and extreme attention to detail! 😻 I pushed one significant change to the product code and a few to the test code - please meow if I messed anything up. |
I'm mirroring this to the MSVC-internal repo - please notify me if any further changes are pushed. |
Resolved adjacent-edit conflicts with #5364 in
|
⬅️ 🏃 🎉 |
Fixes #5244. There is still one remaining issue related to escape sequences in basic regular expressions. I created #5379 to track this.
Changes
[\b]
now matches "backspace" (see the semantics of the production in the recent ECMAScript standard, unchanged since the 3rd edition). I repurposed and renamed one of the_L_flags
to enable this oddity in the parser for ECMAScript. This flag is only set for ECMAScript and had previously been used to represent that the backslash can escape itself in ECMAScript -- but this is superfluous because the flag_L_ident_ECMA
implies the same.[\z]
and[\Z]
no longer match a custom character class "z" provided by a custom regex traits class. Instead, they will match the characters "z" and "Z", as the escapes also do outside of squared character classes. (This change will also make it easier to solve <regex> mishandles locale-based character classes outside of the char range #992 in the future.)\0
represents the literal NUL. The standard also demands that such an escape must not be followed by another decimal digit (which is still specified this way in recent ECMAScript standards), so we now throw anerror_escape
in this case. As a consequence of this change, a backreference escape can no longer be written with leading zeros (see also the most recent ECMAScript standard).\|
is no longer interpreted to mean the character|
. Instead, it is rejected, since its meaning is undefined according to the POSIX standard version referenced by the C++ standard (Section 9.3.2). Users can write|
instead.\|
implementation-defined (in a not particularly good way, I think): It states that\|
could either be interpreted as the character|
or an actual pipe (Section 9.3.2 in the recent standard). The standard then continues that a future version may require\|
to denote a pipe.In any case, the interpretation of
\|
as the character|
, which was implemented before this PR, is not the one recommended by the current version of the standard.]
is accepted now, as the POSIX standard requires because it does not declare]
to be a special character). (I noticed this issue when I added[\]]
as a test case.)_L_flags
for this, so I check one of the flags that are set for basic regular expressions but not for any of the other supported grammars. Setting_L_paren_bal
for basic regular expressions would make the parser accept wrong usages of\}
and\)
as well.[\a]
matches the backslash and a. This was how these escape sequences were interpreted before this PR.std::regex
does in ECMAScript mode, too), but others stick closer to the underlying extended regular expressions semantics and treat them as backslash + character.\}
and\]
in extended regular expressionsAccording to the POSIX standard referenced by the C++ standard, the escape sequence
\}
is undefined in extended regular expressions. However, more recent POSIX versions do define it and it also feels unbalanced to reject\}
while\{
is accepted, so I opted to keep supporting the escape sequence\}
.\[
but not\]
. More recent POSIX versions define\]
as well in basic and extended regular expressions. Should we add support for\]
?