Skip to content

<regex>: Revise parsing of escape sequences #5380

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

Conversation

muellerj2
Copy link
Contributor

@muellerj2 muellerj2 commented Mar 30, 2025

Fixes #5244. There is still one remaining issue related to escape sequences in basic regular expressions. I created #5379 to track this.

Changes

  • ECMAScript: [\b] now matches "backspace" (see the semantics of the production in the recent ECMAScript standard, unchanged since the 3rd edition). I repurposed and renamed one of the _L_flags to enable this oddity in the parser for ECMAScript. This flag is only set for ECMAScript and had previously been used to represent that the backslash can escape itself in ECMAScript -- but this is superfluous because the flag _L_ident_ECMA implies the same.
  • ECMAScript: [\z] and [\Z] no longer match a custom character class "z" provided by a custom regex traits class. Instead, they will match the characters "z" and "Z", as the escapes also do outside of squared character classes. (This change will also make it easier to solve <regex> mishandles locale-based character classes outside of the char range #992 in the future.)
  • ECMAScript: The standard states that a decimal escape starting with digit \0 represents the literal NUL. The standard also demands that such an escape must not be followed by another decimal digit (which is still specified this way in recent ECMAScript standards), so we now throw an error_escape in this case. As a consequence of this change, a backreference escape can no longer be written with leading zeros (see also the most recent ECMAScript standard).
  • basic, grep: The escape sequence \| is no longer interpreted to mean the character |. Instead, it is rejected, since its meaning is undefined according to the POSIX standard version referenced by the C++ standard (Section 9.3.2). Users can write | instead.
    • The current edition of the POSIX standard makes the meaning of \|implementation-defined (in a not particularly good way, I think): It states that \| could either be interpreted as the character | or an actual pipe (Section 9.3.2 in the recent standard). The standard then continues that a future version may require \| to denote a pipe.
      In any case, the interpretation of \| as the character |, which was implemented before this PR, is not the one recommended by the current version of the standard.
  • basic, grep: The regular expression ] is accepted now, as the POSIX standard requires because it does not declare ] to be a special character). (I noticed this issue when I added [\]] as a test case.)
    • Unfortunately, there is no truly suitable _L_flags for this, so I check one of the flags that are set for basic regular expressions but not for any of the other supported grammars. Setting _L_paren_bal for basic regular expressions would make the parser accept wrong usages of \} and \) as well.
  • awk: As required by the POSIX standard (see here), octal escape sequences and all awk-specific character escapes are now recognized within squared character classes (before this PR, only the escapes \a, \b, \f, \n, \r, \t and \v were recognized). On the other hand, unknown escape sequences are now rejected in character classes, because their meaning is ambiguous:
    • The underlying POSIX extended regular expressions don't treat \ as an escape character in character classes, so [\a] matches the backslash and a. This was how these escape sequences were interpreted before this PR.
    • The POSIX standard leaves the interpretation of unknown sequences undefined in the awk specification.
    • Many awk implementations treat such unknown sequences as identity escapes (as std::regex does in ECMAScript mode, too), but others stick closer to the underlying extended regular expressions semantics and treat them as backslash + character.

\} and \] in extended regular expressions

According to the POSIX standard referenced by the C++ standard, the escape sequence \} is undefined in extended regular expressions. However, more recent POSIX versions do define it and it also feels unbalanced to reject \} while \{ is accepted, so I opted to keep supporting the escape sequence \}.

  • The parser supports the escape sequence \[ but not \]. More recent POSIX versions define \] as well in basic and extended regular expressions. Should we add support for \]?

@muellerj2 muellerj2 requested a review from a team as a code owner March 30, 2025 11:33
@github-project-automation github-project-automation bot moved this to Initial Review in STL Code Reviews Mar 30, 2025
@StephanTLavavej StephanTLavavej added bug Something isn't working regex meow is a substring of homeowner labels Mar 31, 2025
@StephanTLavavej StephanTLavavej self-assigned this Mar 31, 2025
@StephanTLavavej
Copy link
Member

I am reviewing this now - thanks as always for the extremely detailed PR description with citations, it is super appreciated for these complex changes. I edited it to fix a typo that made a citation harder to understand - please meow if I messed it up instead.

The parser supports the escape sequence \[ but not \]. More recent POSIX versions define \] as well in basic and extended regular expressions. Should we add support for \]?

Yes, I would be in favor of that, in a followup PR - it seems analogous to implementing a C++ Defect Report, and I suspect 99.9% of users would already expect that behavior.

@StephanTLavavej
Copy link
Member

Thanks for this amazing overhaul and extreme attention to detail! 😻 I pushed one significant change to the product code and a few to the test code - please meow if I messed anything up.

@StephanTLavavej StephanTLavavej removed their assignment Apr 9, 2025
@StephanTLavavej StephanTLavavej moved this from Initial Review to Ready To Merge in STL Code Reviews Apr 9, 2025
@StephanTLavavej StephanTLavavej moved this from Ready To Merge to Merging in STL Code Reviews Apr 9, 2025
@StephanTLavavej StephanTLavavej self-assigned this Apr 9, 2025
@StephanTLavavej
Copy link
Member

I'm mirroring this to the MSVC-internal repo - please notify me if any further changes are pushed.

@StephanTLavavej
Copy link
Member

Resolved adjacent-edit conflicts with #5364 in <regex> where:

  • it changed to _ClassAtom(bool) and this PR changed to _ClassEscape3()
  • it changed to _ClassAtom(const bool _Initial) and this PR expanded "check for valid escape sequence"

@StephanTLavavej StephanTLavavej merged commit 0e63fc6 into microsoft:main Apr 10, 2025
39 checks passed
@github-project-automation github-project-automation bot moved this from Merging to Done in STL Code Reviews Apr 10, 2025
@StephanTLavavej
Copy link
Member

⬅️ 🏃 🎉

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working regex meow is a substring of homeowner
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

<regex>: Some escape sequences are mishandled
2 participants