Experimental support for unicode identifiers #1499

WardBrian · 2025-03-10T19:34:55Z

Rebase of #1407. Copied from there:

I know for a fact that this requires a few changes in stan-dev/stan's json data handler to recognize unicode names, which is just one of several reasons this is a draft.

The basic overview:
OCaml strings should be treated mostly like arrays of bytes, and ocamllex handles inputs as sets of bytes. We can define rules that recognize UTF-8-compatible bytes, and then do validation on them after the fact based on the the Unicode Annex 31: Unicode Identifiers standard.

We then pretend for most of the compiler like it's just bytes, which is fine, because we never do things like subslice variable names.

Finally, at output time, we already had string escaping (since #952), so most of the code-gen works fine. Recent C++ standards require that compilers support UTF-8 names based on the same UAX31 rules linked above, but older ones may not. For now I've got it generating "Universal character names" which seem like the legacy version of this, which hopefully means older compilers will be happy with it.

Submission Checklist

Run unit tests
Documentation
- If a user-facing facing change was made, the documentation PR is here:

Release notes

stanc3 can now accept a flag --allow-unicode which enables the use of non-ascii characters in Stan files. All files are expected to be encoded in UTF-8.
This is experimental and may not work with older C++ compilers.

Copyright and Licensing

By submitting this pull request, the copyright holder is agreeing to
license the submitted work under the BSD 3-clause license (https://opensource.org/licenses/BSD-3-Clause)

codecov · 2025-03-10T21:23:11Z

Codecov Report

Attention: Patch coverage is 74.39024% with 21 lines in your changes missing coverage. Please review.

Project coverage is 89.41%. Comparing base (3aa50a9) to head (68db0d8).

Files with missing lines	Patch %	Lines
src/common/Unicode.ml	55.81%	19 Missing ⚠️
src/frontend/Identifiers.ml	91.66%	2 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##           master    #1499      +/-   ##
==========================================
- Coverage   89.52%   89.41%   -0.12%     
==========================================
  Files          66       68       +2     
  Lines        9684     9757      +73     
==========================================
+ Hits         8670     8724      +54     
- Misses       1014     1033      +19

Files with missing lines	Coverage Δ
src/driver/Entry.ml	`93.75% <100.00%> (ø)`
src/driver/Flags.ml	`100.00% <ø> (ø)`
src/frontend/Errors.ml	`100.00% <100.00%> (ø)`
src/stan_math_backend/Cpp.ml	`88.91% <100.00%> (+0.11%)`	⬆️
src/stan_math_backend/Cpp_Json.ml	`100.00% <100.00%> (ø)`
src/stanc/CLI.ml	`98.11% <100.00%> (+0.01%)`	⬆️
src/frontend/Identifiers.ml	`91.66% <91.66%> (ø)`
src/common/Unicode.ml	`55.81% <55.81%> (ø)`

... and 1 file with indirect coverage changes

🚀 New features to boost your workflow:

❄ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Likely needs more library support: dbuenzli/uucp#25

WardBrian force-pushed the redo-unicode branch from bf92135 to 7b5b915 Compare March 11, 2025 14:28

WardBrian added 8 commits March 13, 2025 14:27

First pass at unicode support

092bf06

Fast path for ASCII

2d466b1

Hide feature behind flag, more testing

796e711

Split utf-16 tests to improve output readability

0ba1518

Reorganize, code-gen UCNs

25625ce

More tests

2e8b66c

Add some notes on potential further validation

0012e20

Likely needs more library support: dbuenzli/uucp#25

Update CI

68db0d8

WardBrian force-pushed the redo-unicode branch from d25b890 to 68db0d8 Compare March 13, 2025 18:34

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Experimental support for unicode identifiers #1499

Experimental support for unicode identifiers #1499

Uh oh!

WardBrian commented Mar 10, 2025

Uh oh!

codecov bot commented Mar 10, 2025 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Experimental support for unicode identifiers #1499

Are you sure you want to change the base?

Experimental support for unicode identifiers #1499

Uh oh!

Conversation

WardBrian commented Mar 10, 2025

Submission Checklist

Release notes

Copyright and Licensing

Uh oh!

codecov bot commented Mar 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Uh oh!

codecov bot commented Mar 10, 2025 •

edited

Loading