Skip to content

Commit 2b691e4

Browse files
committed
Clarify string descriptions
This came out of seeing what (if anything) we want to merge out of #875; it makes the following copy editing changes: - Clearly list allowed codepoints at the start of every string type. - "Unicode" on its own doesn't necessarily mean anything; UTF-16 or UCS-2 is "Unicode". Perhaps a bit pedantic, but "UTF-8" or "codepoints" are "more correct". Similarly, a "character" or "Unicode character" is quite a tricky thing to define. Multiple codepoints can be one "character". Most of the time "codepoint" is really what's intended. - Don't link to some random page for base64 decode. Guess we could link to Wikipedia, but seems a but redundant to me. Fixes #875
1 parent 736fdc5 commit 2b691e4

File tree

1 file changed

+25
-25
lines changed

1 file changed

+25
-25
lines changed

toml.md

Lines changed: 25 additions & 25 deletions
Original file line numberDiff line numberDiff line change
@@ -260,12 +260,11 @@ The above TOML maps to the following JSON.
260260
## String
261261

262262
There are four ways to express strings: basic, multi-line basic, literal, and
263-
multi-line literal. All strings must contain only Unicode characters.
263+
multi-line literal. All strings must be encoded as UTF-8.
264264

265-
**Basic strings** are surrounded by quotation marks (`"`). Any Unicode character
266-
may be used except those that must be escaped: quotation mark, backslash, and
267-
the control characters other than tab (U+0000 to U+0008, U+000A to U+001F,
268-
U+007F).
265+
**Basic strings** are surrounded by quotation marks (`"`). Any codepoint may be
266+
used except those that must be escaped: quotation mark, backslash, and the
267+
control characters other than tab (U+0000 to U+0008, U+000A to U+001F, U+007F).
269268

270269
```toml
271270
str = "I'm a string. \"You can quote me\". Name\tJos\xE9\nLocation\tSF."
@@ -282,19 +281,18 @@ For convenience, some popular characters have a compact escape sequence.
282281
\e - escape (U+001B)
283282
\" - quote (U+0022)
284283
\\ - backslash (U+005C)
285-
\xHH - unicode (U+00HH)
286-
\uHHHH - unicode (U+HHHH)
287-
\UHHHHHHHH - unicode (U+HHHHHHHH)
284+
\xHH - codepoint (U+00HH)
285+
\uHHHH - codepoint (U+HHHH)
286+
\UHHHHHHHH - codepoint (U+HHHHHHHH)
288287
```
289288

290-
Any Unicode character may be escaped with the `\xHH`, `\uHHHH`, or `\UHHHHHHHH`
289+
Any codepoint may be escaped with the `\xHH`, `\uHHHH`, or `\UHHHHHHHH`
291290
forms. The escape codes must be Unicode
292291
[scalar values](https://unicode.org/glossary/#unicode_scalar_value).
293292

294-
Keep in mind that all TOML strings are sequences of Unicode characters, _not_
295-
byte sequences. For binary data, avoid using these escape codes. Instead,
296-
external binary-to-text encoding strategies, like hexadecimal sequences or
297-
[Base64](https://www.base64decode.org/), are recommended for converting between
293+
All TOML strings are UTF-8 encoded, _not_ byte sequences. For binary data, avoid
294+
using these escape codes. Instead, external binary-to-text encoding strategies,
295+
like hexadecimal sequences or base64, are recommended for converting between
298296
bytes and strings.
299297

300298
All other escape sequences not listed above are reserved; if they are used, TOML
@@ -307,6 +305,11 @@ like to break up a very long string into multiple lines. TOML makes this easy.
307305
side and allow newlines. A newline immediately following the opening delimiter
308306
will be trimmed. All other whitespace and newline characters remain intact.
309307

308+
Any codepoint may be used except those that must be escaped: backslash and the
309+
control characters other than tab, line feed, and carriage return (U+0000 to
310+
U+0008, U+000B, U+000C, U+000E to U+001F, U+007F). Carriage returns (U+000D) are
311+
only allowed as part of a newline sequence.
312+
310313
```toml
311314
str1 = """
312315
Roses are red
@@ -349,11 +352,6 @@ str3 = """\
349352
"""
350353
```
351354

352-
Any Unicode character may be used except those that must be escaped: backslash
353-
and the control characters other than tab, line feed, and carriage return
354-
(U+0000 to U+0008, U+000B, U+000C, U+000E to U+001F, U+007F). Carriage returns
355-
(U+000D) are only allowed as part of a newline sequence.
356-
357355
You can write a quotation mark, or two adjacent quotation marks, anywhere inside
358356
a multi-line basic string. They can also be written just inside the delimiters.
359357

@@ -371,8 +369,10 @@ If you're a frequent specifier of Windows paths or regular expressions, then
371369
having to escape backslashes quickly becomes tedious and error-prone. To help,
372370
TOML supports literal strings which do not allow escaping at all.
373371

374-
**Literal strings** are surrounded by single quotes. Like basic strings, they
375-
must appear on a single line:
372+
**Literal strings** are surrounded by single quotes and don't support `\`
373+
escapes. Any codepoint may be used except for control characters other than tab.
374+
375+
Like basic strings, they must appear on a single line:
376376

377377
```toml
378378
# What you see is what you get.
@@ -383,11 +383,13 @@ regex = '<\i\c*\s*>'
383383
```
384384

385385
Since there is no escaping, there is no way to write a single quote inside a
386-
literal string enclosed by single quotes. Luckily, TOML supports a multi-line
387-
version of literal strings that solves this problem.
386+
literal string enclosed by single quotes. TOML supports a multi-line version of
387+
literal strings that solves this problem.
388388

389389
**Multi-line literal strings** are surrounded by three single quotes on each
390-
side and allow newlines. Like literal strings, there is no escaping whatsoever.
390+
side and allow newlines. Like literal strings, there are `\` escapes. Any
391+
codepoint may be used except for control characters other than tab.
392+
391393
A newline immediately following the opening delimiter will be trimmed. TOML
392394
parsers must normalize newlines in the same manner as multi-line basic strings.
393395

@@ -417,8 +419,6 @@ apos15 = "Here are fifteen apostrophes: '''''''''''''''"
417419
str = ''''That,' she said, 'is still pointless.''''
418420
```
419421

420-
Control characters other than tab are not permitted in a literal string.
421-
422422
## Integer
423423

424424
Integers are whole numbers. Positive numbers may be prefixed with a plus sign.

0 commit comments

Comments
 (0)