-
Notifications
You must be signed in to change notification settings - Fork 108
byte sequence <> string #146
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This LGTM (though I'm not qualified to comment on the content of the note).
infra.bs
Outdated
@@ -313,6 +313,14 @@ contains, in the range 0x61 (a) to 0x7A (z), inclusive, by 0x20. | |||
<a>byte sequence</a> <var>B</var>, if the <a>byte-lowercase</a> of <var>A</var> is the | |||
<a>byte-lowercase</a> of <var>B</var>. | |||
|
|||
<p>To <dfn export for="byte sequence">map</dfn> a <a>byte sequence</a> <var>input</var> to a |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder about a name like "simple decode" or "numeric decode" or similar. "map" seems pretty generic and conflicts with my intuition that "the process of mapping a byte sequence to a string is called decoding".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"Simple decode" seems bad because we should not use "simple". That's up to the reader to determine. And "numeric decode" also isn't really clear.
I'd be okay with "iso-8859-1 decode" and "iso-8859-1 encode". Those don't conflict with the Encoding Standard and a note can explain the difference with the iso-8859-1 label which maps to windows-1252.
infra.bs
Outdated
bytes, in the same order. | ||
|
||
<p class=note>This matches the behavior of a ISO-8859-1 decoder, except no such decoder is defined | ||
as the Encoding Standard maps "iso-8859-1" to <a>windows-1252</a>. [[ENCODING]] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should be pretty cautious with this note as some people I know are very confused/mad by the mapping of iso-8859-1 to windows-1252. I'd rather phrase this in a way that makes it sound less like the Encoding Standard is being "wrong". And I think it's a valuable opportunity to spell out this confusing space for people. Including myself.
Something like
This process corresponds somewhat to what has historically been called ISO-8859-1 decoding, at least for some ranges of bytes. However, in reality, software that performs such decoding has behavior for bytes in the range 128–255 that goes beyond the ISO-8859-1 specification, leading to the modern practice of aliasing "iso-8859-1" to "windows-1252" and performing the corresponding algorithm from the Encoding Standard. The algorithm here is different, treating all bytes directly as their corresponding code points instead of using the windows-1252 index for bytes in the range 128–255. [[ENCODING]] [[ISO8859]].
([[ISO8859]] would go to https://www.iso.org/standard/28245.html I guess? Which doesn't let you view the PDF without paying? Wikipedia also gives ftp://std.dkuug.dk/JTC1/sc2/wg3/docs/n411.pdf as a link.)
Then we'll need a similar reverse note.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see you went with just removing the note entirely instead of expanding it. I guess that is OK, but I would like to add one myself later, as I believe this is a confusing area that it would be good to clarify.
infra.bs
Outdated
@@ -313,6 +313,11 @@ contains, in the range 0x61 (a) to 0x7A (z), inclusive, by 0x20. | |||
<a>byte sequence</a> <var>B</var>, if the <a>byte-lowercase</a> of <var>A</var> is the | |||
<a>byte-lowercase</a> of <var>B</var>. | |||
|
|||
<p>To <dfn export>isomorphic decode</dfn> a <a>byte sequence</a> <var>input</var>, return a | |||
<a>string</a> whose <a for=string>length</a> is equal to <var>input</var>'s | |||
<a for="byte sequence">length</a> and whose code points have the same values as <var>input</var>'s |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Link code point
infra.bs
Outdated
<p>To <dfn export>isomorphic encode</dfn> a <a>string</a> <var>input</var>, run these steps:</p> | ||
|
||
<ol> | ||
<li><p>Assert: <var>input</var> contains no code points greater than U+00FF. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Link code point
infra.bs
Outdated
<ol> | ||
<li><p>Assert: <var>input</var> contains no code points greater than U+00FF. | ||
|
||
<li><p>Return a byte sequence whose <a for="byte sequence">length</a> is equal to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Link byte sequence
infra.bs
Outdated
|
||
<li><p>Return a byte sequence whose <a for="byte sequence">length</a> is equal to | ||
<var>input</var>'s <a for=string>length</a> and whose bytes have the same values as | ||
<var>input</var>'s code points, in the same order. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Link code point
infra.bs
Outdated
<li><p>Assert: <var>input</var> contains no code points greater than U+00FF. | ||
|
||
<li><p>Return a byte sequence whose <a for="byte sequence">length</a> is equal to | ||
<var>input</var>'s <a for=string>length</a> and whose bytes have the same values as |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Link bytes
infra.bs
Outdated
<p>To <dfn export>isomorphic decode</dfn> a <a>byte sequence</a> <var>input</var>, return a | ||
<a>string</a> whose <a for=string>length</a> is equal to <var>input</var>'s | ||
<a for="byte sequence">length</a> and whose code points have the same values as <var>input</var>'s | ||
bytes, in the same order. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Link bytes
Since ISO-8859-1 is such a mess it seemed better to avoid getting into it. We can just teach people about isomorphic encode/decode and if they want to learn more that's more stackoverflow/IRC territory. |
I guess that's fair. |
Wait! We should also update
|
This would improve whatwg/fetch#579 and is also needed to properly define ByteString in IDL. There might be a few other places that can use it. This also makes polymorphic algorithms less needed as we just map back and forth.
Preview | Diff