Skip to content

byte sequence <> string #146

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 5 commits into from
Aug 15, 2017
Merged

byte sequence <> string #146

merged 5 commits into from
Aug 15, 2017

Conversation

annevk
Copy link
Member

@annevk annevk commented Aug 15, 2017

This would improve whatwg/fetch#579 and is also needed to properly define ByteString in IDL. There might be a few other places that can use it. This also makes polymorphic algorithms less needed as we just map back and forth.


Preview | Diff

@annevk annevk requested review from tobie and domenic August 15, 2017 13:04
Copy link
Collaborator

@tobie tobie left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This LGTM (though I'm not qualified to comment on the content of the note).

infra.bs Outdated
@@ -313,6 +313,14 @@ contains, in the range 0x61 (a) to 0x7A (z), inclusive, by 0x20.
<a>byte sequence</a> <var>B</var>, if the <a>byte-lowercase</a> of <var>A</var> is the
<a>byte-lowercase</a> of <var>B</var>.

<p>To <dfn export for="byte sequence">map</dfn> a <a>byte sequence</a> <var>input</var> to a
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder about a name like "simple decode" or "numeric decode" or similar. "map" seems pretty generic and conflicts with my intuition that "the process of mapping a byte sequence to a string is called decoding".

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Simple decode" seems bad because we should not use "simple". That's up to the reader to determine. And "numeric decode" also isn't really clear.

I'd be okay with "iso-8859-1 decode" and "iso-8859-1 encode". Those don't conflict with the Encoding Standard and a note can explain the difference with the iso-8859-1 label which maps to windows-1252.

infra.bs Outdated
bytes, in the same order.

<p class=note>This matches the behavior of a ISO-8859-1 decoder, except no such decoder is defined
as the Encoding Standard maps "iso-8859-1" to <a>windows-1252</a>. [[ENCODING]]
Copy link
Member

@domenic domenic Aug 15, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should be pretty cautious with this note as some people I know are very confused/mad by the mapping of iso-8859-1 to windows-1252. I'd rather phrase this in a way that makes it sound less like the Encoding Standard is being "wrong". And I think it's a valuable opportunity to spell out this confusing space for people. Including myself.

Something like

This process corresponds somewhat to what has historically been called ISO-8859-1 decoding, at least for some ranges of bytes. However, in reality, software that performs such decoding has behavior for bytes in the range 128–255 that goes beyond the ISO-8859-1 specification, leading to the modern practice of aliasing "iso-8859-1" to "windows-1252" and performing the corresponding algorithm from the Encoding Standard. The algorithm here is different, treating all bytes directly as their corresponding code points instead of using the windows-1252 index for bytes in the range 128–255. [[ENCODING]] [[ISO8859]].

([[ISO8859]] would go to https://www.iso.org/standard/28245.html I guess? Which doesn't let you view the PDF without paying? Wikipedia also gives ftp://std.dkuug.dk/JTC1/sc2/wg3/docs/n411.pdf as a link.)

Then we'll need a similar reverse note.

Copy link
Member

@domenic domenic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see you went with just removing the note entirely instead of expanding it. I guess that is OK, but I would like to add one myself later, as I believe this is a confusing area that it would be good to clarify.

infra.bs Outdated
@@ -313,6 +313,11 @@ contains, in the range 0x61 (a) to 0x7A (z), inclusive, by 0x20.
<a>byte sequence</a> <var>B</var>, if the <a>byte-lowercase</a> of <var>A</var> is the
<a>byte-lowercase</a> of <var>B</var>.

<p>To <dfn export>isomorphic decode</dfn> a <a>byte sequence</a> <var>input</var>, return a
<a>string</a> whose <a for=string>length</a> is equal to <var>input</var>'s
<a for="byte sequence">length</a> and whose code points have the same values as <var>input</var>'s
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Link code point

infra.bs Outdated
<p>To <dfn export>isomorphic encode</dfn> a <a>string</a> <var>input</var>, run these steps:</p>

<ol>
<li><p>Assert: <var>input</var> contains no code points greater than U+00FF.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Link code point

infra.bs Outdated
<ol>
<li><p>Assert: <var>input</var> contains no code points greater than U+00FF.

<li><p>Return a byte sequence whose <a for="byte sequence">length</a> is equal to
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Link byte sequence

infra.bs Outdated

<li><p>Return a byte sequence whose <a for="byte sequence">length</a> is equal to
<var>input</var>'s <a for=string>length</a> and whose bytes have the same values as
<var>input</var>'s code points, in the same order.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Link code point

infra.bs Outdated
<li><p>Assert: <var>input</var> contains no code points greater than U+00FF.

<li><p>Return a byte sequence whose <a for="byte sequence">length</a> is equal to
<var>input</var>'s <a for=string>length</a> and whose bytes have the same values as
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Link bytes

infra.bs Outdated
<p>To <dfn export>isomorphic decode</dfn> a <a>byte sequence</a> <var>input</var>, return a
<a>string</a> whose <a for=string>length</a> is equal to <var>input</var>'s
<a for="byte sequence">length</a> and whose code points have the same values as <var>input</var>'s
bytes, in the same order.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Link bytes

@annevk
Copy link
Member Author

annevk commented Aug 15, 2017

Since ISO-8859-1 is such a mess it seemed better to avoid getting into it. We can just teach people about isomorphic encode/decode and if they want to learn more that's more stackoverflow/IRC territory.

@domenic
Copy link
Member

domenic commented Aug 15, 2017

I guess that's fair.

@domenic
Copy link
Member

domenic commented Aug 15, 2017

Wait!

We should also update

To get a byte sequence out of a string, use an operation such as UTF-8 encode from the Encoding Standard. [ENCODING]

@annevk annevk merged commit 8d7447e into master Aug 15, 2017
@annevk annevk deleted the annevk/byte-sequence-string branch August 15, 2017 16:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

3 participants