byte sequence <> string #146

annevk · 2017-08-15T12:57:39Z

This would improve whatwg/fetch#579 and is also needed to properly define ByteString in IDL. There might be a few other places that can use it. This also makes polymorphic algorithms less needed as we just map back and forth.

Preview | Diff

tobie

This LGTM (though I'm not qualified to comment on the content of the note).

domenic · 2017-08-15T13:47:00Z

infra.bs

@@ -313,6 +313,14 @@ contains, in the range 0x61 (a) to 0x7A (z), inclusive, by 0x20.
 <a>byte sequence</a> <var>B</var>, if the <a>byte-lowercase</a> of <var>A</var> is the
 <a>byte-lowercase</a> of <var>B</var>.

+<p>To <dfn export for="byte sequence">map</dfn> a <a>byte sequence</a> <var>input</var> to a


I wonder about a name like "simple decode" or "numeric decode" or similar. "map" seems pretty generic and conflicts with my intuition that "the process of mapping a byte sequence to a string is called decoding".

"Simple decode" seems bad because we should not use "simple". That's up to the reader to determine. And "numeric decode" also isn't really clear.

I'd be okay with "iso-8859-1 decode" and "iso-8859-1 encode". Those don't conflict with the Encoding Standard and a note can explain the difference with the iso-8859-1 label which maps to windows-1252.

domenic · 2017-08-15T14:02:37Z

infra.bs

+bytes, in the same order.
+
+<p class=note>This matches the behavior of a ISO-8859-1 decoder, except no such decoder is defined
+as the Encoding Standard maps "iso-8859-1" to <a>windows-1252</a>. [[ENCODING]]


We should be pretty cautious with this note as some people I know are very confused/mad by the mapping of iso-8859-1 to windows-1252. I'd rather phrase this in a way that makes it sound less like the Encoding Standard is being "wrong". And I think it's a valuable opportunity to spell out this confusing space for people. Including myself.

Something like

This process corresponds somewhat to what has historically been called ISO-8859-1 decoding, at least for some ranges of bytes. However, in reality, software that performs such decoding has behavior for bytes in the range 128–255 that goes beyond the ISO-8859-1 specification, leading to the modern practice of aliasing "iso-8859-1" to "windows-1252" and performing the corresponding algorithm from the Encoding Standard. The algorithm here is different, treating all bytes directly as their corresponding code points instead of using the windows-1252 index for bytes in the range 128–255. [[ENCODING]] [[ISO8859]].

([[ISO8859]] would go to https://www.iso.org/standard/28245.html I guess? Which doesn't let you view the PDF without paying? Wikipedia also gives ftp://std.dkuug.dk/JTC1/sc2/wg3/docs/n411.pdf as a link.)

Then we'll need a similar reverse note.

domenic

I see you went with just removing the note entirely instead of expanding it. I guess that is OK, but I would like to add one myself later, as I believe this is a confusing area that it would be good to clarify.

domenic · 2017-08-15T16:27:43Z

infra.bs

@@ -313,6 +313,11 @@ contains, in the range 0x61 (a) to 0x7A (z), inclusive, by 0x20.
 <a>byte sequence</a> <var>B</var>, if the <a>byte-lowercase</a> of <var>A</var> is the
 <a>byte-lowercase</a> of <var>B</var>.

+<p>To <dfn export>isomorphic decode</dfn> a <a>byte sequence</a> <var>input</var>, return a
+<a>string</a> whose <a for=string>length</a> is equal to <var>input</var>'s
+<a for="byte sequence">length</a> and whose code points have the same values as <var>input</var>'s


Link code point

domenic · 2017-08-15T16:27:53Z

infra.bs

+<p>To <dfn export>isomorphic encode</dfn> a <a>string</a> <var>input</var>, run these steps:</p>
+
+<ol>
+ <li><p>Assert: <var>input</var> contains no code points greater than U+00FF.


Link code point

domenic · 2017-08-15T16:28:01Z

infra.bs

+<ol>
+ <li><p>Assert: <var>input</var> contains no code points greater than U+00FF.
+
+ <li><p>Return a byte sequence whose <a for="byte sequence">length</a> is equal to


Link byte sequence

domenic · 2017-08-15T16:28:10Z

infra.bs

+
+ <li><p>Return a byte sequence whose <a for="byte sequence">length</a> is equal to
+ <var>input</var>'s <a for=string>length</a> and whose bytes have the same values as
+ <var>input</var>'s code points, in the same order.


Link code point

domenic · 2017-08-15T16:28:15Z

infra.bs

+ <li><p>Assert: <var>input</var> contains no code points greater than U+00FF.
+
+ <li><p>Return a byte sequence whose <a for="byte sequence">length</a> is equal to
+ <var>input</var>'s <a for=string>length</a> and whose bytes have the same values as


domenic · 2017-08-15T16:28:20Z

infra.bs

+<p>To <dfn export>isomorphic decode</dfn> a <a>byte sequence</a> <var>input</var>, return a
+<a>string</a> whose <a for=string>length</a> is equal to <var>input</var>'s
+<a for="byte sequence">length</a> and whose code points have the same values as <var>input</var>'s
+bytes, in the same order.


annevk · 2017-08-15T16:33:46Z

Since ISO-8859-1 is such a mess it seemed better to avoid getting into it. We can just teach people about isomorphic encode/decode and if they want to learn more that's more stackoverflow/IRC territory.

domenic · 2017-08-15T16:35:26Z

I guess that's fair.

domenic · 2017-08-15T16:36:54Z

Wait!

We should also update

To get a byte sequence out of a string, use an operation such as UTF-8 encode from the Encoding Standard. [ENCODING]

byte sequence <> string

0633fc8

annevk requested review from tobie and domenic August 15, 2017 13:04

tobie approved these changes Aug 15, 2017

View reviewed changes

use <code>

d7c3cc4

domenic requested changes Aug 15, 2017

View reviewed changes

isomorphic

f5e739a

annevk requested a review from domenic August 15, 2017 15:12

annevk mentioned this pull request Aug 15, 2017

Switch btoa() to take ByteString and atob() to return a ByteString? whatwg/html#2911

Open

domenic reviewed Aug 15, 2017

View reviewed changes

link things

3723a5f

domenic approved these changes Aug 15, 2017

View reviewed changes

adjust note

9570220

annevk merged commit 8d7447e into master Aug 15, 2017

annevk deleted the annevk/byte-sequence-string branch August 15, 2017 16:54

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

byte sequence <> string #146

byte sequence <> string #146

Uh oh!

annevk commented Aug 15, 2017 •

edited by pr-preview bot

Loading

Uh oh!

tobie left a comment

Uh oh!

domenic Aug 15, 2017

Uh oh!

annevk Aug 15, 2017

Uh oh!

domenic Aug 15, 2017 •

edited

Loading

Uh oh!

domenic left a comment

Uh oh!

domenic Aug 15, 2017

Uh oh!

domenic Aug 15, 2017

Uh oh!

domenic Aug 15, 2017

Uh oh!

domenic Aug 15, 2017

Uh oh!

domenic Aug 15, 2017

Uh oh!

domenic Aug 15, 2017

Uh oh!

annevk commented Aug 15, 2017

Uh oh!

domenic commented Aug 15, 2017

Uh oh!

domenic commented Aug 15, 2017 •

edited

Loading

Uh oh!

Uh oh!

byte sequence <> string #146

byte sequence <> string #146

Uh oh!

Conversation

annevk commented Aug 15, 2017 • edited by pr-preview bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tobie left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

domenic Aug 15, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

domenic left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

annevk commented Aug 15, 2017

Uh oh!

domenic commented Aug 15, 2017

Uh oh!

domenic commented Aug 15, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

annevk commented Aug 15, 2017 •

edited by pr-preview bot

Loading

domenic Aug 15, 2017 •

edited

Loading

domenic commented Aug 15, 2017 •

edited

Loading