Skip to content

Big5 encoding mishandles some trailing bytes, with possible XSS #171

Open
@mihnita

Description

@mihnita

There are some sequences of bytes that are valid lead-trailing according to the description at https://encoding.spec.whatwg.org/#big5-decoder, but don't have a corresponding Unicode codepoint in the index-big5.txt mapping table.

In this case the first byte is converted to U+FFFD, but the second one is left "as is". In some cases that second byte can be backslash (\, 5C), which can be used to "escape" the ending quote of strings in JavaScript and potentially resulting in XSS exploits.

Example (attached): the 83 5C sequence
According to the algorithm at https://encoding.spec.whatwg.org/#big5-decoder

  • the first byte is a lead byte (case 5, byte is in the range 0x81 to 0xFE, inclusive)
  • for the second byte we have case 3
    • 3.1. Let offset be 0x40 (byte is less than 0x7F)
    • 3.2. byte is in the range 0x40 to 0x7E, inclusive => set pointer to (lead − 0x81) × 157 + (byte − offset). The result of that is (0x83 - 0x81) * 157 + (0x5C - 0x40) which is 0x156
    • there is no mapping for 0x156 in index-big5.txt, and because the byte is an ASCII byte we prepend byte to stream (case 3.6) and Return error (case 3.7)

The end result is a U+FFFD (from the error) followed by a 5C (the trailing byte, "as is")

You can see this in the attached file.
When opened in both Chrome and Firefox the text is rendered as the "Unicode REPLACEMENT CHARACTER" (correct) followed by a back-slash (incorrect).

This is a valid lead-trail byte sequence that should either be replaced by one single U+FFFD character, or by two U+FFFD characters, whatever the policy is (I think the second case).

But the definitely the trailing byte should not be left "as is"

The possible exploit can use the trailing byte (which is backslash) to escape the end of a string, for example.
Checking the console of Firefox you will see the 'SyntaxError: "" string literal contains an unescaped line break' message. In Chrome the message is 'Uncaught SyntaxError: Invalid or unexpected token'

I did not check, but this might also happen in other DBCS (Double Byte Characters Sets) that have the second byte in the ASCII range (for instance in Shift JIS?).

big5.zip

Metadata

Metadata

Assignees

No one assigned

    Labels

    needs implementer interestMoving the issue forward requires implementers to express interestsecurity/privacyThere are security or privacy implications

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions