Comment tokenizer can break when using mbstring function overloading #692

Lexx918 · 2015-08-25T16:54:33Z

I tried as it is written here: guidelines for contributing
And..
Tests for a clean repository:

Tests: 222, Assertions: 69, Failures: 5, Skipped: 4.

And tests after applying my patch:

Tests: 222, Assertions: 69,              Skipped: 4.

:)

P.S. Reason for changes: http://pear.php.net/bugs/bug.php?id=20943

aik099 · 2015-08-25T17:08:38Z

CodeSniffer/Tokenizers/Comment.php

@@ -54,7 +54,7 @@ public function tokenizeString($string, $eolChar, $stackPtr)
        */

        for ($c = 0; $c < $numChars; $c++) {
-            if ($string[$c] !== '/' && $string[$c] !== '*') {
+            if (substr($string, $c, 1) !== '/' && substr($string, $c, 1) !== '*') {


To my knowledge old code does the same as new code. Am I missing something.

Old code won't work on PHP7, but new code will however. That's good thing.

Old code works perfectly fine on PHP7. The unit tests have been running on it most of the year.

aik099 · 2015-08-25T17:10:32Z

Are you passing the --encoding="UTF-8" to the phpcs/phpcbf command? If you're not passing, then it won't work.

Lexx918 · 2015-08-25T17:50:19Z

We use --encoding=utf-8. All files in UTF-8, and the comments contain many russian letters.

gsherwood · 2015-08-25T22:09:55Z

If you've just switched to substr(), it means you're probably overloading the built-in string functions using the multibyte extension. PHP_CodeSniffer doesn't do multibyte support like that. It can't rely on people doing the overloading.

PHP_CodeSniffer uses the iconv extension and the encoding that you supply instead. It uses the special iconv_* functions to do length and substring actions where needed.

So in this case, there may very well be a problem, but your fix wont actually solve it for the majority of users. None of your submissions actually displayed your comment correctly, so I'll see if I can replicate it myself and add a test. Then I'll see which of those function calls needs fixing and if I can do it with iconv.

gsherwood · 2015-08-25T22:38:45Z

Also, if you're getting unit test failures, you probably don't have the iconv extension installed. The failing tests are likely due to the multibyte test comments that are in the test files.

gsherwood · 2015-08-25T22:49:55Z

I've tried replicating commenting problems with various multi-byte characters and I can't do it.

I remember writing this code and not needing any iconv functions in it because it doesn't actually matter if the lengths of the tokens are wrong. The chars I'm looking for are fixed lengths (like * and / and newlines) so comments tokenize fine even if you don't pass the encoding option to PHPCS. I've tested this and it is still working fine.

So I don't know what it going on with your code. Maybe it is a result of a problem higher up. Just check you have the iconv extension installed or the encoding option is doing nothing. Besides that, if you can give me a code snippet that fails, I'll do more testing on it.

aik099 · 2015-08-26T06:22:47Z

The chars I'm looking for are fixed lengths (like * and / and newlines) so comments tokenize fine even if you don't pass the encoding option to PHPCS.

This may not be true if say a russian letter in UTF-8 is encoded as 3 bytes, while ASCII code of individual byte matches the * or /. I'm not sure if that's even would happen in real life with any UTF-8 byte.

Lexx918 · 2015-08-26T06:51:19Z

Just check you have the iconv extension ..

Maybe you are right. I tried to repeat the bug locally on Win7. The newly set php, cloned repository phpcs and exec:
i:\www\PHP_CodeSniffer-orig> php scripts/phpcs -vv --standard=Generic --encoding=utf-8 --report-width=180 I:\www\phpDoc_utf8_test.php
result: http://my.jetscreenshot.com/5130/20150826-mjyj-78kb.jpg
Then repeat it on Ubuntu.
result: http://my.jetscreenshot.com/5130/20150826-jjnn-94kb.jpg
Source phpDoc_utf8_test.php here https://gist.github.com/Lexx918/8c1ed67e93de2b1ba263 (utf-8, unix-line-ending)
What should I check in the settings of php/OS first?

Lexx918 · 2015-08-26T07:05:25Z

Nonetheless:

Warning Internally, PHP strings are byte arrays. As a result, accessing or modifying a string using array brackets is not multi-byte safe, and should only be done with strings that are in a single-byte encoding such as ISO-8859-1. http://php.net/manual/en/language.types.string.php
The optional parameter offset can be used to specify the alternate place from which to start the search (in bytes). http://php.net/preg_match

gsherwood · 2015-08-27T00:20:13Z

Internally, PHP strings are byte arrays. As a result, accessing or modifying a string using array brackets is not multi-byte safe

I'm well aware, but it shouldn't matter in this case because of the specific checks I am doing. As I said above, I remember extensively testing this code while I was writing it because I originally had iconv stuff everywhere and eventually realised it didn't need to support multi-byte strings because they are fixed elsewhere in the main tokenizer. And performance is a huge deal for tokenizing, so even switching to substr or preg* sees a massive decrease on large code bases.

Lexx918 · 2015-08-27T10:39:49Z

Show me, please, how you check description of tag phpDoc.
https://github.com/squizlabs/PHP_CodeSniffer/blob/master/CodeSniffer/Standards/Generic/Tests/Commenting/DocCommentUnitTest.inc#L174
There is only a description phpDoc. Instead description tags. And it is checked using preg_match.
Try to write a few tags with descriptions of varying lengths.

gsherwood · 2015-09-03T04:40:17Z

Show me, please, how you check description of tag phpDoc.

I'm not really sure what you are asking. Are you reporting a related bug? It's probably going to be easier if you post code that produces errors so I can create tests and fix the problems.

Lexx918 · 2015-09-07T15:48:11Z

Ooops! Sorry. I forgot to say, we are using the option:

mbstring.internal_encoding = UTF-8
mbstring.func_overload = 7

As a result, some string functions work with symbols (substr, strlen), and some with bytes (string[index], preg_match). The offset error accumulates. This causes an error in the later iterations of the loop.
If you enable this option, you will see bugs in the tests.
And no, I can't turn them off :(

…tion overloading

gsherwood · 2015-10-06T23:33:06Z

I've added a commit that changes the way the comment tokenizer works so that references to positions inside strings work correctly. This works because those positions are only used for whitespace and known characters (/ and *) before any multi-byte characters are found on a line, and each line is processed independently instead of as one big string.

gsherwood · 2015-11-25T04:56:08Z

Just realised this should have been closed last month.

Added support multi-byte encoding on Comment-Tokenizer

d39c0c3

aik099 reviewed Aug 25, 2015
View reviewed changes

gsherwood added the Status: awaiting feedback label Aug 25, 2015

gsherwood changed the title ~~Added support multi-byte encoding on Comment-Tokenizer~~ Comment tokenizer can break when using mbstring function overloading Oct 6, 2015

gsherwood added Type: bug and removed Status: awaiting feedback labels Oct 6, 2015

gsherwood added a commit that referenced this pull request Oct 6, 2015

Fixed bug #692 : Comment tokenizer can break when using mbstring func…

dae32f3

…tion overloading

gsherwood closed this Nov 25, 2015

Comment tokenizer can break when using mbstring function overloading #692

Comment tokenizer can break when using mbstring function overloading #692

Uh oh!

Conversation

Lexx918 commented Aug 25, 2015

Uh oh!

aik099 Aug 25, 2015

Choose a reason for hiding this comment

Uh oh!

gsherwood Aug 25, 2015

Choose a reason for hiding this comment

Uh oh!

aik099 commented Aug 25, 2015

Uh oh!

Lexx918 commented Aug 25, 2015

Uh oh!

gsherwood commented Aug 25, 2015

Uh oh!

gsherwood commented Aug 25, 2015

Uh oh!

gsherwood commented Aug 25, 2015

Uh oh!

aik099 commented Aug 26, 2015

Uh oh!

Lexx918 commented Aug 26, 2015

Uh oh!

Lexx918 commented Aug 26, 2015

Uh oh!

gsherwood commented Aug 27, 2015

Uh oh!

Lexx918 commented Aug 27, 2015

Uh oh!

gsherwood commented Sep 3, 2015

Uh oh!

Lexx918 commented Sep 7, 2015

Uh oh!

gsherwood commented Oct 6, 2015

Uh oh!

gsherwood commented Nov 25, 2015

Uh oh!

Uh oh!