Skip to content

Comment tokenizer can break when using mbstring function overloading #692

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 1 commit into from
Closed

Comment tokenizer can break when using mbstring function overloading #692

wants to merge 1 commit into from

Conversation

Lexx918
Copy link

@Lexx918 Lexx918 commented Aug 25, 2015

I tried as it is written here: guidelines for contributing
And..
Tests for a clean repository:

Tests: 222, Assertions: 69, Failures: 5, Skipped: 4.

And tests after applying my patch:

Tests: 222, Assertions: 69,              Skipped: 4.

:)

P.S. Reason for changes: http://pear.php.net/bugs/bug.php?id=20943

@@ -54,7 +54,7 @@ public function tokenizeString($string, $eolChar, $stackPtr)
*/

for ($c = 0; $c < $numChars; $c++) {
if ($string[$c] !== '/' && $string[$c] !== '*') {
if (substr($string, $c, 1) !== '/' && substr($string, $c, 1) !== '*') {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To my knowledge old code does the same as new code. Am I missing something.

Old code won't work on PHP7, but new code will however. That's good thing.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Old code works perfectly fine on PHP7. The unit tests have been running on it most of the year.

@aik099
Copy link
Contributor

aik099 commented Aug 25, 2015

Are you passing the --encoding="UTF-8" to the phpcs/phpcbf command? If you're not passing, then it won't work.

@Lexx918
Copy link
Author

Lexx918 commented Aug 25, 2015

We use --encoding=utf-8. All files in UTF-8, and the comments contain many russian letters.

@gsherwood
Copy link
Member

If you've just switched to substr(), it means you're probably overloading the built-in string functions using the multibyte extension. PHP_CodeSniffer doesn't do multibyte support like that. It can't rely on people doing the overloading.

PHP_CodeSniffer uses the iconv extension and the encoding that you supply instead. It uses the special iconv_* functions to do length and substring actions where needed.

So in this case, there may very well be a problem, but your fix wont actually solve it for the majority of users. None of your submissions actually displayed your comment correctly, so I'll see if I can replicate it myself and add a test. Then I'll see which of those function calls needs fixing and if I can do it with iconv.

@gsherwood
Copy link
Member

Also, if you're getting unit test failures, you probably don't have the iconv extension installed. The failing tests are likely due to the multibyte test comments that are in the test files.

@gsherwood
Copy link
Member

I've tried replicating commenting problems with various multi-byte characters and I can't do it.

I remember writing this code and not needing any iconv functions in it because it doesn't actually matter if the lengths of the tokens are wrong. The chars I'm looking for are fixed lengths (like * and / and newlines) so comments tokenize fine even if you don't pass the encoding option to PHPCS. I've tested this and it is still working fine.

So I don't know what it going on with your code. Maybe it is a result of a problem higher up. Just check you have the iconv extension installed or the encoding option is doing nothing. Besides that, if you can give me a code snippet that fails, I'll do more testing on it.

@aik099
Copy link
Contributor

aik099 commented Aug 26, 2015

The chars I'm looking for are fixed lengths (like * and / and newlines) so comments tokenize fine even if you don't pass the encoding option to PHPCS.

This may not be true if say a russian letter in UTF-8 is encoded as 3 bytes, while ASCII code of individual byte matches the * or /. I'm not sure if that's even would happen in real life with any UTF-8 byte.

@Lexx918
Copy link
Author

Lexx918 commented Aug 26, 2015

Just check you have the iconv extension ..

Maybe you are right. I tried to repeat the bug locally on Win7. The newly set php, cloned repository phpcs and exec:
i:\www\PHP_CodeSniffer-orig> php scripts/phpcs -vv --standard=Generic --encoding=utf-8 --report-width=180 I:\www\phpDoc_utf8_test.php
result: http://my.jetscreenshot.com/5130/20150826-mjyj-78kb.jpg
Then repeat it on Ubuntu.
result: http://my.jetscreenshot.com/5130/20150826-jjnn-94kb.jpg
Source phpDoc_utf8_test.php here https://gist.github.com/Lexx918/8c1ed67e93de2b1ba263 (utf-8, unix-line-ending)
What should I check in the settings of php/OS first?

@Lexx918
Copy link
Author

Lexx918 commented Aug 26, 2015

Nonetheless:

  • Warning Internally, PHP strings are byte arrays. As a result, accessing or modifying a string using array brackets is not multi-byte safe, and should only be done with strings that are in a single-byte encoding such as ISO-8859-1. http://php.net/manual/en/language.types.string.php
  • The optional parameter offset can be used to specify the alternate place from which to start the search (in bytes). http://php.net/preg_match

@gsherwood
Copy link
Member

Internally, PHP strings are byte arrays. As a result, accessing or modifying a string using array brackets is not multi-byte safe

I'm well aware, but it shouldn't matter in this case because of the specific checks I am doing. As I said above, I remember extensively testing this code while I was writing it because I originally had iconv stuff everywhere and eventually realised it didn't need to support multi-byte strings because they are fixed elsewhere in the main tokenizer. And performance is a huge deal for tokenizing, so even switching to substr or preg* sees a massive decrease on large code bases.

@Lexx918
Copy link
Author

Lexx918 commented Aug 27, 2015

Show me, please, how you check description of tag phpDoc.
https://github.com/squizlabs/PHP_CodeSniffer/blob/master/CodeSniffer/Standards/Generic/Tests/Commenting/DocCommentUnitTest.inc#L174
There is only a description phpDoc. Instead description tags. And it is checked using preg_match.
Try to write a few tags with descriptions of varying lengths.

@gsherwood
Copy link
Member

Show me, please, how you check description of tag phpDoc.

I'm not really sure what you are asking. Are you reporting a related bug? It's probably going to be easier if you post code that produces errors so I can create tests and fix the problems.

@Lexx918
Copy link
Author

Lexx918 commented Sep 7, 2015

Ooops! Sorry. I forgot to say, we are using the option:

mbstring.internal_encoding = UTF-8
mbstring.func_overload = 7

As a result, some string functions work with symbols (substr, strlen), and some with bytes (string[index], preg_match). The offset error accumulates. This causes an error in the later iterations of the loop.
If you enable this option, you will see bugs in the tests.
And no, I can't turn them off :(

@gsherwood gsherwood changed the title Added support multi-byte encoding on Comment-Tokenizer Comment tokenizer can break when using mbstring function overloading Oct 6, 2015
gsherwood added a commit that referenced this pull request Oct 6, 2015
@gsherwood
Copy link
Member

I've added a commit that changes the way the comment tokenizer works so that references to positions inside strings work correctly. This works because those positions are only used for whitespace and known characters (/ and *) before any multi-byte characters are found on a line, and each line is processed independently instead of as one big string.

@gsherwood
Copy link
Member

Just realised this should have been closed last month.

@gsherwood gsherwood closed this Nov 25, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants