rework tests for the PL_check change #17351

tonycoz · 2019-12-09T00:30:56Z

This re-works #17240 while retaining the original commit and functional change.

Nicholas has an alternative fix that's more back-portable (currently in #17343), but he also says in Re-implement IO::Handle getline and getlines as XS code:

PL_check itself should not need this level of
protection because it should be part of the interpreter's state, not global
state.

So this patch is still needed to fix the thread-safety issue.

If Nicholas's patch is applied the original io/getlines.t test is no longer meaningful, so I added similar code to XS::APItest, a test case for it and removed io/getlines.t.

jkeenan · 2019-12-09T01:22:09Z

I was about to push a smoke-me branch for this, but decided to first test it locally. One test failure:

$ cd t;./perl harness -v ../ext/XS-APItest/t/pl_check.t;cd -
This Perl not built to support threads
Compilation failed in require at t/pl_check.t line 10.
BEGIN failed--compilation aborted at t/pl_check.t line 10.
Dubious, test returned 255 (wstat 65280, 0xff00)
No subtests run 

Test Summary Report
-------------------
../ext/XS-APItest/t/pl_check.t (Wstat: 65280 Tests: 0 Failed: 0)
  Non-zero exit status: 255
  Parse errors: No plan found in TAP output

Thank you very much.
Jim Keenan

jkeenan · 2019-12-09T01:28:36Z

I was about to push a smoke-me branch for this, but decided to first test it locally. One test failure:

$ cd t;./perl harness -v ../ext/XS-APItest/t/pl_check.t;cd -
This Perl not built to support threads
Compilation failed in require at t/pl_check.t line 10.
BEGIN failed--compilation aborted at t/pl_check.t line 10.
Dubious, test returned 255 (wstat 65280, 0xff00)
No subtests run 

Test Summary Report
-------------------
../ext/XS-APItest/t/pl_check.t (Wstat: 65280 Tests: 0 Failed: 0)
  Non-zero exit status: 255
  Parse errors: No plan found in TAP output

Thank you very much.
Jim Keenan

Try this:

diff --git a/ext/XS-APItest/t/pl_check.t b/ext/XS-APItest/t/pl_check.t
index 60bd94852a..e359ab8700 100644
--- a/ext/XS-APItest/t/pl_check.t
+++ b/ext/XS-APItest/t/pl_check.t
@@ -6,13 +6,10 @@ use Config;
 BEGIN {
     require '../../t/test.pl';
 }
+BEGIN { plan skip_all => 'no threads' unless $Config{useithreads} }
 
 use threads;
 
-
-$Config{usethreads}
-  or plan skip_all => "test requires threads";
-
 # do not use XS::APItest in this test
 
 use constant thread_count => 20;

jkeenan · 2019-12-09T13:09:05Z

@tonycoz I think these code changes have performance problems and are not ready for merging.

I corrected the smoke-me branch I had created from your p.r. and proceeded to smoke it. On FreeBSD-11, running my most comprehensive configuration, the test suite as a whole aborted at the 9 hour max I have set in smokecurrent_config. I got failures in these individual test files:

$ grep -E '(Failed test|FAILED)' smokecurrent.log
# Failed test 4 -    right time (waited 6 secs for 3-sec alarm) at op/alarm.t line 54
[2019-12-09 04:10:58+0000] ../t/op/alarm.t.............................................FAILED
# Failed test 4 -    right time (waited 6 secs for 3-sec alarm) at op/alarm.t line 54
[2019-12-09 04:30:32+0000] ../t/op/alarm.t.............................................FAILED
# Failed test 955 - Test Perl 73464 at ./test.pl line 1062
[2019-12-09 05:11:30+0000] ../t/re/pat_thr.t...........................................FAILED
# Failed test 58 - should not COW on long string with substr and m//g at ./re/speed.t line 156
[2019-12-09 06:10:16+0000] ../t/re/speed_thr.t.........................................FAILED
# Failed test 4 -    right time (waited 16 secs for 3-sec alarm) at op/alarm.t line 54
[2019-12-09 07:00:31+0000] ../t/op/alarm.t.............................................FAILED
# Failed test 4 -    right time (waited 6 secs for 3-sec alarm) at op/alarm.t line 54
[2019-12-09 08:14:20+0000] ../t/op/alarm.t.............................................FAILED
# Failed test 4 -    right time (waited 17 secs for 3-sec alarm) at op/alarm.t line 54
[2019-12-09 09:07:47+0000] ../t/op/alarm.t.............................................FAILED
# Failed test 955 - Test Perl 73464 at ./test.pl line 1062
[2019-12-09 09:54:34+0000] ../t/re/pat_thr.t...........................................FAILED
# Failed test 955 - Test Perl 73464 at ./test.pl line 1062
# Failed test 4 -    right time (waited 6 secs for 3-sec alarm) at op/alarm.t line 54
[2019-12-09 10:42:06+0000] ../t/re/pat_thr.t...........................................FAILED
[2019-12-09 10:42:06+0000] ../t/op/alarm.t.............................................FAILED
# Failed test 955 - Test Perl 73464 at ./test.pl line 1062
# Failed test 4 -    right time (waited 16 secs for 3-sec alarm) at op/alarm.t line 54

Here is an example of the failure in t/re/pat_thr.t:

# Failed test 955 - Test Perl 73464 at ./test.pl line 1062
#      got "# Test process timed out - terminating"
# expected "Timeout"
# PROG:
#
#                 BEGIN{require q(test.pl);}
#                 watchdog(3);
#                 $SIG{ALRM} = sub {print "Timeout\n"; exit(1)};
#                 alarm 1;
#                 $_ = "a" x 1000 . "b" x 1000 . "c" x 1000;
#                 /.*a.*b.*c.*[de]/;
# 
# STATUS: 9

While it's not unusual to get failures during smoke-testing in t/re/pat_thr.t and t/op/alarm.t in some of the resource-constrained VMs sitting on this host, it is rare to get timeout failures on the host itself. And it's even rarer for the smoke-test run to exhaust the allocated 9 hours.

(Because the run aborted, no report was generated, but I can gzip the log and send it to you if you want. See http://perl5.test-smoke.org/report/101038 for a more typical smoke-test run on this host with this configuration.)

Granted, these failures could be due to transitory conditions on the host. But this needs more investigation before proceeding.

Thank you very much.
Jim Keenan

tonycoz · 2019-12-09T23:54:32Z

@tonycoz I think these code changes have performance problems and are not ready for merging.

I corrected the smoke-me branch I had created from your p.r. and proceeded to smoke it. On FreeBSD-11, running my most comprehensive configuration, the test suite as a whole aborted at the 9 hour max I have set in smokecurrent_config. I got failures in these individual test files:

$ grep -E '(Failed test|FAILED)' smokecurrent.log
# Failed test 4 -    right time (waited 6 secs for 3-sec alarm) at op/alarm.t line 54
[2019-12-09 04:10:58+0000] ../t/op/alarm.t.............................................FAILED
# Failed test 4 -    right time (waited 6 secs for 3-sec alarm) at op/alarm.t line 54
[2019-12-09 04:30:32+0000] ../t/op/alarm.t.............................................FAILED
# Failed test 955 - Test Perl 73464 at ./test.pl line 1062
[2019-12-09 05:11:30+0000] ../t/re/pat_thr.t...........................................FAILED
# Failed test 58 - should not COW on long string with substr and m//g at ./re/speed.t line 156
[2019-12-09 06:10:16+0000] ../t/re/speed_thr.t.........................................FAILED
# Failed test 4 -    right time (waited 16 secs for 3-sec alarm) at op/alarm.t line 54
[2019-12-09 07:00:31+0000] ../t/op/alarm.t.............................................FAILED
# Failed test 4 -    right time (waited 6 secs for 3-sec alarm) at op/alarm.t line 54
[2019-12-09 08:14:20+0000] ../t/op/alarm.t.............................................FAILED
# Failed test 4 -    right time (waited 17 secs for 3-sec alarm) at op/alarm.t line 54
[2019-12-09 09:07:47+0000] ../t/op/alarm.t.............................................FAILED
# Failed test 955 - Test Perl 73464 at ./test.pl line 1062
[2019-12-09 09:54:34+0000] ../t/re/pat_thr.t...........................................FAILED
# Failed test 955 - Test Perl 73464 at ./test.pl line 1062
# Failed test 4 -    right time (waited 6 secs for 3-sec alarm) at op/alarm.t line 54
[2019-12-09 10:42:06+0000] ../t/re/pat_thr.t...........................................FAILED
[2019-12-09 10:42:06+0000] ../t/op/alarm.t.............................................FAILED
# Failed test 955 - Test Perl 73464 at ./test.pl line 1062
# Failed test 4 -    right time (waited 16 secs for 3-sec alarm) at op/alarm.t line 54

Here is an example of the failure in t/re/pat_thr.t:

# Failed test 955 - Test Perl 73464 at ./test.pl line 1062
#      got "# Test process timed out - terminating"
# expected "Timeout"
# PROG:
#
#                 BEGIN{require q(test.pl);}
#                 watchdog(3);
#                 $SIG{ALRM} = sub {print "Timeout\n"; exit(1)};
#                 alarm 1;
#                 $_ = "a" x 1000 . "b" x 1000 . "c" x 1000;
#                 /.*a.*b.*c.*[de]/;
# 
# STATUS: 9

While it's not unusual to get failures during smoke-testing in t/re/pat_thr.t and t/op/alarm.t in some of the resource-constrained VMs sitting on this host, it is rare to get timeout failures on the host itself. And it's even rarer for the smoke-test run to exhaust the allocated 9 hours.

(Because the run aborted, no report was generated, but I can gzip the log and send it to you if you want. See http://perl5.test-smoke.org/report/101038 for a more typical smoke-test run on this host with this configuration.)

Yes please.

The underlying change is pretty simple, I'd expect it to increase memory usage per interpreter by 3200 bytes, and start up time per interpreter is increased by copying that from the parent thread or the global version (which shouldn't be noticeable unless you're specifically benchmarking.

At compile-time (for perl code, not the perl build itself) we're accessing an interpreter variable instead of a global, which shouldn't be any more expensive if my_perl (aTHX) ends up in a register.

The tests should only have an effect on those tests, or any tests running in parallel with them, the new test in the first commit spawns 20 threads and tries to load XS::APItest in each thread, and that's moderately expensive, but my old Q6600 has no problem with it.

Granted, these failures could be due to transitory conditions on the host. But this needs more investigation before proceeding.

It might be useful to record the results of vmstat 1 5 before and after each run to get some idea of system load, and maybe uptime for the load average report.

Recording vmstat for the entire job would be better, but that's a little more difficult to automate.

jkeenan · 2019-12-11T03:39:43Z

@tonycoz I think these code changes have performance problems and are not ready for merging.
[snip]

Granted, these failures could be due to transitory conditions on the host. But this needs more investigation before proceeding.

[snip]

This turned out to be the case. In the last week, instead of having only 1 or 2 virtual box VMs running on the host, I had 4. This sapped memory. For example, when I had 4 VMs running, I observed this line in top on the host:

Mem: 74M Active, 2448K Inact, 248K Laundry, 7675M Wired, 151M Free

But when I shut down 3 of those 4 VMs, I observed this:

Mem: 58M Active, 244M Inact, 20K Laundry, 3421M Wired, 4179M Free

I then ran a new test of the branch I had created from Tony's p.r. using the same configuration. The result is here: smoke-report 101513. All tests PASS and the running time is actually slightly shorter than usual (though probably not to a statistically significant degree).

Upshot: I no longer object to p.r. 17351.

Thank you very much.
Jim Keenan

Nicholas Clark's fix for IO makes the test in niner's patch meaningless, so test it separately.

Fixes issue Perl#14816

io/handle.t depends on IO::Handle using the PL_check hack, but Nicholas's back portable fix no longer uses that Fix threaded perl detection, thanks to James Keenan.

This failed on Win32 like in the Perl#17351 CI checks with: ../dist/if/t/if.t .................................................. ok Can't call method "sockdomain" on an undefined value at t/cachepropagate-tcp.t line 46. # Looks like your test exited with 9 just after 5. ../dist/IO/t/cachepropagate-tcp.t .................................. Dubious, test returned 9 (wstat 2304, 0x900) Failed 3/8 subtests I suspect what happened is there was a race between the parent accepting the connection and the child exiting and closing the connection. The Microsoft documentation for accept() indicates one possible reason for failure is: WSAECONNRESET An incoming connection was indicated, but was subsequently terminated by the remote peer prior to accepting the call. which I suspect happened here. So I've: - added a basic error check for the result of accept() - made the child to wait for the parent to close the socket - the parent explicitly closes the socket

This failed on Win32 like in the #17351 CI checks with: ../dist/if/t/if.t .................................................. ok Can't call method "sockdomain" on an undefined value at t/cachepropagate-tcp.t line 46. # Looks like your test exited with 9 just after 5. ../dist/IO/t/cachepropagate-tcp.t .................................. Dubious, test returned 9 (wstat 2304, 0x900) Failed 3/8 subtests I suspect what happened is there was a race between the parent accepting the connection and the child exiting and closing the connection. The Microsoft documentation for accept() indicates one possible reason for failure is: WSAECONNRESET An incoming connection was indicated, but was subsequently terminated by the remote peer prior to accepting the call. which I suspect happened here. So I've: - added a basic error check for the result of accept() - made the child to wait for the parent to close the socket - the parent explicitly closes the socket

tonycoz force-pushed the 17240-rework-tests branch from ef894f4 to bf9eb61 Compare December 9, 2019 02:51

tonycoz and others added 3 commits December 12, 2019 09:50

provide a test for the PL_check mis-behaviour

88c2575

Nicholas Clark's fix for IO makes the test in niner's patch meaningless, so test it separately.

Move PL_check to the interp vars to fix threading issues

45755f1

Fixes issue Perl#14816

un-TODO pl_check.t and remove io/handle.t

a8ba221

io/handle.t depends on IO::Handle using the PL_check hack, but Nicholas's back portable fix no longer uses that Fix threaded perl detection, thanks to James Keenan.

tonycoz force-pushed the 17240-rework-tests branch from bf9eb61 to a8ba221 Compare December 11, 2019 23:10

tonycoz merged commit c8b3a65 into Perl:blead Dec 12, 2019

This was referenced Dec 12, 2019

Thread race in dist/IO/IO.xs #14816

Closed

harden IO's cachepropagate-tcp #17360

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

rework tests for the PL_check change #17351

rework tests for the PL_check change #17351

Uh oh!

tonycoz commented Dec 9, 2019

Uh oh!

jkeenan commented Dec 9, 2019

Uh oh!

jkeenan commented Dec 9, 2019

Uh oh!

jkeenan commented Dec 9, 2019

Uh oh!

tonycoz commented Dec 9, 2019

Uh oh!

jkeenan commented Dec 11, 2019

Uh oh!

Uh oh!

rework tests for the PL_check change #17351

rework tests for the PL_check change #17351

Uh oh!

Conversation

tonycoz commented Dec 9, 2019

Uh oh!

jkeenan commented Dec 9, 2019

Uh oh!

jkeenan commented Dec 9, 2019

Uh oh!

jkeenan commented Dec 9, 2019

Uh oh!

tonycoz commented Dec 9, 2019

Uh oh!

jkeenan commented Dec 11, 2019

Uh oh!

Uh oh!