Skip to content

oc exec hangs for six hours #13662

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
stevekuznetsov opened this issue Apr 6, 2017 · 18 comments
Closed

oc exec hangs for six hours #13662

stevekuznetsov opened this issue Apr 6, 2017 · 18 comments
Assignees
Labels
component/cli kind/test-flake Categorizes issue or PR as related to test flakes. priority/P0

Comments

@stevekuznetsov
Copy link
Contributor

[INFO] Validating exec
Running test/end-to-end/core.sh:463: executing 'oc exec -p frontend-1-nsqdp id' expecting success and text '1000'...
SUCCESS after 0.266s: test/end-to-end/core.sh:463: executing 'oc exec -p frontend-1-nsqdp id' expecting success and text '1000'
Standard output from the command:
uid=1000050000 gid=0(root) groups=0(root),1000050000

Standard error from the command:
W0406 10:51:16.968491   15548 cmd.go:337] -p POD_NAME is DEPRECATED and will be removed in a future version. Use exec POD_NAME instead.

Running test/end-to-end/core.sh:464: executing 'oc rsh pod/frontend-1-nsqdp id -u' expecting success and text '1000'...
Killed by signal 15.

as seen here

/cc @ncdc

@stevekuznetsov stevekuznetsov added component/cli kind/test-flake Categorizes issue or PR as related to test flakes. priority/P1 labels Apr 6, 2017
@ncdc
Copy link
Contributor

ncdc commented Apr 6, 2017 via email

@stevekuznetsov
Copy link
Contributor Author

@ncdc can you triage this by passing it on to someone else then

@ncdc
Copy link
Contributor

ncdc commented Apr 6, 2017

@derekwaynecarr @mfojtik do you have anyone who can investigate?

@stevekuznetsov
Copy link
Contributor Author

also: https://ci.openshift.redhat.com/jenkins/job/test_pull_request_origin_integration/871/

This one is on oc rsh:

Running test/end-to-end/core.sh:488: executing 'oc rsh frontend-1-90m28 ls /tmp/sample-app' expecting success and text 'application-template-stibuild'...
Connection to 172.18.11.10 closed by remote host.

This is a very new regression and seems to be happening often.

@enj
Copy link
Contributor

enj commented Apr 6, 2017

Seen in #11647 - https://ci.openshift.redhat.com/jenkins/job/test_pull_request_origin_integration/872/consoleFull#-107583498258b6e51eb7608a5981914356

Excerpt:

Running test/end-to-end/core.sh:464: executing 'oc rsh pod/frontend-1-bn6v4 id -u' expecting success and text '1000'...
Connection to 172.18.0.183 closed by remote host.

Context:

logs: ok
[INFO] Starting build from /tmp/openshift/test-end-to-end/artifacts/stiAppConfig.json with non-existing commit...
Running test/end-to-end/core.sh:456: executing 'oc start-build test --commit=fffffff --wait' expecting failure...
SUCCESS after 0.207s: test/end-to-end/core.sh:456: executing 'oc start-build test --commit=fffffff --wait' expecting failure
There was no output from the command.
Standard error from the command:
Error from server (Forbidden): buildconfigs "test" is forbidden: buildconfigs.build.openshift.io "test" not found

[INFO] Validating exec
Running test/end-to-end/core.sh:463: executing 'oc exec -p frontend-1-bn6v4 id' expecting success and text '1000'...
SUCCESS after 9.809s: test/end-to-end/core.sh:463: executing 'oc exec -p frontend-1-bn6v4 id' expecting success and text '1000'
Standard output from the command:
uid=1000050000 gid=0(root) groups=0(root),1000050000
uid=1000050000 gid=0(root) groups=0(root),1000050000

Standard error from the command:
W0406 12:06:12.530529   31700 cmd.go:337] -p POD_NAME is DEPRECATED and will be removed in a future version. Use exec POD_NAME instead.

Running test/end-to-end/core.sh:464: executing 'oc rsh pod/frontend-1-bn6v4 id -u' expecting success and text '1000'...
Connection to 172.18.0.183 closed by remote host.
++ export status=FAILURE
++ status=FAILURE
+ set +o xtrace
########## FINISHED STAGE: FAILURE: RUN INTEGRATION TESTS ##########
Build step 'Execute shell' marked build as failure

@stevekuznetsov
Copy link
Contributor Author

Seen in https://ci.openshift.redhat.com/jenkins/job/test_pull_request_origin_integration/874/consoleFull

[INFO] Starting build from /tmp/openshift/test-end-to-end/artifacts/stiAppConfig.json with non-existing commit...
Running test/end-to-end/core.sh:456: executing 'oc start-build test --commit=fffffff --wait' expecting failure...
SUCCESS after 0.214s: test/end-to-end/core.sh:456: executing 'oc start-build test --commit=fffffff --wait' expecting failure
There was no output from the command.
Standard error from the command:
Error from server (Forbidden): buildconfigs "test" is forbidden: buildconfigs.build.openshift.io "test" not found

[INFO] Validating exec
Running test/end-to-end/core.sh:463: executing 'oc exec -p frontend-1-b0sd0 id' expecting success and text '1000'...
Connection to 172.18.7.107 closed by remote host.

@mfojtik should we be seeing the new API groups right now?

@stevekuznetsov
Copy link
Contributor Author

Every single build since https://ci.openshift.redhat.com/jenkins/job/test_pull_request_origin/604/ has this

@stevekuznetsov
Copy link
Contributor Author

stevekuznetsov commented Apr 6, 2017

7:07 AM: test_pull_request_origin 569 promotes to a merge for #12733
9:21 AM: test_pull_request_origin 604 starts - last test job to succeed
9:37 AM: merge_pull_request_origin 265 merges #13630
9:45 AM: merge_pull_request_origin 266 begins to process #13529 but hangs for 6 hours
10:34 AM: #13529 #13652 and #13313 merge from @mfojtik process_pull_requests call with merge-pretest-success.

merges into release-1.5 are fine:

4:13 PM: merge_pull_request_origin 267 merges #13418

@stevekuznetsov
Copy link
Contributor Author

PR to test reverting #13630: #13664

@stevekuznetsov
Copy link
Contributor Author

No changes went into aos-cd-jobs or origin-ci-tool in the last couple of days

@stevekuznetsov
Copy link
Contributor Author

stevekuznetsov commented Apr 6, 2017

On the successful test run at 4PM we did not see the new API groups in the buildconfigs "test" not found message.

EDIT: that was release-1.5 so that seems reasonable.

@enj
Copy link
Contributor

enj commented Apr 6, 2017

Every single failure but one (disk out of space) had the new API group -- @enj did your impersonating client stuff touch this?

@stevekuznetsov that PR did not touch API groups at all. It was just a refactor of existing code into helper methods.

@stevekuznetsov
Copy link
Contributor Author

Successful jobs before #13630 also had the same not found error, so that seems like a red herring

@enj
Copy link
Contributor

enj commented Apr 6, 2017

@stevekuznetsov does not #13529 (comment) look like a better candidate?

@stevekuznetsov
Copy link
Contributor Author

Man, such an obvious one. Let's not merge-pretest-success in the future.

@stevekuznetsov
Copy link
Contributor Author

The error message is definitely a red herring -- from the last build we have available, on Mar 30, 2017 7:45:36 PM:

[INFO] Starting build from /tmp/openshift/test-end-to-end/artifacts/stiAppConfig.json with non-existing commit...
Running test/end-to-end/core.sh:467: executing 'oc start-build test --commit=fffffff --wait' expecting failure...
SUCCESS after 0.210s: test/end-to-end/core.sh:467: executing 'oc start-build test --commit=fffffff --wait' expecting failure
There was no output from the command.
Standard error from the command:
Error from server (Forbidden): buildconfigs "test" is forbidden: buildconfigs.build.openshift.io "test" not found

[INFO] Validating exec
Running test/end-to-end/core.sh:474: executing 'oc exec -p frontend-1-1xpqx id' expecting success and text '1000'...
SUCCESS after 0.302s: test/end-to-end/core.sh:474: executing 'oc exec -p frontend-1-1xpqx id' expecting success and text '1000'
Standard output from the command:
uid=1000050000 gid=0(root) groups=0(root),1000050000

Standard error from the command:
W0330 20:40:38.423803   29241 cmd.go:337] -p POD_NAME is DEPRECATED and will be removed in a future version. Use exec POD_NAME instead.

The hang does seem to be from #13529 although I can't find the test run that @soltysh was quoting there

@stevekuznetsov
Copy link
Contributor Author

As an aside the test there at end_to_end/core.sh:456 hasn't been touched for ever, but I am about 9999% sure we are aliasing a failure there, we should be failing due to a non-existent commit but we are failing on a 404. @mfojtik you added that a couple years ago in 818cc10 so maybe you can triage someone to look at it and make the test valid?

@soltysh
Copy link
Contributor

soltysh commented Apr 7, 2017

The hang does seem to be from #13529 although I can't find the test run that @soltysh was quoting there

I've picked one of test runs and I've seen it there. I'll wait for your revert to land and submit the fix again.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
component/cli kind/test-flake Categorizes issue or PR as related to test flakes. priority/P0
Projects
None yet
Development

No branches or pull requests

6 participants