Slow mount of PVC for che workspaces - can prevent workspaces from starting. #4079

rhopp · 2018-07-26T12:30:41Z

gitlab issue - https://gitlab.cee.redhat.com/dtsd/housekeeping/issues/2169
Mounting PVC which has lots of files is very slow and it can even prevent che workspaces from starting.

To put it into perspective - It's enough to have 3 workspaces with Angular [1] example to get into the state, where workspaces fail to start.

Right now, this may not be such a big issue (but we had people in the past who were affected by this), but this won't scale in the future... Even if we raise the timeout for workspaces to start, it would be unbearalbe to wait for startup for ~10 minutes or more.

Steps to reproduce

Basically fill up claim-che-workspace PVC with lots of files and try to start workspace - observe that it takes either too long, or fails after ~10 minutes.

OR use "more real world" approach:

Create workspace from factory: https://che.openshift.io/f?id=factory3db802wkfbg8wox5
This link will spawn up workspace with Angular quickstart and immediately starts build - this generates (downloads) few thousands files.
Stop that workspace
Create new workspace with the same (or any other quickstart), do build (a.k.a. generate lots of files)
Stop the second workspace
Create third workspace - When I used 2 workspaces with Angular quickstart, this third one fails to start for me.

[1] - Created from factory - https://www.eclipse.org/che/getting-started/cloud-osio/

The text was updated successfully, but these errors were encountered:

rhopp · 2018-07-26T12:49:49Z

@ScrewTSW also created job, which is monitoring mount times of PVC. This PVC was pre-populated with filex equivalent to two built angular quickstarts (~20000 files if I remember that correctly). @ScrewTSW Could you please provide link to zabbix?

ScrewTSW · 2018-07-26T14:02:26Z

Here are the job results with two angular projects installed.

Number of files in project:

Events from openshift console:

Link to Zabbix graph with data colelcted:
https://zabbix.devshift.net:9443/zabbix/charts.php?ddreset=1

rhopp · 2018-07-26T14:16:48Z

This have one more implication...

Normally, when workspace is deleted, the rm-<workspace-name> pod is spawned. It's purpose is to delete workspace files from the PVC.

But when the PVC has lots of files, the mount fails (or takes longer, then is the timeout for startup of the rm- pod) and user is basically stuck in situation when he cannot start any workspace AND he is not able to remove the data from PVC (well.. technically he can deploy some own pod with mounted PVC and delete it manually from terminal, but this will be disabled, when we revoke admin rights from users -che namespaces)

WDYT @davidfestal ?

slemeur · 2018-07-29T21:02:21Z

@rhopp : https://gitlab.cee.redhat.com/dtsd/housekeeping/issues/2169

ibuziuk · 2018-07-30T08:03:51Z

@rhopp could you please test how it works on prod-preview where deployments are used instead of bare pods - eclipse-che/che#10021 (comment) ?

rhopp · 2018-07-30T09:19:13Z

@ibuziuk On prod-preview, when the failure happens, it happens with weird message (timeout 0 miliseconds, even though it fails after few minutes):

Error when starting agent
Unable to start workspace agent. Error when trying to start the workspace agent: Timed out waiting for [0] milliseconds for [Deployment] with name:[workspaceekokh2u0tab43lgn.dockerimage] in namespace [rhopp-preview-che].

slemeur · 2018-07-30T09:22:46Z

That's the one I hit last week indeed.

ibuziuk · 2018-07-30T09:25:51Z

@rhopp @slemeur but the message in openshift events is still about the Fail Mount problem, right ?
I believe the message exposed to che would be better once the following regression would be fixed - eclipse-che/che#10559

sbose78 · 2018-09-14T06:18:28Z

Could you please tell me if there's been progress on resolving this?

ScrewTSW · 2018-10-05T13:40:31Z

@sbose78 Hello.
I've tested the new update on the free-stg cluster. I really tried to break it, but I was unable to :D
I've mounted a simple pod with the PVC attached, created 30 000 artificial files + 6 large ones besides those that already existed from the workspaces previously run with that account (around 8000 more real files)
I tried mounting the volume and creating new workspaces back and forth and I didn't see any issues with mount or increase in the time it took to mount the volume and start the pod/workspace.
I even tried a https://che.prod-preview.openshift.io/dashboard/#/load-factory?id=factoryqlvwzsnbpcnvwxbo factory. The start of workspace and cloning of the project worked without any issues.
The build is where the factory failed, but that's already a known bug.
The cluster seems to be stable. 👍 for promoting to prod-preview

sbose78 · 2018-10-05T14:12:31Z

@ScrewTSW Fantastic! Thank you.

The build is where the factory failed, but that's already a known bug.

yeah, this should be OK.

ibuziuk · 2018-11-05T13:05:57Z

@ScrewTSW could you please verify and close this issue if all the starter clusters have been updated and the issue is not reproducible anymore ? cc: @rhopp

ibuziuk · 2018-11-09T08:30:42Z

All starter clusters have been updated this week to openshift 3.11. However, yet another requirement for fixing this slow volume mount issue is having gluster-subvol driver fix [1] deployed to the prod clusters. I will provide an update here when the fix would be available on prod clusters and qa verification could be done.

[1] gluster/gluster-subvol#24

ibuziuk · 2018-12-12T18:16:32Z

@rhopp @ScrewTSW @Katka92 the gluster subvol fix is applied on 1a cluster. Will you be able to verify that volume mount problem is not reproducible there and +1 for applying the same fix to all the other starter clusters?

ppitonak · 2018-12-13T09:21:55Z

I can confirm that e2e tests started passing on this cluster. 12 of 12 runs successful so far.

https://ci.centos.org/job/devtools-test-e2e-openshift.io-smoketest-us-east-1a-released/
https://ci.centos.org/job/devtools-test-e2e-openshift.io-smoketest-us-east-1a-beta/

ibuziuk · 2018-12-13T09:47:57Z

@ppitonak are we positive that the fact that tests started to pass related to the gluster subvol fix (are those were failing due to slow volume mount before?) and not fix for #4626 ?

ppitonak · 2018-12-13T10:58:32Z

I'm not sure but

you commented in this issue on Dec 12, 18:16 UTC
fix for #4626 got into production on Dec 12, 19:13 UTC
first successful build started on Dec 12 18:35 UTC

ibuziuk · 2018-12-13T11:04:59Z

@rhopp could someone from QA team with the account provisioned against 1a make sure that it is not possible to reproduce the problem by following the steps to reproduce from the description ?

rhopp · 2018-12-13T16:21:14Z

PVC mount issue seems to be fixed on 1a cluster. Workspace container starts quickly even with lots of files inside.

I've encountered another (not so critical issue) while testing that... I'll report that as separate issue.

dak1n1 · 2018-12-13T19:29:25Z

The update has now been applied to the rest of OSIO.

ibuziuk · 2018-12-14T14:53:48Z

@rhopp closing this issue

rhopp added area/che team/service-delivery SEV2-high labels Jul 26, 2018

rhopp added the type/bug label Jul 26, 2018

stevengutz added the priority/P4 Normal label Jul 26, 2018

rhopp mentioned this issue Jul 27, 2018

Performance analyzes for basic usage of Che.OSIO redhat-developer/rh-che#776

Open

5 tasks

slemeur added SEV1-urgent priority/P1 Critical and removed SEV2-high priority/P4 Normal labels Jul 29, 2018

nickboldt mentioned this issue Jul 30, 2018

Transition of Codenvy.io -> OSIO redhat-developer/rh-che#701

Closed

14 tasks

slemeur added SEV2-high and removed SEV1-urgent labels Jul 30, 2018

ibuziuk mentioned this issue Sep 9, 2018

Volume mount/unmount failure stops Che workspace from loading #4285

Closed

ibuziuk added the status/pending-qa-verification label Nov 5, 2018

l0rd added the team/che/osio label Nov 6, 2018

ibuziuk mentioned this issue Nov 14, 2018

Intermittent: Workspace pod fails to start due to volume mount failures redhat-developer/rh-che#467

Closed

4 tasks

ibuziuk closed this as completed Dec 14, 2018

Slow mount of PVC for che workspaces - can prevent workspaces from starting. #4079

Slow mount of PVC for che workspaces - can prevent workspaces from starting. #4079

Comments

rhopp commented Jul 26, 2018 • edited by ibuziuk Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Steps to reproduce

rhopp commented Jul 26, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ScrewTSW commented Jul 26, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rhopp commented Jul 26, 2018

Uh oh!

slemeur commented Jul 29, 2018

Uh oh!

ibuziuk commented Jul 30, 2018

Uh oh!

rhopp commented Jul 30, 2018

Uh oh!

slemeur commented Jul 30, 2018

Uh oh!

ibuziuk commented Jul 30, 2018

Uh oh!

sbose78 commented Sep 14, 2018

Uh oh!

ScrewTSW commented Oct 5, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sbose78 commented Oct 5, 2018

Uh oh!

ibuziuk commented Nov 5, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ibuziuk commented Nov 9, 2018

Uh oh!

ibuziuk commented Dec 12, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ppitonak commented Dec 13, 2018

Uh oh!

ibuziuk commented Dec 13, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ppitonak commented Dec 13, 2018

Uh oh!

ibuziuk commented Dec 13, 2018

Uh oh!

rhopp commented Dec 13, 2018

Uh oh!

dak1n1 commented Dec 13, 2018

Uh oh!

ibuziuk commented Dec 14, 2018

Uh oh!

rhopp commented Jul 26, 2018 •

edited by ibuziuk

Loading

rhopp commented Jul 26, 2018 •

edited

Loading

ScrewTSW commented Jul 26, 2018 •

edited

Loading

ScrewTSW commented Oct 5, 2018 •

edited

Loading

ibuziuk commented Nov 5, 2018 •

edited

Loading

ibuziuk commented Dec 12, 2018 •

edited

Loading

ibuziuk commented Dec 13, 2018 •

edited

Loading