Skip to content
This repository was archived by the owner on Jul 23, 2020. It is now read-only.

Slow mount of PVC for che workspaces - can prevent workspaces from starting. #4079

Closed
rhopp opened this issue Jul 26, 2018 · 21 comments
Closed

Comments

@rhopp
Copy link
Collaborator

rhopp commented Jul 26, 2018

gitlab issue - https://gitlab.cee.redhat.com/dtsd/housekeeping/issues/2169
Mounting PVC which has lots of files is very slow and it can even prevent che workspaces from starting.

To put it into perspective - It's enough to have 3 workspaces with Angular [1] example to get into the state, where workspaces fail to start.

Right now, this may not be such a big issue (but we had people in the past who were affected by this), but this won't scale in the future... Even if we raise the timeout for workspaces to start, it would be unbearalbe to wait for startup for ~10 minutes or more.

Steps to reproduce

Basically fill up claim-che-workspace PVC with lots of files and try to start workspace - observe that it takes either too long, or fails after ~10 minutes.

OR use "more real world" approach:

  • Create workspace from factory: https://che.openshift.io/f?id=factory3db802wkfbg8wox5
    This link will spawn up workspace with Angular quickstart and immediately starts build - this generates (downloads) few thousands files.
  • Stop that workspace
  • Create new workspace with the same (or any other quickstart), do build (a.k.a. generate lots of files)
  • Stop the second workspace
  • Create third workspace - When I used 2 workspaces with Angular quickstart, this third one fails to start for me.

[1] - Created from factory - https://www.eclipse.org/che/getting-started/cloud-osio/

@rhopp
Copy link
Collaborator Author

rhopp commented Jul 26, 2018

@ScrewTSW also created job, which is monitoring mount times of PVC. This PVC was pre-populated with filex equivalent to two built angular quickstarts (~20000 files if I remember that correctly). @ScrewTSW Could you please provide link to zabbix?

@rhopp rhopp added the type/bug label Jul 26, 2018
@stevengutz stevengutz added the priority/P4 Normal label Jul 26, 2018
@ScrewTSW
Copy link
Collaborator

ScrewTSW commented Jul 26, 2018

Here are the job results with two angular projects installed.

Number of files in project:
screenshot from 2018-07-26 15-46-09

Events from openshift console:
screenshot from 2018-07-26 15-55-59

Link to Zabbix graph with data colelcted:
https://zabbix.devshift.net:9443/zabbix/charts.php?ddreset=1

@rhopp
Copy link
Collaborator Author

rhopp commented Jul 26, 2018

This have one more implication...

Normally, when workspace is deleted, the rm-<workspace-name> pod is spawned. It's purpose is to delete workspace files from the PVC.

But when the PVC has lots of files, the mount fails (or takes longer, then is the timeout for startup of the rm- pod) and user is basically stuck in situation when he cannot start any workspace AND he is not able to remove the data from PVC (well.. technically he can deploy some own pod with mounted PVC and delete it manually from terminal, but this will be disabled, when we revoke admin rights from users -che namespaces)

WDYT @davidfestal ?

@slemeur
Copy link
Collaborator

slemeur commented Jul 29, 2018

@ibuziuk
Copy link
Collaborator

ibuziuk commented Jul 30, 2018

@rhopp could you please test how it works on prod-preview where deployments are used instead of bare pods - eclipse-che/che#10021 (comment) ?

@rhopp
Copy link
Collaborator Author

rhopp commented Jul 30, 2018

@ibuziuk On prod-preview, when the failure happens, it happens with weird message (timeout 0 miliseconds, even though it fails after few minutes):

Error when starting agent
Unable to start workspace agent. Error when trying to start the workspace agent: Timed out waiting for [0] milliseconds for [Deployment] with name:[workspaceekokh2u0tab43lgn.dockerimage] in namespace [rhopp-preview-che].

@slemeur
Copy link
Collaborator

slemeur commented Jul 30, 2018

That's the one I hit last week indeed.

screen shot 2018-07-23 at 13 28 09

@ibuziuk
Copy link
Collaborator

ibuziuk commented Jul 30, 2018

@rhopp @slemeur but the message in openshift events is still about the Fail Mount problem, right ?
I believe the message exposed to che would be better once the following regression would be fixed - eclipse-che/che#10559

@sbose78
Copy link
Collaborator

sbose78 commented Sep 14, 2018

Could you please tell me if there's been progress on resolving this?

@ScrewTSW
Copy link
Collaborator

ScrewTSW commented Oct 5, 2018

@sbose78 Hello.
I've tested the new update on the free-stg cluster. I really tried to break it, but I was unable to :D
I've mounted a simple pod with the PVC attached, created 30 000 artificial files + 6 large ones besides those that already existed from the workspaces previously run with that account (around 8000 more real files)
I tried mounting the volume and creating new workspaces back and forth and I didn't see any issues with mount or increase in the time it took to mount the volume and start the pod/workspace.
I even tried a https://che.prod-preview.openshift.io/dashboard/#/load-factory?id=factoryqlvwzsnbpcnvwxbo factory. The start of workspace and cloning of the project worked without any issues.
The build is where the factory failed, but that's already a known bug.
The cluster seems to be stable. 👍 for promoting to prod-preview

@sbose78
Copy link
Collaborator

sbose78 commented Oct 5, 2018

@ScrewTSW Fantastic! Thank you.

The build is where the factory failed, but that's already a known bug.

yeah, this should be OK.

@ibuziuk
Copy link
Collaborator

ibuziuk commented Nov 5, 2018

@ScrewTSW could you please verify and close this issue if all the starter clusters have been updated and the issue is not reproducible anymore ? cc: @rhopp

@ibuziuk
Copy link
Collaborator

ibuziuk commented Nov 9, 2018

All starter clusters have been updated this week to openshift 3.11. However, yet another requirement for fixing this slow volume mount issue is having gluster-subvol driver fix [1] deployed to the prod clusters. I will provide an update here when the fix would be available on prod clusters and qa verification could be done.

[1] gluster/gluster-subvol#24

@ibuziuk
Copy link
Collaborator

ibuziuk commented Dec 12, 2018

@rhopp @ScrewTSW @Katka92 the gluster subvol fix is applied on 1a cluster. Will you be able to verify that volume mount problem is not reproducible there and +1 for applying the same fix to all the other starter clusters?

@ppitonak
Copy link
Collaborator

@ibuziuk
Copy link
Collaborator

ibuziuk commented Dec 13, 2018

@ppitonak are we positive that the fact that tests started to pass related to the gluster subvol fix (are those were failing due to slow volume mount before?) and not fix for #4626 ?

@ppitonak
Copy link
Collaborator

I'm not sure but

you commented in this issue on Dec 12, 18:16 UTC
fix for #4626 got into production on Dec 12, 19:13 UTC
first successful build started on Dec 12 18:35 UTC

@ibuziuk
Copy link
Collaborator

ibuziuk commented Dec 13, 2018

@rhopp could someone from QA team with the account provisioned against 1a make sure that it is not possible to reproduce the problem by following the steps to reproduce from the description ?

@rhopp
Copy link
Collaborator Author

rhopp commented Dec 13, 2018

PVC mount issue seems to be fixed on 1a cluster. Workspace container starts quickly even with lots of files inside.

I've encountered another (not so critical issue) while testing that... I'll report that as separate issue.

@dak1n1
Copy link

dak1n1 commented Dec 13, 2018

The update has now been applied to the rest of OSIO.

@ibuziuk
Copy link
Collaborator

ibuziuk commented Dec 14, 2018

@rhopp closing this issue

@ibuziuk ibuziuk closed this as completed Dec 14, 2018
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

9 participants