[perf] MCAD takes a very long time to delete a large number of AppWrappers

As part of the MCAD load test, I created 1000 AppWrappers _not_ fitting into the cluster (they request a high amount of CPU).
Once all of these AppWrappers are in one of these states: `[Queueing, HeadOfLine, Pending, Failed]`, the main test ends, and the cleanup starts.

All the AppWrapper are deleted with `oc delete AppWrappers --all -n <namespace>`.
The timing of this call is shown in blue in the figure below.

Once this call returns, I create a canary AppWrapper, and wait for it to be executed.
This step is show in red in the figure below. 
The [Ansible logs](https://rhods-baremetal-results.s3.amazonaws.com/local-ci/codeflare/codeflare-light/20230713_1351/000__test/000__test-case_cpu_light_unschedulable/000__mcad_load_test_multiple_values/002__mcad_load_test_value__aw.count%3D1000/001__mcad_load_test/001__codeflare__cleanup_appwrappers/_ansible.log) of this command confirm that most of the `23 minutes` is spent _before_ the `.status.controllerfirsttimestamp` even gets filled.

![image](https://github.com/project-codeflare/multi-cluster-app-dispatcher/assets/7559202/c4a1182a-92ce-42d1-aea6-2b598123ae4f)

All the details of the scale test are at [this address](https://rhods-baremetal-results.s3.amazonaws.com/local-ci/codeflare/codeflare-light/20230713_1351/000__test/000__test-case_cpu_light_unschedulable/001__plots/report_00_report:_error_report.html) ([files here](https://rhods-baremetal-results.s3.amazonaws.com/index.html#local-ci/codeflare/codeflare-light/20230713_1351/000__test/000__test-case_cpu_light_unschedulable/000__mcad_load_test_multiple_values/002__mcad_load_test_value__aw.count=1000/001__mcad_load_test/001__codeflare__cleanup_appwrappers/)). Mind that there was a typo in the code (wrong file read as part of the visualizer parsing) which make the clean up phase appear as 5 minutes long (this was the _test_ length :D).

This other plot (from [this test](https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/openshift-psap_ci-artifacts/844/pull-ci-openshift-psap-ci-artifacts-main-codeflare-e2e/1678995592128237568/artifacts/e2e/test/artifacts/000__test-case_cpu_light_unschedulable/000__mcad_load_test_multiple_values/004__mcad_load_test_value__aw.count=1000/002__plots/report_01_report:_resource_allocation_timelines.html)) confirm that none of the 1000 AppWrappers created in the first 5 minutes of the test are discovered in the first `25 minutes` of the test:
![image](https://github.com/project-codeflare/multi-cluster-app-dispatcher/assets/7559202/b2324d38-332c-4a46-8322-45b1407b8e9b)



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[perf] MCAD takes a very long time to delete a large number of AppWrappers #477

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[perf] MCAD takes a very long time to delete a large number of AppWrappers #477

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions