Skip to content

RHOAIENG-16587: fix(test): ensure papermill tests run successfully for all supported notebooks #834

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

Conversation

andyatmiami
Copy link
Contributor

@andyatmiami andyatmiami commented Jan 12, 2025

Description

Our testing framework had a fundamental issue in that it would, for certain images, naively run the test suite of parent images for a child image. In the case of tensorflow images, this caused problems because the (child) tensorflow image would purposefully downgraded a package version given compatability issues. Furthermore, when attempting to verify the changes to address this core issue, numerous other bugs were encountered that required to be fixed.

Significant changes introduced in this commit:

  • logic of the test-% Makefile target was moved into a shell script named test_jupyter_with_papermill.sh
  • test resources required to be copied to pod are now retrieved via the locally checked out repo
    • previously there were pulled from a remote branch via wget (defaulting to main branch)
    • this ensure our PR checks are now always leveraging any updated files
  • test_notebook.ipynb files now expect an expected_versions.json file to exist within the same directory.
    • expected_versions.json file is derived from the relevant ...-notebook-imagestream.yaml manifest and leveraged when asserting on dependency versions
    • admittedly the duplication of various helper functions across all the notebook files is annoying - but helps to keep the .ipynb files self-contained
  • ...-notebook-imagestream.yaml manifest had annotations updated to include any package that had a unit test in test_notebook.ipynb asserting versions
  • CUDA tensorflow unit test that converts to ONNX was updated to be actually functional

Minor changes introduced in this commit:

  • use ifdef OS in Makefile to avoid warnings about undefined variable
  • all test_notebook.ipynb files:
    • have an id attribute defined in metadata
    • specify the nbformat as 4.5
  • the more compute-intensive notebooks had readinessProbe and livenessProbe settings updated to be less aggressive
    • was observing liveness checks sporadically failing while the notebook was being tested - and this update seemed to help
  • trustyai notebook now runs the minimal and datascience papermill tests (similar to tensorflow and pytorch) instead of including the test code within its own test_noteook.ipynb file
  • various "quality of life" improvements where introduced into test_jupyter_with_papermill.sh
    • properly invoke tests for any valid/supported target
      • previously certain test targets required manual intervention in order to run end to end
    • improved logging (particularly when running with the -x flag)
    • more modular structure to hopefully improve readability
    • script now honors proper shell return code semantics (i.e. returns 0 on success)

It should also be noted that while most intel notebooks are now passing the papermill tests - there are issues with the intel tensorflow unit tests still. More detail is captured in the JIRA ticket related to this commit. However, we are imminently removing intel logic from the notebooks repo entirely... so I didn't wanna burn any more time trying to get that last notebook to pass as it will be removed shortly!

Further details on expected_versions.json:

  • yq (which is installed via the Makefile is used to:
    • query the relevant imagestream manifest to parse out the ["opendatahub.io/notebook-software"] and ["opendatahub.io/notebook-python-dependencies"] annotations from the first element of the spec.tags attribute
    • inject name/version elements for nbdime and nbgitpuller (which are asserted against in minimal notebook tests)
    • convert this yaml list to a JSON object of the form: {<package>: <version>}
  • this JSON object is then copied into the running notebook workload in the same directly that test_notebook.ipynb resides
  • each test_notebook.ipynb has a couple helper functions defined to then interact with this file:
    • def load_expected_versions() -> dict:
    • def get_expected_version(dependency_name: str) -> str:
      • The argument provided to the get_expected_version function should match the name attribute of the JSON structure defined in the imagestream manifest

Related-to: https://issues.redhat.com/browse/RHOAIENG-16587

How Has This Been Tested?

Testing can be performed simply by invoking make with a target that matches the test-% pattern-matching rule.

ex: gmake test-jupyter-pytorch-ubi9-python-3.11 -e NOTEBOOK_REPO_BRANCH_BASE=https://raw.githubusercontent.com/andyatmiami/notebooks/fix-e2e-asserts for basic sanity checking. However, all known/supported/expected test-% patterns should be verified.

I have also created a utility in https://gitlab.cee.redhat.com/astonebe/notebook-utils that can be used to more efficiently test notebooks "in bulk". While it still runs sequentially, it handles deploying, testing, and undeploying a given list of targets.

  • I used this utility script to test all notebooks that have a corresponding imagestream manifest.

Merge criteria:

  • The commits are squashed in a cohesive manner and have meaningful messages.
  • Testing instructions have been added in the PR body (for PRs involving changes that are not immediately obvious).
  • The developer has manually tested the changes and verified that the changes work

Copy link
Contributor

openshift-ci bot commented Jan 12, 2025

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@jiridanek
Copy link
Member

Alternative proposal

The package versions we should have in our images are the ones given in the docs, which is at https://access.redhat.com/articles/rhoai-supported-configs (section "Supported workbench images"). This table is generated (or written manually, before red-hat-data-services#476), based on our ImageStream manifests at https://github.com/opendatahub-io/notebooks/tree/main/manifests/base (and https://github.com/red-hat-data-services/notebooks/tree/main/manifests/base for RHOAI)

opendatahub.io/notebook-python-dependencies: |
[
{"name": "ROCm-PyTorch", "version": "2.4"},
{"name": "Tensorboard", "version": "2.16"},
{"name": "Kafka-Python-ng", "version": "2.2"},
{"name": "Matplotlib", "version": "3.9"},
{"name": "Numpy", "version": "2.1"},

We should not duplicate that information in the test ipython notebook files. The tests should check that what's in the image corresponds to what's given in the manifests.

@andyatmiami
Copy link
Contributor Author

andyatmiami commented Jan 12, 2025

So, given today in our tests we are duplicating this information...

do you want me to address your suggestion in this PR or a follow up PR?

For emphasis - I wholly agree with your proposal - I've just scope-creeped the heck outta this present PR already 😎

@jiridanek
Copy link
Member

do you want me to address your suggestion in this PR or a follow up PR?

We did not do enough PRs of this kind for me to discern a pattern, but on my previous team, there was preference for moving in small steps, e.g. first move script embedded in Makefile to it's own file unchanged, then modify said script.

Guess this is largely a mater of personal taste. And also of review speed, it's hard to split things up when reviews for the parts are not forthcoming.

@andyatmiami
Copy link
Contributor Author

/test all

@jstourac
Copy link
Member

I got to this for a brief review. I agree with Jiri to re-use what we define in our manifest files in imagestreams for the testing purposes instead of duplicating - but yeah, let's do it step by step as you wrote above. So, I am happy to see these tests being extracted from the Makefile making it a bit more readable. What I'm not a big fan here is that we're creating yet another bash script for this. I think we should utilize the existing pytest framework with the testcontainers for as much tests as possible so that we can run these in a unified environment.

Anyway, since this aims only to extract these out of Makefile and to keep this compatible with the existing approach, I'm not against this in general right now. But to the future, I think we should think to migrate this again - this time to the pytest where most of our tests will be, hopefully.

@jiridanek
Copy link
Member

I'm with Jan, pulling the bash code out of Makefile is a good improvement in itself.

@andyatmiami andyatmiami force-pushed the fix-e2e-asserts branch 2 times, most recently from ee21ef4 to 5dd5618 Compare January 17, 2025 19:51
@andyatmiami andyatmiami changed the title fix(test): provide ability to override versions used in asserts during papermill tests fix(test): ensure papermill tests run successfully for all supported notebooks Jan 17, 2025
@andyatmiami andyatmiami changed the title fix(test): ensure papermill tests run successfully for all supported notebooks RHOAIENG-16587: fix(test): ensure papermill tests run successfully for all supported notebooks Jan 17, 2025
@andyatmiami andyatmiami marked this pull request as ready for review January 17, 2025 19:57
@openshift-ci openshift-ci bot requested review from dibryant and jiridanek January 17, 2025 19:58
@opendatahub-io opendatahub-io deleted a comment from openshift-ci bot Jan 17, 2025
@opendatahub-io opendatahub-io deleted a comment from openshift-ci bot Jan 17, 2025
@andyatmiami
Copy link
Contributor Author

/test notebooks-ubi9-e2e-tests

(weird GCP errors on building the cluster - hopefully transient 🤞 )

@andyatmiami
Copy link
Contributor Author

PASSING TESTS!

image

@andyatmiami
Copy link
Contributor Author

/test notebook-rocm-jupyter-tf-ubi9-python-3-11-pr-image-mirror notebook-rocm-jupyter-pyt-ubi9-python-3-11-pr-image-mirror

@andyatmiami
Copy link
Contributor Author

/test notebook-rocm-jupyter-pyt-ubi9-python-3-11-pr-image-mirror

@@ -128,6 +128,7 @@ venv/
ENV/
env.bak/
venv.bak/
.DS_store
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🍎

@jiridanek
Copy link
Member

/lgtm

this is a huge improvement over what we had before, so yeah, I'm happy to have it merged

…notebooks

Our testing framework had a fundamental issue in that it would, for certain images, naively run the test suite of parent images for a child image.  In the case of `tensorflow` images, this caused problems because the (child) `tensorflow` image would **purposefully** _downgraded_ a package version given compatability issues. Furthermore, when attempting to verify the changes to address this core issue, numerous other bugs were encountered that required to be fixed.

Significant changes introduced in this commit:
- logic of the `test-%` Makefile target was moved into a shell script named `test_jupyter_with_papermill.sh`
- test resources required to be copied to pod are now retrieved via the locally checked out repo
    - previously there were pulled from a remote branch via `wget` (defaulting to `main` branch)
    - this ensure our PR checks are now always leveraging any updated files
- test_notebook.ipynb files now expect an `expected_versions.json` file to exist within the same directory.
    - `expected_versions.json` file is derived from the relevant `...-notebook-imagestream.yaml` manifest and leveraged when asserting on dependency versions
		- admittedly the duplication of various helper functions across all the notebook files is annoying - but helps to keep the `.ipynb` files self-contained
- `...-notebook-imagestream.yaml` manifest had annotations updated to include any package that had a unit test in `test_notebook.ipynb` asserting versions
- CUDA tensorflow unit test that converts to ONNX was updated to be actually functional

Minor changes introduced in this commit:
- use `ifdef OS` in Makefile to avoid warnings about undefined variable
- all `test_notebook.ipynb` files:
    - have an `id` attribute defined in metadata
		- specify the `nbformat` as `4.5`
- the more compute-intensive notebooks had `readinessProbe` and `livenessProbe` settings updated to be less aggressive
    - was observing liveness checks sporadically failing while the notebook was being tested - and this update seemed to help
- `trustyai` notebook now runs the minimal and datascience papermill tests (similar to `tensorflow` and `pytorch`) instead of including the test code within its own `test_noteook.ipynb` file
- various "quality of life" improvements where introduced into `test_jupyter_with_papermill.sh`
    - properly invoke tests for any valid/supported target
        - previously certain test targets required manual intervention in order to run end to end
    - improved logging (particularly when running with the `-x` flag)
		- more modular structure to hopefully improve readability
		- script now honors proper shell return code semantics (i.e. returns `0` on success)

It should also be noted that while most `intel` notebooks are now passing the papermill tests - there are issues with the `intel` `tensorflow` unit tests still.  More detail is captured in the JIRA ticket related to this commit.  However, we are imminently removing `intel` logic from the `notebooks` repo entirely... so I didn't wanna burn any more time trying to get that last notebook to pass as it will be removed shortly!

Further details on `expected_versions.json`:
- `yq` (which is installed via the `Makefile` is used to:
    - query the relevant imagestream manifest to parse out the `["opendatahub.io/notebook-software"]` and `["opendatahub.io/notebook-python-dependencies"]` annotations from the first element of the `spec.tags` attribute
		- inject name/version elements for `nbdime` and `nbgitpuller` (which are asserted against in `minimal` notebook tests)
		- convert this `yaml` list to a JSON object of the form: `{<package>: <version>}`
- this JSON object is then copied into the running notebook workload in the same directly that `test_notebook.ipynb` resides
- each `test_notebook.ipynb` has a couple helper functions defined to then interact with this file:
    - `def load_expected_versions() -> dict:`
		- `def get_expected_version(dependency_name: str) -> str:
- The argument provided to the `get_expected_version` function should match the `name` attribute of the JSON structure defined in the imagestream manifest

Related-to: https://issues.redhat.com/browse/RHOAIENG-16587
@jiridanek
Copy link
Member

/lgtm

@openshift-ci openshift-ci bot added the lgtm label Jan 20, 2025
@atheo89
Copy link
Member

atheo89 commented Jan 20, 2025

/lgtm

@opendatahub-io opendatahub-io deleted a comment from openshift-ci bot Jan 20, 2025
@andyatmiami
Copy link
Contributor Author

/override ci/prow/intel-notebooks-e2e-tests ci/prow/notebooks-ubi9-e2e-tests ci/prow/rocm-notebooks-e2e-tests

Copy link
Contributor

openshift-ci bot commented Jan 20, 2025

@andyatmiami: Overrode contexts on behalf of andyatmiami: ci/prow/intel-notebooks-e2e-tests, ci/prow/notebooks-ubi9-e2e-tests, ci/prow/rocm-notebooks-e2e-tests

In response to this:

/override ci/prow/intel-notebooks-e2e-tests ci/prow/notebooks-ubi9-e2e-tests ci/prow/rocm-notebooks-e2e-tests

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@andyatmiami
Copy link
Contributor Author

/override ci/prow/notebook-cuda-jupyter-tf-ubi9-python-3-11-pr-image-mirror

(same old build timeout BS.. i'm working on getting to the bottom of that!)

Copy link
Contributor

openshift-ci bot commented Jan 20, 2025

@andyatmiami: Overrode contexts on behalf of andyatmiami: ci/prow/notebook-cuda-jupyter-tf-ubi9-python-3-11-pr-image-mirror

In response to this:

/override ci/prow/notebook-cuda-jupyter-tf-ubi9-python-3-11-pr-image-mirror

(same old build timeout BS.. i'm working on getting to the bottom of that!)

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@andyatmiami
Copy link
Contributor Author

/override ci/prow/notebook-jupyter-intel-tf-ubi9-python-3-11-pr-image-mirror

(same old quay 502 issues and retries not handling it)

error: unable to copy layer sha256:45d40005ddd1bccd29a7b1abc2a88c08732ab10f9904c997cc0c322f77319bce to quay.io/opendatahub/workbench-images: received unexpected HTTP status: 502 Bad Gateway
info: Mirroring completed in 33.41s (4.343MB/s)
error: one or more errors occurred while uploading images
2025-01-20T17:08:23+00:00 ERROR Unable to mirror image
{"component":"entrypoint","error":"wrapped process failed: exit status 1","file":"sigs.k8s.io/prow/pkg/entrypoint/run.go:84","func":"sigs.k8s.io/prow/pkg/entrypoint.Options.internalRun","level":"error","msg":"Error executing test process","severity":"error","time":"2025-01-20T17:08:23Z"}
error: failed to execute wrapped command: exit status 1

Copy link
Contributor

openshift-ci bot commented Jan 20, 2025

@andyatmiami: Overrode contexts on behalf of andyatmiami: ci/prow/notebook-jupyter-intel-tf-ubi9-python-3-11-pr-image-mirror

In response to this:

/override ci/prow/notebook-jupyter-intel-tf-ubi9-python-3-11-pr-image-mirror

(same old quay 502 issues and retries not handling it)

error: unable to copy layer sha256:45d40005ddd1bccd29a7b1abc2a88c08732ab10f9904c997cc0c322f77319bce to quay.io/opendatahub/workbench-images: received unexpected HTTP status: 502 Bad Gateway
info: Mirroring completed in 33.41s (4.343MB/s)
error: one or more errors occurred while uploading images
2025-01-20T17:08:23+00:00 ERROR Unable to mirror image
{"component":"entrypoint","error":"wrapped process failed: exit status 1","file":"sigs.k8s.io/prow/pkg/entrypoint/run.go:84","func":"sigs.k8s.io/prow/pkg/entrypoint.Options.internalRun","level":"error","msg":"Error executing test process","severity":"error","time":"2025-01-20T17:08:23Z"}
error: failed to execute wrapped command: exit status 1

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@andyatmiami
Copy link
Contributor Author

/override ci/prow/notebook-jupyter-pytorch-ubi9-python-3-11-pr-image-mirror

(same old build timeout BS.. i'm working on getting to the bottom of that!)

Copy link
Contributor

openshift-ci bot commented Jan 20, 2025

@andyatmiami: Overrode contexts on behalf of andyatmiami: ci/prow/notebook-jupyter-pytorch-ubi9-python-3-11-pr-image-mirror

In response to this:

/override ci/prow/notebook-jupyter-pytorch-ubi9-python-3-11-pr-image-mirror

(same old build timeout BS.. i'm working on getting to the bottom of that!)

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@andyatmiami
Copy link
Contributor Author

/override ci/prow/notebook-rocm-jupyter-pyt-ubi9-python-3-11-pr-image-mirror

(same old build timeout BS.. i'm working on getting to the bottom of that!)

Copy link
Contributor

openshift-ci bot commented Jan 20, 2025

@andyatmiami: Overrode contexts on behalf of andyatmiami: ci/prow/notebook-rocm-jupyter-pyt-ubi9-python-3-11-pr-image-mirror

In response to this:

/override ci/prow/notebook-rocm-jupyter-pyt-ubi9-python-3-11-pr-image-mirror

(same old build timeout BS.. i'm working on getting to the bottom of that!)

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@andyatmiami
Copy link
Contributor Author

/override ci/prow/notebook-rocm-jupyter-tf-ubi9-python-3-11-pr-image-mirror

(same old build timeout BS.. i'm working on getting to the bottom of that!)

Copy link
Contributor

openshift-ci bot commented Jan 20, 2025

@andyatmiami: Overrode contexts on behalf of andyatmiami: ci/prow/notebook-rocm-jupyter-tf-ubi9-python-3-11-pr-image-mirror

In response to this:

/override ci/prow/notebook-rocm-jupyter-tf-ubi9-python-3-11-pr-image-mirror

(same old build timeout BS.. i'm working on getting to the bottom of that!)

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@andyatmiami
Copy link
Contributor Author

/override ci/prow/images

(same old build timeout BS.. i'm working on getting to the bottom of that!)

Copy link
Contributor

openshift-ci bot commented Jan 20, 2025

@andyatmiami: Overrode contexts on behalf of andyatmiami: ci/prow/images

In response to this:

/override ci/prow/images

(same old build timeout BS.. i'm working on getting to the bottom of that!)

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@andyatmiami
Copy link
Contributor Author

/approve

Copy link
Contributor

openshift-ci bot commented Jan 20, 2025

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: andyatmiami

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-merge-bot openshift-merge-bot bot merged commit 5c34753 into opendatahub-io:main Jan 20, 2025
29 checks passed
jiridanek pushed a commit to rhoai-ide-konflux/notebooks that referenced this pull request Mar 30, 2025
RHOAIENG-16587: fix(test): ensure papermill tests run successfully for all supported notebooks
jesuino pushed a commit to jesuino/notebooks that referenced this pull request Jun 17, 2025
…lux/component-updates/component-update-odh-workbench-jupyter-pytorch-rocm-py311-ubi9-n-v2-22

Update odh-workbench-jupyter-pytorch-rocm-py311-ubi9-n-v2-22 to 181b0a5
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants