Skip to content

Endpoint failing after initially passing ping health check #4115

Open
@nfarley-soaren

Description

@nfarley-soaren

Describe the bug
I am trying to deploy an Mlflow model to a new endpoint using a custom Docker container. Initial creation seems to proceed without any problems. It even initially passes the ping health check. After a little while it stops responding and I get an: 'The primary container for production variant xxxxx did not pass the ping health check' error. I have been able to deploy multiple other models, previously, without running into this problem. The model itself loads and scores without issues, locally.

To reproduce
A clear, step-by-step set of instructions to reproduce the bug.

Expected behavior
I would expect either successful deployment or a specific errors if the deployment fails.

Screenshots or logs
I've added the logs from Cloudwatch below. Unfortunately, they aren't particularly informative:

[2023-09-11 16:29:10 +0000] [17720] [INFO] Starting gunicorn 20.1.0

[2023-09-11 16:29:10 +0000] [17720] [INFO] Listening at: http://127.0.0.1:8000 (17720)
[2023-09-11 16:29:10 +0000] [17720] [INFO] Using worker: gevent
[2023-09-11 16:29:10 +0000] [17728] [INFO] Booting worker with pid: 17728
[2023-09-11 16:29:10 +0000] [17729] [INFO] Booting worker with pid: 17729
[2023-09-11 16:29:10 +0000] [17730] [INFO] Booting worker with pid: 17730
[2023-09-11 16:29:10 +0000] [17731] [INFO] Booting worker with pid: 17731
10.32.0.2 - - [11/Sep/2023:16:29:15 +0000] "GET /ping HTTP/1.1" 200 1 "-" "AHC/2.0"
10.32.0.2 - - [11/Sep/2023:16:29:18 +0000] "GET /ping HTTP/1.1" 200 1 "-" "AHC/2.0"
10.32.0.2 - - [11/Sep/2023:16:29:23 +0000] "GET /ping HTTP/1.1" 200 1 "-" "AHC/2.0"
10.32.0.2 - - [11/Sep/2023:16:29:28 +0000] "GET /ping HTTP/1.1" 200 1 "-" "AHC/2.0"
10.32.0.2 - - [11/Sep/2023:16:29:33 +0000] "GET /ping HTTP/1.1" 200 1 "-" "AHC/2.0"
10.32.0.2 - - [11/Sep/2023:16:29:38 +0000] "GET /ping HTTP/1.1" 200 1 "-" "AHC/2.0"
10.32.0.2 - - [11/Sep/2023:16:29:43 +0000] "GET /ping HTTP/1.1" 200 1 "-" "AHC/2.0"
10.32.0.2 - - [11/Sep/2023:16:29:48 +0000] "GET /ping HTTP/1.1" 200 1 "-" "AHC/2.0"
10.32.0.2 - - [11/Sep/2023:16:29:53 +0000] "GET /ping HTTP/1.1" 200 1 "-" "AHC/2.0"
10.32.0.2 - - [11/Sep/2023:16:29:58 +0000] "GET /ping HTTP/1.1" 200 1 "-" "AHC/2.0"
10.32.0.2 - - [11/Sep/2023:16:30:03 +0000] "GET /ping HTTP/1.1" 200 1 "-" "AHC/2.0"
10.32.0.2 - - [11/Sep/2023:16:30:08 +0000] "GET /ping HTTP/1.1" 200 1 "-" "AHC/2.0"
10.32.0.2 - - [11/Sep/2023:16:30:13 +0000] "GET /ping HTTP/1.1" 200 1 "-" "AHC/2.0"
10.32.0.2 - - [11/Sep/2023:16:30:18 +0000] "GET /ping HTTP/1.1" 200 1 "-" "AHC/2.0"
10.32.0.2 - - [11/Sep/2023:16:30:23 +0000] "GET /ping HTTP/1.1" 200 1 "-" "AHC/2.0"
10.32.0.2 - - [11/Sep/2023:16:30:28 +0000] "GET /ping HTTP/1.1" 200 1 "-" "AHC/2.0"
10.32.0.2 - - [11/Sep/2023:16:30:33 +0000] "GET /ping HTTP/1.1" 200 1 "-" "AHC/2.0"

System information
A description of your system. Please provide:

  • SageMaker Python SDK version: 1.24.0
  • Framework name (eg. PyTorch) or algorithm (eg. KMeans): Mlflow/Catboost
  • Framework version: 2.3.2/1.1.1
  • Python version: 3.10.4
  • CPU or GPU: CPU
  • Custom Docker image (Y/N): Y

Additional context
Add any other context about the problem here.

Metadata

Metadata

Assignees

Labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions