Description
Describe the bug
I am trying to deploy an Mlflow model to a new endpoint using a custom Docker container. Initial creation seems to proceed without any problems. It even initially passes the ping health check. After a little while it stops responding and I get an: 'The primary container for production variant xxxxx did not pass the ping health check' error. I have been able to deploy multiple other models, previously, without running into this problem. The model itself loads and scores without issues, locally.
To reproduce
A clear, step-by-step set of instructions to reproduce the bug.
Expected behavior
I would expect either successful deployment or a specific errors if the deployment fails.
Screenshots or logs
I've added the logs from Cloudwatch below. Unfortunately, they aren't particularly informative:
[2023-09-11 16:29:10 +0000] [17720] [INFO] Starting gunicorn 20.1.0
[2023-09-11 16:29:10 +0000] [17720] [INFO] Listening at: http://127.0.0.1:8000 (17720)
[2023-09-11 16:29:10 +0000] [17720] [INFO] Using worker: gevent
[2023-09-11 16:29:10 +0000] [17728] [INFO] Booting worker with pid: 17728
[2023-09-11 16:29:10 +0000] [17729] [INFO] Booting worker with pid: 17729
[2023-09-11 16:29:10 +0000] [17730] [INFO] Booting worker with pid: 17730
[2023-09-11 16:29:10 +0000] [17731] [INFO] Booting worker with pid: 17731
10.32.0.2 - - [11/Sep/2023:16:29:15 +0000] "GET /ping HTTP/1.1" 200 1 "-" "AHC/2.0"
10.32.0.2 - - [11/Sep/2023:16:29:18 +0000] "GET /ping HTTP/1.1" 200 1 "-" "AHC/2.0"
10.32.0.2 - - [11/Sep/2023:16:29:23 +0000] "GET /ping HTTP/1.1" 200 1 "-" "AHC/2.0"
10.32.0.2 - - [11/Sep/2023:16:29:28 +0000] "GET /ping HTTP/1.1" 200 1 "-" "AHC/2.0"
10.32.0.2 - - [11/Sep/2023:16:29:33 +0000] "GET /ping HTTP/1.1" 200 1 "-" "AHC/2.0"
10.32.0.2 - - [11/Sep/2023:16:29:38 +0000] "GET /ping HTTP/1.1" 200 1 "-" "AHC/2.0"
10.32.0.2 - - [11/Sep/2023:16:29:43 +0000] "GET /ping HTTP/1.1" 200 1 "-" "AHC/2.0"
10.32.0.2 - - [11/Sep/2023:16:29:48 +0000] "GET /ping HTTP/1.1" 200 1 "-" "AHC/2.0"
10.32.0.2 - - [11/Sep/2023:16:29:53 +0000] "GET /ping HTTP/1.1" 200 1 "-" "AHC/2.0"
10.32.0.2 - - [11/Sep/2023:16:29:58 +0000] "GET /ping HTTP/1.1" 200 1 "-" "AHC/2.0"
10.32.0.2 - - [11/Sep/2023:16:30:03 +0000] "GET /ping HTTP/1.1" 200 1 "-" "AHC/2.0"
10.32.0.2 - - [11/Sep/2023:16:30:08 +0000] "GET /ping HTTP/1.1" 200 1 "-" "AHC/2.0"
10.32.0.2 - - [11/Sep/2023:16:30:13 +0000] "GET /ping HTTP/1.1" 200 1 "-" "AHC/2.0"
10.32.0.2 - - [11/Sep/2023:16:30:18 +0000] "GET /ping HTTP/1.1" 200 1 "-" "AHC/2.0"
10.32.0.2 - - [11/Sep/2023:16:30:23 +0000] "GET /ping HTTP/1.1" 200 1 "-" "AHC/2.0"
10.32.0.2 - - [11/Sep/2023:16:30:28 +0000] "GET /ping HTTP/1.1" 200 1 "-" "AHC/2.0"
10.32.0.2 - - [11/Sep/2023:16:30:33 +0000] "GET /ping HTTP/1.1" 200 1 "-" "AHC/2.0"
System information
A description of your system. Please provide:
- SageMaker Python SDK version: 1.24.0
- Framework name (eg. PyTorch) or algorithm (eg. KMeans): Mlflow/Catboost
- Framework version: 2.3.2/1.1.1
- Python version: 3.10.4
- CPU or GPU: CPU
- Custom Docker image (Y/N): Y
Additional context
Add any other context about the problem here.