Description
Describe the bug
HyperparameterTuner does not preserve container mode parameters (container_entry_point
and container_arguments
) when creating training jobs, causing tuning jobs to fail. Individual training jobs work correctly with container mode, but hyperparameter tuning jobs lose the container configuration and fall back to script mode logic, resulting in failures.
To reproduce
from sagemaker.estimator import Estimator
from sagemaker.tuner import HyperparameterTuner, ContinuousParameter
# Create estimator with container mode
estimator = Estimator(
image_uri="123456789012.dkr.ecr.us-east-1.amazonaws.com/my-image:latest",
role="arn:aws:iam::123456789012:role/SageMakerRole",
instance_type="ml.m5.large",
instance_count=1,
container_entry_point=["python", "-m", "my_module"],
container_arguments=["train", "model1"],
)
# Create hyperparameter tuner
tuner = HyperparameterTuner(
estimator=estimator,
objective_metric_name="accuracy",
objective_type="Maximize",
hyperparameter_ranges={
"learning_rate": ContinuousParameter(0.001, 0.1)
},
max_jobs=2,
max_parallel_jobs=1,
)
# This will fail - individual training jobs missing container parameters
tuner.fit()
Expected behavior
The hyperparameter tuning job should preserve the container mode configuration and set ContainerEntrypoint
and ContainerArguments
in the AlgorithmSpecification
of individual training jobs, just like when calling estimator.fit()
directly.
Screenshots or logs
Individual training job within tuning job shows missing container parameters:
"AlgorithmSpecification": {
"TrainingImage": "123456789012.dkr.ecr.us-east-1.amazonaws.com/my-image:latest",
"TrainingInputMode": "File",
"MetricDefinitions": [...],
"EnableSageMakerMetricsTimeSeries": false
// Missing: ContainerEntrypoint and ContainerArguments
}
Training jobs fail with:
AlgorithmError: Framework Error:
AttributeError: 'NoneType' object has no attribute 'endswith'
System information
- SageMaker Python SDK version: 2.244.2
- Framework name: Custom container (Estimator class)
- Framework version: N/A
- Python version: 3.10
- CPU or GPU: CPU
- Custom Docker image (Y/N): Y
Additional context
Root cause analysis
The issue is in two locations in the SDK:
1. sagemaker/job.py
- Missing container parameter extraction
_Job._load_config()
method (lines 117-124) only extracts basic configuration and ignores container mode parameters:
return {
"input_config": input_config,
"role": role,
"output_config": output_config,
"resource_config": resource_config,
"stop_condition": stop_condition,
"vpc_config": vpc_config,
# Missing: container_entry_point, container_arguments
}
2. sagemaker/session.py
- Missing container parameter handling
_map_training_config()
method (line 3584+) doesn't accept container parameters in its signature and doesn't include them in the AlgorithmSpecification
(lines 3685-3694).
The method signature is missing container_entry_point
and container_arguments
parameters, and the AlgorithmSpecification
construction only includes:
algorithm_spec = {"TrainingInputMode": input_mode}
if metric_definitions is not None:
algorithm_spec["MetricDefinitions"] = metric_definitions
if algorithm_arn:
algorithm_spec["AlgorithmName"] = algorithm_arn
else:
algorithm_spec["TrainingImage"] = image_uri
# Missing: ContainerEntrypoint and ContainerArguments
Comparison with working code
Individual training jobs work because session.train()
correctly handles container parameters (lines 1266-1270):
if container_entry_point is not None:
train_request["AlgorithmSpecification"]["ContainerEntrypoint"] = container_entry_point
if container_arguments is not None:
train_request["AlgorithmSpecification"]["ContainerArguments"] = container_arguments
Code path analysis
Working path (individual training jobs):
estimator.fit()
→session.train()
→ ✅ Includes container parameters
Broken path (hyperparameter tuning):
tuner.fit()
→_TuningJob._prepare_training_config()
- →
_Job._load_config()
→ ❌ Drops container parameters - →
session._map_training_config()
→ ❌ Doesn't handle container parameters
Verification
- ✅ Container mode works with
estimator.fit()
(individual training jobs) - ❌ Container mode fails with
tuner.fit()
(hyperparameter tuning) - ✅ Script mode works with
tuner.fit()
Impact
This prevents users from using container mode with hyperparameter tuning, forcing them to use script mode for tuning jobs even when their training logic is containerized.
Suggested fix
- Update
_Job._load_config()
to extract container parameters from the estimator:
# Add to the return dict:
config = {
"input_config": input_config,
"role": role,
"output_config": output_config,
"resource_config": resource_config,
"stop_condition": stop_condition,
"vpc_config": vpc_config,
}
# Add container mode parameters
if hasattr(estimator, 'container_entry_point') and estimator.container_entry_point:
config['container_entry_point'] = estimator.container_entry_point
if hasattr(estimator, 'container_arguments') and estimator.container_arguments:
config['container_arguments'] = estimator.container_arguments
return config
- Update
_map_training_config()
signature to accept container parameters and include them inAlgorithmSpecification
:
def _map_training_config(
cls,
static_hyperparameters,
input_mode,
role,
output_config,
stop_condition,
# ... existing params ...
container_entry_point=None, # Add this
container_arguments=None, # Add this
):
# ... existing code ...
# Add to AlgorithmSpecification:
if container_entry_point is not None:
algorithm_spec["ContainerEntrypoint"] = container_entry_point
if container_arguments is not None:
algorithm_spec["ContainerArguments"] = container_arguments
This would align the hyperparameter tuning code path with the working individual training job implementation.