Skip to content

SageMaker SDK Bug Report: HyperparameterTuner Missing Container Mode Support #5184

Open
@josh-gree

Description

@josh-gree

Describe the bug
HyperparameterTuner does not preserve container mode parameters (container_entry_point and container_arguments) when creating training jobs, causing tuning jobs to fail. Individual training jobs work correctly with container mode, but hyperparameter tuning jobs lose the container configuration and fall back to script mode logic, resulting in failures.

To reproduce

from sagemaker.estimator import Estimator
from sagemaker.tuner import HyperparameterTuner, ContinuousParameter

# Create estimator with container mode
estimator = Estimator(
    image_uri="123456789012.dkr.ecr.us-east-1.amazonaws.com/my-image:latest",
    role="arn:aws:iam::123456789012:role/SageMakerRole",
    instance_type="ml.m5.large",
    instance_count=1,
    container_entry_point=["python", "-m", "my_module"],
    container_arguments=["train", "model1"],
)

# Create hyperparameter tuner
tuner = HyperparameterTuner(
    estimator=estimator,
    objective_metric_name="accuracy",
    objective_type="Maximize",
    hyperparameter_ranges={
        "learning_rate": ContinuousParameter(0.001, 0.1)
    },
    max_jobs=2,
    max_parallel_jobs=1,
)

# This will fail - individual training jobs missing container parameters
tuner.fit()

Expected behavior
The hyperparameter tuning job should preserve the container mode configuration and set ContainerEntrypoint and ContainerArguments in the AlgorithmSpecification of individual training jobs, just like when calling estimator.fit() directly.

Screenshots or logs
Individual training job within tuning job shows missing container parameters:

"AlgorithmSpecification": {
    "TrainingImage": "123456789012.dkr.ecr.us-east-1.amazonaws.com/my-image:latest",
    "TrainingInputMode": "File",
    "MetricDefinitions": [...],
    "EnableSageMakerMetricsTimeSeries": false
    // Missing: ContainerEntrypoint and ContainerArguments
}

Training jobs fail with:

AlgorithmError: Framework Error: 
AttributeError: 'NoneType' object has no attribute 'endswith'

System information

  • SageMaker Python SDK version: 2.244.2
  • Framework name: Custom container (Estimator class)
  • Framework version: N/A
  • Python version: 3.10
  • CPU or GPU: CPU
  • Custom Docker image (Y/N): Y

Additional context

Root cause analysis

The issue is in two locations in the SDK:

1. sagemaker/job.py - Missing container parameter extraction

_Job._load_config() method (lines 117-124) only extracts basic configuration and ignores container mode parameters:

return {
    "input_config": input_config,
    "role": role,
    "output_config": output_config,
    "resource_config": resource_config,
    "stop_condition": stop_condition,
    "vpc_config": vpc_config,
    # Missing: container_entry_point, container_arguments
}

2. sagemaker/session.py - Missing container parameter handling

_map_training_config() method (line 3584+) doesn't accept container parameters in its signature and doesn't include them in the AlgorithmSpecification (lines 3685-3694).

The method signature is missing container_entry_point and container_arguments parameters, and the AlgorithmSpecification construction only includes:

algorithm_spec = {"TrainingInputMode": input_mode}
if metric_definitions is not None:
    algorithm_spec["MetricDefinitions"] = metric_definitions

if algorithm_arn:
    algorithm_spec["AlgorithmName"] = algorithm_arn
else:
    algorithm_spec["TrainingImage"] = image_uri

# Missing: ContainerEntrypoint and ContainerArguments

Comparison with working code

Individual training jobs work because session.train() correctly handles container parameters (lines 1266-1270):

if container_entry_point is not None:
    train_request["AlgorithmSpecification"]["ContainerEntrypoint"] = container_entry_point

if container_arguments is not None:
    train_request["AlgorithmSpecification"]["ContainerArguments"] = container_arguments

Code path analysis

Working path (individual training jobs):

  1. estimator.fit()session.train() → ✅ Includes container parameters

Broken path (hyperparameter tuning):

  1. tuner.fit()_TuningJob._prepare_training_config()
  2. _Job._load_config() → ❌ Drops container parameters
  3. session._map_training_config() → ❌ Doesn't handle container parameters

Verification

  • ✅ Container mode works with estimator.fit() (individual training jobs)
  • ❌ Container mode fails with tuner.fit() (hyperparameter tuning)
  • ✅ Script mode works with tuner.fit()

Impact

This prevents users from using container mode with hyperparameter tuning, forcing them to use script mode for tuning jobs even when their training logic is containerized.

Suggested fix

  1. Update _Job._load_config() to extract container parameters from the estimator:
# Add to the return dict:
config = {
    "input_config": input_config,
    "role": role,
    "output_config": output_config,
    "resource_config": resource_config,
    "stop_condition": stop_condition,
    "vpc_config": vpc_config,
}

# Add container mode parameters
if hasattr(estimator, 'container_entry_point') and estimator.container_entry_point:
    config['container_entry_point'] = estimator.container_entry_point
    
if hasattr(estimator, 'container_arguments') and estimator.container_arguments:
    config['container_arguments'] = estimator.container_arguments

return config
  1. Update _map_training_config() signature to accept container parameters and include them in AlgorithmSpecification:
def _map_training_config(
    cls,
    static_hyperparameters,
    input_mode,
    role,
    output_config,
    stop_condition,
    # ... existing params ...
    container_entry_point=None,  # Add this
    container_arguments=None,    # Add this
):
    # ... existing code ...
    
    # Add to AlgorithmSpecification:
    if container_entry_point is not None:
        algorithm_spec["ContainerEntrypoint"] = container_entry_point
        
    if container_arguments is not None:
        algorithm_spec["ContainerArguments"] = container_arguments

This would align the hyperparameter tuning code path with the working individual training job implementation.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions