Skip to content

[Bug] ModelTrainer drops TrainingJobName for PipelineSession, breaking use_custom_job_prefix on TrainingStep #5776

@rojo1997

Description

@rojo1997

PySDK Version

  • PySDK V2 (2.x)
  • PySDK V3 (3.x)

Describe the bug
When a ModelTrainer is executed under a PipelineSession (i.e. produces a TrainingStep), ModelTrainer._create_training_job_args explicitly removes training_job_name from the request before serializing it to PascalCase:

# sagemaker/train/model_trainer.py (sagemaker-train 1.8.0)
if boto3 or isinstance(self.sagemaker_session, PipelineSession):
    if isinstance(self.sagemaker_session, PipelineSession):
        training_request.pop("training_job_name", None)
    # Convert snake_case to PascalCase for AWS API
    pipeline_request = {to_pascal_case(k): v for k, v in training_request.items()}
    serialized_request = serialize(pipeline_request)
    return serialized_request

Because the key is popped, the resulting request dict has no TrainingJobName. Downstream, TrainingStep.arguments (with PipelineDefinitionConfig(use_custom_job_prefix=True)) relies on TrainingJobName being present in the request so the prefix is preserved (and trim_request_dict removes it when use_custom_job_prefix=False).

The net effect is that use_custom_job_prefix=True is silently ignored for TrainingStep when the step is built from a ModelTrainer: every pipeline execution produces a random auto-generated training job name instead of the configured base_job_name prefix.

This is the same class of bug as #3991 and #4590 (which were about TransformStep), but for the new V3 ModelTrainerTrainingStep path.

To reproduce

from sagemaker.core.workflow.pipeline import Pipeline
from sagemaker.core.workflow.pipeline_context import PipelineSession
from sagemaker.core.workflow.pipeline_definition_config import PipelineDefinitionConfig
from sagemaker.train.model_trainer import ModelTrainer
# ... build a ModelTrainer `trainer` with base_job_name=\"my-prefix\" ...

pipeline_session = PipelineSession()
trainer.sagemaker_session = pipeline_session

step_args = trainer._create_training_job_args()
assert \"TrainingJobName\" in step_args, step_args  # FAILS — key was popped

pipeline = Pipeline(
    name=\"repro\",
    steps=[...],  # TrainingStep built from trainer
    sagemaker_session=pipeline_session,
    pipeline_definition_config=PipelineDefinitionConfig(use_custom_job_prefix=True),
)
# Pipeline executions will NOT use \"my-prefix-...\" as the training job name.

Expected behavior
TrainingJobName should remain in the request dict so that PipelineDefinitionConfig(use_custom_job_prefix=True) produces training jobs named with the configured prefix. When use_custom_job_prefix=False, TrainingStep.arguments/trim_request_dict will strip the key as usual.

A minimal fix is to stop popping the key:

if boto3 or isinstance(self.sagemaker_session, PipelineSession):
    pipeline_request = {to_pascal_case(k): v for k, v in training_request.items()}
    serialized_request = serialize(pipeline_request)
    return serialized_request

As a workaround we currently monkey-patch _create_training_job_args to re-insert TrainingJobName = _get_unique_name(self.base_job_name) when the session is a PipelineSession.

Screenshots or logs
N/A — silent misbehavior; the pipeline executes but job names use the default random name instead of the configured prefix.

System information

  • SageMaker Python SDK version: sagemaker-train 1.8.0, sagemaker-core 2.8.0, sagemaker-mlops 1.8.0, sagemaker-serve 1.8.0 (also reproduces on 1.7.1 / 2.7.1)
  • Framework name or algorithm: custom (source_code via ModelTrainer)
  • Framework version: N/A
  • Python version: 3.13
  • CPU or GPU: CPU (irrelevant, bug is SDK-side)
  • Custom Docker image (Y/N): Y

Additional context
Related closed issues for other step types: #3991 (TransformStep), #4590 (TransformStep/ProcessingStep).

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions