Skip to content

[RL] [Ray] Update #568 errors #578

Closed
@Guilherme-B

Description

@Guilherme-B

First of all, thank you for updating the Ray container to add checkpoint support, it was definitely missing and rather important.

Today I proceeded to test said modifications #568 , and a few things seem to be out of order:
Both checkpoint and outputs seem to be saved properly. However, the overall process fails, throwing the following:

TERMINATED trials:
- PPO_RoboschoolReacher-v1_0:#011TERMINATED [pid=92], 1774 s, 65 iter, 1638000 ts, 18.4 rew
Saved model configuration.
Traceback (most recent call last):
File "train-reacher.py", line 51, in <module>
MyLauncher().train_main()
File "/opt/ml/code/sagemaker_rl/ray_launcher.py", line 250, in train_main
launcher.launch()
File "/opt/ml/code/sagemaker_rl/ray_launcher.py", line 242, in launch
config=config)
File "/opt/ml/code/sagemaker_rl/ray_launcher.py", line 208, in save_checkpoint_and_serving_model
self.copy_checkpoints_to_model_output()
File "/opt/ml/code/sagemaker_rl/ray_launcher.py", line 183, in copy_checkpoints_to_model_output
raise RuntimeError("Failed to save checkpoint files - .tune_metadata or .extra_data")

As for the evaluation process, it too seems to be malfunctioning, not being able to retrieve checkpoints:

LocalMultiGPUOptimizer devices ['/cpu:0']
Traceback (most recent call last):
File "evaluate-ray.py", line 110, in <module>
run(args, parser)
File "evaluate-ray.py", line 79, in run
agent.restore(args.checkpoint)
File "/usr/local/lib/python3.6/dist-packages/ray/tune/trainable.py", line 253, in restore
self._restore(checkpoint_path)
File "/usr/local/lib/python3.6/dist-packages/ray/rllib/agents/ppo/ppo.py", line 153, in _restore
extra_data = pickle.load(open(checkpoint_path + ".extra_data", "rb"))
FileNotFoundError: [Errno 2] No such file or directory: '/opt/ml/input/data/model/checkpoint.extra_data'

All of the previous happened whilst running an exact copy of the repository, no changes have been included.

Side note 1 - Recall to modify the examples - available through SageMaker's notebook instances - to mirror the updates, otherwise some people might mistakenly bind their code to the old sources.

Side note 2 - Callbacks can be quite a useful tool as well, not sure if their support is planned.

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions