Description
First of all, thank you for updating the Ray container to add checkpoint support, it was definitely missing and rather important.
Today I proceeded to test said modifications #568 , and a few things seem to be out of order:
Both checkpoint and outputs seem to be saved properly. However, the overall process fails, throwing the following:
TERMINATED trials:
- PPO_RoboschoolReacher-v1_0:#011TERMINATED [pid=92], 1774 s, 65 iter, 1638000 ts, 18.4 rew
Saved model configuration.
Traceback (most recent call last):
File "train-reacher.py", line 51, in <module>
MyLauncher().train_main()
File "/opt/ml/code/sagemaker_rl/ray_launcher.py", line 250, in train_main
launcher.launch()
File "/opt/ml/code/sagemaker_rl/ray_launcher.py", line 242, in launch
config=config)
File "/opt/ml/code/sagemaker_rl/ray_launcher.py", line 208, in save_checkpoint_and_serving_model
self.copy_checkpoints_to_model_output()
File "/opt/ml/code/sagemaker_rl/ray_launcher.py", line 183, in copy_checkpoints_to_model_output
raise RuntimeError("Failed to save checkpoint files - .tune_metadata or .extra_data")
As for the evaluation process, it too seems to be malfunctioning, not being able to retrieve checkpoints:
LocalMultiGPUOptimizer devices ['/cpu:0']
Traceback (most recent call last):
File "evaluate-ray.py", line 110, in <module>
run(args, parser)
File "evaluate-ray.py", line 79, in run
agent.restore(args.checkpoint)
File "/usr/local/lib/python3.6/dist-packages/ray/tune/trainable.py", line 253, in restore
self._restore(checkpoint_path)
File "/usr/local/lib/python3.6/dist-packages/ray/rllib/agents/ppo/ppo.py", line 153, in _restore
extra_data = pickle.load(open(checkpoint_path + ".extra_data", "rb"))
FileNotFoundError: [Errno 2] No such file or directory: '/opt/ml/input/data/model/checkpoint.extra_data'
All of the previous happened whilst running an exact copy of the repository, no changes have been included.
Side note 1 - Recall to modify the examples - available through SageMaker's notebook instances - to mirror the updates, otherwise some people might mistakenly bind their code to the old sources.
Side note 2 - Callbacks can be quite a useful tool as well, not sure if their support is planned.