Skip to content

Commit eb48aba

Browse files
jarednielsenrahul003
authored andcommitted
Docs (aws#60)
* Rework README to point directly to framework pages * WIP * WIP * Rename documentation to docs * Updated sagemaker.md * Move README.md to top-level and delete old * Add 'how-to-use' to tensorflow.md
1 parent f8661a8 commit eb48aba

File tree

11 files changed

+283
-258
lines changed

11 files changed

+283
-258
lines changed

README.md

Lines changed: 98 additions & 23 deletions
Original file line numberDiff line numberDiff line change
@@ -1,32 +1,107 @@
1-
## Tornasole
1+
# Sagemaker Debugger
22

3-
Tornasole is an upcoming AWS service designed to be a debugger
4-
for machine learning models. It lets you go beyond just looking
5-
at scalars like losses and accuracies during training and
6-
gives you full visibility into all tensors 'flowing through the graph'
7-
during training or inference.
3+
- [Overview](#overview)
4+
- [Examples](#sagemaker-example)
5+
- [How It Works](#how-it-works)
86

9-
Using Tornasole is a two step process:
7+
## Overview
8+
Sagemaker Debugger is an AWS service to automatically debug your machine learning training process.
9+
It helps you develop better, faster, cheaper models by catching common errors quickly. It supports
10+
TensorFlow, PyTorch, MXNet, and XGBoost on Python 3.6+.
1011

11-
### Saving tensors
12+
- Zero-code-change experience on SageMaker and AWS Deep Learning containers.
13+
- Automated anomaly detection and state assertions.
14+
- Realtime training job monitoring and visibility into any tensor value.
15+
- Distributed training and TensorBoard support.
1216

13-
This needs the `tornasole` package built for the appropriate framework.
14-
It allows you to collect the tensors you want at the frequency
15-
that you want, and save them for analysis.
16-
Please follow the appropriate Readme page to install the correct version.
17+
There are two ways to use it: Automatic mode and configurable mode.
1718

19+
- Automatic mode: No changes to your training script. Specify the rules you want and launch a SageMaker Estimator job.
20+
- Configurable mode: More powerful, lets you specify exactly which tensors and collections to save. Use the Python API within your script.
1821

19-
#### [Tornasole TensorFlow](docs/tensorflow/README.md)
20-
#### [Tornasole MXNet](docs/mxnet/README.md)
21-
#### [Tornasole PyTorch](docs/pytorch/README.md)
22-
#### [Tornasole XGBoost](docs/xgboost/README.md)
2322

24-
### Analysis
25-
Please refer **[this page](docs/rules/README.md)** for more details about how to analyze.
26-
The analysis of these tensors can be done on a separate machine in parallel with the training job.
23+
## Example: SageMaker Zero-Code-Change
24+
This example uses a zero-script-change experience, where you can use your training script as-is.
25+
See the [example notebooks](https://link.com) for more details.
26+
```python
27+
import sagemaker
28+
from sagemaker.debugger import rule_configs, Rule, CollectionConfig
2729

28-
## ContactUs
29-
We would like to hear from you. If you have any question or feedback, please reach out to us [email protected]
30+
# Choose a built-in rule to monitor your training job
31+
rule = Rule.sagemaker(
32+
rule_configs.exploding_tensor(),
33+
rule_parameters={
34+
"tensor_regex": ".*"
35+
},
36+
collections_to_save=[
37+
CollectionConfig(name="weights"),
38+
CollectionConfig(name="losses"),
39+
],
40+
)
3041

31-
## License
32-
This library is licensed under the Apache 2.0 License.
42+
# Pass the rule to the estimator
43+
sagemaker_simple_estimator = sagemaker.tensorflow.TensorFlow(
44+
entry_point="script.py",
45+
role=sagemaker.get_execution_role(),
46+
framework_version="1.15",
47+
py_version="py3",
48+
rules=[rule],
49+
)
50+
51+
sagemaker_simple_estimator.fit()
52+
```
53+
54+
That's it! SageMaker will automatically monitor your training job for your and create a CloudWatch
55+
event if you run into exploding tensor values.
56+
57+
If you want greater configuration and control, we offer that too. Simply
58+
59+
60+
## Example: Running Locally
61+
Requires Python 3.6+, and this example uses tf.keras. Run
62+
```
63+
pip install smdebug
64+
```
65+
66+
To use Sagemaker Debugger, simply add a callback hook:
67+
```python
68+
import smdebug.tensorflow as smd
69+
hook = smd.KerasHook.(out_dir=args.out_dir)
70+
71+
model = tf.keras.models.Sequential([ ... ])
72+
model.compile(
73+
optimizer='adam',
74+
loss='sparse_categorical_crossentropy',
75+
)
76+
77+
# Add the hook as a callback
78+
model.fit(x_train, y_train, epochs=args.epochs, callbacks=[hook])
79+
model.evaluate(x_test, y_test, callbacks=[hook])
80+
81+
# Create a trial to inspect the saved tensors
82+
trial = smd.create_trial(out_dir=args.out_dir)
83+
print(f"Saved tensor values for {trial.tensors()}")
84+
print(f"Loss values were {trial.tensor('CrossEntropyLoss:0')}")
85+
```
86+
87+
## How It Works
88+
SageMaker Debugger uses a `hook` to store the values of tensors throughout the training process. Another process called a `rule` job
89+
simultaneously monitors and validates these outputs to ensure that training is progressing as expected.
90+
A rule might check for vanishing gradients, or exploding tensor values, or poor weight initialization.
91+
If a rule is triggered, it will raise a CloudWatch event and stop the training job, saving you time
92+
and money.
93+
94+
SageMaker Debugger can be used inside or outside of SageMaker. There are three main use cases:
95+
- SageMaker Zero-Script-Change: Here you specify which rules to use when setting up the estimator and run your existing script, no changes needed. See the first example above.
96+
- SageMaker Bring-Your-Own-Container: Here you specify the rules to use, and modify your training script.
97+
- Non-SageMaker: Here you write custom rules (or manually analyze the tensors) and modify your training script. See the second example above.
98+
99+
The reason for different setups is that SageMaker Zero-Script-Change uses custom framework forks of TensorFlow, PyTorch, MXNet, and XGBoost to save tensors automatically.
100+
These framework forks are not available in custom containers or non-SM environments, so you must modify your training script in these environments.
101+
102+
See the [SageMaker page](https://link.com) for details on SageMaker Zero-Script-Change and BYOC experience.\
103+
See the frameworks pages for details on modifying the training script:
104+
- [TensorFlow](https://link.com)
105+
- [PyTorch](https://link.com)
106+
- [MXNet](https://link.com)
107+
- [XGBoost](https://link.com)

documentation/API.md renamed to docs/API.md

Lines changed: 20 additions & 43 deletions
Original file line numberDiff line numberDiff line change
@@ -10,55 +10,32 @@ These objects exist across all frameworks.
1010
- [SaveConfig](#saveconfig)
1111
- [ReductionConfig](#reductionconfig)
1212

13-
---
14-
## SageMaker Zero-Code-Change vs. Python API
13+
## Glossary
1514

16-
There are two ways to use sagemaker-debugger: SageMaker Zero-Code-Change or Python API.
15+
The imports assume `import smdebug.{tensorflow,pytorch,mxnet,xgboost} as smd`.
1716

18-
SageMaker Zero-Code-Change will use a custom framework fork to automatically instantiate the hook, register tensors, and create collections.
19-
All you need to do is decide which built-in rules to use. Further documentation is available on [AWS Docs](https://link.com).
20-
```python
21-
import sagemaker
22-
from sagemaker.debugger import rule_configs, Rule, CollectionConfig, DebuggerHookConfig, TensorBoardOutputConfig
23-
24-
hook_config = DebuggerHookConfig(
25-
s3_output_path = args.s3_path,
26-
container_local_path = args.local_path,
27-
hook_parameters = {
28-
"save_steps": "0,20,40,60,80"
29-
},
30-
collection_configs = {
31-
{ "CollectionName": "weights" },
32-
{ "CollectionName": "biases" },
33-
},
34-
)
17+
**Hook**: The main interface to use training. This object can be passed as a model hook/callback
18+
in Tensorflow and Keras. It keeps track of collections and writes output files at each step.
19+
- `hook = smd.Hook(out_dir="/tmp/mnist_job")`
3520

36-
rule = Rule.sagemaker(
37-
rule_configs.exploding_tensor(),
38-
rule_parameters={
39-
"tensor_regex": ".*"
40-
},
41-
collections_to_save=[
42-
CollectionConfig(name="weights"),
43-
CollectionConfig(name="losses"),
44-
],
45-
)
21+
**Mode**: One of "train", "eval", "predict", or "global". Helpful for segmenting data based on the phase
22+
you're in. Defaults to "global".
23+
- `train_mode = smd.modes.TRAIN`
4624

47-
sagemaker_simple_estimator = sagemaker.tensorflow.TensorFlow(
48-
entry_point="script.py",
49-
role=sagemaker.get_execution_role(),
50-
framework_version="1.15",
51-
py_version="py3",
52-
rules=[rule],
53-
debugger_hook_config=hook_config,
54-
)
25+
**Collection**: A group of tensors. Each collection contains its own save configuration and regexes for
26+
tensors to include/exclude.
27+
- `collection = hook.get_collection("losses")`
5528

56-
sagemaker_simple_estimator.fit()
57-
```
29+
**SaveConfig**: A Python dict specifying how often to save losses and tensors.
30+
- `save_config = smd.SaveConfig(save_interval=10)`
31+
32+
**ReductionConfig**: Allows you to save a reduction, such as 'mean' or 'l1 norm', instead of the full tensor.
33+
- `reduction_config = smd.ReductionConfig(reductions=['min', 'max', 'mean'], norms=['l1'])`
34+
35+
**Trial**: The main interface to use when analyzing a completed training job. Access collections and tensors. See [trials documentation](https://link.com).
36+
- `trial = smd.create_trial(out_dir="/tmp/mnist_job")`
5837

59-
The Python API requires more configuration but is also more flexible. You must write your own custom rules
60-
instead of using SageMaker's built-in rules, but you can use it with a custom container in SageMaker or in your own
61-
environment. It is described further below.
38+
**Rule**: A condition that will trigger an exception and terminate the training job early, for example a vanishing gradient. See [rules documentation](https://link.com).
6239

6340

6441
---
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.

docs/sagemaker.md

Lines changed: 138 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,138 @@
1+
# SageMaker
2+
3+
There are two cases for SageMaker:
4+
- Zero-Script-Change (ZSC): Here you specify which rules to use, and run your existing script.
5+
- Supported in Deep Learning Containers: `TensorFlow==1.15, PyTorch==1.3, MXNet==1.6`
6+
- Bring-Your-Own-Container (BYOC): Here you specify the rules to use, and modify your training script.
7+
- Supported with `TensorFlow==1.13/1.14/1.15, PyTorch==1.2/1.3, MXNet==1.4,1.5,1.6`
8+
9+
Table of Contents
10+
- [Configuration Details](#version-support)
11+
- [Using a Custom Container](#byoc-example)
12+
13+
## Configuration Details
14+
The DebuggerHookConfig is the main object.
15+
16+
```python
17+
rule = sagemaker.debugger.Rule.sagemaker(
18+
base_config: dict, # Use an import, e.g. sagemaker.debugger.rule_configs.exploding_tensor()
19+
name: str=None,
20+
instance_type: str=None,
21+
container_local_path: str=None,
22+
volume_size_in_gb: int=None,
23+
other_trials_s3_input_paths: str=None,
24+
rule_parameters: dict=None,
25+
collections_to_save: list[sagemaker.debugger.CollectionConfig]=None,
26+
)
27+
```
28+
29+
```python
30+
hook_config = sagemaker.debugger.DebuggerHookConfig(
31+
s3_output_path: str,
32+
container_local_path: str=None,
33+
hook_parameters: dict=None,
34+
collection_configs: list[sagemaker.debugger.CollectionConfig]=None,
35+
)
36+
```
37+
38+
```python
39+
tb_config = sagemaker.debugger.TensorBoardOutputConfig(
40+
s3_output_path: str,
41+
container_local_path: str=None,
42+
)
43+
```
44+
45+
```python
46+
collection_config = sagemaker.debugger.CollectionConfig(
47+
name: str,
48+
parameters: dict,
49+
)
50+
```
51+
52+
A full example script is below:
53+
```python
54+
import sagemaker
55+
from sagemaker.debugger import rule_configs, Rule, DebuggerHookConfig, TensorBoardOutputConfig, CollectionConfig
56+
57+
hook_parameters = {
58+
"include_regex": "my_regex,another_regex", # comma-separated string of regexes
59+
"save_interval": 100,
60+
"save_steps": "1,2,3,4", # comma-separated string of steps to save
61+
"start_step": 1,
62+
"end_step": 2000,
63+
"reductions": "min,max,mean,std,abs_variance,abs_sum,abs_l2_norm",
64+
}
65+
weights_config = CollectionConfiguration("weights")
66+
biases_config = CollectionConfiguration("biases")
67+
losses_config = CollectionConfiguration("losses")
68+
tb_config = TensorBoardOutputConfig(s3_output_path="s3://my-bucket/tensorboard")
69+
70+
hook_config = DebuggerHookConfig(
71+
s3_output_path="s3://my-bucket/smdebug",
72+
hook_parameters=hook_parameters,
73+
collection_configs=[weights_config, biases_config, losses_config],
74+
)
75+
76+
exploding_tensor_rule = Rule.sagemaker(
77+
base_config=rule_configs.exploding_tensor(),
78+
rule_parameters={
79+
"tensor_regex": ".*",
80+
},
81+
collections_to_save=[weights_config, losses_config],
82+
)
83+
vanishing_gradient_rule = Rule.sagemaker(base_config=rule_configs.vanishing_gradient())
84+
85+
# Or use sagemaker.pytorch.PyTorch or sagemaker.mxnet.MXNet
86+
sagemaker_simple_estimator = sagemaker.tensorflow.TensorFlow(
87+
entry_point=simple_entry_point_script,
88+
role=sagemaker.get_execution_role(),
89+
base_job_name=args.job_name,
90+
train_instance_count=1,
91+
train_instance_type="ml.m4.xlarge",
92+
framework_version="1.15",
93+
py_version="py3",
94+
# smdebug-specific arguments below
95+
rules=[exploding_tensor_rule, vanishing_gradient_rule],
96+
debugger_hook_config=hook_config,
97+
tensorboard_output_config=tb_config,
98+
)
99+
100+
sagemaker_simple_estimator.fit()
101+
```
102+
103+
## Using a Custom Container
104+
To use a custom container (without the framework forks), you should modify your script.
105+
Use the same sagemaker Estimator setup as shown below, and in your script, call
106+
107+
```python
108+
hook = smd.{hook_class}.create_from_json_file()
109+
```
110+
111+
and modify the rest of your script as shown in the API docs. Click on your desired framework below.
112+
- [TensorFlow](https://link.com)
113+
- [PyTorch](https://link.com)
114+
- [MXNet](https://link.com)
115+
- [XGBoost](https://link.com)
116+
117+
118+
## Comprehensive Rule List
119+
Full list of rules is:
120+
| Rule Name | Behavior |
121+
| --- | --- |
122+
| `vanishing_gradient` | Detects a vanishing gradient. |
123+
| `all_zero` | ??? |
124+
| `check_input_images` | ??? |
125+
| `similar_across_runs` | ??? |
126+
| `weight_update_ratio` | ??? |
127+
| `exploding_tensor` | ??? |
128+
| `unchanged_tensor` | ??? |
129+
| `loss_not_decreasing` | ??? |
130+
| `dead_relu` | ??? |
131+
| `confusion` | ??? |
132+
| `overfit` | ??? |
133+
| `tree_depth` | ??? |
134+
| `tensor_variance` | ??? |
135+
| `overtraining` | ??? |
136+
| `poor_weight_initialization` | ??? |
137+
| `saturated_activation` | ??? |
138+
| `nlp_sequence_ratio` | ??? |

documentation/tensorflow.md renamed to docs/tensorflow.md

Lines changed: 27 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -3,12 +3,23 @@
33
SageMaker Zero-Code-Change supported container: TensorFlow 1.15. See the [AWS Docs](https://link.com) for details.\
44
Python API supported versions: Tensorflow 1.13, 1.14, 1.15. Keras 2.3.
55

6+
7+
68
## Contents
9+
- [How to Use](#how-to-use)
710
- [Keras Example](#keras-example)
811
- [MonitoredSession Example](#monitored-session-example)
912
- [Estimator Example](#estimator-example)
1013
- [Full API](#full-api)
1114

15+
## How to Use
16+
1. `import smdebug.tensorflow as smd`
17+
2. Instantiate a hook. `smd.{hook_class}.create_from_json_file()` in a SageMaker environment or `smd.{hook_class}()` elsewhere.
18+
3. Pass the hook to the model as a callback.
19+
4. If using a custom container or outside of SageMaker, wrap the optimizer with `optimizer = hook.wrap_optimizer(optimizer)`.
20+
21+
(Optional): Configure collections. See the [Common API](https://link.com) page for details on how to do this.
22+
1223
## tf.keras Example
1324
```python
1425
import smdebug.tensorflow as smd
@@ -140,3 +151,19 @@ wrap_optimizer(
140151
)
141152
```
142153
Adds functionality to the optimizer object to log gradients. Returns the original optimizer and doesn't change the optimization process.
154+
155+
## Concepts
156+
The steps to use Tornasole in any framework are:
157+
158+
1. Create a `hook`.
159+
2. Register your model and optimizer with the hook.
160+
3. Specify the `rule` to be used.
161+
4. After training, create a `trial` to manually analyze the tensors.
162+
163+
See the [API page](https://link.com) for more details.
164+
165+
## Detailed Links
166+
- [Full API](https://link.com)
167+
- [Rules and Trials](https://link.com)
168+
- [Distributed Training](https://link.com)
169+
- [TensorBoard](https://link.com)
File renamed without changes.

0 commit comments

Comments
 (0)