Skip to content

1904 Add early stop handler #1921

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 14 commits into from
Apr 3, 2021
Merged

Conversation

Nic-Ma
Copy link
Contributor

@Nic-Ma Nic-Ma commented Apr 2, 2021

Fixes #1904 .

Description

This PR added the EarlyStopHandler for workflows.

Status

Ready

Types of changes

  • Non-breaking change (fix or new feature that would not break existing functionality).
  • Breaking change (fix or new feature that would cause existing functionality to change).
  • New tests added to cover the changes.
  • Integration tests passed locally by running ./runtests.sh -f -u --net --coverage.
  • Quick tests passed locally by running ./runtests.sh --quick --unittests.
  • In-line docstrings updated.
  • Documentation updated, tested make html command in the docs/ folder.

@Nic-Ma
Copy link
Contributor Author

Nic-Ma commented Apr 2, 2021

/black

@Nic-Ma Nic-Ma requested review from ericspod, nvkevlu, rijobro and wyli April 2, 2021 03:54
@Nic-Ma
Copy link
Contributor Author

Nic-Ma commented Apr 2, 2021

/black

@Nic-Ma
Copy link
Contributor Author

Nic-Ma commented Apr 2, 2021

/black

@Nic-Ma
Copy link
Contributor Author

Nic-Ma commented Apr 2, 2021

/black

@Nic-Ma
Copy link
Contributor Author

Nic-Ma commented Apr 2, 2021

/black

wyli
wyli previously approved these changes Apr 2, 2021
Copy link
Contributor

@wyli wyli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks!

@Nic-Ma Nic-Ma enabled auto-merge (squash) April 2, 2021 11:48
@Nic-Ma
Copy link
Contributor Author

Nic-Ma commented Apr 2, 2021

Hi @vfdev-5 ,

Could you please help check the error message in the CI:

======================================================================
ERROR: test_early_stop_val_metric (tests.test_handler_early_stop.TestHandlerEarlyStop)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/__w/MONAI/MONAI/tests/test_handler_early_stop.py", line 58, in test_early_stop_val_metric
    handler.set_trainer(trainer=trainer)
  File "/__w/MONAI/MONAI/monai/handlers/earlystop_handler.py", line 90, in set_trainer
    cumulative_delta=self.cumulative_delta,
  File "/opt/conda/lib/python3.6/site-packages/ignite/handlers/early_stopping.py", line 74, in __init__
    self.logger = setup_logger(__name__ + "." + self.__class__.__name__)
  File "/opt/conda/lib/python3.6/site-packages/ignite/utils.py", line 149, in setup_logger
    import ignite.distributed as idist
AttributeError: module 'ignite' has no attribute 'distributed'

Do you have any experience with it?

Thanks in advance.

@vfdev-5
Copy link
Member

vfdev-5 commented Apr 2, 2021

Hi @Nic-Ma , yes this looks very strange. Usually, this can happen if an old version is installed, e.g. <0.4 but checking the logs seems like it is using 0.4.4 version.

@Nic-Ma
Copy link
Contributor Author

Nic-Ma commented Apr 2, 2021

Hi @Nic-Ma , yes this looks very strange. Usually, this can happen if an old version is installed, e.g. <0.4 but checking the logs seems like it is using 0.4.4 version.

Thanks for your check.
As the test of CheckpointSaver can pass, what's the difference between checkpoint and earlystopping handlers in ignite?

Thanks.

@vfdev-5
Copy link
Member

vfdev-5 commented Apr 2, 2021

@Nic-Ma the issue is related to setup_logger from ignite.utils where import ignite.distributed is called inside the function. I have a feeling but may be wrong that monai's lazy importing optional_import method could be a root of the issue...

A minimal repro could be

setup_logger, _ = optional_import("ignite.utils", "0.4.4", exact_version, "setup_logger")
logger = setup_logger("test")

what's the difference between checkpoint and earlystopping handlers in ignite?

they are rather different. EarlyStopping sets up a logger by default. By the way, do you use ignite's TerminateOnNan somewhere ?

@wyli
Copy link
Contributor

wyli commented Apr 2, 2021

strange I couldn't replicate the issue locally:

Python 3.7.9 (default, Aug 31 2020, 12:42:55) 
[GCC 7.3.0] :: Anaconda, Inc. on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> from monai.utils import exact_version, optional_import
>>> setup_logger, _ = optional_import("ignite.utils", "0.4.4", exact_version, "setup_logger")
>>> logger = setup_logger("test")
>>> logger.info("hello")
2021-04-02 10:50:54,156 test INFO: hello
>>> import ignite
>>> ignite.__version__
'0.4.4'
>>> 

@vfdev-5
Copy link
Member

vfdev-5 commented Apr 2, 2021

@wyli thanks for testing it. Yes, from my side I couldn't repro it neither and even init of EarlyStopping works

from monai.utils import exact_version, optional_import

Engine, _ = optional_import("ignite.engine", "0.4.4", exact_version, "Engine")
EarlyStopping, _ = optional_import("ignite.handlers", "0.4.4", exact_version, "EarlyStopping")

engine = Engine(lambda e, b: None)
handler = EarlyStopping(4, lambda e: 0.0, engine)

This is also a bit unsurprising as build / quick-py3 (ubuntu-latest) (pull_request) ci and TestHandlerEarlyStop are passing.

May be related to the machine env on those build / GPU-quick-py3 (PT16+CUDAXYZ) ?

@Nic-Ma
Copy link
Contributor Author

Nic-Ma commented Apr 2, 2021

Hi @vfdev-5 ,

Seems it's a random issue in the tests, should we import ignite.distributed first in the CI test?

Thanks.

@Nic-Ma
Copy link
Contributor Author

Nic-Ma commented Apr 2, 2021

@vfdev-5 , maybe import ignite.distributed will do some logic for GPU? All the CI running together caused GPU resource issue?

Thanks.

@vfdev-5
Copy link
Member

vfdev-5 commented Apr 2, 2021

@Nic-Ma yes, I see that it fails on some of GPU tests and passes on others...

Yes, ignite.distributed is intended to simplify things while working with distibuted configuration and GPUs (we also have GPU tests and haven't seen such problems). I wonder if the environment (e.g. build / GPU-quick-py3 (PT17+CUDA110) (pull_request)) could be replicated locally on other machines with GPU ?

@wyli
Copy link
Contributor

wyli commented Apr 2, 2021

I couldn't replicate the issue locally using exactly the same configure as build / GPU-quick-py3 (PT17+CUDA110)... this is a very interesting one :D

@wyli wyli disabled auto-merge April 2, 2021 17:02
@wyli
Copy link
Contributor

wyli commented Apr 2, 2021

I'm able to replicate this with some randomness:

docker run --gpus all --rm -ti --ipc=host --net=host -v ~/Documents/MONAI:/opt/monai nvcr.io/nvidia/pytorch:20.09-py3
# within the docker container:
python -m pip install --upgrade pip wheel
python -m pip install -r requirements-dev.txt
BUILD_MONAI=0 ./runtests.sh --quick --unittests

but I couldn't locate the issue further...

some debug info:

Starting test: test_early_stop_train_loss (tests.test_handler_early_stop.TestHandlerEarlyStop)...
> /opt/conda/lib/python3.6/site-packages/ignite/utils.py(150)setup_logger()
-> import ignite.distributed as idist
(Pdb) import ignite
(Pdb) print(ignite.__dict__)
{'__name__': 'ignite', '__doc__': None, '__package__': 'ignite', '__loader__': <_frozen_importlib_external.SourceFileLoader object at 0x7fd17c102ef0>, '__spec__': ModuleSpec(name='ignite', loader=<_frozen_importlib_external.SourceFileLo
ader object at 0x7fd17c102ef0>, origin='/opt/conda/lib/python3.6/site-packages/ignite/__init__.py', submodule_search_locations=['/opt/conda/lib/python3.6/site-packages/ignite']), '__path__': ['/opt/conda/lib/python3.6/site-packages/igni
te'], '__file__': '/opt/conda/lib/python3.6/site-packages/ignite/__init__.py', '__cached__': '/opt/conda/lib/python3.6/site-packages/ignite/__pycache__/__init__.cpython-36.pyc', '__builtins__': {'__name__': 'builtins', '__doc__': "Built
-in functions, exceptions, and other objects.\n\nNoteworthy: None is the `nil' object; Ellipsis represents `...' in slices.", '__package__': '', '__loader__': <class '_frozen_importlib.BuiltinImporter'>, '__spec__': ModuleSpec(name='bui
ltins', loader=<class '_frozen_importlib.BuiltinImporter'>), '__build_class__': <built-in function __build_class__>, '__import__': <built-in function __import__>, 'abs': <built-in function abs>, 'all': <built-in function all>, 'any': <b
uilt-in function any>, 'ascii': <built-in function ascii>, 'bin': <built-in function bin>, 'callable': <built-in function callable>, 'chr': <built-in function chr>, 'compile': <built-in function compile>, 'delattr': <built-in function d
elattr>, 'dir': <built-in function dir>, 'divmod': <built-in function divmod>, 'eval': <built-in function eval>, 'exec': <built-in function exec>, 'format': <built-in function format>, 'getattr': <built-in function getattr>, 'globals': 
<built-in function globals>, 'hasattr': <built-in function hasattr>, 'hash': <built-in function hash>, 'hex': <built-in function hex>, 'id': <built-in function id>, 'input': <built-in function input>, 'isinstance': <built-in function is
instance>, 'issubclass': <built-in function issubclass>, 'iter': <built-in function iter>, 'len': <built-in function len>, 'locals': <built-in function locals>, 'max': <built-in function max>, 'min': <built-in function min>, 'next': <bu
ilt-in function next>, 'oct': <built-in function oct>, 'ord': <built-in function ord>, 'pow': <built-in function pow>, 'print': <built-in function print>, 'repr': <built-in function repr>, 'round': <built-in function round>, 'setattr': 
<built-in function setattr>, 'sorted': <built-in function sorted>, 'sum': <built-in function sum>, 'vars': <built-in function vars>, 'None': None, 'Ellipsis': Ellipsis, 'NotImplemented': NotImplemented, 'False': False, 'True': True, 'bo
ol': <class 'bool'>, 'memoryview': <class 'memoryview'>, 'bytearray': <class 'bytearray'>, 'bytes': <class 'bytes'>, 'classmethod': <class 'classmethod'>, 'complex': <class 'complex'>, 'dict': <class 'dict'>, 'enumerate': <class 'enumer
ate'>, 'filter': <class 'filter'>, 'float': <class 'float'>, 'frozenset': <class 'frozenset'>, 'property': <class 'property'>, 'int': <class 'int'>, 'list': <class 'list'>, 'map': <class 'map'>, 'object': <class 'object'>, 'range': <cla
ss 'range'>, 'reversed': <class 'reversed'>, 'set': <class 'set'>, 'slice': <class 'slice'>, 'staticmethod': <class 'staticmethod'>, 'str': <class 'str'>, 'super': <class 'super'>, 'tuple': <class 'tuple'>, 'type': <class 'type'>, 'zip'
: <class 'zip'>, '__debug__': True, 'BaseException': <class 'BaseException'>, 'Exception': <class 'Exception'>, 'TypeError': <class 'TypeError'>, 'StopAsyncIteration': <class 'StopAsyncIteration'>, 'StopIteration': <class 'StopIteration
'>, 'GeneratorExit': <class 'GeneratorExit'>, 'SystemExit': <class 'SystemExit'>, 'KeyboardInterrupt': <class 'KeyboardInterrupt'>, 'ImportError': <class 'ImportError'>, 'ModuleNotFoundError': <class 'ModuleNotFoundError'>, 'OSError': <
class 'OSError'>, 'EnvironmentError': <class 'OSError'>, 'IOError': <class 'OSError'>, 'EOFError': <class 'EOFError'>, 'RuntimeError': <class 'RuntimeError'>, 'RecursionError': <class 'RecursionError'>, 'NotImplementedError': <class 'No
tImplementedError'>, 'NameError': <class 'NameError'>, 'UnboundLocalError': <class 'UnboundLocalError'>, 'AttributeError': <class 'AttributeError'>, 'SyntaxError': <class 'SyntaxError'>, 'IndentationError': <class 'IndentationError'>, '
TabError': <class 'TabError'>, 'LookupError': <class 'LookupError'>, 'IndexError': <class 'IndexError'>, 'KeyError': <class 'KeyError'>, 'ValueError': <class 'ValueError'>, 'UnicodeError': <class 'UnicodeError'>, 'UnicodeEncodeError': <
class 'UnicodeEncodeError'>, 'UnicodeDecodeError': <class 'UnicodeDecodeError'>, 'UnicodeTranslateError': <class 'UnicodeTranslateError'>, 'AssertionError': <class 'AssertionError'>, 'ArithmeticError': <class 'ArithmeticError'>, 'Floati
ngPointError': <class 'FloatingPointError'>, 'OverflowError': <class 'OverflowError'>, 'ZeroDivisionError': <class 'ZeroDivisionError'>, 'SystemError': <class 'SystemError'>, 'ReferenceError': <class 'ReferenceError'>, 'BufferError': <c
lass 'BufferError'>, 'MemoryError': <class 'MemoryError'>, 'Warning': <class 'Warning'>, 'UserWarning': <class 'UserWarning'>, 'DeprecationWarning': <class 'DeprecationWarning'>, 'PendingDeprecationWarning': <class 'PendingDeprecationWa
rning'>, 'SyntaxWarning': <class 'SyntaxWarning'>, 'RuntimeWarning': <class 'RuntimeWarning'>, 'FutureWarning': <class 'FutureWarning'>, 'ImportWarning': <class 'ImportWarning'>, 'UnicodeWarning': <class 'UnicodeWarning'>, 'BytesWarning
': <class 'BytesWarning'>, 'ResourceWarning': <class 'ResourceWarning'>, 'ConnectionError': <class 'ConnectionError'>, 'BlockingIOError': <class 'BlockingIOError'>, 'BrokenPipeError': <class 'BrokenPipeError'>, 'ChildProcessError': <cla
ss 'ChildProcessError'>, 'ConnectionAbortedError': <class 'ConnectionAbortedError'>, 'ConnectionRefusedError': <class 'ConnectionRefusedError'>, 'ConnectionResetError': <class 'ConnectionResetError'>, 'FileExistsError': <class 'FileExis
tsError'>, 'FileNotFoundError': <class 'FileNotFoundError'>, 'IsADirectoryError': <class 'IsADirectoryError'>, 'NotADirectoryError': <class 'NotADirectoryError'>, 'InterruptedError': <class 'InterruptedError'>, 'PermissionError': <class
 'PermissionError'>, 'ProcessLookupError': <class 'ProcessLookupError'>, 'TimeoutError': <class 'TimeoutError'>, 'open': <built-in function open>, 'quit': Use quit() or Ctrl-D (i.e. EOF) to exit, 'exit': Use exit() or Ctrl-D (i.e. EOF) 
to exit, 'copyright': Copyright (c) 2001-2019 Python Software Foundation.
All Rights Reserved.

Copyright (c) 2000 BeOpen.com.
All Rights Reserved.

Copyright (c) 1995-2001 Corporation for National Research Initiatives.
All Rights Reserved.

Copyright (c) 1991-1995 Stichting Mathematisch Centrum, Amsterdam.
All Rights Reserved., 'credits':     Thanks to CWI, CNRI, BeOpen.com, Zope Corporation and a cast of thousands
    for supporting Python development.  See www.python.org for more information., 'license': Type license() to see the full license text, 'help': Type help() for interactive help, or help(object) for help about object., '__pybind11_inte
rnals_v3__': <capsule object NULL at 0x7fd2130f6f90>, '__pybind11_internals_v4_gcc_libstdcpp_cxxabi1011__': <capsule object NULL at 0x7fd1a15e5ed0>, '__pybind11_internals_v4_gcc_libstdcpp_cxxabi1013__': <capsule object NULL at 0x7fd1877
36660>}, 'ignite': <module 'ignite' from '/opt/conda/lib/python3.6/site-packages/ignite/__init__.py'>, '__version__': '0.4.4'}
(Pdb)

@vfdev-5
Copy link
Member

vfdev-5 commented Apr 2, 2021

@wyli i tried as well to repro locally. Results are flaky: on the 3rd run I could have this ignite.distributed issue. Will investigate further...

There is also another reproducible issue with warp (if used with BUILD_MONAI=0 ./runtests.sh --quick --unittests or BUILD_MONAI=0 python ./tests/runner.py -p test_warp):

======================================================================
ERROR: test_grad (tests.test_warp.TestWarp)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/opt/monai/tests/test_warp.py", line 114, in test_grad
    gradcheck(warp_layer, (input_image, ddf), atol=1e-2, eps=1e-2)
  File "/opt/conda/lib/python3.6/site-packages/torch/autograd/gradcheck.py", line 288, in gradcheck
    func_out = func(*tupled_inputs)
  File "/opt/conda/lib/python3.6/site-packages/torch/nn/modules/module.py", line 726, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/opt/monai/monai/networks/blocks/warp.py", line 111, in forward
    image, grid, mode=self._interp_mode, padding_mode=f"{self._padding_mode}", align_corners=True
  File "/opt/conda/lib/python3.6/site-packages/torch/nn/functional.py", line 3364, in grid_sample
    "'bilinear' or 'nearest', but got: '{}'".format(mode))
ValueError: nn.functional.grid_sample(): expected mode to be 'bilinear' or 'nearest', but got: 'bicubic'

I also remarked this exception (also present in the logs here):

 Starting test: test_dataloader (tests.test_thread_buffer.TestDataLoader)...
 Exception in thread Thread-1327:
 Traceback (most recent call last):
   File "/opt/conda/lib/python3.6/threading.py", line 916, in _bootstrap_inner
     self.run()
   File "/opt/conda/lib/python3.6/threading.py", line 864, in run
     self._target(*self._args, **self._kwargs)
   File "/__w/MONAI/MONAI/monai/data/thread_buffer.py", line 45, in enqueue_values
     for src_val in self.src:
 TypeError: 'super' object is not iterable

@wyli
Copy link
Contributor

wyli commented Apr 2, 2021

right, bicubic is a new option in torch.nn.functional.grid_sampler introduced in torch 1.8.1, I'll add some check early next week

@vfdev-5
Copy link
Member

vfdev-5 commented Apr 2, 2021

@wyli I passed some time inspecting what happens and here are my conclusions:

If we run test_print_info and test_handler_early_stop, it should repro the issue and should fail with AttributeError: module 'ignite' has no attribute 'distributed'.

The failure is related to

output["Pytorch Ignite"] = get_package_version("ignite")

and

del sys.modules[dep_name]

Phew, it is not ignite's failure :)

Other remarks:

  • IMO, it would be better to order test_modules here :
    for test_module in {os.path.basename(f)[:-3] for f in files}:

Otherwise, tests order is arbitrary. A fix could be:

test_modules = sorted({os.path.basename(f)[:-3] for f in files})
for test_module in test_modules: 
  • To improve logs readability, it would be helpful to flush stdout with:
- ${cmdPrefix}${cmd} ./tests/runner.py -p "test_((?!integration).)"
+ ${cmdPrefix}${cmd} -u ./tests/runner.py -p "test_((?!integration).)"

Thanks

@wyli
Copy link
Contributor

wyli commented Apr 2, 2021

@wyli I passed some time inspecting what happens and here are my conclusions:

If we run test_print_info and test_handler_early_stop, it should repro the issue and should fail with AttributeError: module 'ignite' has no attribute 'distributed'.

The failure is related to

output["Pytorch Ignite"] = get_package_version("ignite")

and

del sys.modules[dep_name]

Phew, it is not ignite's failure :)

Other remarks:

  • IMO, it would be better to order test_modules here :
    for test_module in {os.path.basename(f)[:-3] for f in files}:

Otherwise, tests order is arbitrary. A fix could be:

test_modules = sorted({os.path.basename(f)[:-3] for f in files})
  • To improve logs readability, it would be helpful to flush stdout with:
- ${cmdPrefix}${cmd} ./tests/runner.py -p "test_((?!integration).)"
+ ${cmdPrefix}${cmd} -u ./tests/runner.py -p "test_((?!integration).)"

Thanks

@vfdev-5 really nice investigations, many thanks for looking into these. cc @Nic-Ma @rijobro

(incidentally, I feel it might be helpful to have shuffled test cases if we assume they should work independently.)

@Nic-Ma
Copy link
Contributor Author

Nic-Ma commented Apr 3, 2021

Hi @vfdev-5 and @wyli ,

Thanks so much for your debug and investigation!!
So seems it's related to this PR: #1296, @rijobro could you please help take a look this line:
https://github.com/Project-MONAI/MONAI/blob/master/monai/utils/module.py#L266
Maybe you should not delete module in get_package_version?

Thanks in advance.

@wyli wyli dismissed their stale review April 3, 2021 11:07

ci/cd errors

@wyli wyli mentioned this pull request Apr 3, 2021
3 tasks
@wyli
Copy link
Contributor

wyli commented Apr 3, 2021

I also remarked this exception (also present in the logs here):

 Starting test: test_dataloader (tests.test_thread_buffer.TestDataLoader)...
 Exception in thread Thread-1327:
 Traceback (most recent call last):
   File "/opt/conda/lib/python3.6/threading.py", line 916, in _bootstrap_inner
     self.run()
   File "/opt/conda/lib/python3.6/threading.py", line 864, in run
     self._target(*self._args, **self._kwargs)
   File "/__w/MONAI/MONAI/monai/data/thread_buffer.py", line 45, in enqueue_values
     for src_val in self.src:
 TypeError: 'super' object is not iterable

Hi @vfdev-5 I couldn't open the above link, could you please update? I'll look into it. Thanks!

@vfdev-5
Copy link
Member

vfdev-5 commented Apr 3, 2021

Hi @wyli

Hi @vfdev-5 I couldn't open the above link, could you please update? I'll look into it. Thanks!

The link is on the full log of one of the previously failling jobs. The same thing is present in one of current jobs:

incidentally, I feel it might be helpful to have shuffled test cases if we assume they should work independently.

I understand your point on that, however it could be more complicated to reproduce the failure as it has happend here.

PS: thanks a lot for the invitation to join github org !

@wyli wyli closed this in #1930 Apr 3, 2021
@wyli wyli reopened this Apr 3, 2021
@Nic-Ma Nic-Ma enabled auto-merge (squash) April 3, 2021 15:10
@Nic-Ma Nic-Ma merged commit 1fd5e0a into Project-MONAI:master Apr 3, 2021
nsrivathsa pushed a commit to nsrivathsa/MONAI that referenced this pull request Apr 12, 2021
* [DLMED] add EarlyStop handler

Signed-off-by: Nic Ma <[email protected]>

* [MONAI] python code formatting

Signed-off-by: monai-bot <[email protected]>

* [DLMED] enhance validation handler

Signed-off-by: Nic Ma <[email protected]>

* [DLMED] add set_trainer support

Signed-off-by: Nic Ma <[email protected]>

* [DLMED] add more check

Signed-off-by: Nic Ma <[email protected]>

* [DLMED] fix flake8 issue

Signed-off-by: Nic Ma <[email protected]>

* [DLMED] update according to comments

Signed-off-by: Nic Ma <[email protected]>

* [DLMED] fix flake8 issue

Signed-off-by: Nic Ma <[email protected]>

Co-authored-by: monai-bot <[email protected]>
Co-authored-by: Wenqi Li <[email protected]>
Signed-off-by: Neha Srivathsa <[email protected]>
@Nic-Ma Nic-Ma deleted the 1904-add-early-stop branch July 2, 2021 23:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Need to add Early-Stop handler
4 participants