我想在多个 GPU 上使用 tensorflow 的 MirroredStrategy 训练模型。我之前用过很多次 strategy = tf.distribute.MirroredStrategy()
,但这次当我执行它时,我遇到了一个异常长的错误,并且该机制卡住了很长时间。
回溯(最近一次调用最后一次):
File "/research/dept8/gds/anafees/anaconda3/lib/python3.8/threading.py",line 932,in _bootstrap_inner
INFO:tensorflow:Using MirroredStrategy with devices ('/job:localhost/replica:0/task:0/device:GPU:0','/job:localhost/replica:0/task:0/device:GPU:1')
self.run()
File "/research/dept8/gds/anafees/anaconda3/lib/python3.8/threading.py",line 870,in run
self._target(*self._args,**self._kwargs)
File "/research/dept8/gds/anafees/anaconda3/lib/python3.8/multiprocessing/pool.py",line 519,in _handle_workers
cls._wait_for_updates(current_sentinels,change_notifier)
File "/research/dept8/gds/anafees/anaconda3/lib/python3.8/multiprocessing/pool.py",line 499,in _wait_for_updates
wait(sentinels,timeout=timeout)
NameError: name 'wait' is not defined