关于该问题的报告很少,但仍然没有找到答案的运气。简单地说,这里是简短的代码片段:
import tensorflow as tf
from tensorflow.keras import layers
print(tf.__version__)
# 2.3.1
mirrored_strategy = tf.distribute.MirroredStrategy()
with mirrored_strategy.scope():
model = tf.keras.Sequential([tf.keras.layers.Dense(1,input_shape=(1,))])
model.compile(loss='mse',optimizer='sgd')
dataset = tf.data.Dataset.from_tensors(([1.],[1.])).repeat(100).batch(10)
model.fit(dataset,epochs=4)
执行后我得到
INFO:tensorflow:Using MirroredStrategy with devices ('/job:localhost/replica:0/task:0/device:GPU:0','/job:localhost/replica:0/task:0/device:GPU:1')
Epoch 1/4
INFO:tensorflow:batch_all_reduce: 2 all-reduces with algorithm = nccl,num_packs = 1
INFO:tensorflow:batch_all_reduce: 2 all-reduces with algorithm = nccl,num_packs = 1
10/10 [==============================] - 1s 93ms/step - loss: 807385211185512087331799040.0000
Epoch 2/4
10/10 [==============================] - 1s 93ms/step - loss: nan
Epoch 3/4
10/10 [==============================] - 1s 93ms/step - loss: nan
Epoch 4/4
10/10 [==============================] - 1s 93ms/step - loss: nan
10/10 [==============================] - 0s 48ms/step - loss: nan
没有策略输出看起来正常,损失计算正常
Epoch 1/4
10/10 [==============================] - 0s 2ms/step - loss: 4.2581
Epoch 2/4
10/10 [==============================] - 0s 2ms/step - loss: 1.8821
Epoch 3/4
10/10 [==============================] - 0s 2ms/step - loss: 0.8319
Epoch 4/4
10/10 [==============================] - 0s 2ms/step - loss: 0.3677
10/10 [==============================] - 0s 1ms/step - loss: 0.2284
作为运行时环境,我使用来自 Nvidia GPU Cloud 的 tensorflow 容器 nvcr.io/nvidia/tensorflow:20.10-tf2-py3
- 因此它是最新的并且与所有类型的驱动程序兼容。我也试过更新版本 20.12-tf2-py3