今天,我很自豪地在计算机上安装了第二台RTX 2070,以进一步加快TensorFlow 2.2的速度。但令人失望的是,在一个GPU上运行的python脚本不再起作用。我试图将其简化为一个最小的可行示例,该示例适用于该行
strategy = tf.distribute.MirroredStrategy(devices=["/cpu:0"])
如果我用以下任何一行替换此行
strategy = tf.distribute.MirroredStrategy(devices=["/gpu:0","/gpu:1"])
strategy = tf.distribute.experimental.MultiWorkerMirroredStrategy()
strategy = tf.distribute.MirroredStrategy()
我收到如下错误消息:
Starting training
Epoch 1/5
2020-05-23 22:52:59.205856: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10
2020-05-23 22:52:59.400434: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
2020-05-23 22:53:00.881437: E tensorflow/stream_executor/cuda/cuda_dnn.cc:329] Could not create cudnn handle: CUDNN_STATUS_INTERNAL_ERROR
2020-05-23 22:53:00.898484: E tensorflow/stream_executor/cuda/cuda_dnn.cc:329] Could not create cudnn handle: CUDNN_STATUS_INTERNAL_ERROR
Traceback (most recent call last):
File "example3.py",line 77,in <module>
main()
File "example3.py",line 70,in main
model.fit(x=training_generator,workers=1,epochs=5,steps_per_epoch = len(training_generator))
File "/home/frank/.local/lib/python3.6/site-packages/tensorflow/python/keras/engine/training.py",line 66,in _method_wrapper
return method(self,*args,**kwargs)
File "/home/frank/.local/lib/python3.6/site-packages/tensorflow/python/keras/engine/training.py",line 848,in fit
tmp_logs = train_function(iterator)
File "/home/frank/.local/lib/python3.6/site-packages/tensorflow/python/eager/def_function.py",line 580,in __call__
result = self._call(*args,**kwds)
File "/home/frank/.local/lib/python3.6/site-packages/tensorflow/python/eager/def_function.py",line 644,in _call
return self._stateless_fn(*args,**kwds)
File "/home/frank/.local/lib/python3.6/site-packages/tensorflow/python/eager/function.py",line 2420,in __call__
return graph_function._filtered_call(args,kwargs) # pylint: disable=protected-access
File "/home/frank/.local/lib/python3.6/site-packages/tensorflow/python/eager/function.py",line 1665,in _filtered_call
self.captured_inputs)
File "/home/frank/.local/lib/python3.6/site-packages/tensorflow/python/eager/function.py",line 1746,in _call_flat
ctx,args,cancellation_manager=cancellation_manager))
File "/home/frank/.local/lib/python3.6/site-packages/tensorflow/python/eager/function.py",line 598,in call
ctx=ctx)
File "/home/frank/.local/lib/python3.6/site-packages/tensorflow/python/eager/execute.py",line 60,in quick_execute
inputs,attrs,num_outputs)
tensorflow.python.framework.errors_impl.UnknownError: 2 root error(s) found.
(0) Unknown: Failed to get convolution algorithm. This is probably because cuDNN failed to initialize,so try looking to see if a warning log message was printed above.
[[node sequential/conv2d/Conv2D (defined at /usr/lib64/python3.6/threading.py:916) ]]
(1) Unknown: Failed to get convolution algorithm. This is probably because cuDNN failed to initialize,so try looking to see if a warning log message was printed above.
[[node sequential/conv2d/Conv2D (defined at /usr/lib64/python3.6/threading.py:916) ]]
[[div_no_nan_1/ReadVariableOp/_14]]
0 successful operations.
0 derived errors ignored. [Op:__inference_train_function_1034]
Errors may have originated from an input operation.
Input Source operations connected to node sequential/conv2d/Conv2D:
cond_1/Identity (defined at example3.py:70)
Input Source operations connected to node sequential/conv2d/Conv2D:
cond_1/Identity (defined at example3.py:70)
Function call stack:
train_function -> train_function
2020-05-23 22:53:00.943669: W tensorflow/core/kernels/data/generator_dataset_op.cc:103] Error occurred when finalizing Generatordataset iterator: Failed precondition: Python interpreter state is not initialized. The process may be terminated.
[[{{node PyFunc}}]]
以下是重现该错误的完整代码:
import tensorflow as tf
import numpy as np
from PIL import Image,ImageDraw
class DataGenerator(tf.keras.utils.Sequence):
def __init__(self,BatchSize,PicX,PicY,Color):
self._BatchSize = BatchSize
self._dim = (PicX,PicY)
self._Color = Color
def __len__(self):
return 100
def create_random_form(self):
img = Image.new('RGB',self._dim,(50,50,50))
draw = ImageDraw.Draw(img)
label = np.random.randint(3)
x0 = np.random.randint(int((self._dim[0]-5)/2))+1
x1 = np.random.randint(int((self._dim[0]-5)/2))+int(self._dim[0]/2)
y0 = np.random.randint(int((self._dim[1]-5)/2))
y1 = np.random.randint(int((self._dim[1]-5)/2))+int(self._dim[1]/2)
if label == 0:
draw.rectangle((x0,y0,x1,y1),fill=self._Color)
elif label == 1:
draw.ellipse((x0,fill=self._Color)
else:
draw.polygon([(x0,y0),(x0,(x1,y1)],fill=self._Color)
return img,label
def __getitem__(self,index):
X = np.empty((self._BatchSize,*self._dim,3))
y = np.empty((self._BatchSize),dtype=int)
for i in range(0,self._BatchSize):
img,label = self.create_random_form()
X[i,] = tf.keras.preprocessing.image.img_to_array(img) / 255.0
y[i] = label
return X,y
def main():
PicX = 300
PicY = 300
Color = (255,255,255)
#save_some_pics(20)
print("Starting a minimal,self-contained error reproduction")
#strategy = tf.distribute.MirroredStrategy(devices=["/gpu:0","/gpu:1"])
#strategy = tf.distribute.experimental.MultiWorkerMirroredStrategy()
strategy = tf.distribute.MirroredStrategy(devices=["/gpu:0"])
#strategy = tf.distribute.OneDeviceStrategy(device="/gpu:0")
with strategy.scope():
model = tf.keras.Sequential()
model.add(tf.keras.layers.Conv2D(32,(9,9),activation='relu',input_shape=(PicX,3)))
model.add(tf.keras.layers.MaxPooling2D((9,9)))
model.add(tf.keras.layers.Conv2D(64,activation='relu'))
model.add(tf.keras.layers.MaxPooling2D((9,9)))
model.add(tf.keras.layers.flatten())
model.add(tf.keras.layers.Dense(64,activation='relu'))
model.add(tf.keras.layers.Dense(3,activation='softmax'))
model.compile(optimizer='rmsprop',loss='sparse_categorical_crossentropy',metrics=['accuracy'])
print(model.summary())
training_generator = DataGenerator(10,Color)
print("Starting training")
model.fit(x=training_generator,steps_per_epoch = len(training_generator))
test_generator = DataGenerator(10,Color)
test_loss,test_acc = model.evaluate(test_generator)
print("Test loss {},test accuracy {}".format(test_loss,test_acc))
if __name__ == '__main__':
main()
在CPU上运行它就像在计算机中只有一个GPU以及
strategy = tf.distribute.OneDeviceStrategy(device="/gpu:0")
我开始接受常规训练:
2020-05-23 23:11:56.890690: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1),but there must be at least one NUMA node,so returning NUMA node zero
2020-05-23 23:11:56.891333: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1),so returning NUMA node zero
2020-05-23 23:11:56.891912: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1),so returning NUMA node zero
2020-05-23 23:11:56.892554: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1247] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 7377 MB memory) -> physical GPU (device: 0,name: GeForce RTX 2070,pci bus id: 0000:01:00.0,compute capability: 7.5)
2020-05-23 23:11:56.892873: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1),so returning NUMA node zero
2020-05-23 23:11:56.893483: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1247] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:1 with 7377 MB memory) -> physical GPU (device: 1,pci bus id: 0000:02:00.0,compute capability: 7.5)
Starting training
Epoch 1/5
2020-05-23 23:11:58.036789: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10
100/100 [==============================] - 44s 438ms/step - loss: 8.8931 - accuracy: 0.4841
Epoch 2/5
100/100 [==============================] - 44s 437ms/step - loss: 0.8959 - accuracy: 0.6444
我对尝试方法的想法已用尽,并且在搜索错误消息时并没有提出太多建议-高度赞赏任何想法!