添加第二个GPU后，TensorFlow不起作用（CUDNN_STATUS_INTERNAL_ERROR）

今天，我很自豪地在计算机上安装了第二台RTX 2070，以进一步加快TensorFlow 2.2的速度。但令人失望的是，在一个GPU上运行的python脚本不再起作用。我试图将其简化为一个最小的可行示例，该示例适用于该行

strategy = tf.distribute.MirroredStrategy(devices=["/cpu:0"])

如果我用以下任何一行替换此行

strategy = tf.distribute.MirroredStrategy(devices=["/gpu:0","/gpu:1"])
strategy = tf.distribute.experimental.MultiWorkerMirroredStrategy()
strategy = tf.distribute.MirroredStrategy()

我收到如下错误消息：

Starting training
Epoch 1/5
2020-05-23 22:52:59.205856: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10
2020-05-23 22:52:59.400434: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
2020-05-23 22:53:00.881437: E tensorflow/stream_executor/cuda/cuda_dnn.cc:329] Could not create cudnn handle: CUDNN_STATUS_INTERNAL_ERROR
2020-05-23 22:53:00.898484: E tensorflow/stream_executor/cuda/cuda_dnn.cc:329] Could not create cudnn handle: CUDNN_STATUS_INTERNAL_ERROR
Traceback (most recent call last):
  File "example3.py",line 77,in <module>
    main()
  File "example3.py",line 70,in main
    model.fit(x=training_generator,workers=1,epochs=5,steps_per_epoch = len(training_generator))
  File "/home/frank/.local/lib/python3.6/site-packages/tensorflow/python/keras/engine/training.py",line 66,in _method_wrapper
    return method(self,*args,**kwargs)
  File "/home/frank/.local/lib/python3.6/site-packages/tensorflow/python/keras/engine/training.py",line 848,in fit
    tmp_logs = train_function(iterator)
  File "/home/frank/.local/lib/python3.6/site-packages/tensorflow/python/eager/def_function.py",line 580,in __call__
    result = self._call(*args,**kwds)
  File "/home/frank/.local/lib/python3.6/site-packages/tensorflow/python/eager/def_function.py",line 644,in _call
    return self._stateless_fn(*args,**kwds)
  File "/home/frank/.local/lib/python3.6/site-packages/tensorflow/python/eager/function.py",line 2420,in __call__
    return graph_function._filtered_call(args,kwargs)  # pylint: disable=protected-access
  File "/home/frank/.local/lib/python3.6/site-packages/tensorflow/python/eager/function.py",line 1665,in _filtered_call
    self.captured_inputs)
  File "/home/frank/.local/lib/python3.6/site-packages/tensorflow/python/eager/function.py",line 1746,in _call_flat
    ctx,args,cancellation_manager=cancellation_manager))
  File "/home/frank/.local/lib/python3.6/site-packages/tensorflow/python/eager/function.py",line 598,in call
    ctx=ctx)
  File "/home/frank/.local/lib/python3.6/site-packages/tensorflow/python/eager/execute.py",line 60,in quick_execute
    inputs,attrs,num_outputs)
tensorflow.python.framework.errors_impl.UnknownError: 2 root error(s) found.
  (0) Unknown:  Failed to get convolution algorithm. This is probably because cuDNN failed to initialize,so try looking to see if a warning log message was printed above.
         [[node sequential/conv2d/Conv2D (defined at /usr/lib64/python3.6/threading.py:916) ]]
  (1) Unknown:  Failed to get convolution algorithm. This is probably because cuDNN failed to initialize,so try looking to see if a warning log message was printed above.
         [[node sequential/conv2d/Conv2D (defined at /usr/lib64/python3.6/threading.py:916) ]]
         [[div_no_nan_1/ReadVariableOp/_14]]
0 successful operations.
0 derived errors ignored. [Op:__inference_train_function_1034]

Errors may have originated from an input operation.
Input Source operations connected to node sequential/conv2d/Conv2D:
 cond_1/Identity (defined at example3.py:70)

Input Source operations connected to node sequential/conv2d/Conv2D:
 cond_1/Identity (defined at example3.py:70)

Function call stack:
train_function -> train_function

2020-05-23 22:53:00.943669: W tensorflow/core/kernels/data/generator_dataset_op.cc:103] Error occurred when finalizing Generatordataset iterator: Failed precondition: Python interpreter state is not initialized. The process may be terminated.
         [[{{node PyFunc}}]]

以下是重现该错误的完整代码：

import tensorflow as tf
import numpy as np
from PIL import Image,ImageDraw

class DataGenerator(tf.keras.utils.Sequence):
    def __init__(self,BatchSize,PicX,PicY,Color):
        self._BatchSize = BatchSize
        self._dim = (PicX,PicY)
        self._Color = Color

    def __len__(self):
        return 100

    def create_random_form(self):
        img = Image.new('RGB',self._dim,(50,50,50))
        draw = ImageDraw.Draw(img)
        label = np.random.randint(3)
        x0 = np.random.randint(int((self._dim[0]-5)/2))+1
        x1 = np.random.randint(int((self._dim[0]-5)/2))+int(self._dim[0]/2)
        y0 = np.random.randint(int((self._dim[1]-5)/2))
        y1 = np.random.randint(int((self._dim[1]-5)/2))+int(self._dim[1]/2)
        if label == 0:
            draw.rectangle((x0,y0,x1,y1),fill=self._Color)
        elif label == 1:
            draw.ellipse((x0,fill=self._Color)                
        else:
            draw.polygon([(x0,y0),(x0,(x1,y1)],fill=self._Color)     
        return img,label

    def __getitem__(self,index):
        X = np.empty((self._BatchSize,*self._dim,3))
        y = np.empty((self._BatchSize),dtype=int)
        for i in range(0,self._BatchSize):
            img,label = self.create_random_form()
            X[i,] = tf.keras.preprocessing.image.img_to_array(img) / 255.0
            y[i] = label
        return X,y

def main():
    PicX = 300
    PicY = 300
    Color = (255,255,255)
    #save_some_pics(20)
    print("Starting a minimal,self-contained error reproduction")
    #strategy = tf.distribute.MirroredStrategy(devices=["/gpu:0","/gpu:1"])
    #strategy = tf.distribute.experimental.MultiWorkerMirroredStrategy()
    strategy = tf.distribute.MirroredStrategy(devices=["/gpu:0"])
    #strategy = tf.distribute.OneDeviceStrategy(device="/gpu:0")
    with strategy.scope():
        model = tf.keras.Sequential()
        model.add(tf.keras.layers.Conv2D(32,(9,9),activation='relu',input_shape=(PicX,3)))
        model.add(tf.keras.layers.MaxPooling2D((9,9)))        
        model.add(tf.keras.layers.Conv2D(64,activation='relu'))
        model.add(tf.keras.layers.MaxPooling2D((9,9)))
        model.add(tf.keras.layers.flatten())
        model.add(tf.keras.layers.Dense(64,activation='relu'))     
        model.add(tf.keras.layers.Dense(3,activation='softmax'))
        model.compile(optimizer='rmsprop',loss='sparse_categorical_crossentropy',metrics=['accuracy'])
    print(model.summary())
    training_generator = DataGenerator(10,Color)
    print("Starting training")
    model.fit(x=training_generator,steps_per_epoch = len(training_generator))
    test_generator = DataGenerator(10,Color)    
    test_loss,test_acc = model.evaluate(test_generator)
    print("Test loss {},test accuracy {}".format(test_loss,test_acc))    

if __name__ == '__main__':
    main()

在CPU上运行它就像在计算机中只有一个GPU以及

strategy = tf.distribute.OneDeviceStrategy(device="/gpu:0")

我开始接受常规训练：

2020-05-23 23:11:56.890690: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1),but there must be at least one NUMA node,so returning NUMA node zero
2020-05-23 23:11:56.891333: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1),so returning NUMA node zero
2020-05-23 23:11:56.891912: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1),so returning NUMA node zero
2020-05-23 23:11:56.892554: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1247] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 7377 MB memory) -> physical GPU (device: 0,name: GeForce RTX 2070,pci bus id: 0000:01:00.0,compute capability: 7.5)
2020-05-23 23:11:56.892873: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1),so returning NUMA node zero
2020-05-23 23:11:56.893483: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1247] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:1 with 7377 MB memory) -> physical GPU (device: 1,pci bus id: 0000:02:00.0,compute capability: 7.5)
Starting training
Epoch 1/5
2020-05-23 23:11:58.036789: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10
100/100 [==============================] - 44s 438ms/step - loss: 8.8931 - accuracy: 0.4841
Epoch 2/5
100/100 [==============================] - 44s 437ms/step - loss: 0.8959 - accuracy: 0.6444

我对尝试方法的想法已用尽，并且在搜索错误消息时并没有提出太多建议-高度赞赏任何想法！

添加第二个GPU后，TensorFlow不起作用（CUDNN_STATUS_INTERNAL_ERROR）

iCMS 回答：添加第二个GPU后，TensorFlow不起作用（CUDNN_STATUS_INTERNAL_ERROR）

大家都在问