Tensorflow 不使用多个 GPU - 出现 OOM

2024-05-15 • 问答

我在多 GPU 机器上遇到了 OOM，因为 TF 2.3 似乎只使用一个 GPU 分配张量。

tensorflow/core/framework/op_kernel.cc:1767] OP_REQUIRES failed at conv_ops.cc:539 : 
Resource exhausted: OOM when allocating tensor with shape[20532,64,48,32] 
and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc.

但是当我运行我的代码时，tensorflow 确实可以识别多个 GPU：

Adding visible gpu devices: 0,1,2

我还需要做些什么才能让 TF 使用所有 GPU？

直接的答案是肯定的，您确实需要做更多的工作才能让 TF 识别多个 GPU。您应该参考本指南，但 tldr 是

mirrored_strategy = tf.distribute.MirroredStrategy()

with mirrored_strategy.scope():
  ...

https://www.tensorflow.org/guide/distributed_training#using_tfdistributestrategy_with_tfkerasmodelfit

但在您的情况下，正在发生其他事情。虽然这个张量可能会触发 OOM，但很可能是因为之前分配了一些大张量。

第一个维度，您的批量大小，是 20532，这真的很大。由于分解是 2**2 × 3 × 29 × 59，我猜你正在使用 CHW 格式，你的源图像是 3x64x128，经过几次卷积后被修剪。我怀疑是无意广播。打印一个 model.summary() ，然后查看每层出来的张量的大小。您可能还需要查看您的批处理。

Tensorflow 不使用多个 GPU - 出现 OOM

iCMS 回答：Tensorflow 不使用多个 GPU - 出现 OOM

大家都在问