GKE-无法使CUDA与pytorch一起使用

我已经设置了一个带有nvidia tesla k80的kubernetes节点,并按照this tutorial尝试运行一个运行nvidia驱动程序和cuda驱动程序的pytorch docker映像。

我的nvidia驱动程序和cuda驱动程序都可以在/usr/localpod中访问:

$> ls /usr/local
bin  cuda  cuda-10.0  etc  games  include  lib  man  nvidia  sbin  share  src

我的图像nvidia/cuda:10.0-runtime-ubuntu18.04也使我的GPU变得与众不同:

$> /usr/local/nvidia/bin/nvidia-smi
Fri Nov  8 16:24:35 2019
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 410.79       Driver Version: 410.79       CUDA Version: 10.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla K80           Off  | 00000000:00:04.0 Off |                    0 |
| N/A   73C    P8    35W / 149W |      0MiB / 11441MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

但是在安装pytorch 1.3.0之后,即使将LD_libraRY_PATH设置为/usr/local/nvidia/lib64:/usr/local/cuda/lib64,我也无法使pytorch识别我的cuda安装:

$> python3 -c "import torch; print(torch.cuda.is_available())"
False

$> python3
Python 3.6.8 (default,Oct  7 2019,12:59:55)
[GCC 8.3.0] on linux
Type "help","copyright","credits" or "license" for more information.
>>> import torch
>>> print ('\t\ttorch.cuda.current_device()    =',torch.cuda.current_device())
Traceback (most recent call last):
  File "<stdin>",line 1,in <module>
  File "/usr/local/lib/python3.6/dist-packages/torch/cuda/__init__.py",line 386,in current_device
    _lazy_init()
  File "/usr/local/lib/python3.6/dist-packages/torch/cuda/__init__.py",line 192,in _lazy_init
    _check_driver()
  File "/usr/local/lib/python3.6/dist-packages/torch/cuda/__init__.py",line 111,in _check_driver
    of the CUDA driver.""".format(str(torch._C._cuda_getDriverVersion())))
AssertionError:
The NVIDIA driver on your system is too old (found version 10000).
Please update your GPU driver by downloading and installing a new
version from the URL: http://www.nvidia.com/Download/index.aspx
Alternatively,go to: https://pytorch.org to install
a PyTorch version that has been compiled with your version
of the CUDA driver.

上面的错误很奇怪,因为我的图像的cuda版本是10.0,而Google GKE提到:

  

受支持的最新CUDA版本是10.0

此外,GKE的守护程序集会自动安装NVIDIA驱动程序

  

将GPU节点添加到群集后,您需要在节点上安装NVIDIA的设备驱动程序。

     

Google提供了一个daemonset,可以自动为您安装驱动程序。   请参阅以下部分,以获取有关Container-Optimized OS(COS)和Ubuntu节点的安装说明。

     

要部署安装daemonset,请运行以下命令:   kubectl apply -f https://raw.githubusercontent.com/GoogleCloudPlatform/container-engine-accelerators/master/nvidia-driver-installer/cos/daemonset-preloaded.yaml

我尝试了我能想到的一切,但没有成功...

kukududushen 回答:GKE-无法使CUDA与pytorch一起使用

我已经通过从pytorch/pytorch:1.2-cuda10.0-cudnn7-devel构建docker镜像来降低pytorch版本的性能,解决了我的问题。

我仍然不十分了解为什么它无法正常运行,否则应该猜测pytorch 1.3.0cuda 10.0不兼容。

本文链接:https://www.f2er.com/3136483.html

大家都在问